Batched Tree Ensembles

Batched Tree Ensemble Classifier

class snapml.BatchedTreeEnsembleClassifier(base_ensemble=SnapBoostingMachineClassifier(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)

Batched Tree Ensemble Classifier

This class enables batched training of a tree ensemble classifier on large datasets. Given a tree ensemble classifier, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.

Parameters

base_ensemble{sklearn.ensemble.RandomForestClassifier, sklearn.ensemble.ExtraTreesClassifier, snapml.SnapRandomForestClassifier, snapml.SnapBoostingMachineClassifier, xgboost.XGBClassifier, lightgbm.LGBMClassifier}, default=snapml.SnapBoostingMachineClassifier: The base ensemble that will be split into sub-ensembles and used for batched training.
max_sub_ensembles: int, default=10: The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
outer_lr_scaling: float, default=0.5: The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
inner_lr_scaling: float, defualt=0.5: If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)

Attributes

n_classes_int: The number of classes
classes_ndarary, shape (n_classes, ): Set of unique classes
ensembles_list: Trained sub-ensembles

build_ensemble(X, target, weights)

Build a new sub-ensemble and insert it into model

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
targetndarray, shape (n_samples,): Boosting target.
weightsndarray, shape (n_samples,): Boosting weights.
first_batch: bool: Is this the first batch?

fit(X, y, sample_weight=None)

Fit the base ensemble on a batch of data.

Parameters

Xndarray, shape (n_samples, n_features): Training data.
yndarray, shape (n_samples,): Training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns

selfobject: Returns an instance of self.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

partial_fit(X, y, sample_weight=None, classes=None)

Continue training the model with a new batch of data.

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.
classesndarray, shape (n_classes,), default=None: Set of unique classes across the entire dataset. This argument is only required for first call to partial fit.

Returns

selfobject: Returns an instance of self.

predict(X)

Predict class labels

Parameters

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns

predndarray, shape = (n_samples,): Predicted class labels

predict_proba(X)

Predict class probabilities

Parameters

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns

predndarray, shape = (n_samples, n_classes): Predicted class probabilities

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like of shape (n_samples, n_features): Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

train_on_batch(X, y, sample_weight=None)

Train on a new batch of data

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Batched Tree Ensemble Regressor

class snapml.BatchedTreeEnsembleRegressor(base_ensemble=SnapBoostingMachineRegressor(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)

Batched Tree Ensemble Regressor

This class enables batched training of a tree ensemble regressor on large datasets. Given a tree ensemble regressor, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.

Parameters

base_ensemble{sklearn.ensemble.RandomForestRegressor, sklearn.ensemble.ExtraTreesRegressor, snapml.SnapRandomForestRegressor, snapml.SnapBoostingMachineRegressor, xgboost.XGBRegressor, lightgbm.LGBMRegressor}, default=snapml.SnapBoostingMachineRegressor: The base ensemble that will be split into sub-ensembles and used for batched training.
max_sub_ensembles: int, default=10: The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
outer_lr_scaling: float, default=0.5: The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
inner_lr_scaling: float, defualt=0.5: If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)

Attributes

ensembles_list: Trained sub-ensembles

build_ensemble(X, target, weights)

Build a new sub-ensemble and insert it into model

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
targetndarray, shape (n_samples,): Boosting target.
weightsndarray, shape (n_samples,): Boosting weights.
first_batch: bool: Is this the first batch?

fit(X, y, sample_weight=None)

Fit the base ensemble on a batch of data.

Parameters

Xndarray, shape (n_samples, n_features): Training data.
yndarray, shape (n_samples,): Training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns

selfobject: Returns an instance of self.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

partial_fit(X, y, sample_weight=None)

Continue training the model with a new batch of data.

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training regression targets.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns

selfobject: Returns an instance of self.

predict(X)

Predict target values

Parameters

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns

predndarray, shape = (n_samples,): Predicted target values

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters

Xarray-like of shape (n_samples, n_features): Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True values for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns

scorefloat: \(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

train_on_batch(X, y, sample_weight=None)

Train on a new batch of data

Parameters

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.