Batched Tree Ensembles

Batched Tree Ensemble Classifier

class snapml.BatchedTreeEnsembleClassifier(base_ensemble=SnapBoostingMachineClassifier(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)

Batched Tree Ensemble Classifier

This class enables batched training of a tree ensemble classifier on large datasets. Given a tree ensemble classifier, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.

Parameters:

base_ensemble{sklearn.ensemble.RandomForestClassifier, sklearn.ensemble.ExtraTreesClassifier, snapml.SnapRandomForestClassifier, snapml.SnapBoostingMachineClassifier, xgboost.XGBClassifier, lightgbm.LGBMClassifier}, default=snapml.SnapBoostingMachineClassifier: The base ensemble that will be split into sub-ensembles and used for batched training.
max_sub_ensembles: int, default=10: The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
outer_lr_scaling: float, default=0.5: The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
inner_lr_scaling: float, defualt=0.5: If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)

Attributes:

n_classes_int: The number of classes
classes_ndarary, shape (n_classes, ): Set of unique classes
ensembles_list: Trained sub-ensembles

build_ensemble(X, target, weights)

Build a new sub-ensemble and insert it into model

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
targetndarray, shape (n_samples,): Boosting target.
weightsndarray, shape (n_samples,): Boosting weights.
first_batch: bool: Is this the first batch?

fit(X, y, sample_weight=None)

Fit the base ensemble on a batch of data.

Parameters:

Xndarray, shape (n_samples, n_features): Training data.
yndarray, shape (n_samples,): Training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns:

selfobject: Returns an instance of self.

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

partial_fit(X, y, sample_weight=None, classes=None)

Continue training the model with a new batch of data.

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.
classesndarray, shape (n_classes,), default=None: Set of unique classes across the entire dataset. This argument is only required for first call to partial fit.

Returns:

selfobject: Returns an instance of self.

predict(X)

Predict class labels

Parameters:

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns:

predndarray, shape = (n_samples,): Predicted class labels

predict_proba(X)

Predict class probabilities

Parameters:

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns:

predndarray, shape = (n_samples, n_classes): Predicted class probabilities

score(X, y, sample_weight=None)

Return accuracy on provided data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

Xarray-like of shape (n_samples, n_features): Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns:

scorefloat: Mean accuracy of self.predict(X) w.r.t. y.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleClassifier

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_partial_fit_request(*, classes: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleClassifier

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to partial_fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

classesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for classes parameter in partial_fit.
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in partial_fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleClassifier

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

train_on_batch(X, y, sample_weight=None)

Train on a new batch of data

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Batched Tree Ensemble Regressor

class snapml.BatchedTreeEnsembleRegressor(base_ensemble=SnapBoostingMachineRegressor(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)

Batched Tree Ensemble Regressor

This class enables batched training of a tree ensemble regressor on large datasets. Given a tree ensemble regressor, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.

Parameters:

base_ensemble{sklearn.ensemble.RandomForestRegressor, sklearn.ensemble.ExtraTreesRegressor, snapml.SnapRandomForestRegressor, snapml.SnapBoostingMachineRegressor, xgboost.XGBRegressor, lightgbm.LGBMRegressor}, default=snapml.SnapBoostingMachineRegressor: The base ensemble that will be split into sub-ensembles and used for batched training.
max_sub_ensembles: int, default=10: The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
outer_lr_scaling: float, default=0.5: The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
inner_lr_scaling: float, defualt=0.5: If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)

Attributes:

ensembles_list: Trained sub-ensembles

build_ensemble(X, target, weights)

Build a new sub-ensemble and insert it into model

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
targetndarray, shape (n_samples,): Boosting target.
weightsndarray, shape (n_samples,): Boosting weights.
first_batch: bool: Is this the first batch?

fit(X, y, sample_weight=None)

Fit the base ensemble on a batch of data.

Parameters:

Xndarray, shape (n_samples, n_features): Training data.
yndarray, shape (n_samples,): Training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns:

selfobject: Returns an instance of self.

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

partial_fit(X, y, sample_weight=None)

Continue training the model with a new batch of data.

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training regression targets.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.

Returns:

selfobject: Returns an instance of self.

predict(X)

Predict target values

Parameters:

Xndarray, shape=(n_samples, n_features): Samples to be used for prediction

Returns:

predndarray, shape = (n_samples,): Predicted target values

score(X, y, sample_weight=None)

Return coefficient of determination on test data.

The coefficient of determination, $R^2$, is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

Xarray-like of shape (n_samples, n_features): Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True values for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns:

scorefloat: $R^2$ of self.predict(X) w.r.t. y.

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleRegressor

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

set_partial_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleRegressor

Configure whether metadata should be requested to be passed to the partial_fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to partial_fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to partial_fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in partial_fit.

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → BatchedTreeEnsembleRegressor

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score.

Returns:

selfobject: The updated object.

train_on_batch(X, y, sample_weight=None)

Train on a new batch of data

Parameters:

Xndarray, shape (n_samples, n_features): Batch of training data.
yndarray, shape (n_samples,): Batch of training labels.
sample_weightndarray, shape (n_samples,), default=None: Sample weights to be applied during training.