Batched Tree Ensembles
Batched Tree Ensemble Classifier
- class snapml.BatchedTreeEnsembleClassifier(base_ensemble=SnapBoostingMachineClassifier(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)
Batched Tree Ensemble Classifier
This class enables batched training of a tree ensemble classifier on large datasets. Given a tree ensemble classifier, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.
- Parameters
- base_ensemble{sklearn.ensemble.RandomForestClassifier, sklearn.ensemble.ExtraTreesClassifier, snapml.SnapRandomForestClassifier, snapml.SnapBoostingMachineClassifier, xgboost.XGBClassifier, lightgbm.LGBMClassifier}, default=snapml.SnapBoostingMachineClassifier
The base ensemble that will be split into sub-ensembles and used for batched training.
- max_sub_ensembles: int, default=10
The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
- outer_lr_scaling: float, default=0.5
The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
- inner_lr_scaling: float, defualt=0.5
If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)
- Attributes
- n_classes_int
The number of classes
- classes_ndarary, shape (n_classes, )
Set of unique classes
- ensembles_list
Trained sub-ensembles
- build_ensemble(X, target, weights)
Build a new sub-ensemble and insert it into model
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- targetndarray, shape (n_samples,)
Boosting target.
- weightsndarray, shape (n_samples,)
Boosting weights.
- first_batch: bool
Is this the first batch?
- fit(X, y, sample_weight=None)
Fit the base ensemble on a batch of data.
- Parameters
- Xndarray, shape (n_samples, n_features)
Training data.
- yndarray, shape (n_samples,)
Training labels.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.
- Returns
- selfobject
Returns an instance of self.
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- partial_fit(X, y, sample_weight=None, classes=None)
Continue training the model with a new batch of data.
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- yndarray, shape (n_samples,)
Batch of training labels.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.
- classesndarray, shape (n_classes,), default=None
Set of unique classes across the entire dataset. This argument is only required for first call to partial fit.
- Returns
- selfobject
Returns an instance of self.
- predict(X)
Predict class labels
- Parameters
- Xndarray, shape=(n_samples, n_features)
Samples to be used for prediction
- Returns
- predndarray, shape = (n_samples,)
Predicted class labels
- predict_proba(X)
Predict class probabilities
- Parameters
- Xndarray, shape=(n_samples, n_features)
Samples to be used for prediction
- Returns
- predndarray, shape = (n_samples, n_classes)
Predicted class probabilities
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- train_on_batch(X, y, sample_weight=None)
Train on a new batch of data
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- yndarray, shape (n_samples,)
Batch of training labels.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.
Batched Tree Ensemble Regressor
- class snapml.BatchedTreeEnsembleRegressor(base_ensemble=SnapBoostingMachineRegressor(), max_sub_ensembles=10, inner_lr_scaling=0.5, outer_lr_scaling=0.5)
Batched Tree Ensemble Regressor
This class enables batched training of a tree ensemble regressor on large datasets. Given a tree ensemble regressor, provided as a base ensemble, the algorithm will split the trees into a number of sub-ensembles. Each sub-ensemble is trained on a different batch of data, and the boosting mechanism is applied across batches to improve accuracy.
- Parameters
- base_ensemble{sklearn.ensemble.RandomForestRegressor, sklearn.ensemble.ExtraTreesRegressor, snapml.SnapRandomForestRegressor, snapml.SnapBoostingMachineRegressor, xgboost.XGBRegressor, lightgbm.LGBMRegressor}, default=snapml.SnapBoostingMachineRegressor
The base ensemble that will be split into sub-ensembles and used for batched training.
- max_sub_ensembles: int, default=10
The maximum number of sub-ensembles to use. It is recommended to set this parameter roughly equal to the expected number of batches. If more batches are provided than the number of sub-ensembles, the last sub-ensemble will be replaced.
- outer_lr_scaling: float, default=0.5
The boosting mechanism across batches will use learning rate 1.0/(max_sub_ensembles ** outer_lr_scaling)
- inner_lr_scaling: float, defualt=0.5
If the base ensemble has a learning rate (e.g. it is a boosting machine), the learning rate will be scaled by a factor (max_sub_ensembles ** inner_lr_scaling)
- Attributes
- ensembles_list
Trained sub-ensembles
- build_ensemble(X, target, weights)
Build a new sub-ensemble and insert it into model
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- targetndarray, shape (n_samples,)
Boosting target.
- weightsndarray, shape (n_samples,)
Boosting weights.
- first_batch: bool
Is this the first batch?
- fit(X, y, sample_weight=None)
Fit the base ensemble on a batch of data.
- Parameters
- Xndarray, shape (n_samples, n_features)
Training data.
- yndarray, shape (n_samples,)
Training labels.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.
- Returns
- selfobject
Returns an instance of self.
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- partial_fit(X, y, sample_weight=None)
Continue training the model with a new batch of data.
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- yndarray, shape (n_samples,)
Batch of training regression targets.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.
- Returns
- selfobject
Returns an instance of self.
- predict(X)
Predict target values
- Parameters
- Xndarray, shape=(n_samples, n_features)
Samples to be used for prediction
- Returns
- predndarray, shape = (n_samples,)
Predicted target values
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
\(R^2\) of
self.predict(X)
wrt. y.
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- train_on_batch(X, y, sample_weight=None)
Train on a new batch of data
- Parameters
- Xndarray, shape (n_samples, n_features)
Batch of training data.
- yndarray, shape (n_samples,)
Batch of training labels.
- sample_weightndarray, shape (n_samples,), default=None
Sample weights to be applied during training.