Boosting Machines
Boosting Machine Classifier
- class snapml.BoostingMachineClassifier(n_jobs=1, num_round=100, max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, class_weight=None, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_ids=[0], colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)
Boosting machine for binary and multi-class classification tasks.
A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).
- Parameters
- num_roundint, default=100
Number of boosting iterations.
- learning_ratefloat, default=0.1
Learning rate / shrinkage factor.
- random_stateint, default=0
Random seed.
- colsample_bytreefloat, default=1.0
Fraction of feature columns used at each boosting iteration.
- subsamplefloat, default=1.0
Fraction of training examples used at each boosting iteration.
- verbosebool, default=False
Print off information during training.
- lambda_l2float, default=0.0
L2-reguralization penalty used during tree-building.
- early_stopping_roundsint, default=10
When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.
- compress_treesbool, default=False
Compress trees after training for fast inference.
- base_scorefloat, default=None
Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification) or zero (multiclass classification).
- class_weight{‘balanced’, None}, default=None
If set to ‘balanced’ samples weights will be applied to account for class imbalance, otherwise no sample weights will be used.
- max_depthint, default=None
If set, will set min_max_depth = max_depth = max_max_depth
- min_max_depthint, default=1
Minimum max_depth of trees in the ensemble.
- max_max_depthint, default=5
Maximum max_depth of trees in the ensemble.
- n_jobsint, default=1
Number of threads to use during training.
- use_histogramsbool, default=True
Use histograms to accelerate tree-building.
- hist_nbinsint, default=256
Number of histogram bins.
- use_gpubool, default=False
Use GPU for tree-building.
- gpu_idsarray-like of int, default: [0]
Device IDs of the GPUs which will be used when GPU acceleration is enabled.
- tree_select_probabilityfloat, default=1.0
Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.
- regularizerfloat, default=1.0
L2-regularization penality for the kernel ridge regressor.
- fit_interceptbool, default=False
Include intercept term in the kernel ridge regressor.
- gammafloat, default=1.0
Guassian kernel parameter.
- n_componentsint, default=10
Number of components in the random projection.
- Attributes
- feature_importances_array-like, shape=(n_features,)
Feature importances computed across trees.
- apply(X)
Map batch of examples to leaf indices and labels.
- Parameters
- Xdense matrix (ndarray)
Batch of examples.
- Returns
- indicesarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes)
The leaf indices. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.
- labelsarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes)
The leaf labels. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.
- export_model(output_file, output_type='pmml')
Export model trained in snapml to the given output file using a format of the given type.
Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.
- Parameters
- output_filestr
Output filename
- output_type{‘pmml’}
Output file type
- fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)
Fit the model according to the given train data.
- Parameters
- Xdense matrix (ndarray)
Train dataset
- yarray-like, shape = (n_samples,)
The target vector corresponding to X.
- sample_weightarray-like, shape = (n_samples,)
Training sample weights
- X_valdense matrix (ndarray)
Validation dataset
- y_valarray-like, shape = (n_samples,)
The target vector corresponding to X_val.
- sample_weight_valarray-like, shape = (n_samples,)
Validation sample weights
- aggregate_importancesbool, default=True
Aggregate feature importances over boosting rounds
- Returns
- selfobject
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- import_model(input_file, input_type, tree_format='auto', X=None)
Import a pre-trained boosted ensemble model and optimize the trees for fast inference.
Supported import formats include PMML, ONNX, XGBoost json and lightGBM text. The corresponding input file types to be provided to the import_model function are ‘pmml’, ‘onnx’, ‘xgb_json’, and ‘lightgbm’ respectively.
Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.
If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.
If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.
By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.
Note: If the input file contains features that are not supported by the import function, then an exception is thrown indicating the feature and the line number within the input file containing the feature.
- Parameters
- input_filestr
Input filename
- input_type{‘pmml’, ‘onnx’, ‘xgb_json’, ‘lightgbm’}
Input file type
- tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}
Tree format
- Xdense matrix (ndarray)
Optional input dataset used for compressing trees
- Returns
- selfobject
- optimize_trees(tree_format='auto', X=None)
Optimize the trees in the ensemble for fast inference.
Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.
If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.
If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.
By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.
- Parameters
- tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}
Tree format
- Xdense matrix (ndarray)
Optional input dataset used for compressing trees
- Returns
- selfobject
- predict(X, n_jobs=None)
Predict class labels
- Parameters
- Xdense matrix (ndarray)
Dataset used for predicting class estimates.
- n_jobsint
Number of threads to use for prediction.
- Returns
- pred: array-like, shape = (n_samples,)
Returns the predicted class labels
- predict_proba(X, n_jobs=None)
Predict class label probabilities
- Parameters
- Xdense matrix (ndarray)
Dataset used for predicting class estimates.
- n_jobsint
Number of threads to use for prediction.
- Returns
- proba: array-like, shape = (n_samples, 2)
Returns the predicted class probabilities
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)
Set the parameters of this model.
Valid parameter keys can be listed with
get_params()
.- Returns
- self
Boosting Machine Regressor
- class snapml.BoostingMachineRegressor(n_jobs=1, num_round=100, objective='mse', max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, max_delta_step=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)
Boosting machine for regression tasks.
A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).
- Parameters
- num_roundint, default=100
Number of boosting iterations.
- objective{‘mse’, ‘cross_entropy’, ‘poisson’}, default=’mse’
Training objective.
- learning_ratefloat, default=0.1
Learning rate / shrinkage factor.
- random_stateint, default=0
Random seed.
- colsample_bytreefloat, default=1.0
Fraction of feature columns used at each boosting iteration.
- subsamplefloat, default=1.0
Fraction of training examples used at each boosting iteration.
- verbosebool, default=False
Print off information during training.
- lambda_l2float, default=0.0
L2-reguralization penalty used during tree-building.
- max_delta_stepfloat, default=0.0
Reguralization term to ensure numerical stability when “objective = poisson”.
- early_stopping_roundsint, default=10
When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.
- compress_treesbool, default=False
Compress trees after training for fast inference.
- base_scorefloat, default=None
Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification).
- max_depthint, default=None
If set, will set min_max_depth = max_depth = max_max_depth
- min_max_depthint, default=1
Minimum max_depth of trees in the ensemble.
- max_max_depthint, default=5
Maximum max_depth of trees in the ensemble.
- n_jobsint, default=1
Number of threads to use during training.
- use_histogramsbool, default=True
Use histograms to accelerate tree-building.
- hist_nbinsint, default=256
Number of histogram bins.
- use_gpubool, default=False
Use GPU for tree-building.
- gpu_idint, default=0
Device ID for GPU to use during training.
- tree_select_probabilityfloat, default=1.0
Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.
- regularizerfloat, default=1.0
L2-regularization penality for the kernel ridge regressor.
- fit_interceptbool, default=False
Include intercept term in the kernel ridge regressor.
- gammafloat, default=1.0
Guassian kernel parameter.
- n_componentsint, default=10
Number of components in the random projection.
- Attributes
- feature_importances_array-like, shape=(n_features,)
Feature importances computed across trees.
- apply(X)
Map batch of examples to leaf indices and labels.
- Parameters
- Xdense matrix (ndarray)
Batch of examples.
- Returns
- indicesarray-like, shape = (n_samples, num_round)
The leaf indices.
- labelsarray-like, shape = (n_samples, num_round)
The leaf labels.
- export_model(output_file, output_type='pmml')
Export model trained in snapml to the given output file using a format of the given type.
Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.
- Parameters
- output_filestr
Output filename
- output_type{‘pmml’}
Output file type
- fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)
Fit the model according to the given train data.
- Parameters
- Xdense matrix (ndarray)
Train dataset
- yarray-like, shape = (n_samples,)
The target vector corresponding to X
- sample_weightarray-like, shape = (n_samples,)
Training sample weights
- X_valdense matrix (ndarray)
Validation dataset
- y_valarray-like, shape = (n_samples,)
The target vector corresponding to X_val.
- sample_weight_valarray-like, shape = (n_samples,)
Validation sample weights
- aggregate_importancesbool, default=True
Aggregate feature importances over boosting rounds
- Returns
- selfobject
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- import_model(input_file, input_type, tree_format='auto', X=None)
Import a pre-trained boosted ensemble model and optimize the trees for fast inference.
Supported import formats include PMML, ONNX, XGBoost json and lightGBM text. The corresponding input file types to be provided to the import_model function are ‘pmml’, ‘onnx’, ‘xgb_json’, and ‘lightgbm’ respectively.
Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.
If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.
If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.
By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.
Note: If the input file contains features that are not supported by the import function, then an exception is thrown indicating the feature and the line number within the input file containing the feature.
- Parameters
- input_filestr
Input filename
- input_type{‘pmml’, ‘onnx’, ‘xgb_json’, ‘lightgbm’}
Input file type
- tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}
Tree format
- Xdense matrix (ndarray)
Optional input dataset used for compressing trees
- Returns
- selfobject
- optimize_trees(tree_format='auto', X=None)
Optimize the trees in the ensemble for fast inference.
Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.
If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.
If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.
By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.
- Parameters
- tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}
Tree format
- Xdense matrix (ndarray)
Optional input dataset used for compressing trees
- Returns
- selfobject
- predict(X, n_jobs=None)
Predict estimates
- Parameters
- Xdense matrix (ndarray)
Dataset used for prediction
- n_jobsint
Number of threads to use for prediction.
- Returns
- pred: array-like, shape = (n_samples,)
Returns the predictions
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
\(R^2\) of
self.predict(X)
wrt. y.
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)
Set the parameters of this model.
Valid parameter keys can be listed with
get_params()
.- Returns
- self