Boosting Machines

Boosting Machine Classifier

class snapml.BoostingMachineClassifier(n_jobs=1, num_round=100, max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, class_weight=None, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_ids=[0], colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)

Boosting machine for binary and multi-class classification tasks.

A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).

Parameters
num_roundint, default=100

Number of boosting iterations.

learning_ratefloat, default=0.1

Learning rate / shrinkage factor.

random_stateint, default=0

Random seed.

colsample_bytreefloat, default=1.0

Fraction of feature columns used at each boosting iteration.

subsamplefloat, default=1.0

Fraction of training examples used at each boosting iteration.

verbosebool, default=False

Print off information during training.

lambda_l2float, default=0.0

L2-reguralization penalty used during tree-building.

early_stopping_roundsint, default=10

When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.

compress_treesbool, default=False

Compress trees after training for fast inference.

base_scorefloat, default=None

Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification) or zero (multiclass classification).

class_weight{‘balanced’, None}, default=None

If set to ‘balanced’ samples weights will be applied to account for class imbalance, otherwise no sample weights will be used.

max_depthint, default=None

If set, will set min_max_depth = max_depth = max_max_depth

min_max_depthint, default=1

Minimum max_depth of trees in the ensemble.

max_max_depthint, default=5

Maximum max_depth of trees in the ensemble.

n_jobsint, default=1

Number of threads to use during training.

use_histogramsbool, default=True

Use histograms to accelerate tree-building.

hist_nbinsint, default=256

Number of histogram bins.

use_gpubool, default=False

Use GPU for tree-building.

gpu_idsarray-like of int, default: [0]

Device IDs of the GPUs which will be used when GPU acceleration is enabled.

tree_select_probabilityfloat, default=1.0

Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.

regularizerfloat, default=1.0

L2-regularization penality for the kernel ridge regressor.

fit_interceptbool, default=False

Include intercept term in the kernel ridge regressor.

gammafloat, default=1.0

Guassian kernel parameter.

n_componentsint, default=10

Number of components in the random projection.

Attributes
feature_importances_array-like, shape=(n_features,)

Feature importances computed across trees.

apply(X)

Map batch of examples to leaf indices and labels.

Parameters
Xdense matrix (ndarray)

Batch of examples.

Returns
indicesarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes)

The leaf indices. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.

labelsarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes)

The leaf labels. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.

export_model(output_file, output_type='pmml')

Export model trained in snapml to the given output file using a format of the given type.

Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.

Parameters
output_filestr

Output filename

output_type{‘pmml’}

Output file type

fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)

Fit the model according to the given train data.

Parameters
Xdense matrix (ndarray)

Train dataset

yarray-like, shape = (n_samples,)

The target vector corresponding to X.

sample_weightarray-like, shape = (n_samples,)

Training sample weights

X_valdense matrix (ndarray)

Validation dataset

y_valarray-like, shape = (n_samples,)

The target vector corresponding to X_val.

sample_weight_valarray-like, shape = (n_samples,)

Validation sample weights

aggregate_importancesbool, default=True

Aggregate feature importances over boosting rounds

Returns
selfobject
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

import_model(input_file, input_type, tree_format='auto', X=None)

Import a pre-trained boosted ensemble model and optimize the trees for fast inference.

Supported import formats include PMML, ONNX, XGBoost json and lightGBM text. The corresponding input file types to be provided to the import_model function are ‘pmml’, ‘onnx’, ‘xgb_json’, and ‘lightgbm’ respectively.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Note: If the input file contains features that are not supported by the import function, then an exception is thrown indicating the feature and the line number within the input file containing the feature.

Parameters
input_filestr

Input filename

input_type{‘pmml’, ‘onnx’, ‘xgb_json’, ‘lightgbm’}

Input file type

tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}

Tree format

Xdense matrix (ndarray)

Optional input dataset used for compressing trees

Returns
selfobject
optimize_trees(tree_format='auto', X=None)

Optimize the trees in the ensemble for fast inference.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Parameters
tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}

Tree format

Xdense matrix (ndarray)

Optional input dataset used for compressing trees

Returns
selfobject
predict(X, n_jobs=None)

Predict class labels

Parameters
Xdense matrix (ndarray)

Dataset used for predicting class estimates.

n_jobsint

Number of threads to use for prediction.

Returns
pred: array-like, shape = (n_samples,)

Returns the predicted class labels

predict_proba(X, n_jobs=None)

Predict class label probabilities

Parameters
Xdense matrix (ndarray)

Dataset used for predicting class estimates.

n_jobsint

Number of threads to use for prediction.

Returns
proba: array-like, shape = (n_samples, 2)

Returns the predicted class probabilities

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this model.

Valid parameter keys can be listed with get_params().

Returns
self

Boosting Machine Regressor

class snapml.BoostingMachineRegressor(n_jobs=1, num_round=100, objective='mse', max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)

Boosting machine for regression tasks.

A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).

Parameters
num_roundint, default=100

Number of boosting iterations.

objective{‘mse’, ‘cross_entropy’}, default=’mse’

Training objective.

learning_ratefloat, default=0.1

Learning rate / shrinkage factor.

random_stateint, default=0

Random seed.

colsample_bytreefloat, default=1.0

Fraction of feature columns used at each boosting iteration.

subsamplefloat, default=1.0

Fraction of training examples used at each boosting iteration.

verbosebool, default=False

Print off information during training.

lambda_l2float, default=0.0

L2-reguralization penalty used during tree-building.

early_stopping_roundsint, default=10

When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.

compress_treesbool, default=False

Compress trees after training for fast inference.

base_scorefloat, default=None

Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification).

max_depthint, default=None

If set, will set min_max_depth = max_depth = max_max_depth

min_max_depthint, default=1

Minimum max_depth of trees in the ensemble.

max_max_depthint, default=5

Maximum max_depth of trees in the ensemble.

n_jobsint, default=1

Number of threads to use during training.

use_histogramsbool, default=True

Use histograms to accelerate tree-building.

hist_nbinsint, default=256

Number of histogram bins.

use_gpubool, default=False

Use GPU for tree-building.

gpu_idint, default=0

Device ID for GPU to use during training.

tree_select_probabilityfloat, default=1.0

Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.

regularizerfloat, default=1.0

L2-regularization penality for the kernel ridge regressor.

fit_interceptbool, default=False

Include intercept term in the kernel ridge regressor.

gammafloat, default=1.0

Guassian kernel parameter.

n_componentsint, default=10

Number of components in the random projection.

Attributes
feature_importances_array-like, shape=(n_features,)

Feature importances computed across trees.

apply(X)

Map batch of examples to leaf indices and labels.

Parameters
Xdense matrix (ndarray)

Batch of examples.

Returns
indicesarray-like, shape = (n_samples, num_round)

The leaf indices.

labelsarray-like, shape = (n_samples, num_round)

The leaf labels.

export_model(output_file, output_type='pmml')

Export model trained in snapml to the given output file using a format of the given type.

Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.

Parameters
output_filestr

Output filename

output_type{‘pmml’}

Output file type

fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)

Fit the model according to the given train data.

Parameters
Xdense matrix (ndarray)

Train dataset

yarray-like, shape = (n_samples,)

The target vector corresponding to X

sample_weightarray-like, shape = (n_samples,)

Training sample weights

X_valdense matrix (ndarray)

Validation dataset

y_valarray-like, shape = (n_samples,)

The target vector corresponding to X_val.

sample_weight_valarray-like, shape = (n_samples,)

Validation sample weights

aggregate_importancesbool, default=True

Aggregate feature importances over boosting rounds

Returns
selfobject
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

import_model(input_file, input_type, tree_format='auto', X=None)

Import a pre-trained boosted ensemble model and optimize the trees for fast inference.

Supported import formats include PMML, ONNX, XGBoost json and lightGBM text. The corresponding input file types to be provided to the import_model function are ‘pmml’, ‘onnx’, ‘xgb_json’, and ‘lightgbm’ respectively.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Note: If the input file contains features that are not supported by the import function, then an exception is thrown indicating the feature and the line number within the input file containing the feature.

Parameters
input_filestr

Input filename

input_type{‘pmml’, ‘onnx’, ‘xgb_json’, ‘lightgbm’}

Input file type

tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}

Tree format

Xdense matrix (ndarray)

Optional input dataset used for compressing trees

Returns
selfobject
optimize_trees(tree_format='auto', X=None)

Optimize the trees in the ensemble for fast inference.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Parameters
tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}

Tree format

Xdense matrix (ndarray)

Optional input dataset used for compressing trees

Returns
selfobject
predict(X, n_jobs=None)

Predict estimates

Parameters
Xdense matrix (ndarray)

Dataset used for prediction

n_jobsint

Number of threads to use for prediction.

Returns
pred: array-like, shape = (n_samples,)

Returns the predictions

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

\(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this model.

Valid parameter keys can be listed with get_params().

Returns
self