Boosting Machines

Boosting Machine Classifier

class snapml.BoostingMachineClassifier(n_jobs=1, num_round=100, max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, class_weight=None, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_ids=[0], colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)

Boosting machine for binary and multi-class classification tasks.

A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).

Parameters

num_roundint, default=100: Number of boosting iterations.
learning_ratefloat, default=0.1: Learning rate / shrinkage factor.
random_stateint, default=0: Random seed.
colsample_bytreefloat, default=1.0: Fraction of feature columns used at each boosting iteration.
subsamplefloat, default=1.0: Fraction of training examples used at each boosting iteration.
verbosebool, default=False: Print off information during training.
lambda_l2float, default=0.0: L2-reguralization penalty used during tree-building.
early_stopping_roundsint, default=10: When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.
compress_treesbool, default=False: Compress trees after training for fast inference.
base_scorefloat, default=None: Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification) or zero (multiclass classification).
class_weight{‘balanced’, None}, default=None: If set to ‘balanced’ samples weights will be applied to account for class imbalance, otherwise no sample weights will be used.
max_depthint, default=None: If set, will set min_max_depth = max_depth = max_max_depth
min_max_depthint, default=1: Minimum max_depth of trees in the ensemble.
max_max_depthint, default=5: Maximum max_depth of trees in the ensemble.
n_jobsint, default=1: Number of threads to use during training.
use_histogramsbool, default=True: Use histograms to accelerate tree-building.
hist_nbinsint, default=256: Number of histogram bins.
use_gpubool, default=False: Use GPU for tree-building.
gpu_idsarray-like of int, default: [0]: Device IDs of the GPUs which will be used when GPU acceleration is enabled.
tree_select_probabilityfloat, default=1.0: Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.
regularizerfloat, default=1.0: L2-regularization penality for the kernel ridge regressor.
fit_interceptbool, default=False: Include intercept term in the kernel ridge regressor.
gammafloat, default=1.0: Guassian kernel parameter.
n_componentsint, default=10: Number of components in the random projection.

Attributes

feature_importances_array-like, shape=(n_features,): Feature importances computed across trees.

apply(X)

Map batch of examples to leaf indices and labels.

Parameters

Xdense matrix (ndarray): Batch of examples.

Returns

indicesarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes): The leaf indices. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.
labelsarray-like, shape = (n_samples, num_round) or (n_samples, num_round, num_classes): The leaf labels. Output is 2-dim for binary classification. Output is 3-dim for multiclass classification.

export_model(output_file, output_type='pmml')

Export model trained in snapml to the given output file using a format of the given type.

Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.

Parameters

output_filestr: Output filename
output_type{‘pmml’}: Output file type

fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)

Fit the model according to the given train data.

Parameters

Xdense matrix (ndarray): Train dataset
yarray-like, shape = (n_samples,): The target vector corresponding to X.
sample_weightarray-like, shape = (n_samples,): Training sample weights
X_valdense matrix (ndarray): Validation dataset
y_valarray-like, shape = (n_samples,): The target vector corresponding to X_val.
sample_weight_valarray-like, shape = (n_samples,): Validation sample weights
aggregate_importancesbool, default=True: Aggregate feature importances over boosting rounds

Returns

selfobject

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

import_model(input_file, input_type, tree_format='auto', X=None)

Import a pre-trained boosted ensemble model and optimize the trees for fast inference.

Supported import formats include PMML, ONNX, XGBoost json and lightGBM text. The corresponding input file types to be provided to the import_model function are ‘pmml’, ‘onnx’, ‘xgb_json’, and ‘lightgbm’ respectively.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Note: If the input file contains features that are not supported by the import function, then an exception is thrown indicating the feature and the line number within the input file containing the feature.

Parameters

input_filestr: Input filename
input_type{‘pmml’, ‘onnx’, ‘xgb_json’, ‘lightgbm’}: Input file type
tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}: Tree format
Xdense matrix (ndarray): Optional input dataset used for compressing trees

Returns

selfobject

optimize_trees(tree_format='auto', X=None)

Optimize the trees in the ensemble for fast inference.

Depending on how the tree_format argument is set, this function will return a different optimized model format. This format determines which inference engine is used for subsequent calls to ‘predict’ or ‘predict_proba’.

If tree_format is set to ‘compress_trees’, the model will be optimized for execution on the CPU, using our compressed decision trees approach. Note: if this option is selected, an optional dataset X can be provided, which will be used to predict node access characteristics during node clustering.

If tree_format is set to ‘zdnn_tensors’, the model will be optimized for execution on the IBM z16 AI accelerator, using a matrix-based inference algorithm leveraging the zDNN library.

By default tree_format is set to ‘auto’. A check is performed and if the IBM z16 AI accelerator is available the model will be optimized according to ‘zdnn_tensors’, otherwise it will be optimized according to ‘compress_trees’. The selected optimized tree format can be read by parameter self.booster_.optimized_tree_format_.

Parameters

tree_format{‘auto’, ‘compress_trees’, ‘zdnn_tensors’}: Tree format
Xdense matrix (ndarray): Optional input dataset used for compressing trees

Returns

selfobject

predict(X, n_jobs=None)

Predict class labels

Parameters

Xdense matrix (ndarray): Dataset used for predicting class estimates.
n_jobsint: Number of threads to use for prediction.

Returns

pred: array-like, shape = (n_samples,): Returns the predicted class labels

predict_proba(X, n_jobs=None)

Predict class label probabilities

Parameters

Xdense matrix (ndarray): Dataset used for predicting class estimates.
n_jobsint: Number of threads to use for prediction.

Returns

proba: array-like, shape = (n_samples, 2): Returns the predicted class probabilities

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like of shape (n_samples, n_features): Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this model.

Valid parameter keys can be listed with get_params().

Returns

self

Boosting Machine Regressor

class snapml.BoostingMachineRegressor(n_jobs=1, num_round=100, objective='mse', max_depth=None, min_max_depth=1, max_max_depth=5, early_stopping_rounds=10, random_state=0, base_score=None, learning_rate=0.1, verbose=False, compress_trees=False, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, colsample_bytree=1.0, subsample=1.0, lambda_l2=0.0, tree_select_probability=1.0, regularizer=1.0, fit_intercept=False, gamma=1.0, n_components=10)

Boosting machine for regression tasks.

A heterogeneous boosting machine that mixes binary decision trees (of stochastic max_depth) with linear models with random fourier features (approximation of kernel ridge regression).

Parameters

num_roundint, default=100: Number of boosting iterations.
objective{‘mse’, ‘cross_entropy’}, default=’mse’: Training objective.
learning_ratefloat, default=0.1: Learning rate / shrinkage factor.
random_stateint, default=0: Random seed.
colsample_bytreefloat, default=1.0: Fraction of feature columns used at each boosting iteration.
subsamplefloat, default=1.0: Fraction of training examples used at each boosting iteration.
verbosebool, default=False: Print off information during training.
lambda_l2float, default=0.0: L2-reguralization penalty used during tree-building.
early_stopping_roundsint, default=10: When a validation set is provided, training will stop if the validation loss does not decrease after a fixed number of rounds.
compress_treesbool, default=False: Compress trees after training for fast inference.
base_scorefloat, default=None: Base score to initialize boosting algorithm. If None then the algorithm will initialize the base score to be the average target (regression) or the logit of the probability of the positive class (binary classification).
max_depthint, default=None: If set, will set min_max_depth = max_depth = max_max_depth
min_max_depthint, default=1: Minimum max_depth of trees in the ensemble.
max_max_depthint, default=5: Maximum max_depth of trees in the ensemble.
n_jobsint, default=1: Number of threads to use during training.
use_histogramsbool, default=True: Use histograms to accelerate tree-building.
hist_nbinsint, default=256: Number of histogram bins.
use_gpubool, default=False: Use GPU for tree-building.
gpu_idint, default=0: Device ID for GPU to use during training.
tree_select_probabilityfloat, default=1.0: Probability of selecting a tree (rather than a kernel ridge regressor) at each boosting iteration.
regularizerfloat, default=1.0: L2-regularization penality for the kernel ridge regressor.
fit_interceptbool, default=False: Include intercept term in the kernel ridge regressor.
gammafloat, default=1.0: Guassian kernel parameter.
n_componentsint, default=10: Number of components in the random projection.

Attributes

feature_importances_array-like, shape=(n_features,): Feature importances computed across trees.

apply(X)

Map batch of examples to leaf indices and labels.

Parameters

Xdense matrix (ndarray): Batch of examples.

Returns

indicesarray-like, shape = (n_samples, num_round): The leaf indices.
labelsarray-like, shape = (n_samples, num_round): The leaf labels.

export_model(output_file, output_type='pmml')

Export model trained in snapml to the given output file using a format of the given type.

Currently only PMML is supported as export format. The corresponding output file type to be provided to the export_model function is ‘pmml’.

Parameters

output_filestr: Output filename
output_type{‘pmml’}: Output file type

fit(X, y, sample_weight=None, X_val=None, y_val=None, sample_weight_val=None, aggregate_importances=True)

Fit the model according to the given train data.

Parameters

Xdense matrix (ndarray): Train dataset
yarray-like, shape = (n_samples,): The target vector corresponding to X
sample_weightarray-like, shape = (n_samples,): Training sample weights
X_valdense matrix (ndarray): Validation dataset
y_valarray-like, shape = (n_samples,): The target vector corresponding to X_val.
sample_weight_valarray-like, shape = (n_samples,): Validation sample weights
aggregate_importancesbool, default=True: Aggregate feature importances over boosting rounds

Returns

selfobject

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.