Decision Trees

Decision Tree Classifier

class snapml.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Classifier

This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary and multi-class classification problems.

Parameters

criterionstring, default=”gini”

This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.

splitterstring, default=”best”

This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

max_depthint or None, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

max_featuresint, float, string or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then consider int(max_features * n_features) features at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

random_stateint, or None, default=None

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobsinteger, default=1

The number of CPU threads to use.

use_histogramsboolean, default=True

Use histogram-based splits rather than exact splits.

hist_nbinsint, default=256

Number of histogram bins.

use_gpuboolean, default=False

Use GPU acceleration (only supported for histogram-based splits).

gpu_idint, default=0

Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, default=False

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Attributes

classes_array of shape = [n_classes]: The classes labels (single output problem)
n_classes_int: The number of classes (for single output problems)

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters

X_traindense matrix (ndarray): Train dataset
y_trainarray-like, shape = (n_samples,): The target vector corresponding to X_train.
sample_weightarray-like, shape = [n_samples] or None: Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

selfobject

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

predict(X, n_jobs=0)

Class/Regression predictions

The returned class/regression estimates.

Parameters

Xdense matrix (ndarray) or memmap (np.memmap): Dataset used for predicting class/regression estimates.
n_jobsint, default=0: Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred/proba: array-like, shape = (n_samples,): Returns the predicted class/values of the sample.

predict_proba(X, n_jobs=0)

Predict class probabilities.

Parameters

Xdense matrix (ndarray): Dataset used for predicting probabilities.
n_jobsint, default=0: Number of threads used to run inference. By default inference runs with maximum number of available threads.
Returns
——-
proba: array-like, shape = (n_samples, n_classes): Returns the predicted probabilities the sample.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters

Xarray-like of shape (n_samples, n_features): Test samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True labels for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns

scorefloat: Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

Decision Tree Regressor

class snapml.DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Regressor

This class implements a decision tree regressor using the IBM Snap ML library. It can be used for regression tasks.

Parameters

criterion{‘mse’}, default=”mse”

This function measures the quality of a split.

splitterstring, default=”best”

This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

max_depthint or None, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

max_featuresint, float, string or None, default=None

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then consider int(max_features * n_features) features at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

random_stateint, or None, default=None

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobsinteger, default=1

The number of CPU threads to use.

use_histogramsboolean, default=True

Use histogram-based splits rather than exact splits.

hist_nbinsint, default=256

Number of histogram bins.

use_gpuboolean, default=False

Use GPU acceleration (only supported for histogram-based splits).

gpu_idint, default=0

Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, default=False

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Attributes

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters

X_traindense matrix (ndarray): Train dataset
y_trainarray-like, shape = (n_samples,): The target vector corresponding to X_train.
sample_weightarray-like, shape = [n_samples] or None: Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns

selfobject

get_params(deep=True)

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

predict(X, n_jobs=0)

Class/Regression predictions

The returned class/regression estimates.

Parameters

Xdense matrix (ndarray) or memmap (np.memmap): Dataset used for predicting class/regression estimates.
n_jobsint, default=0: Number of threads used to run inference. By default inference runs with maximum number of available threads.

Returns

pred/proba: array-like, shape = (n_samples,): Returns the predicted class/values of the sample.

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters

Xarray-like of shape (n_samples, n_features): Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
yarray-like of shape (n_samples,) or (n_samples, n_outputs): True values for X.
sample_weightarray-like of shape (n_samples,), default=None: Sample weights.

Returns

scorefloat: \(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.