Decision Trees

Decision Tree Classifier

class snapml.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Classifier

This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary and multi-class classification problems.

Parameters
criterionstring, default=”gini”

This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.

splitterstring, default=”best”

This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

max_depthint or None, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

max_featuresint, float, string or None, default=None
The number of features to consider when looking for the best split:
  • If int, then consider max_features features at each split.

  • If float, then consider int(max_features * n_features) features at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

random_stateint, or None, default=None

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobsinteger, default=1

The number of CPU threads to use.

use_histogramsboolean, default=True

Use histogram-based splits rather than exact splits.

hist_nbinsint, default=256

Number of histogram bins.

use_gpuboolean, default=False

Use GPU acceleration (only supported for histogram-based splits).

gpu_idint, default=0

Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, default=False

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Attributes
classes_array of shape = [n_classes]

The classes labels (single output problem)

n_classes_int

The number of classes (for single output problems)

fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
X_traindense matrix (ndarray)

Train dataset

y_trainarray-like, shape = (n_samples,)

The target vector corresponding to X_train.

sample_weightarray-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns
selfobject
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

predict(X, n_jobs=None)

Class/Regression predictions

The returned class/regression estimates.

Parameters
Xdense matrix (ndarray) or memmap (np.memmap)

Dataset used for predicting class/regression estimates.

n_jobsint, default=None

Number of threads used to run inference. By default the value of the class attribute is used..

Returns
pred/proba: array-like, shape = (n_samples,)

Returns the predicted class/values of the sample.

predict_proba(X, n_jobs=None)

Predict class probabilities.

Parameters
Xdense matrix (ndarray)

Dataset used for predicting probabilities.

n_jobsint, default=None

Number of threads used to run inference. By default the value of the class attribute is used..

Returns
——-
proba: array-like, shape = (n_samples, n_classes)

Returns the predicted probabilities the sample.

score(X, y, sample_weight=None)

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

Mean accuracy of self.predict(X) wrt. y.

set_params(**params)

Set the parameters of this model.

Valid parameter keys can be listed with get_params().

Returns
self

Decision Tree Regressor

class snapml.DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)

Decision Tree Regressor

This class implements a decision tree regressor using the IBM Snap ML library. It can be used for regression tasks.

Parameters
criterion{‘mse’}, default=”mse”

This function measures the quality of a split.

splitterstring, default=”best”

This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.

max_depthint or None, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.

min_samples_leafint or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least min_samples_leaf training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.

max_featuresint, float, string or None, default=None
The number of features to consider when looking for the best split:
  • If int, then consider max_features features at each split.

  • If float, then consider int(max_features * n_features) features at each split.

  • If “auto”, then max_features=n_features.

  • If “sqrt”, then max_features=sqrt(n_features).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

random_stateint, or None, default=None

If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobsinteger, default=1

The number of CPU threads to use.

use_histogramsboolean, default=True

Use histogram-based splits rather than exact splits.

hist_nbinsint, default=256

Number of histogram bins.

use_gpuboolean, default=False

Use GPU acceleration (only supported for histogram-based splits).

gpu_idint, default=0

Device ID of the GPU which will be used when GPU acceleration is enabled.

verbosebool, default=False

If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.

Attributes
fit(X_train, y_train, sample_weight=None)

Fit the model according to the given train data.

Parameters
X_traindense matrix (ndarray)

Train dataset

y_trainarray-like, shape = (n_samples,)

The target vector corresponding to X_train.

sample_weightarray-like, shape = [n_samples] or None

Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.

Returns
selfobject
get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

predict(X, n_jobs=None)

Class/Regression predictions

The returned class/regression estimates.

Parameters
Xdense matrix (ndarray) or memmap (np.memmap)

Dataset used for predicting class/regression estimates.

n_jobsint, default=None

Number of threads used to run inference. By default the value of the class attribute is used..

Returns
pred/proba: array-like, shape = (n_samples,)

Returns the predicted class/values of the sample.

score(X, y, sample_weight=None)

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters
Xarray-like of shape (n_samples, n_features)

Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True values for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns
scorefloat

\(R^2\) of self.predict(X) wrt. y.

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_params(**params)

Set the parameters of this model.

Valid parameter keys can be listed with get_params().

Returns
self