Decision Trees
Decision Tree Classifier
- class snapml.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)
Decision Tree Classifier
This class implements a decision tree classifier using the IBM Snap ML library. It can be used for binary and multi-class classification problems.
- Parameters
- criterionstring, default=”gini”
This function measures the quality of a split. Possible values: “gini” and “entropy” for information gain. “entropy” is currently not supported.
- splitterstring, default=”best”
This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.
- max_depthint or None, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least
min_samples_leaf
training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.- max_featuresint, float, string or None, default=None
- The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then consider int(max_features * n_features) features at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
- random_stateint, or None, default=None
If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_jobsinteger, default=1
The number of CPU threads to use.
- use_histogramsboolean, default=True
Use histogram-based splits rather than exact splits.
- hist_nbinsint, default=256
Number of histogram bins.
- use_gpuboolean, default=False
Use GPU acceleration (only supported for histogram-based splits).
- gpu_idint, default=0
Device ID of the GPU which will be used when GPU acceleration is enabled.
- verbosebool, default=False
If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.
- Attributes
- classes_array of shape = [n_classes]
The classes labels (single output problem)
- n_classes_int
The number of classes (for single output problems)
- fit(X_train, y_train, sample_weight=None)
Fit the model according to the given train data.
- Parameters
- X_traindense matrix (ndarray)
Train dataset
- y_trainarray-like, shape = (n_samples,)
The target vector corresponding to X_train.
- sample_weightarray-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.
- Returns
- selfobject
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X, n_jobs=0)
Class/Regression predictions
The returned class/regression estimates.
- Parameters
- Xdense matrix (ndarray) or memmap (np.memmap)
Dataset used for predicting class/regression estimates.
- n_jobsint, default=0
Number of threads used to run inference. By default inference runs with maximum number of available threads.
- Returns
- pred/proba: array-like, shape = (n_samples,)
Returns the predicted class/values of the sample.
- predict_proba(X, n_jobs=0)
Predict class probabilities.
- Parameters
- Xdense matrix (ndarray)
Dataset used for predicting probabilities.
- n_jobsint, default=0
Number of threads used to run inference. By default inference runs with maximum number of available threads.
- Returns
- ——-
- proba: array-like, shape = (n_samples, n_classes)
Returns the predicted probabilities the sample.
- score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
Mean accuracy of
self.predict(X)
wrt. y.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
Decision Tree Regressor
- class snapml.DecisionTreeRegressor(criterion='mse', splitter='best', max_depth=None, min_samples_leaf=1, max_features=None, random_state=None, n_jobs=1, use_histograms=True, hist_nbins=256, use_gpu=False, gpu_id=0, verbose=False)
Decision Tree Regressor
This class implements a decision tree regressor using the IBM Snap ML library. It can be used for regression tasks.
- Parameters
- criterion{‘mse’}, default=”mse”
This function measures the quality of a split.
- splitterstring, default=”best”
This parameter defines the strategy used to choose the split at each node. Possible values: “best” and “random”. “random” is currently not supported.
- max_depthint or None, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_leaf samples.
- min_samples_leafint or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it generates at least
min_samples_leaf
training samples in each of the left and right branches. - If int, then consider min_samples_leaf as the minimum number. - If float, then consider ceil(min_samples_leaf * n_samples) as the minimum number.- max_featuresint, float, string or None, default=None
- The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then consider int(max_features * n_features) features at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
- random_stateint, or None, default=None
If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_jobsinteger, default=1
The number of CPU threads to use.
- use_histogramsboolean, default=True
Use histogram-based splits rather than exact splits.
- hist_nbinsint, default=256
Number of histogram bins.
- use_gpuboolean, default=False
Use GPU acceleration (only supported for histogram-based splits).
- gpu_idint, default=0
Device ID of the GPU which will be used when GPU acceleration is enabled.
- verbosebool, default=False
If True, it prints debugging information while training. Warning: this will increase the training time. For performance evaluation, use verbose=False.
- Attributes
- fit(X_train, y_train, sample_weight=None)
Fit the model according to the given train data.
- Parameters
- X_traindense matrix (ndarray)
Train dataset
- y_trainarray-like, shape = (n_samples,)
The target vector corresponding to X_train.
- sample_weightarray-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. TODO: Splits that would create child nodes with net zero or negative weight are ignored while searching for a split in each node.
- Returns
- selfobject
- get_params(deep=True)
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X, n_jobs=0)
Class/Regression predictions
The returned class/regression estimates.
- Parameters
- Xdense matrix (ndarray) or memmap (np.memmap)
Dataset used for predicting class/regression estimates.
- n_jobsint, default=0
Number of threads used to run inference. By default inference runs with maximum number of available threads.
- Returns
- pred/proba: array-like, shape = (n_samples,)
Returns the predicted class/values of the sample.
- score(X, y, sample_weight=None)
Return the coefficient of determination of the prediction.
The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares
((y_true - y_pred)** 2).sum()
and \(v\) is the total sum of squares((y_true - y_true.mean()) ** 2).sum()
. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.- Parameters
- Xarray-like of shape (n_samples, n_features)
Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape
(n_samples, n_samples_fitted)
, wheren_samples_fitted
is the number of samples used in the fitting for the estimator.- yarray-like of shape (n_samples,) or (n_samples, n_outputs)
True values for X.
- sample_weightarray-like of shape (n_samples,), default=None
Sample weights.
- Returns
- scorefloat
\(R^2\) of
self.predict(X)
wrt. y.
Notes
The \(R^2\) score used when calling
score
on a regressor usesmultioutput='uniform_average'
from version 0.23 to keep consistent with default value ofr2_score()
. This influences thescore
method of all the multioutput regressors (except forMultiOutputRegressor
).
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.