H2o Ml

R API

satRday_workshop/step_09_stacking.R#L9

library(mlbench)
data("BostonHousing2")

# Start H2O
library(h2o)
h2o.init(nthreads = -1)

Flow

When starting an H2O cluster, the flow will become available at http://localhost:54321/flow/index.html

Training

github: satRday_workshop

Videos

Model stacking (ensemble)

Practical Ensemble Learning by Erin LeDell
H2O Ensemble Homepage
H2O Ensemble Tutorial
LeDell, E. (2015) Scalable Ensemble Learning and Computationally Efficient Variance Estimation (Doctoral Dissertation)
Breiman, L. (1996) Stacked Regressions

GBM

?h2o.gbm h2o.gbm package:h2o R Documentation

Gradient Boosted Machines

Description:

 Builds gradient boosted classification trees, and gradient boosted
 regression trees on a parsed data set.

Usage:

 h2o.gbm(x, y, training_frame, model_id, checkpoint, ignore_const_cols = TRUE,
   distribution = c("AUTO", "gaussian", "bernoulli", "multinomial", "poisson",
   "gamma", "tweedie", "laplace", "quantile", "huber"), quantile_alpha = 0.5,
   tweedie_power = 1.5, huber_alpha, ntrees = 50, max_depth = 5,
   min_rows = 10, learn_rate = 0.1, learn_rate_annealing = 1,
   sample_rate = 1, sample_rate_per_class, col_sample_rate = 1,
   col_sample_rate_change_per_level = 1, col_sample_rate_per_tree = 1,
   nbins = 20, nbins_top_level, nbins_cats = 1024, validation_frame = NULL,
   balance_classes = FALSE, class_sampling_factors,
   max_after_balance_size = 1, seed, build_tree_one_node = FALSE,
   nfolds = 0, fold_column = NULL, fold_assignment = c("AUTO", "Random",
   "Modulo", "Stratified"), keep_cross_validation_predictions = FALSE,
   keep_cross_validation_fold_assignment = FALSE,
   score_each_iteration = FALSE, score_tree_interval = 0,
   stopping_rounds = 0, stopping_metric = c("AUTO", "deviance", "logloss",
   "MSE", "AUC", "misclassification", "mean_per_class_error"),
   stopping_tolerance = 0.001, max_runtime_secs = 0, offset_column = NULL,
   weights_column = NULL, min_split_improvement, histogram_type = c("AUTO",
   "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"),
   max_abs_leafnode_pred)

Arguments:

   x: A vector containing the names or indices of the predictor
      variables to use in building the GBM model.

   y: The name or index of the response variable. If the data does
      not contain a header, this is the column index number
      starting at 0, and increasing from left to right. (The
      response must be either an integer or a categorical
      variable).

training_frame: An H2OFrame object containing the variables in the model.

model_id: (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.

checkpoint: “Model checkpoint (provide the model_id) to resume training with.”

ignore_const_cols: A logical value indicating whether or not to ignore all the constant columns in the training frame.

distribution: A ‘character’ string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “quantile”, “huber” or “gaussian”

quantile_alpha: Desired quantile for Quantile regression, must be between 0 and 1.

tweedie_power: Tweedie power for Tweedie regression, must be between 1 and 2.

huber_alpha: Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).

ntrees: A nonnegative integer that determines the number of trees to grow.

max_depth: Maximum depth to grow the tree.

min_rows: Minimum number of rows to assign to teminal nodes.

learn_rate: Learning rate (from ‘0.0’ to ‘1.0’)

learn_rate_annealing: Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999)

sample_rate: Row sample rate per tree (from ‘0.0’ to ‘1.0’)

sample_rate_per_class: Row sample rate per tree per class (one per class, from ‘0.0’ to ‘1.0’)

col_sample_rate: Column sample rate per split (from ‘0.0’ to ‘1.0’)

col_sample_rate_change_per_level: Relative change of the column sampling rate for every level (from 0.0 to 2.0)

col_sample_rate_per_tree: Column sample rate per tree (from ‘0.0’ to ‘1.0’)

nbins: For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.

nbins_top_level: For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.

nbins_cats: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.

validation_frame: An H2OFrame object indicating the validation dataset used to contruct the confusion matrix. Defaults to NULL. If left as NULL, this defaults to the training data when ‘nfolds = 0’.

balance_classes: logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data).

class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.

max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is FALSE, which is the default behavior.

seed: Seed for random numbers (affects sampling).

build_tree_one_node: Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.

nfolds: (Optional) Number of folds for cross-validation.

fold_column: (Optional) Column with cross-validation fold index assignment per observation

fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified, must be “AUTO”, “Random”, “Modulo”, or “Stratified”. The Stratified option will stratify the folds based on the response variable, for classification problems.

keep_cross_validation_predictions: Whether to keep the predictions of the cross-validation models

keep_cross_validation_fold_assignment: Whether to keep the cross-validation fold assignment.

score_each_iteration: Attempts to score each tree.

score_tree_interval: Score the model after every so many trees. Disabled if set to 0.

stopping_rounds: Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.

stopping_metric: Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “misclassification”, or “mean_per_class_error”.

stopping_tolerance: Relative tolerance for metric-based stopping criterion (if relative improvement is not at least this much, stop)

max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.

offset_column: Specify the offset column.

weights_column: Specify the weights column.

min_split_improvement: Minimum relative improvement in squared error reduction for a split to happen.

histogram_type: What type of histogram to use for finding optimal split points Can be one of “AUTO”, “UniformAdaptive”, “Random”, “QuantilesGlobal” or “RoundRobin”.

max_abs_leafnode_pred: Maximum absolute value of a leaf node prediction.

Details:

 The default distribution function will guess the model type based
 on the response column type. In order to run properly, the
 response column must be an numeric for "gaussian" or an enum for
 "bernoulli" or "multinomial".

Published

03 September 2016