Skip to content

Commit eeeac41

Browse files
author
Mark Landry
committed
Bring up our version to master.
2 parents a86090c + 93e5c01 commit eeeac41

1 file changed

Lines changed: 170 additions & 56 deletions

File tree

Lines changed: 170 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
> # !!! UNDER CONSTRUCTION !!!
2-
> (But open the book anyway, we're adding content...)
3-
41
# Ensembles: Stacking, Super Learner
52
- Overview
63
- What is Ensemble Learning?
@@ -13,7 +10,7 @@
1310

1411
In this tutorial, we will discuss ensemble learning with a focus on a type of ensemble learning called stacking or Super Learning. We present the H2O implementation of the Super Learner algorithm, called "H2O Ensemble."
1512

16-
Following the introduction to ensemble learning, we will dive into a hands-on code demo of the **h2oEnsemble** R package.
13+
Following the introduction to ensemble learning, we will dive into a hands-on code demo of the [h2oEnsemble](https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble) R package.
1714

1815

1916
# What is Ensemble Learning?
@@ -42,24 +39,31 @@ Both bagging and boosting are ensembles that take a collection of weak learners
4239

4340
## Stacking / Super Learning
4441

45-
Stacking is a broad class of algorithms that involves training a second-level "metalearner" to ensemble a group of base learners.
42+
Stacking is a broad class of algorithms that involves training a second-level "metalearner" to ensemble a group of base learners. The type of ensemble learning implemented in H2O is called "super learning", "stacked regression" or "stacking."
43+
44+
### Some Background
45+
[Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman), known for his work on classification and regression trees and the creator of the Random Forest algorithm, formalized stacking in his 1996 paper, ["Stacked Regressions"](http://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf). Although the idea originated with [David Wolpert](https://en.wikipedia.org/wiki/David_Wolpert) in 1992 under the name ["Stacked Generalization"](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1533), the modern form of stacking that uses internal k-fold cross-validation was Dr. Breiman's contribution.
46+
47+
However, it wasn't until 2007, that the theoretical background for stacking was developed, taking on the name "Super Learner". Until this time, the mathematical reasons for why stacking worked were unknown. The Super Learner algorithm learns the optimal combination of the base learner fits. In an article titled, "[Super Learner](http://dx.doi.org/10.2202/1544-6115.1309)," by [Mark van der Laan](http://www.stat.berkeley.edu/~laan/Laan/laan.html) et al., proved that the Super Learner ensemble represents an asymptotically optimal system for learning.
4648

47-
The type of ensemble learning implemented in H2O is called "super learning", "stacked regression" or "stacking." The Super Learner algorithm learns the optimal combination of the base learner fits. In a 2007 article titled, "[Super Learner](http://dx.doi.org/10.2202/1544-6115.1309)," it was shown that the super learner ensemble represents an asymptotically optimal system for learning.
4849

4950
### Super Learner Algorithm
50-
Set up the ensemble:
51+
52+
Here is an outline of the tasks involved in training and testing a Super Learner ensemble.
53+
54+
#### Set up the ensemble
5155
- Specify a list of L base algorithms (with a specific set of model parameters).
52-
- Specify a metalearning algorithm
56+
- Specify a metalearning algorithm.
5357

54-
Train the ensemble:
58+
#### Train the ensemble
5559
- Train each of the L base algorithms on the training set.
5660
- Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
5761
- The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the "level-one" data.
5862
- Train the metalearning algorithm on the level-one data.
5963

6064
The "ensemble model" consists of the L base learning models and the metalearning model.
6165

62-
Predict on new data:
66+
#### Predict on new data
6367
- To generate ensemble predictions, first generate predictions from the base learners.
6468
- Feed those predictions into the metalearner to generate the ensemble prediction.
6569

@@ -68,120 +72,230 @@ Predict on new data:
6872

6973
The H2O Super Learner ensemble has been implemented as a stand-alone R package called [h2oEnsemble](https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble). The package is an extension to the [h2o](https://cran.r-project.org/web/packages/h2o/index.html) R package that allows the user to train an ensemble containing H2O algorithms. As in the **h2o** R package, all of the actual computation in **h2oEnsemble** is performed inside the H2O cluster, rather than in R memory.
7074

71-
The main computational tasks in the Super Learner ensemble algorithm is the training and cross-validation of the base learners and metalearner. Therefore, implementing the "plumbing" of the ensemble in R (rather than in Java) does not incur a loss of performance.
75+
The main computational tasks in the Super Learner ensemble algorithm are the training and cross-validation of the base learners and metalearner. Therefore, implementing the "plumbing" of the ensemble in R (rather than in Java) does not incur a loss of performance. All training and data processing are performed in the high-performance H2O cluster.
7276

7377

7478
## Install H2O Ensemble
7579

76-
To install the **h2oEnsemble** package, you just need to follow the installation instructions on the [README](https://github.com/h2oai/h2o-3/blob/master/h2o-r/ensemble/README.md#install) file.
80+
To install the **h2oEnsemble** package, you just need to follow the installation instructions on the [README](https://github.com/h2oai/h2o-3/blob/master/h2o-r/ensemble/README.md#install) file, also documented here for convenience.
81+
82+
### H2O R Package
83+
84+
First you need to install the H2O R package if you don't already have it installed. The R installation instructions are at: [http://h2o.ai/download](http://h2o.ai/download)
85+
86+
87+
### H2O Ensemble R Package
88+
89+
The recommended way of installing the **h2oEnsemble** R package is directly from GitHub using the [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package (however, [H2O World](http://h2oworld.h2o.ai/) tutorial attendees should install the package from the provided USB stick).
90+
91+
#### Install from GitHub
92+
```{r install_h2oEnsemble}
93+
library(devtools)
94+
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
95+
```
96+
7797

7898

7999
## Demo
80100

81-
This is an example of binary classification using `h2o.ensemble`. This example is also included in the R package documentation for h2o.ensemble
101+
This is an example of binary classification using the `h2o.ensemble` function, which is available in **h2oEnsemble**.
82102

83103

84104
### Start H2O Cluster
85-
```r
86-
library(h2oEnsemble) # Requires version >=0.0.4 of h2oEnsemble
87-
localH2O <- h2o.init(nthreads = -1) # Start an H2O cluster with nthreads = num cores on your machine
105+
```{r start_h2o}
106+
library(h2oEnsemble) # This will load the `h2o` R package as well
107+
h2o.init(nthreads = -1) # Start an H2O cluster with nthreads = num cores on your machine
108+
h2o.removeAll() # Clean slate - just in case the cluster was already running
88109
```
89110

90111

91112
### Load Data into H2O Cluster
92-
```r
93-
# Import a sample binary outcome train/test set into R
94-
train <- h2o.importFile("http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv")
95-
test <- h2o.importFile("http://www.stat.berkeley.edu/~ledell/data/higgs_test_5k.csv")
113+
114+
First, set the path to the directory in which the tutorial is located on the server that runs H2O (here, locally):
115+
116+
```{r set_path}
117+
ROOT_PATH <- "/Users/me/h2oai/world/h2o-world-2015-training/tutorials"
118+
```
119+
120+
Import a sample binary outcome train and test set into the H2O cluster.
121+
```{r import_data}
122+
train <- h2o.importFile(paste0(ROOT_PATH, "/data/higgs_10k.csv"))
123+
test <- h2o.importFile(paste0(ROOT_PATH, "/data/higgs_test_5k.csv"))
96124
y <- "C1"
97125
x <- setdiff(names(train), y)
126+
```
98127

99-
#For binary classification, response should be a factor
128+
For binary classification, the response should be encoded as factor (aka "enum" in Java). The user can specify column types in the `h2o.importFile` command, or you can convert the response column as follows:
129+
130+
```{r convert_response}
100131
train[,y] <- as.factor(train[,y])
101132
test[,y] <- as.factor(test[,y])
102133
```
103134

104135

105136
### Specify Base Learners & Metalearner
106-
For this example, we will use the default base learner library, which includes the H2O GLM, Random Forest, GBM and Deep Learner (all using default model parameter values).
137+
For this example, we will use the default base learner library, which includes the H2O GLM, Random Forest, GBM and Deep Learner (all using default model parameter values). We will also use the default metalearner, the H2O GLM.
107138

108-
```r
139+
```{r}
109140
learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
110141
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
111-
metalearner <- "h2o.deeplearning.wrapper"
142+
metalearner <- "h2o.glm.wrapper"
112143
```
113144

114145

115146
### Train an Ensemble
116147
Train the ensemble using 5-fold CV to generate level-one data. Note that more CV folds will take longer to train, but should increase performance.
117-
```r
148+
```{r train_ensemble}
118149
fit <- h2o.ensemble(x = x, y = y,
119150
training_frame = train,
120151
family = "binomial",
121152
learner = learner,
122153
metalearner = metalearner,
123-
cvControl = list(V = 5, shuffle = TRUE))
154+
cvControl = list(V = 5))
124155
```
125156

126157

127158
### Predict
128159
Generate predictions on the test set.
129-
```r
130-
pp <- predict(fit, test)
131-
predictions <- as.data.frame(pp$pred)[,3] #third column, p1 is P(Y==1)
160+
```{r predict_ensemble}
161+
pred <- predict(fit, test)
162+
predictions <- as.data.frame(pred$pred)[,3] #third column, p1 is P(Y==1)
132163
labels <- as.data.frame(test[,y])[,1]
133164
```
134165

135166
### Model Evaluation
136167

137168
Since the response is binomial, we can use Area Under the ROC Curve (AUC) to evaluate the model performance. We first generate predictions on the test set and then calculate test set AUC using the [cvAUC](https://cran.r-project.org/web/packages/cvAUC/) R package.
138169

139-
```r
170+
```{r ensemble_auc}
140171
# Ensemble test AUC
141-
library(cvAUC) # Used to calculate test set AUC (cvAUC version >=1.0.1)
172+
library(cvAUC) # Used to calculate test set AUC
142173
cvAUC::AUC(predictions = predictions, labels = labels)
143174
# 0.7888723
144175
145176
# Base learner test AUC (for comparison)
146177
L <- length(learner)
147-
auc <- sapply(seq(L), function(l) cvAUC::AUC(predictions = as.data.frame(pp$basepred)[,l], labels = labels))
178+
auc <- sapply(seq(L), function(l) cvAUC::AUC(predictions = as.data.frame(pred$basepred)[,l], labels = labels))
148179
data.frame(learner, auc)
149-
# learner auc
150-
#1 h2o.glm.wrapper 0.6871288
151-
#2 h2o.randomForest.wrapper 0.7711654
152-
#3 h2o.gbm.wrapper 0.7817075
153-
#4 h2o.deeplearning.wrapper 0.7425813
154-
155-
# Note that the ensemble results above are not reproducible since
156-
# h2o.deeplearning is not reproducible when using multiple cores,
157-
# and we did not set a seed for h2o.randomForest.wrapper or h2o.gbm.wrapper.
180+
# learner auc
181+
# 1 h2o.glm.wrapper 0.6871288
182+
# 2 h2o.randomForest.wrapper 0.7711654
183+
# 3 h2o.gbm.wrapper 0.7817075
184+
# 4 h2o.deeplearning.wrapper 0.7425813
158185
```
186+
Note that the ensemble results above are not reproducible since `h2o.deeplearning` is not reproducible when using multiple cores, and we did not set a seed for `h2o.randomForest.wrapper`.
187+
159188
Additional note: In a future version, performance metrics such as AUC will be computed automatically, as in the other H2O algos.
160189

161190

162191
### Specifying New Learners
163192

164-
Here is an example of how to generate a base learner library using custom base learners:
165-
```r
166-
h2o.randomForest.1 <- function(..., ntrees = 1000, nbins = 100, seed = 1) {
167-
h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
168-
}
169-
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", seed = 1) {
170-
h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
171-
}
172-
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", seed = 1) {
173-
h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
174-
}
175-
learner <- c("h2o.randomForest.1", "h2o.deeplearning.1", "h2o.deeplearning.2")
193+
Now let's try again with a more extensive set of base learners. Here is an example of how to generate a custom learner wrappers:
194+
195+
```{r custom_learners}
196+
h2o.glm.1 <- function(..., alpha = 0.0) h2o.glm.wrapper(..., alpha = alpha)
197+
h2o.glm.2 <- function(..., alpha = 0.5) h2o.glm.wrapper(..., alpha = alpha)
198+
h2o.glm.3 <- function(..., alpha = 1.0) h2o.glm.wrapper(..., alpha = alpha)
199+
h2o.randomForest.1 <- function(..., ntrees = 200, nbins = 50, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
200+
h2o.randomForest.2 <- function(..., ntrees = 200, sample_rate = 0.75, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
201+
h2o.randomForest.3 <- function(..., ntrees = 200, sample_rate = 0.85, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
202+
h2o.randomForest.4 <- function(..., ntrees = 200, nbins = 50, balance_classes = TRUE, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, balance_classes = balance_classes, seed = seed)
203+
h2o.gbm.1 <- function(..., ntrees = 100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)
204+
h2o.gbm.2 <- function(..., ntrees = 100, nbins = 50, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
205+
h2o.gbm.3 <- function(..., ntrees = 100, max_depth = 10, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
206+
h2o.gbm.4 <- function(..., ntrees = 100, col_sample_rate = 0.8, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
207+
h2o.gbm.5 <- function(..., ntrees = 100, col_sample_rate = 0.7, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
208+
h2o.gbm.6 <- function(..., ntrees = 100, col_sample_rate = 0.6, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
209+
h2o.gbm.7 <- function(..., ntrees = 100, balance_classes = TRUE, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, balance_classes = balance_classes, seed = seed)
210+
h2o.gbm.8 <- function(..., ntrees = 100, max_depth = 3, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
211+
deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
212+
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
213+
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
214+
h2o.deeplearning.3 <- function(..., hidden = c(500,500), activation = "RectifierWithDropout", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
215+
h2o.deeplearning.4 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, balance_classes = TRUE, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, balance_classes = balance_classes, seed = seed)
216+
h2o.deeplearning.5 <- function(..., hidden = c(100,100,100), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
217+
h2o.deeplearning.6 <- function(..., hidden = c(50,50), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
218+
h2o.deeplearning.7 <- function(..., hidden = c(100,100), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
219+
```
220+
221+
222+
Let's grab a subset of these learners for our base learner library and re-train the ensemble.
223+
224+
```{r}
225+
learner <- c("h2o.glm.wrapper",
226+
"h2o.randomForest.1", "h2o.randomForest.2",
227+
"h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8",
228+
"h2o.deeplearning.1", "h2o.deeplearning.6", "h2o.deeplearning.7")
229+
```
230+
231+
Train with new library:
232+
```
233+
fit <- h2o.ensemble(x = x, y = y,
234+
training_frame = train,
235+
validation_frame = NULL,
236+
family = family,
237+
learner = learner,
238+
metalearner = metalearner,
239+
cvControl = list(V = 5))
176240
```
177241

242+
Evaluate the performance:
243+
```{r}
244+
cvAUC::AUC(predictions = predictions , labels = labels)
245+
# 0.7904223
246+
```
247+
We see an increase in performance by including a more diverse library.
178248

249+
Base learner test AUC (for comparison)
250+
```{r}
251+
L <- length(learner)
252+
auc <- sapply(seq(L), function(l) cvAUC::AUC(predictions = as.data.frame(pred$basepred)[,l], labels = labels))
253+
data.frame(learner, auc)
179254
255+
# learner auc
256+
# 1 h2o.glm.wrapper 0.6871288
257+
# 2 h2o.randomForest.1 0.7809140
258+
# 3 h2o.randomForest.2 0.7835352
259+
# 4 h2o.gbm.1 0.7816863
260+
# 5 h2o.gbm.6 0.7821683
261+
# 6 h2o.gbm.8 0.7804483
262+
# 7 h2o.deeplearning.1 0.7160903
263+
# 8 h2o.deeplearning.6 0.7272538
264+
# 9 h2o.deeplearning.7 0.7379495
265+
```
180266

181-
TO FINISH...
182267

183-
## Roadmap for H2O Ensemble
184-
H2O Ensemble is currently only available using the R API, however, it will be accessible via all our APIs in a future release. You can follow the progress of H2O Ensemble development on the [H2O JIRA](https://0xdata.atlassian.net/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PUBDEV+AND+component+%3D+Ensemble)
268+
So what happens to the ensemble if we remove some of the weaker learners? Let's remove the GLM and DL from the learner library and see what happens
185269

270+
Here is a more stripped down version of the ensemble:
271+
```{r}
272+
learner <- c("h2o.randomForest.1", "h2o.randomForest.2",
273+
"h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8")
186274
275+
fit <- h2o.ensemble(x = x, y = y,
276+
training_frame = train,
277+
validation_frame = NULL,
278+
family = family,
279+
learner = learner,
280+
metalearner = metalearner,
281+
cvControl = list(V = 5))
282+
283+
# Generate predictions on the test set
284+
pred <- predict(fit, test)
285+
predictions <- as.data.frame(pred$pred)[,3] #third column, p1 is P(Y==1)
286+
labels <- as.data.frame(test[,y])[,1]
287+
288+
# Ensemble test AUC
289+
cvAUC::AUC(predictions = predictions , labels = labels)
290+
# 0.7887694
291+
```
292+
293+
We actually lose performance by removing the weak learners!
187294

295+
296+
297+
298+
299+
300+
## Roadmap for H2O Ensemble
301+
H2O Ensemble is currently only available using the R API, however, it will be accessible via all our APIs in a future release. You can follow the progress of H2O Ensemble development on the [H2O JIRA](https://0xdata.atlassian.net/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PUBDEV+AND+component+%3D+Ensemble) (tickets with the "Ensemble" tag).

0 commit comments

Comments
 (0)