You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> (But open the book anyway, we're adding content...)
3
-
4
1
# Ensembles: Stacking, Super Learner
5
2
- Overview
6
3
- What is Ensemble Learning?
@@ -13,7 +10,7 @@
13
10
14
11
In this tutorial, we will discuss ensemble learning with a focus on a type of ensemble learning called stacking or Super Learning. We present the H2O implementation of the Super Learner algorithm, called "H2O Ensemble."
15
12
16
-
Following the introduction to ensemble learning, we will dive into a hands-on code demo of the **h2oEnsemble** R package.
13
+
Following the introduction to ensemble learning, we will dive into a hands-on code demo of the [h2oEnsemble](https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble) R package.
17
14
18
15
19
16
# What is Ensemble Learning?
@@ -42,24 +39,31 @@ Both bagging and boosting are ensembles that take a collection of weak learners
42
39
43
40
## Stacking / Super Learning
44
41
45
-
Stacking is a broad class of algorithms that involves training a second-level "metalearner" to ensemble a group of base learners.
42
+
Stacking is a broad class of algorithms that involves training a second-level "metalearner" to ensemble a group of base learners. The type of ensemble learning implemented in H2O is called "super learning", "stacked regression" or "stacking."
43
+
44
+
### Some Background
45
+
[Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman), known for his work on classification and regression trees and the creator of the Random Forest algorithm, formalized stacking in his 1996 paper, ["Stacked Regressions"](http://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf). Although the idea originated with [David Wolpert](https://en.wikipedia.org/wiki/David_Wolpert) in 1992 under the name ["Stacked Generalization"](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.1533), the modern form of stacking that uses internal k-fold cross-validation was Dr. Breiman's contribution.
46
+
47
+
However, it wasn't until 2007, that the theoretical background for stacking was developed, taking on the name "Super Learner". Until this time, the mathematical reasons for why stacking worked were unknown. The Super Learner algorithm learns the optimal combination of the base learner fits. In an article titled, "[Super Learner](http://dx.doi.org/10.2202/1544-6115.1309)," by [Mark van der Laan](http://www.stat.berkeley.edu/~laan/Laan/laan.html) et al., proved that the Super Learner ensemble represents an asymptotically optimal system for learning.
46
48
47
-
The type of ensemble learning implemented in H2O is called "super learning", "stacked regression" or "stacking." The Super Learner algorithm learns the optimal combination of the base learner fits. In a 2007 article titled, "[Super Learner](http://dx.doi.org/10.2202/1544-6115.1309)," it was shown that the super learner ensemble represents an asymptotically optimal system for learning.
48
49
49
50
### Super Learner Algorithm
50
-
Set up the ensemble:
51
+
52
+
Here is an outline of the tasks involved in training and testing a Super Learner ensemble.
53
+
54
+
#### Set up the ensemble
51
55
- Specify a list of L base algorithms (with a specific set of model parameters).
52
-
- Specify a metalearning algorithm
56
+
- Specify a metalearning algorithm.
53
57
54
-
Train the ensemble:
58
+
#### Train the ensemble
55
59
- Train each of the L base algorithms on the training set.
56
60
- Perform k-fold cross-validation on each of these learners and collect the cross-validated predicted values from each of the L algorithms.
57
61
- The N cross-validated predicted values from each of the L algorithms can be combined to form a new N x L matrix. This matrix, along wtih the original response vector, is called the "level-one" data.
58
62
- Train the metalearning algorithm on the level-one data.
59
63
60
64
The "ensemble model" consists of the L base learning models and the metalearning model.
61
65
62
-
Predict on new data:
66
+
#### Predict on new data
63
67
- To generate ensemble predictions, first generate predictions from the base learners.
64
68
- Feed those predictions into the metalearner to generate the ensemble prediction.
65
69
@@ -68,120 +72,230 @@ Predict on new data:
68
72
69
73
The H2O Super Learner ensemble has been implemented as a stand-alone R package called [h2oEnsemble](https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble). The package is an extension to the [h2o](https://cran.r-project.org/web/packages/h2o/index.html) R package that allows the user to train an ensemble containing H2O algorithms. As in the **h2o** R package, all of the actual computation in **h2oEnsemble** is performed inside the H2O cluster, rather than in R memory.
70
74
71
-
The main computational tasks in the Super Learner ensemble algorithm is the training and cross-validation of the base learners and metalearner. Therefore, implementing the "plumbing" of the ensemble in R (rather than in Java) does not incur a loss of performance.
75
+
The main computational tasks in the Super Learner ensemble algorithm are the training and cross-validation of the base learners and metalearner. Therefore, implementing the "plumbing" of the ensemble in R (rather than in Java) does not incur a loss of performance. All training and data processing are performed in the high-performance H2O cluster.
72
76
73
77
74
78
## Install H2O Ensemble
75
79
76
-
To install the **h2oEnsemble** package, you just need to follow the installation instructions on the [README](https://github.com/h2oai/h2o-3/blob/master/h2o-r/ensemble/README.md#install) file.
80
+
To install the **h2oEnsemble** package, you just need to follow the installation instructions on the [README](https://github.com/h2oai/h2o-3/blob/master/h2o-r/ensemble/README.md#install) file, also documented here for convenience.
81
+
82
+
### H2O R Package
83
+
84
+
First you need to install the H2O R package if you don't already have it installed. The R installation instructions are at: [http://h2o.ai/download](http://h2o.ai/download)
85
+
86
+
87
+
### H2O Ensemble R Package
88
+
89
+
The recommended way of installing the **h2oEnsemble** R package is directly from GitHub using the [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package (however, [H2O World](http://h2oworld.h2o.ai/) tutorial attendees should install the package from the provided USB stick).
test <- h2o.importFile(paste0(ROOT_PATH, "/data/higgs_test_5k.csv"))
96
124
y <- "C1"
97
125
x <- setdiff(names(train), y)
126
+
```
98
127
99
-
#For binary classification, response should be a factor
128
+
For binary classification, the response should be encoded as factor (aka "enum" in Java). The user can specify column types in the `h2o.importFile` command, or you can convert the response column as follows:
129
+
130
+
```{r convert_response}
100
131
train[,y] <- as.factor(train[,y])
101
132
test[,y] <- as.factor(test[,y])
102
133
```
103
134
104
135
105
136
### Specify Base Learners & Metalearner
106
-
For this example, we will use the default base learner library, which includes the H2O GLM, Random Forest, GBM and Deep Learner (all using default model parameter values).
137
+
For this example, we will use the default base learner library, which includes the H2O GLM, Random Forest, GBM and Deep Learner (all using default model parameter values). We will also use the default metalearner, the H2O GLM.
Train the ensemble using 5-fold CV to generate level-one data. Note that more CV folds will take longer to train, but should increase performance.
117
-
```r
148
+
```{r train_ensemble}
118
149
fit <- h2o.ensemble(x = x, y = y,
119
150
training_frame = train,
120
151
family = "binomial",
121
152
learner = learner,
122
153
metalearner = metalearner,
123
-
cvControl=list(V=5, shuffle=TRUE))
154
+
cvControl = list(V = 5))
124
155
```
125
156
126
157
127
158
### Predict
128
159
Generate predictions on the test set.
129
-
```r
130
-
pp<- predict(fit, test)
131
-
predictions<- as.data.frame(pp$pred)[,3] #third column, p1 is P(Y==1)
160
+
```{r predict_ensemble}
161
+
pred <- predict(fit, test)
162
+
predictions <- as.data.frame(pred$pred)[,3] #third column, p1 is P(Y==1)
132
163
labels <- as.data.frame(test[,y])[,1]
133
164
```
134
165
135
166
### Model Evaluation
136
167
137
168
Since the response is binomial, we can use Area Under the ROC Curve (AUC) to evaluate the model performance. We first generate predictions on the test set and then calculate test set AUC using the [cvAUC](https://cran.r-project.org/web/packages/cvAUC/) R package.
138
169
139
-
```r
170
+
```{r ensemble_auc}
140
171
# Ensemble test AUC
141
-
library(cvAUC) # Used to calculate test set AUC (cvAUC version >=1.0.1)
# Note that the ensemble results above are not reproducible since
156
-
# h2o.deeplearning is not reproducible when using multiple cores,
157
-
# and we did not set a seed for h2o.randomForest.wrapper or h2o.gbm.wrapper.
180
+
# learner auc
181
+
# 1 h2o.glm.wrapper 0.6871288
182
+
# 2 h2o.randomForest.wrapper 0.7711654
183
+
# 3 h2o.gbm.wrapper 0.7817075
184
+
# 4 h2o.deeplearning.wrapper 0.7425813
158
185
```
186
+
Note that the ensemble results above are not reproducible since `h2o.deeplearning` is not reproducible when using multiple cores, and we did not set a seed for `h2o.randomForest.wrapper`.
187
+
159
188
Additional note: In a future version, performance metrics such as AUC will be computed automatically, as in the other H2O algos.
160
189
161
190
162
191
### Specifying New Learners
163
192
164
-
Here is an example of how to generate a base learner library using custom base learners:
H2O Ensemble is currently only available using the R API, however, it will be accessible via all our APIs in a future release. You can follow the progress of H2O Ensemble development on the [H2O JIRA](https://0xdata.atlassian.net/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PUBDEV+AND+component+%3D+Ensemble)
268
+
So what happens to the ensemble if we remove some of the weaker learners? Let's remove the GLM and DL from the learner library and see what happens
185
269
270
+
Here is a more stripped down version of the ensemble:
We actually lose performance by removing the weak learners!
187
294
295
+
296
+
297
+
298
+
299
+
300
+
## Roadmap for H2O Ensemble
301
+
H2O Ensemble is currently only available using the R API, however, it will be accessible via all our APIs in a future release. You can follow the progress of H2O Ensemble development on the [H2O JIRA](https://0xdata.atlassian.net/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PUBDEV+AND+component+%3D+Ensemble) (tickets with the "Ensemble" tag).
0 commit comments