h2o-tutorials/tutorials/glm/glm_h2oworld_demo.Rmd at master · JavaMuller/h2o-tutorials

248 lines (226 loc) · 13.5 KB
title: "H2O World GLM Demo"
author: "Tomas Nykodym"
date: "November 7, 2015"
output: html_document
* Introduction
  * Installation and Startup
* Cover Type Dataset
* Multinomial Model
* Binomial Model
  * Adding extra features
* Multinomial Model Revisited
## Introduction
This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/).
If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to setwd() to the location of this script. h2o.init() starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started.
More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/). 
### H2O R Package
Load the H2O R package:
```{r load_library}
## R installation instructions are at http://h2o.ai/download
library(h2o)
### Start H2O
Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory:
```{r start_h2o}
h2o.init(nthreads=-1, max_mem_size="2G")
h2o.removeAll() ## clean slate - just in case the cluster was already running
## Cover Type Data
Predicting forest cover type from cartographic variables only (no remotely sensed data).
Let's import the dataset:
```{r import_data}
D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
h2o.summary(D)
We got 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes.
Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
```{r split_data}
data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
names(data) <- c("Train","Test","Valid")
y = "Cover_Type"
x = names(data$Train)
x = x[-which(x==y)]
## Multinomial Model 
We got data in, let's run glm. As we mentioned previously, Cover_Type is the reponse and we use all other columns as predictors.
We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try. 
The rest can remain default.
```{r build_model1}
m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
h2o.confusionMatrix(m1, valid=TRUE)
The model predicts only the majority class so it's not use at all! Maybe we regularized it too much, let's try again without regularization: 
```{r build_model2}
m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
h2o.confusionMatrix(m2, valid=TRUE)  # get confusion matrix in the validation data
No overfitting(train and test performance are the same), regularization is not needed in this case. 
This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only.
##Binomial Model
Multinomial models are difficult and time consuming, let's try simpler binary classification. 
We'll take a subset of the data with only class_1 and class_2 (the two majority classes) and build a binomial model deciding between them.
```{r get_and_split_binomial_data}
D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
# split to train/test/validation again
data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
names(data_binomial) <- c("Train","Test","Valid")
We can run binomial model now: 
```{r build_binomial_model_1}
m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
h2o.confusionMatrix(m_binomial, valid = TRUE)
h2o.confusionMatrix(m_binomial, valid = TRUE)
The output for binomial problem is slightly different from the multinomial. The confusion matrix now has a threshold attached to it.
The model produces probability of class_1 and class_2 similarly to multinomial example earlier, however this time we only have two classes and we can tune the classification to our needs. The classification errors in binomial case have special meaning, we call them false-positive and false negative. In reality, each can have different cost associated with it and we want to tune our classifier accordingly. The common way to evaluate a binary classifier performance is to look at it's [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). ROC curve plots true positive rate versus false positive rate. We can plot it from h2o model output:
```{r build_binomial_model_output_1}
fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
plot(fpr,tpr, type='l')
title('AUC')
lines(fpr_val,tpr_val,type='l',col='red')
legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))                             
The area under the ROC curve (AUC) is a common "goodness" metric for binary classifiers. We got:
```{r build_binomial_model_output_2}
h2o.auc(m_binomial,valid=FALSE) # on train                   
h2o.auc(m_binomial,valid=TRUE)  # on test
The default confusion matrix is computed at thresholds which optimizes[F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds, h2o output shows optimal thresholds for some common metrics.
```{r build_binomial_model_output_3}
m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores                  
The model we just built gets 23% classification error at F1-optimizing threshold, there is still room for improvement.
Let's add some features:
* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
* We can add interaction terms capturing interactions between catgorical variables
Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test). 
We'll use h2o.hist to determine interval boundaries (many more ways to do that!) on the Train set.  
We'll take only the bins with non-trivial support:  
```{r cut_column}
cut_column <- function(data, col) {
  # need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value) 
  min_val = min(data$Train[,col])-1
  max_val = max(data$Train[,col])+1
  x = h2o.hist(data$Train[, col])
  # use only the breaks with enough support
  breaks = x$breaks[which(x$counts > 1000)]
  # assign level names 
  lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max")
  col_cut <- paste(col,"_cut",sep="")
  data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls)
  # now do the same for test and validation, but using the breaks computed on the training!
  if(!is.null(data$Test)) {
    min_val = min(data$Test[,col])-1
    max_val = max(data$Test[,col])+1
    data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls)
  if(!is.null(data$Valid)) {
    min_val = min(data$Valid[,col])-1
    max_val = max(data$Valid[,col])+1
    data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls)
Now lets make a convenience function generating interaction terms on all three of our datasets. We use h2o.interaction:
```{r generate_interactions}
interactions <- function(data, cols, pairwise = TRUE) {
  iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
  data$Train <- h2o.cbind(data$Train,iii)
  if(!is.null(data$Test)) {
    iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
    data$Test <- h2o.cbind(data$Test,iii)
  if(!is.null(data$Valid)) {
    iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
    data$Valid <- h2o.cbind(data$Valid,iii)
Lastly, let's wrap adding of the features into seperate function call as we will use it again later.
We'll add intervals for each numeric column and we'll add interactions between each pair of binary columns.
```{r add_features}
# add features to our cover type example
# let's cut all the numerical columns into intervals and add interactions between categorical terms
add_features <- function(data) {
  names(data) <- c("Train","Test","Valid")
  data = cut_column(data,'Elevation')
  data = cut_column(data,'Hillshade_Noon')
  data = cut_column(data,'Hillshade_9am')
  data = cut_column(data,'Hillshade_3pm')
  data = cut_column(data,'Horizontal_Distance_To_Hydrology')
  data = cut_column(data,'Slope')
  data = cut_column(data,'Horizontal_Distance_To_Roadways')
  data = cut_column(data,'Aspect')
  # pairwise interactions between all categorical columns
  interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut")
  data = interactions(data, interaction_cols)
  # interactions between Hillshade columns
  interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut")
  data = interactions(data, interaction_cols2,pairwise = FALSE)
Now we generate new features and add them to the dataset. We'll also need to generate column names gain as we added more columns:
```{r add_features_binomial}
# Add Features
data_binomial_ext <- add_features(data_binomial)
data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext")
data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext")
data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext")
y = "Cover_Type"
x = names(data_binomial_ext$Train)
x = x[-which(x==y)]
Lets build the model, we should add some regularization this time because we added correlated variables, let's try default:
```{r build_binomial_ext_1}
m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
Oops, does not run, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead.
```{r build_binomial_ext_2}
m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
h2o.confusionMatrix(m_binomial_1_ext)
h2o.auc(m_binomial_1_ext,valid=TRUE)
Not better, maybe too much regularization? Lets pick smaller lambda and try again.
```{r build_binomial_ext_3}
m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
h2o.auc(m_binomial_2_ext,valid=TRUE)
Way better, we got auc of .91 and classification error of 0.180838. 
We picked our regularization strength arbitrarily. Also, we used only l2 penalty but we added lot of extra features, some of which may be useless. 
Maybe we can do better with l1 penalty. 
So we want to run lambda search to find optimal penalty strength and we want to have non-zero l1 penalty to get sparse solution.
We'll use IRLSM solver this time as it does much better with lambda search and l1 penalty. 
Recall we were not able to use before. We can use it now as we are running with lambda search which will filter large portion of the inactive (coefficient==0) predictors. 
```{r build_binomial_ext_4}
m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
h2o.auc(m_binomial_3_ext,valid=TRUE)
Better yet, we have 17% error and we used only 3000 out of 7000 features.
Ok, our new features improved binomial model significantly,let's revisit our former multinomial model and see if they make a difference there (they should!):
```{r build_multinomial_ext_4}
# Multinomial Model 2
# let's revisit the multinomial case with our new features
data_ext <- add_features(data)
data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext")
data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext")
data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext")
y = "Cover_Type"
x = names(data_ext$Train)
x = x[-which(x==y)]
m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
# 21% err down from 28%
h2o.confusionMatrix(m2, valid=TRUE)
Improved considerably, 21% insetad of 28%.
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

glm_h2oworld_demo.Rmd

Latest commit

History

glm_h2oworld_demo.Rmd

File metadata and controls