Added Rmd file for glm demo.

Tomas Nykodym · Tomas Nykodym · commit e97df5077c3e · 2015-11-08T20:44:45.000-08:00
diff --git a/tutorials/deeplearning/deeplearning.Rmd b/tutorials/deeplearning/deeplearning.Rmd
@@ -2,7 +2,6 @@
 
 * Introduction
   * Installation and Startup
-  * Decision Boundaries
 * Cover Type Dataset
   * Exploratory Data Analysis
   * Deep Learning Model
diff --git a/tutorials/glm/glm_h2oworld_demo.R b/tutorials/glm/glm_h2oworld_demo.R
@@ -135,7 +135,7 @@ x = names(data_ext$Train)
 x = x[-which(x==y)]
 m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
 # 21% err down from 28%
-m2
+summary(m2)
 
 ### All done, shutdown H2O    
 h2o.shutdown(prompt=FALSE)
diff --git a/tutorials/glm/glm_h2oworld_demo.Rmd b/tutorials/glm/glm_h2oworld_demo.Rmd
@@ -0,0 +1,248 @@
+---
+title: "H2O World GLM Demo"
+author: "Tomas Nykodym"
+date: "November 7, 2015"
+output: html_document
+---
+
+* Introduction
+  * Installation and Startup
+* Cover Type Dataset
+* Multinomial Model
+* Binomial Model
+  * Adding extra features
+* Multinomial Model Revisited
+
+
+## Introduction
+This tutorial shows how a H2O [GLM](http://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do binary and multi-class classification. This tutorial covers usage of H2O from R. A python version of this tutorial will be available as well in a separate document. This file is available in plain R, R markdown and regular markdown formats, and the plots are available as PDF files. All documents are available [on Github](https://github.com/h2oai/h2o-world-2015-training/raw/master/tutorials/glm/).
+
+If run from plain R, execute R in the directory of this script. If run from RStudio, be sure to setwd() to the location of this script. h2o.init() starts H2O in R's current working directory. h2o.importFile() looks for files from the perspective of where H2O was started.
+
+More examples and explanations can be found in our [H2O GLM booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/). 
+
+### H2O R Package
+
+Load the H2O R package:
+
+```{r load_library}
+## R installation instructions are at http://h2o.ai/download
+library(h2o)
+```
+
+### Start H2O
+Start up a 1-node H2O server on your local machine, and allow it to use all CPU cores and up to 2GB of memory:
+
+```{r start_h2o}
+h2o.init(nthreads=-1, max_mem_size="2G")
+h2o.removeAll() ## clean slate - just in case the cluster was already running
+```
+## Cover Type Data
+Predicting forest cover type from cartographic variables only (no remotely sensed data).
+Let's import the dataset:
+```{r import_data}
+D = h2o.importFile(path = normalizePath("../data/covtype.full.csv"))
+h2o.summary(D)
+```
+We got 11 numeric and two categorical features. Response is "Cover_Type" and has 7 classes.
+Let's split the data into Train/Test/Validation with train having 70% and Test and Validation 15% each:
+```{r split_data}
+data = h2o.splitFrame(D,ratios=c(.7,.15),destination_frames = c("train","test","valid"))
+names(data) <- c("Train","Test","Valid")
+y = "Cover_Type"
+x = names(data$Train)
+x = x[-which(x==y)]
+```
+## Multinomial Model 
+We got data in, let's run glm. As we mentioned previously, Cover_Type is the reponse and we use all other columns as predictors.
+We have multi-class problem so we pick family=multinomial. L-BFGS solver tends to be faster on multinomial problems, so we pick L-BFGS for our first try. 
+The rest can remain default.
+```{r build_model1}
+m1 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS')
+h2o.confusionMatrix(m1, valid=TRUE)
+```
+The model predicts only the majority class so it's not use at all! Maybe we regularized it too much, let's try again without regularization: 
+```{r build_model2}
+m2 = h2o.glm(training_frame = data$Train, validation_frame = data$Valid, x = x, y = y,family='multinomial',solver='L_BFGS', lambda = 0)
+h2o.confusionMatrix(m2, valid=FALSE) # get confusion matrix in the training data
+h2o.confusionMatrix(m2, valid=TRUE)  # get confusion matrix in the validation data
+```
+No overfitting(train and test performance are the same), regularization is not needed in this case. 
+
+This model is actually useful. It got 28% classification error, down from 51% obtained by predicting majority class only.
+
+##Binomial Model
+Multinomial models are difficult and time consuming, let's try simpler binary classification. 
+We'll take a subset of the data with only class_1 and class_2 (the two majority classes) and build a binomial model deciding between them.
+```{r get_and_split_binomial_data}
+D_binomial = D[D$Cover_Type %in% c("class_1","class_2"),]
+h2o.setLevels(D_binomial$Cover_Type,c("class_1","class_2"))
+# split to train/test/validation again
+data_binomial = h2o.splitFrame(D_binomial,ratios=c(.7,.15),destination_frames = c("train_b","test_b","valid_b"))
+names(data_binomial) <- c("Train","Test","Valid")
+```
+We can run binomial model now: 
+```{r build_binomial_model_1}
+m_binomial = h2o.glm(training_frame = data_binomial$Train, validation_frame = data_binomial$Valid, x = x, y = y, family='binomial',lambda=0)
+h2o.confusionMatrix(m_binomial, valid = TRUE)
+h2o.confusionMatrix(m_binomial, valid = TRUE)
+```
+The output for binomial problem is slightly different from the multinomial. The confusion matrix now has a threshold attached to it.
+
+The model produces probability of class_1 and class_2 similarly to multinomial example earlier, however this time we only have two classes and we can tune the classification to our needs. The classification errors in binomial case have special meaning, we call them false-positive and false negative. In reality, each can have different cost associated with it and we want to tune our classifier accordingly. The common way to evaluate a binary classifier performance is to look at it's [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). ROC curve plots true positive rate versus false positive rate. We can plot it from h2o model output:
+```{r build_binomial_model_output_1}
+fpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$fpr
+tpr = m_binomial@model$training_metrics@metrics$thresholds_and_metric_scores$tpr
+fpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$fpr
+tpr_val = m_binomial@model$validation_metrics@metrics$thresholds_and_metric_scores$tpr
+plot(fpr,tpr, type='l')
+title('AUC')
+lines(fpr_val,tpr_val,type='l',col='red')
+legend("bottomright",c("Train", "Validation"),col=c("black","red"),lty=c(1,1),lwd=c(3,3))                             
+```
+
+The area under the ROC curve (AUC) is a common "goodness" metric for binary classifiers. We got:
+
+```{r build_binomial_model_output_2}
+h2o.auc(m_binomial,valid=FALSE) # on train                   
+h2o.auc(m_binomial,valid=TRUE)  # on test
+```
+
+The default confusion matrix is computed at thresholds which optimizes[F1 score](https://en.wikipedia.org/wiki/F1_score). We can choose different thresholds, h2o output shows optimal thresholds for some common metrics.
+```{r build_binomial_model_output_3}
+m_binomial@model$training_metrics@metrics$max_criteria_and_metric_scores                  
+```
+
+The model we just built gets 23% classification error at F1-optimizing threshold, there is still room for improvement.
+Let's add some features:
+
+* There are 11 numerical predictors in the dataset, we will cut them into intervals and add a categorical variable for each
+* We can add interaction terms capturing interactions between catgorical variables
+
+Let's make a convenience function to cut the column into intervals working on all three of our datasets (Train/Validation/Test). 
+We'll use h2o.hist to determine interval boundaries (many more ways to do that!) on the Train set.  
+We'll take only the bins with non-trivial support:  
+```{r cut_column}
+cut_column <- function(data, col) {
+  # need lower/upper bound due to h2o.cut behavior (points < the first break or > the last break are replaced with missing value) 
+  min_val = min(data$Train[,col])-1
+  max_val = max(data$Train[,col])+1
+  x = h2o.hist(data$Train[, col])
+  # use only the breaks with enough support
+  breaks = x$breaks[which(x$counts > 1000)]
+  # assign level names 
+  lvls = c("min",paste("i_",breaks[2:length(breaks)-1],sep=""),"max")
+  col_cut <- paste(col,"_cut",sep="")
+  data$Train[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Train[,col],breaks=c(min_val,breaks,max_val)),lvls)
+  # now do the same for test and validation, but using the breaks computed on the training!
+  if(!is.null(data$Test)) {
+    min_val = min(data$Test[,col])-1
+    max_val = max(data$Test[,col])+1
+    data$Test[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Test[,col],breaks=c(min_val,breaks,max_val)),lvls)
+  }
+  if(!is.null(data$Valid)) {
+    min_val = min(data$Valid[,col])-1
+    max_val = max(data$Valid[,col])+1
+    data$Valid[,col_cut] <- h2o.setLevels(h2o.cut(x = data$Valid[,col],breaks=c(min_val,breaks,max_val)),lvls)
+  }
+  data
+}
+```
+Now lets make a convenience function generating interaction terms on all three of our datasets. We use h2o.interaction:
+```{r generate_interactions}
+interactions <- function(data, cols, pairwise = TRUE) {
+  iii = h2o.interaction(data = data$Train, destination_frame = "itrain",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
+  data$Train <- h2o.cbind(data$Train,iii)
+  if(!is.null(data$Test)) {
+    iii = h2o.interaction(data = data$Test, destination_frame = "itest",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
+    data$Test <- h2o.cbind(data$Test,iii)
+  }
+  if(!is.null(data$Valid)) {
+    iii = h2o.interaction(data = data$Valid, destination_frame = "ivalid",factors = cols,pairwise=pairwise,max_factors=1000,min_occurrence=100)
+    data$Valid <- h2o.cbind(data$Valid,iii)
+  }
+  data
+}
+```
+Lastly, let's wrap adding of the features into seperate function call as we will use it again later.
+We'll add intervals for each numeric column and we'll add interactions between each pair of binary columns.
+```{r add_features}
+# add features to our cover type example
+# let's cut all the numerical columns into intervals and add interactions between categorical terms
+add_features <- function(data) {
+  names(data) <- c("Train","Test","Valid")
+  data = cut_column(data,'Elevation')
+  data = cut_column(data,'Hillshade_Noon')
+  data = cut_column(data,'Hillshade_9am')
+  data = cut_column(data,'Hillshade_3pm')
+  data = cut_column(data,'Horizontal_Distance_To_Hydrology')
+  data = cut_column(data,'Slope')
+  data = cut_column(data,'Horizontal_Distance_To_Roadways')
+  data = cut_column(data,'Aspect')
+  # pairwise interactions between all categorical columns
+  interaction_cols = c("Elevation_cut","Wilderness_Area","Soil_Type","Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut","Horizontal_Distance_To_Hydrology_cut","Slope_cut","Horizontal_Distance_To_Roadways_cut","Aspect_cut")
+  data = interactions(data, interaction_cols)
+  # interactions between Hillshade columns
+  interaction_cols2 = c("Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut")
+  data = interactions(data, interaction_cols2,pairwise = FALSE)
+  data
+}
+```
+Now we generate new features and add them to the dataset. We'll also need to generate column names gain as we added more columns:
+```{r add_features_binomial}
+# Add Features
+data_binomial_ext <- add_features(data_binomial)
+data_binomial_ext$Train <- h2o.assign(data_binomial_ext$Train,"train_b_ext")
+data_binomial_ext$Valid <- h2o.assign(data_binomial_ext$Valid,"valid_b_ext")
+data_binomial_ext$Test <- h2o.assign(data_binomial_ext$Test,"test_b_ext")
+y = "Cover_Type"
+x = names(data_binomial_ext$Train)
+x = x[-which(x==y)]
+```
+Lets build the model, we should add some regularization this time because we added correlated variables, let's try default:
+```{r build_binomial_ext_1}
+m_binomial_1_ext = try(h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial'))
+```
+Oops, does not run, we know have more features than the default method can solve with 2GB of RAM. Let's try L-BFGS instead.
+```{r build_binomial_ext_2}
+m_binomial_1_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS')
+h2o.confusionMatrix(m_binomial_1_ext)
+h2o.auc(m_binomial_1_ext,valid=TRUE)
+```
+Not better, maybe too much regularization? Lets pick smaller lambda and try again.
+```{r build_binomial_ext_3}
+m_binomial_2_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', solver='L_BFGS', lambda=1e-4)
+h2o.confusionMatrix(m_binomial_2_ext, valid=TRUE)
+h2o.auc(m_binomial_2_ext,valid=TRUE)
+```
+Way better, we got auc of .91 and classification error of 0.180838. 
+We picked our regularization strength arbitrarily. Also, we used only l2 penalty but we added lot of extra features, some of which may be useless. 
+Maybe we can do better with l1 penalty. 
+So we want to run lambda search to find optimal penalty strength and we want to have non-zero l1 penalty to get sparse solution.
+We'll use IRLSM solver this time as it does much better with lambda search and l1 penalty. 
+Recall we were not able to use before. We can use it now as we are running with lambda search which will filter large portion of the inactive (coefficient==0) predictors. 
+```{r build_binomial_ext_4}
+m_binomial_3_ext = h2o.glm(training_frame = data_binomial_ext$Train, validation_frame = data_binomial_ext$Valid, x = x, y = y, family='binomial', lambda_search=TRUE)
+h2o.confusionMatrix(m_binomial_3_ext, valid=TRUE)
+h2o.auc(m_binomial_3_ext,valid=TRUE)
+```
+Better yet, we have 17% error and we used only 3000 out of 7000 features.
+Ok, our new features improved binomial model significantly,let's revisit our former multinomial model and see if they make a difference there (they should!):
+```{r build_multinomial_ext_4}
+# Multinomial Model 2
+# let's revisit the multinomial case with our new features
+data_ext <- add_features(data)
+data_ext$Train <- h2o.assign(data_ext$Train,"train_m_ext")
+data_ext$Valid <- h2o.assign(data_ext$Valid,"valid_m_ext")
+data_ext$Test <- h2o.assign(data_ext$Test,"test_m_ext")
+y = "Cover_Type"
+x = names(data_ext$Train)
+x = x[-which(x==y)]
+m2 = h2o.glm(training_frame = data_ext$Train, validation_frame = data_ext$Valid, x = x, y = y,family='multinomial',solver='L_BFGS',lambda=1e-4)
+# 21% err down from 28%
+h2o.confusionMatrix(m2, valid=TRUE)
+```
+Improved considerably, 21% insetad of 28%.
+
+
+