Skip to content

Commit 0be1d66

Browse files
author
anqi
committed
Finish GLRM tutorial except LaTeX image and additional references
1 parent 1ebda6b commit 0be1d66

1 file changed

Lines changed: 25 additions & 21 deletions

File tree

tutorials/glrm/glrm-tutorial.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,26 +19,30 @@
1919

2020
## Overview
2121

22-
This tutorial introduces the Generalized Low Rank Model (GLRM), a new machine learning approach for reconstructing missing values and identifying important features in heterogeneous data. It demonstrates how to build a GLRM in H2O and integrate it into a data science pipeline to make better predictions.
22+
This tutorial introduces the Generalized Low Rank Model (GLRM) [[1](#references)], a new machine learning approach for reconstructing missing values and identifying important features in heterogeneous data. It demonstrates how to build a GLRM in H2O and integrate it into a data science pipeline to make better predictions.
2323

2424
## What is a Low Rank Model?
2525

26+
Across business and research, analysts seek to understand large collections of data with numeric and categorical values. Many entries in this table may be noisy or even missing altogether. Low rank models facilitate the understanding of tabular data by producing a condensed vector representation for every row and column in the data set.
27+
28+
Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a low, user-specified number of columns k. The matrix Y has k rows and number of columns d equal to the total dimension of the embedded features in A. For example, if A has 3 numeric columns and 1 categorical column with 4 distinct levels (e.g., red, yellow, blue and green), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y will be identical (d = n).
29+
2630
#### TODO: Finish up this section and include LaTeX image!
2731

28-
Given a data table A with m rows and n columns, a GLRM will decompose A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. Similarly, the matrix Y has k rows and the same number of columns as A, when categorical columns are expanded into indicator variables. The number k is chosen to be much less than both m and n, indicating the amount of compression by the low rank model representation: the smaller is k, the more compression.
32+
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller is k, the more compression we gain from our low rank representation.
2933

30-
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced feature space. Thus, we can approximately reconstruct A from the matrix product XY.
34+
GLRMs are an extension of well-known matrix factorization methods such as Principal Components Analysis (PCA). While PCA is limited to numeric data, GLRMs can handle mixed numeric, categorical, ordinal, and Boolean data with an arbitrary number of missing values. It allows the user to apply regularization to X and Y, imposing restrictions like non-negativity appropriate to a particular data science context. Thus, it is an extremely flexible approach to analyzing and interpreting heterogeneous data sets.
3135

3236
## Why use Low Rank Models?
3337

34-
- **Memory:** By saving only the X and Y matrices, we can significantly reduce the amount of memory required to store a large dataset. A file that is 10 GB can be compressed down to 100 MB. When we need the original data again, we can reconstruct it on the fly from X and Y with minimal loss in accuracy.
38+
- **Memory:** By saving only the X and Y matrices, we can significantly reduce the amount of memory required to store a large data set. A file that is 10 GB can be compressed down to 100 MB. When we need the original data again, we can reconstruct it on the fly from X and Y with minimal loss in accuracy.
3539
- **Speed:** We can use GLRM to compress data with high-dimensional, heterogeneous features into a few numeric columns. This leads to a huge speed-up in model-building and prediction, especially by machine learning algorithms that scale poorly with the size of the feature space. Below, we will see an example with 10x speed-up and no accuracy loss in deep learning.
3640
- **Feature Engineering:** The Y matrix represents the most important combinations of features from the training data. These condensed features, called archetypes, can be analyzed, visualized and incorporated into various data science applications.
37-
- **Missing Data Imputation:** Reconstructing a dataset from X and Y will automatically impute missing values. This imputation is accomplished by intelligently leveraging the information contained in the known values of each feature, as well as user-provided parameters such as the loss function.
41+
- **Missing Data Imputation:** Reconstructing a data set from X and Y will automatically impute missing values. This imputation is accomplished by intelligently leveraging the information contained in the known values of each feature, as well as user-provided parameters such as the loss function.
3842

3943
## Example 1: Visualizing Walking Stances
4044

41-
For our first example, we will use data on [Subject 01's walking stances](https://simtk.org/project/xml/downloads.xml?group_id=603) from an experiment carried out by Hamner and Delp (2013) [2]. Each of the 151 row of the dataset contains the (x, y, z) coordinates of major body parts recorded at a specific point in time.
45+
For our first example, we will use data on [Subject 01's walking stances](https://simtk.org/project/xml/downloads.xml?group_id=603) from an experiment carried out by _Hamner and Delp (2013)_ [[2](#references)]. Each of the 151 row of the data set contains the (x, y, z) coordinates of major body parts recorded at a specific point in time.
4246

4347
#### Basic Model Building
4448

@@ -48,11 +52,11 @@ For our first example, we will use data on [Subject 01's walking stances](https:
4852
pathToData <- "/data/h2o-training/glrm/subject01_walk1.csv"
4953
gait.hex <- h2o.importFile(path = pathToData, destination_frame = "gait.hex")
5054

51-
###### Get a summary of the imported dataset.
55+
###### Get a summary of the imported data set.
5256
dim(gait.hex)
5357
summary(gait.hex)
5458

55-
###### Build a basic GLRM using quadratic loss and no regularization. Since this dataset has no missing values, this is equivalent to principal components analysis (PCA). We skip the first column since it is the time index, set the rank k = 10, and allow the algorithm to run for a maximum of 1,000 iterations.
59+
###### Build a basic GLRM using quadratic loss and no regularization. Since this data set has no missing values, this is equivalent to principal components analysis (PCA). We skip the first column since it is the time index, set the rank k = 10, and allow the algorithm to run for a maximum of 1,000 iterations.
5660
gait.glrm <- h2o.glrm(training_frame = gait.hex, cols = 2:ncol(gait.hex), k = 10, loss = "Quadratic",
5761
regularization_x = "None", regularization_y = "None", max_iterations = 1000)
5862

@@ -100,12 +104,12 @@ Suppose that due to a sensor malfunction, our walking stance data has missing va
100104
pathToMissingData <- "/data/h2o-training/glrm/subject01_walk1_miss15.csv"
101105
gait.miss <- h2o.importFile(path = pathToMissingData, destination_Frame = "gait.miss")
102106

103-
###### Get a summary of the imported dataset.
107+
###### Get a summary of the imported data set.
104108
dim(gait.miss)
105109
summary(gait.miss)
106110
sum(is.na(gait.miss))
107111

108-
###### Build a basic GLRM with quadratic loss and no regularization, validating on our original dataset with no missing values. We change the algorithm initialization method, increase the maximum number of iterations to 2,000, and reduce the minimum step size to 1e-6 to ensure it converges.
112+
###### Build a basic GLRM with quadratic loss and no regularization, validating on our original data set with no missing values. We change the algorithm initialization method, increase the maximum number of iterations to 2,000, and reduce the minimum step size to 1e-6 to ensure it converges.
109113
gait.glrm2 <- h2o.glrm(training_frame = gait.miss, validation_frame = gait.hex, cols = 2:ncol(gait.miss), k = 10, init = "SVD", svd_method = "GramSVD",
110114
loss = "Quadratic", regularization_x = "None", regularization_y = "None", max_iterations = 2000, min_step_size = 1e-6)
111115
plot(gait.glrm2)
@@ -125,13 +129,13 @@ Suppose that due to a sensor malfunction, our walking stance data has missing va
125129

126130
## Example 2: Compressing Zip Codes
127131

128-
For our second example, we will be using two datasets. The first is compliance actions carried out by the U.S. Labor Department's [Wage and Hour Division (WHD)](http://ogesdw.dol.gov/views/data_summary.php) from 2014-2015. This includes information on each investigation, including the zip code tabulation area (ZCTA) at which the firm is located, number of violations found, and civil penalties assessed. We want to predict whether a firm is a repeat and/or willful violator. In order to do this, we need to encode the categorical ZCTA column in a meaningful way. One common approach is to replace ZCTA with indicator variables for every unique level, but due to its high cardinality (there are over 32,000 ZCTAs!), this is slow and leads to overfitting.
132+
For our second example, we will be using two data sets. The first is compliance actions carried out by the U.S. Labor Department's [Wage and Hour Division (WHD)](http://ogesdw.dol.gov/views/data_summary.php) from 2014-2015. This includes information on each investigation, including the zip code tabulation area (ZCTA) at which the firm is located, number of violations found, and civil penalties assessed. We want to predict whether a firm is a repeat and/or willful violator. In order to do this, we need to encode the categorical ZCTA column in a meaningful way. One common approach is to replace ZCTA with indicator variables for every unique level, but due to its high cardinality (there are over 32,000 ZCTAs!), this is slow and leads to overfitting.
129133

130-
Instead, we will use GLRM to condense ZCTAs into a few numeric columns representing the demographics of that area. Our second dataset is the 2009-2013 [American Community Survey (ACS)](http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk) 5-year estimates of household characteristics. Each row contains information for a unique ZCTA, such as average household size, number of children, education level and ethnicity. By transforming the WHD data with GLRM, we not only address the speed and overfitting issue, but also transfer knowledge between similar ZCTAs in our model.
134+
Instead, we will use GLRM to condense ZCTAs into a few numeric columns representing the demographics of that area. Our second data set is the 2009-2013 [American Community Survey (ACS)](http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk) 5-year estimates of household characteristics. Each row contains information for a unique ZCTA, such as average household size, number of children, education level and ethnicity. By transforming the WHD data with GLRM, we not only address the speed and overfitting issue, but also transfer knowledge between similar ZCTAs in our model.
131135

132136
#### Condensing Categorical Data
133137

134-
###### Initialize the H2O server and import the ACS dataset.
138+
###### Initialize the H2O server and import the ACS data set.
135139
library(h2o)
136140
h2o.init()
137141
pathToACSData <- "/data/h2o-training/glrm/ACS_13_5YR_DP02_cleaned.zip"
@@ -141,11 +145,11 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
141145
acs_zcta_col <- acs_orig$ZCTA5
142146
acs_full <- acs_orig[,-which(colnames(acs_orig) == "ZCTA5")]
143147

144-
###### Get a summary of the ACS dataset.
148+
###### Get a summary of the ACS data set.
145149
dim(acs_full)
146150
summary(acs_full)
147151

148-
###### Build a GLRM to reduce ZCTA demographics to k = 10 archetypes. We standardize the data before performing the fit to ensure differences in scale between columns don't unduly affect the algorithm. For the loss function, we select quadratic again, but this time, we apply regularization to X and Y in order to sparsify the resulting features.
152+
###### Build a GLRM to reduce ZCTA demographics to k = 10 archetypes. We standardize the data before performing the fit to ensure differences in scale don't unduly affect the algorithm. For the loss function, we select quadratic again, but this time, apply regularization to X and Y in order to sparsify the compressed features.
149153
acs_model <- h2o.glrm(training_frame = acs_full, k = 10, transform = "STANDARDIZE",
150154
loss = "Quadratic", regularization_x = "Quadratic",
151155
regularization_y = "L1", max_iterations = 100, gamma_x = 0.25, gamma_y = 0.5)
@@ -172,9 +176,9 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
172176

173177
#### Runtime and Accuracy Comparison
174178

175-
We now build a deep learning model on the WHD dataset to predict repeat and/or willful violators. For comparison purposes, we train our model using the original data, the data with the ZCTA column replaced by the compressed GLRM representation (the X matrix), and the data with the ZCTA column replaced by all the demographic features in the ACS dataset.
179+
We now build a deep learning model on the WHD data set to predict repeat and/or willful violators. For comparison purposes, we train our model using the original data, the data with the ZCTA column replaced by the compressed GLRM representation (the X matrix), and the data with the ZCTA column replaced by all the demographic features in the ACS data set.
176180

177-
###### Import WHD dataset and get a summary.
181+
###### Import WHD data set and get a summary.
178182
pathToWHDData <- "/data/h2o-training/glrm/whd_zcta_cleaned.zip"
179183
whd_zcta <- h2o.uploadFile(path = pathToWHDData, col.types = c(rep("enum", 7), rep("numeric", 97)))
180184
dim(whd_zcta)
@@ -185,28 +189,28 @@ We now build a deep learning model on the WHD dataset to predict repeat and/or w
185189
train <- whd_zcta[split <= 0.8,]
186190
test <- whd_zcta[split > 0.8,]
187191

188-
###### Build a deep learning model on original WHD data to predict repeat and/or willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator, so we specify a multinomial distribution.
192+
###### Build a deep learning model on original WHD data to predict repeat/willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator, so we specify a multinomial distribution.
189193
myY <- "flsa_repeat_violator"
190194
myX <- setdiff(5:ncol(train), which(colnames(train) == myY))
191195
orig_time <- system.time(dl_orig <- h2o.deeplearning(x = myX, y = myY, training_frame = train,
192196
validation_frame = test, distribution = "multinomial",
193197
epochs = 0.1, hidden = c(50,50,50)))
194198

195-
###### Replace each ZCTA in the WHD data with the row of the X matrix corresponding to its compressed demographic representation. At the end of the merge, our single categorical column will be replaced by k = 10 numeric columns.
199+
###### Replace each ZCTA in the WHD data with the row of the X matrix corresponding to its compressed demographic representation. At the end, our single categorical column will be replaced by k = 10 numeric columns.
196200
zcta_arch_x$zcta5_cd <- acs_zcta_col
197201
whd_arch <- h2o.merge(whd_zcta, zcta_arch_x, all.x = TRUE, all.y = FALSE)
198202
whd_arch$zcta5_cd <- NULL
199203
summary(whd_arch)
200204

201-
###### Split the reduced WHD data into test and train, and build a deep learning model to predict repeat and/or willful violators.
205+
###### Split the reduced WHD data into test and train, and build a deep learning model to predict repeat/willful violators.
202206
train_mod <- whd_arch[split <= 0.8,]
203207
test_mod <- whd_arch[split > 0.8,]
204208
myX <- setdiff(5:ncol(train_mod), which(colnames(train_mod) == myY))
205209
mod_time <- system.time(dl_mod <- h2o.deeplearning(x = myX, y = myY, training_frame = train_mod,
206210
validation_frame = test_mod, distribution = "multinomial",
207211
epochs = 0.1, hidden = c(50,50,50)))
208212

209-
###### Compare the performance between the two models. We see that the model built on the reduced WHD dataset finishes almost 10 times faster than the model using the original dataset, and it yields a lower log-loss error.
213+
###### Compare the performance between the two models. We see that the model built on the reduced WHD data set finishes almost 10 times faster than the model using the original data set, and it yields a lower log-loss error.
210214
data.frame(original = c(orig_time[3], h2o.logloss(dl_orig, train = TRUE), h2o.logloss(dl_orig, valid = TRUE)),
211215
reduced = c(mod_time[3], h2o.logloss(dl_mod, train = TRUE), h2o.logloss(dl_mod, valid = TRUE)),
212216
row.names = c("runtime", "train_logloss", "test_logloss"))

0 commit comments

Comments
 (0)