Skip to content

Commit e846d52

Browse files
author
anqi
committed
Change matrix decomposition image
1 parent eeeac41 commit e846d52

2 files changed

Lines changed: 22 additions & 22 deletions

File tree

tutorials/glrm/glrm-tutorial.md

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -23,18 +23,18 @@ This tutorial introduces the Generalized Low Rank Model (GLRM) [[1](#references)
2323

2424
Across business and research, analysts seek to understand large collections of data with numeric and categorical values. Many entries in this table may be noisy or even missing altogether. Low rank models facilitate the understanding of tabular data by producing a condensed vector representation for every row and column in the data set.
2525

26-
Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. The matrix Y has k rows and d columns, where d is equal to the total dimension of the embedded features in A. For example, if A has 4 numeric columns and 1 categorical column with 3 distinct levels (e.g., _red_, _blue_ and _green_), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y will be identical.
26+
Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. The matrix Y has k rows and d columns, where d is equal to the total dimension of the embedded features in A. For example, if A has 4 numeric columns and 1 categorical column with 3 distinct levels (e.g., _red_, _blue_ and _green_), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y are identical, as shown below.
2727

2828
![GLRM Matrix Decomposition](images/glrm_matrix_decomposition.png)
2929

30-
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller k is, the more compression we gain from our low rank representation.
30+
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced dimension feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller k is, the more compression we gain from our low rank representation.
3131

3232
GLRMs are an extension of well-known matrix factorization methods such as Principal Components Analysis (PCA). While PCA is limited to numeric data, GLRMs can handle mixed numeric, categorical, ordinal and Boolean data with an arbitrary number of missing values. It allows the user to apply regularization to X and Y, imposing restrictions like non-negativity appropriate to a particular data science context. Thus, it is an extremely flexible approach for analyzing and interpreting heterogeneous data sets.
3333

3434
## Why use Low Rank Models?
3535

3636
- **Memory:** By saving only the X and Y matrices, we can significantly reduce the amount of memory required to store a large data set. A file that is 10 GB can be compressed down to 100 MB. When we need the original data again, we can reconstruct it on the fly from X and Y with minimal loss in accuracy.
37-
- **Speed:** We can use GLRM to compress data with high-dimensional, heterogeneous features into a few numeric columns. This leads to a huge speed-up in model-building and prediction, especially by machine learning algorithms that scale poorly with the size of the feature space. Below, we will see an example with 10x speed-up and no accuracy loss in deep learning.
37+
- **Speed:** We can use GLRM to compress data with high-dimensional, heterogeneous features into a few numeric columns. This leads to a huge speed-up in model building and prediction, especially by machine learning algorithms that scale poorly with the size of the feature space. Below, we will see an example with 10x speed-up and no accuracy loss in deep learning.
3838
- **Feature Engineering:** The Y matrix represents the most important combination of features from the training data. These condensed features, called archetypes, can be analyzed, visualized and incorporated into various data science applications.
3939
- **Missing Data Imputation:** Reconstructing a data set from X and Y will automatically impute missing values. This imputation is accomplished by intelligently leveraging the information contained in the known values of each feature, as well as user-provided parameters such as the loss function.
4040

@@ -44,25 +44,25 @@ For our first example, we will use data on [Subject 01's walking stances](https:
4444

4545
#### Basic Model Building
4646

47-
###### Initialize the H2O server and import our walking stance data.
47+
###### Initialize the H2O server and import our walking stance data. We use all available cores on our computer and allocate a maximum of 2 GB of memory to H2O.
4848
library(h2o)
49-
h2o.init(nthreads = -1, max_mem_size = "2G") # Use all available cores and 2 GB of memory
49+
h2o.init(nthreads = -1, max_mem_size = "2G")
5050
gait.hex <- h2o.importFile(path = "../data/subject01_walk1.csv", destination_frame = "gait.hex")
5151

5252
###### Get a summary of the imported data set.
5353
dim(gait.hex)
5454
summary(gait.hex)
5555

56-
###### Build a basic GLRM using quadratic loss and no regularization. Since this data set has no missing values, this is equivalent to principal components analysis (PCA). We skip the first column since it is the time index, set the rank k = 10, and allow the algorithm to run for a maximum of 1,000 iterations.
56+
###### Build a basic GLRM using quadratic loss and no regularization. Since this data set contains only numeric features and no missing values, this is equivalent to PCA. We skip the first column since it is the time index, set the rank k = 10, and allow the algorithm to run for a maximum of 1,000 iterations.
5757
gait.glrm <- h2o.glrm(training_frame = gait.hex, cols = 2:ncol(gait.hex), k = 10, loss = "Quadratic",
5858
regularization_x = "None", regularization_y = "None", max_iterations = 1000)
5959

60-
###### To ensure our algorithm converged, we should always plot the objective function value per iteration after model-building is complete.
60+
###### To ensure our algorithm converged, we should always plot the objective function value per iteration after model building is complete.
6161
plot(gait.glrm)
6262

6363
#### Plotting Archetypal Features
6464

65-
###### The rows of the Y matrix represent the principal stances, or archetypes, that Subject 01 took while walking. We can visualize each of the 10 stances by plotting the (x, y) coordinate weights of every body part.
65+
###### The rows of the Y matrix represent the principal stances that Subject 01 took while walking. We can visualize each of the 10 stances by plotting the (x, y) coordinate weights of each body part.
6666
gait.y <- gait.glrm@model$archetypes
6767
gait.y.mat <- as.matrix(gait.y)
6868
x_coords <- seq(1, ncol(gait.y), by = 3)
@@ -105,17 +105,17 @@ Suppose that due to a sensor malfunction, our walking stance data has missing va
105105
###### Count the total number of missing values in the data set.
106106
sum(is.na(gait.miss))
107107

108-
###### Build a basic GLRM with quadratic loss and no regularization, validating on our original data set with no missing values. We change the algorithm initialization method, increase the maximum number of iterations to 2,000, and reduce the minimum step size to 1e-6 to ensure it converges.
108+
###### Build a basic GLRM with quadratic loss and no regularization, validating on our original data set that has no missing values. We change the algorithm initialization method, increase the maximum number of iterations to 2,000, and reduce the minimum step size to 1e-6 to ensure convergence.
109109
gait.glrm2 <- h2o.glrm(training_frame = gait.miss, validation_frame = gait.hex, cols = 2:ncol(gait.miss), k = 10, init = "SVD", svd_method = "GramSVD",
110110
loss = "Quadratic", regularization_x = "None", regularization_y = "None", max_iterations = 2000, min_step_size = 1e-6)
111111
plot(gait.glrm2)
112112

113113
###### Impute missing values in our training data from X and Y.
114114
gait.pred2 <- predict(gait.glrm2, gait.miss)
115115
head(gait.pred2)
116-
sum(is.na(gait.pred2)) # No missing values in reconstructed data!
116+
sum(is.na(gait.pred2))
117117

118-
###### Plot original and reconstructed data of the x-coordinate of the left acromium. Red x's mark the points where the training data contains a missing value, so we can see how accurate our imputation is.
118+
###### Plot the original and reconstructed values of the x-coordinate of the left acromium. Red x's mark the points where the training data contains a missing value, so we can see how accurate our imputation is.
119119
lacro.pred.df2 <- as.data.frame(gait.pred2$reconstr_L.Acromium.X[1:150])
120120
matplot(time.df, cbind(lacro.df, lacro.pred.df2), xlab = "Time", ylab = "X-Coordinate of Left Acromium", main = "Position of Left Acromium over Time", type = "l", lty = 1, col = c(1,4))
121121
legend("topright", legend = c("Original", "Imputed"), col = c(1,4), pch = 1)
@@ -125,26 +125,26 @@ Suppose that due to a sensor malfunction, our walking stance data has missing va
125125

126126
## Example 2: Compressing Zip Codes
127127

128-
For our second example, we will be using two data sets. The first is compliance actions carried out by the U.S. Labor Department's [Wage and Hour Division (WHD)](http://ogesdw.dol.gov/views/data_summary.php) from 2014-2015. This includes information on each investigation, including the zip code tabulation area (ZCTA) at which the firm is located, number of violations found, and civil penalties assessed. We want to predict whether a firm is a repeat and/or willful violator. In order to do this, we need to encode the categorical ZCTA column in a meaningful way. One common approach is to replace ZCTA with indicator variables for every unique level, but due to its high cardinality (there are over 32,000 ZCTAs!), this is slow and leads to overfitting.
128+
For our second example, we will be using two data sets. The first is compliance actions carried out by the U.S. Labor Department's [Wage and Hour Division (WHD)](http://ogesdw.dol.gov/views/data_summary.php) from 2014-2015. This includes information on each investigation, including the zip code tabulation area (ZCTA) where the firm is located, number of violations found and civil penalties assessed. We want to predict whether a firm is a repeat and/or willful violator. In order to do this, we need to encode the categorical ZCTA column in a meaningful way. One common approach is to replace ZCTA with indicator variables for every unique level, but due to its high cardinality (there are over 32,000 ZCTAs!), this is slow and leads to overfitting.
129129

130-
Instead, we will use GLRM to condense ZCTAs into a few numeric columns representing the demographics of that area. Our second data set is the 2009-2013 [American Community Survey (ACS)](http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk) 5-year estimates of household characteristics. Each row contains information for a unique ZCTA, such as average household size, number of children, education level and ethnicity. By transforming the WHD data with GLRM, we not only address the speed and overfitting issue, but also transfer knowledge between similar ZCTAs in our model.
130+
Instead, we will build a GLRM to condense ZCTAs into a few numeric columns representing the demographics of that area. Our second data set is the 2009-2013 [American Community Survey (ACS)](http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk) 5-year estimates of household characteristics. Each row contains information for a unique ZCTA, such as average household size, number of children and education. By transforming the WHD data with our GLRM, we not only address the speed and overfitting issues, but also transfer knowledge between similar ZCTAs in our model.
131131

132132
#### Condensing Categorical Data
133133

134-
###### Initialize the H2O server and import the ACS data set.
134+
###### Initialize the H2O server and import the ACS data set. We use all available cores on our computer and allocate a maximum of 2 GB of memory to H2O.
135135
library(h2o)
136-
h2o.init(nthreads = -1, max_mem_size = "2G") # Use all available cores and 2 GB of memory
136+
h2o.init(nthreads = -1, max_mem_size = "2G")
137137
acs_orig <- h2o.importFile(path = "../data/ACS_13_5YR_DP02_cleaned.zip", col.types = c("enum", rep("numeric", 149)))
138138

139-
###### Save and drop the zip code tabulation area column.
139+
###### Separate out the zip code tabulation area column.
140140
acs_zcta_col <- acs_orig$ZCTA5
141141
acs_full <- acs_orig[,-which(colnames(acs_orig) == "ZCTA5")]
142142

143143
###### Get a summary of the ACS data set.
144144
dim(acs_full)
145145
summary(acs_full)
146146

147-
###### Build a GLRM to reduce ZCTA demographics to k = 10 archetypes. We standardize the data before performing the fit to ensure differences in scale don't unduly affect the algorithm. For the loss function, we select quadratic again, but this time, apply regularization to X and Y in order to sparsify the compressed features.
147+
###### Build a GLRM to reduce ZCTA demographics to k = 10 archetypes. We standardize the data before model building to ensure a good fit. For the loss function, we select quadratic again, but this time, apply regularization to X and Y in order to sparsify the condensed features.
148148
acs_model <- h2o.glrm(training_frame = acs_full, k = 10, transform = "STANDARDIZE",
149149
loss = "Quadratic", regularization_x = "Quadratic",
150150
regularization_y = "L1", max_iterations = 100, gamma_x = 0.25, gamma_y = 0.5)
@@ -174,7 +174,7 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
174174

175175
We now build a deep learning model on the WHD data set to predict repeat and/or willful violators. For comparison purposes, we train our model using the original data, the data with the ZCTA column replaced by the compressed GLRM representation (the X matrix), and the data with the ZCTA column replaced by all the demographic features in the ACS data set.
176176

177-
###### Import WHD data set and get a summary.
177+
###### Import the WHD data set and get a summary.
178178
whd_zcta <- h2o.importFile(path = "../data/whd_zcta_cleaned.zip", col.types = c(rep("enum", 7), rep("numeric", 97)))
179179
dim(whd_zcta)
180180
summary(whd_zcta)
@@ -184,20 +184,20 @@ We now build a deep learning model on the WHD data set to predict repeat and/or
184184
train <- whd_zcta[split <= 0.8,]
185185
test <- whd_zcta[split > 0.8,]
186186

187-
###### Build a deep learning model on original WHD data to predict repeat/willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator, so we specify a multinomial distribution. We skip the first four columns, which consist of case ID and location information that is already captured by the ZCTA.
187+
###### Build a deep learning model on the WHD data set to predict repeat/willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator. Thus, we specify a multinomial distribution. We skip the first four columns, which consist of the case ID and location information that is already captured by the ZCTA.
188188
myY <- "flsa_repeat_violator"
189189
myX <- setdiff(5:ncol(train), which(colnames(train) == myY))
190190
orig_time <- system.time(dl_orig <- h2o.deeplearning(x = myX, y = myY, training_frame = train,
191191
validation_frame = test, distribution = "multinomial",
192192
epochs = 0.1, hidden = c(50,50,50)))
193193

194-
###### Replace each ZCTA in the WHD data with the row of the X matrix corresponding to its compressed demographic representation. At the end, our single categorical column will be replaced by k = 10 numeric columns.
194+
###### Replace each ZCTA in the WHD data with the row of the X matrix corresponding to its condensed demographic representation. In the end, our single categorical column will be replaced by k = 10 numeric columns.
195195
zcta_arch_x$zcta5_cd <- acs_zcta_col
196196
whd_arch <- h2o.merge(whd_zcta, zcta_arch_x, all.x = TRUE, all.y = FALSE)
197197
whd_arch$zcta5_cd <- NULL
198198
summary(whd_arch)
199199

200-
###### Split the reduced WHD data into test and train, and build a deep learning model to predict repeat/willful violators.
200+
###### Split the reduced WHD data into test/train and build a deep learning model to predict repeat/willful violators.
201201
train_mod <- whd_arch[split <= 0.8,]
202202
test_mod <- whd_arch[split > 0.8,]
203203
myX <- setdiff(5:ncol(train_mod), which(colnames(train_mod) == myY))
@@ -211,7 +211,7 @@ We now build a deep learning model on the WHD data set to predict repeat and/or
211211
whd_acs$zcta5_cd <- NULL
212212
summary(whd_acs)
213213

214-
###### Split the combined WHD-ACS data into test and train, and build a deep learning model to predict repeat/willful violators.
214+
###### Split the combined WHD-ACS data into test/train and build a deep learning model to predict repeat/willful violators.
215215
train_comb <- whd_acs[split <= 0.8,]
216216
test_comb <- whd_acs[split > 0.8,]
217217
myX <- setdiff(5:ncol(train_comb), which(colnames(train_comb) == myY))
53.1 KB
Loading

0 commit comments

Comments
 (0)