Skip to content

Commit 59be7fa

Browse files
author
anqi
committed
Add comparison of deep learning model on combined WHD and ACS training data
1 parent 9ce4797 commit 59be7fa

1 file changed

Lines changed: 33 additions & 21 deletions

File tree

tutorials/glrm/glrm-tutorial.md

Lines changed: 33 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -25,19 +25,19 @@ This tutorial introduces the Generalized Low Rank Model (GLRM) [[1](#references)
2525

2626
Across business and research, analysts seek to understand large collections of data with numeric and categorical values. Many entries in this table may be noisy or even missing altogether. Low rank models facilitate the understanding of tabular data by producing a condensed vector representation for every row and column in the data set.
2727

28-
Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a low, user-specified number of columns k. The matrix Y has k rows and number of columns d equal to the total dimension of the embedded features in A. For example, if A has 3 numeric columns and 1 categorical column with 4 distinct levels (e.g., red, yellow, blue and green), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y will be identical.
28+
Specifically, given a data table A with m rows and n columns, a GLRM consists of a decomposition of A into numeric matrices X and Y. The matrix X has the same number of rows as A, but only a small, user-specified number of columns k. The matrix Y has k rows and d columns, where d is equal to the total dimension of the embedded features in A. For example, if A has 4 numeric columns and 1 categorical column with 3 distinct levels (e.g., _setosa_, _versicolor_ and _virginica_), then Y will have 7 columns. When A contains only numeric features, the number of columns in A and Y will be identical.
2929

3030
![GLRM Matrix Decomposition](images/glrm_matrix_decomposition.png)
3131

32-
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller is k, the more compression we gain from our low rank representation.
32+
Both X and Y have practical interpretations. Each row of Y is an archetypal feature formed from the columns of A, and each row of X corresponds to a row of A projected into this reduced feature space. We can approximately reconstruct A from the matrix product XY, which has rank k. The number k is chosen to be much less than both m and n: a typical value for 1 million rows and 2,000 columns of numeric data is k = 15. The smaller k is, the more compression we gain from our low rank representation.
3333

34-
GLRMs are an extension of well-known matrix factorization methods such as Principal Components Analysis (PCA). While PCA is limited to numeric data, GLRMs can handle mixed numeric, categorical, ordinal, and Boolean data with an arbitrary number of missing values. It allows the user to apply regularization to X and Y, imposing restrictions like non-negativity appropriate to a particular data science context. Thus, it is an extremely flexible approach to analyzing and interpreting heterogeneous data sets.
34+
GLRMs are an extension of well-known matrix factorization methods such as Principal Components Analysis (PCA). While PCA is limited to numeric data, GLRMs can handle mixed numeric, categorical, ordinal and Boolean data with an arbitrary number of missing values. It allows the user to apply regularization to X and Y, imposing restrictions like non-negativity appropriate to a particular data science context. Thus, it is an extremely flexible approach for analyzing and interpreting heterogeneous data sets.
3535

3636
## Why use Low Rank Models?
3737

3838
- **Memory:** By saving only the X and Y matrices, we can significantly reduce the amount of memory required to store a large data set. A file that is 10 GB can be compressed down to 100 MB. When we need the original data again, we can reconstruct it on the fly from X and Y with minimal loss in accuracy.
3939
- **Speed:** We can use GLRM to compress data with high-dimensional, heterogeneous features into a few numeric columns. This leads to a huge speed-up in model-building and prediction, especially by machine learning algorithms that scale poorly with the size of the feature space. Below, we will see an example with 10x speed-up and no accuracy loss in deep learning.
40-
- **Feature Engineering:** The Y matrix represents the most important combinations of features from the training data. These condensed features, called archetypes, can be analyzed, visualized and incorporated into various data science applications.
40+
- **Feature Engineering:** The Y matrix represents the most important combination of features from the training data. These condensed features, called archetypes, can be analyzed, visualized and incorporated into various data science applications.
4141
- **Missing Data Imputation:** Reconstructing a data set from X and Y will automatically impute missing values. This imputation is accomplished by intelligently leveraging the information contained in the known values of each feature, as well as user-provided parameters such as the loss function.
4242

4343
## Example 1: Visualizing Walking Stances
@@ -48,9 +48,8 @@ For our first example, we will use data on [Subject 01's walking stances](https:
4848

4949
###### Initialize the H2O server and import our walking stance data.
5050
library(h2o)
51-
h2o.init()
52-
pathToData <- "/data/h2o-training/glrm/subject01_walk1.csv"
53-
gait.hex <- h2o.importFile(path = pathToData, destination_frame = "gait.hex")
51+
h2o.init(nthreads = -1, max_mem_size = "2G") # Use all available cores and 2 GB of memory
52+
gait.hex <- h2o.importFile(path = "../data/glrm/subject01_walk1.csv", destination_frame = "gait.hex")
5453

5554
###### Get a summary of the imported data set.
5655
dim(gait.hex)
@@ -100,13 +99,12 @@ For our first example, we will use data on [Subject 01's walking stances](https:
10099

101100
Suppose that due to a sensor malfunction, our walking stance data has missing values randomly interspersed. We can use GLRM to reconstruct these missing values from the existing data.
102101

103-
###### Import walking stance data containing 15% missing values.
104-
pathToMissingData <- "/data/h2o-training/glrm/subject01_walk1_miss15.csv"
105-
gait.miss <- h2o.importFile(path = pathToMissingData, destination_Frame = "gait.miss")
106-
107-
###### Get a summary of the imported data set.
102+
###### Import walking stance data containing 15% missing values and get a summary.
103+
gait.miss <- h2o.importFile(path = "../data/glrm/subject01_walk1_miss15.csv", destination_Frame = "gait.miss")
108104
dim(gait.miss)
109105
summary(gait.miss)
106+
107+
###### Count the total number of missing values in the data set.
110108
sum(is.na(gait.miss))
111109

112110
###### Build a basic GLRM with quadratic loss and no regularization, validating on our original data set with no missing values. We change the algorithm initialization method, increase the maximum number of iterations to 2,000, and reduce the minimum step size to 1e-6 to ensure it converges.
@@ -117,7 +115,7 @@ Suppose that due to a sensor malfunction, our walking stance data has missing va
117115
###### Impute missing values in our training data from X and Y.
118116
gait.pred2 <- predict(gait.glrm2, gait.miss)
119117
head(gait.pred2)
120-
sum(is.na(gait.pred2))
118+
sum(is.na(gait.pred2)) # No missing values in reconstructed data!
121119

122120
###### Plot original and reconstructed data of the x-coordinate of the left acromium. Red x's mark the points where the training data contains a missing value, so we can see how accurate our imputation is.
123121
lacro.pred.df2 <- as.data.frame(gait.pred2$reconstr_L.Acromium.X[1:150])
@@ -137,9 +135,8 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
137135

138136
###### Initialize the H2O server and import the ACS data set.
139137
library(h2o)
140-
h2o.init()
141-
pathToACSData <- "/data/h2o-training/glrm/ACS_13_5YR_DP02_cleaned.zip"
142-
acs_orig <- h2o.uploadFile(path = pathToACSData, col.types = c("enum", rep("numeric", 149)))
138+
h2o.init(nthreads = -1, max_mem_size = "2G") # Use all available cores and 2 GB of memory
139+
acs_orig <- h2o.importFile(path = "../data/glrm/ACS_13_5YR_DP02_cleaned.zip", col.types = c("enum", rep("numeric", 149)))
143140

144141
###### Save and drop the zip code tabulation area column.
145142
acs_zcta_col <- acs_orig$ZCTA5
@@ -166,6 +163,7 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
166163
(acs_zcta_col == "84104") | # Salt Lake City, UT
167164
(acs_zcta_col == "94086") | # Sunnyvale, CA
168165
(acs_zcta_col == "95014")) # Cupertino, CA
166+
169167
city_arch <- as.data.frame(zcta_arch_x[idx,1:2])
170168
xeps <- (max(city_arch[,1]) - min(city_arch[,1])) / 10
171169
yeps <- (max(city_arch[,2]) - min(city_arch[,2])) / 10
@@ -179,8 +177,7 @@ Instead, we will use GLRM to condense ZCTAs into a few numeric columns represent
179177
We now build a deep learning model on the WHD data set to predict repeat and/or willful violators. For comparison purposes, we train our model using the original data, the data with the ZCTA column replaced by the compressed GLRM representation (the X matrix), and the data with the ZCTA column replaced by all the demographic features in the ACS data set.
180178

181179
###### Import WHD data set and get a summary.
182-
pathToWHDData <- "/data/h2o-training/glrm/whd_zcta_cleaned.zip"
183-
whd_zcta <- h2o.uploadFile(path = pathToWHDData, col.types = c(rep("enum", 7), rep("numeric", 97)))
180+
whd_zcta <- h2o.importFile(path = "../data/glrm/whd_zcta_cleaned.zip", col.types = c(rep("enum", 7), rep("numeric", 97)))
184181
dim(whd_zcta)
185182
summary(whd_zcta)
186183

@@ -189,7 +186,7 @@ We now build a deep learning model on the WHD data set to predict repeat and/or
189186
train <- whd_zcta[split <= 0.8,]
190187
test <- whd_zcta[split > 0.8,]
191188

192-
###### Build a deep learning model on original WHD data to predict repeat/willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator, so we specify a multinomial distribution.
189+
###### Build a deep learning model on original WHD data to predict repeat/willful violators. Our response is a categorical column with four levels: N/A = neither repeat nor willful, R = repeat, W = willful, and RW = repeat and willful violator, so we specify a multinomial distribution. We skip the first four columns, which consist of case ID and location information that is already captured by the ZCTA.
193190
myY <- "flsa_repeat_violator"
194191
myX <- setdiff(5:ncol(train), which(colnames(train) == myY))
195192
orig_time <- system.time(dl_orig <- h2o.deeplearning(x = myX, y = myY, training_frame = train,
@@ -210,10 +207,25 @@ We now build a deep learning model on the WHD data set to predict repeat and/or
210207
validation_frame = test_mod, distribution = "multinomial",
211208
epochs = 0.1, hidden = c(50,50,50)))
212209

213-
###### Compare the performance between the two models. We see that the model built on the reduced WHD data set finishes almost 10 times faster than the model using the original data set, and it yields a lower log-loss error.
210+
###### Replace each ZCTA in the WHD data with the row of ACS data containing its full demographic information.
211+
colnames(acs_orig)[1] <- "zcta5_cd"
212+
whd_acs <- h2o.merge(whd_zcta, acs_orig, all.x = TRUE, all.y = FALSE)
213+
whd_acs$zcta5_cd <- NULL
214+
summary(whd_acs)
215+
216+
###### Split the combined WHD-ACS data into test and train, and build a deep learning model to predict repeat/willful violators.
217+
train_comb <- whd_acs[split <= 0.8,]
218+
test_comb <- whd_acs[split > 0.8,]
219+
myX <- setdiff(5:ncol(train_comb), which(colnames(train_comb) == myY))
220+
comb_time <- system.time(dl_comb <- h2o.deeplearning(x = myX, y = myY, training_frame = train_comb,
221+
validation_frame = test_comb, distribution = "multinomial",
222+
epochs = 0.1, hidden = c(50,50,50)))
223+
224+
###### Compare the performance between the three models. We see that the model built on the reduced WHD data set finishes almost 10 times faster than the model using the original data set, and it yields a lower log-loss error. The model with the combined WHD-ACS data set does not improve significantly on this error. We can conclude that our GLRM compressed the ZCTA demographics with little informational loss.
214225
data.frame(original = c(orig_time[3], h2o.logloss(dl_orig, train = TRUE), h2o.logloss(dl_orig, valid = TRUE)),
215226
reduced = c(mod_time[3], h2o.logloss(dl_mod, train = TRUE), h2o.logloss(dl_mod, valid = TRUE)),
216-
row.names = c("runtime", "train_logloss", "test_logloss"))
227+
combined = c(comb_time[3], h2o.logloss(dl_comb, train = TRUE), h2o.logloss(dl_comb, valid = TRUE)),
228+
row.names = c("runtime", "train_logloss", "test_logloss"))
217229

218230
## References
219231

0 commit comments

Comments
 (0)