Skip to content

Commit 78843b9

Browse files
committed
Updates to intro-to-datascience tutorial
1 parent d429684 commit 78843b9

1 file changed

Lines changed: 146 additions & 4 deletions

File tree

Lines changed: 146 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
# Introduction to Data Science, Machine Learning and Predictive Analytics
1+
# Introduction to Data Science, Machine Learning & Predictive Analytics
22
- Overview
33
- What is Data Science?
44
- Data Science Tasks
55
- Problem Formulation
6-
- Data Processing
6+
- Collect & Process Data
77
- Machine Learning
8-
- Insights and Action
8+
- Insights & Action
99
- What is Machine Learning?
1010
- Machine Learning Tasks
11-
- Classification (Binary and Multiclass)
1211
- Regression
12+
- Classification (Binary and Multiclass)
1313
- Ranking
1414
- Clustering
1515
- Dimensionality Reduction
@@ -18,13 +18,155 @@
1818

1919
## Overview
2020

21+
This tutorial is designed to provide an introduction to concepts in the fields of data science and machine learning. A shorter version of this content, designed for non-data scientists, is available as a [slidedeck](http://www.slideshare.net/0xdata/intro-to-data-science-for-nondata-scientists) from a previous H2O meetup.
22+
2123
## What is Data Science?
2224

25+
One of the earliest uses of the term "data science" occured in the title of the 1996 [ International Federation of Classification Societies (IFCS)](http://www.classification-society.org/ifcs/index.html) conference in Kobe, Japan.
26+
27+
![IFCS Conference 1996](images/datascience_poster_1996ifcs.jpg)
28+
29+
The term re-emerged and became popularized by [William Cleveland](http://www.stat.purdue.edu/~wsc/) (then at Bell Labs) when he published, "[Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.](http://www.stat.purdue.edu/~wsc/papers/datascience.pdf)" in 2001.
30+
31+
This publication describes a plan to enlarge the major areas of technical work of the field of statistics. Dr. Cleveland states, "Since plan is ambitious and implies substantial change, the altered field will be called “data science." The plan sets out six technical areas for a university department and advocates a specific allocation of resources to research and development in each area as a percent of the total resources that are available beyond those needed to teach the courses in the department's curriculum. Those areas are:
32+
- Multidisciplinary Investigations
33+
- Models and Methods for Data
34+
- Computing with Data
35+
- Pedagogy
36+
- Tool Evaluation
37+
- Theory
38+
39+
Since then, the use of the term has been rapidly increasing, with a sharp increase in use in the past few years (Google Trends for several terms):
40+
41+
![DS Google Trends](images/data_terms_google_trends.png)
42+
43+
Most recently, the term "data science" is used to describe an amalgamation of topics from a variety of technical fields (statistics, machine learning, computer science, engineering, visualization) that are concerned with data processing and learning from data.
44+
45+
More information on the history of the term "data science" is chronicled [here](http://whatsthebigdata.com/2012/04/26/a-very-short-history-of-data-science/).
46+
47+
2348
## Data Science Tasks
2449

50+
Data science can also be defined by the processes that are required to solve problems. Here is a summary of those major tasks:
51+
- Problem Formulation
52+
- Identify an outcome of interest and the type of task (e.g. classification, regression)
53+
- Identify the potential predictor variables
54+
- Identify the independent sampling units
55+
- Collect & Process Data
56+
- Conduct research experiment (e.g. Clinical Trial)
57+
- Collect examples / randomly sample the population
58+
- Transform, clean, impute, filter, aggregate data
59+
- Prepare the data for machine learning — X, Y
60+
- Machine Learning
61+
- Modeling using a machine learning algorithm (training)
62+
- Model evaluation and comparison
63+
- Sensitivity & Cost Analysis
64+
- Insights & Action
65+
- Translate results into action items
66+
- Feed results into research pipeline
67+
2568
## What is Machine Learning?
2669

70+
So, now that we have loosely defined the term, "data science", how do we define "machine learning"?
71+
72+
Here are a few definitions of machine learning by the experts.
73+
74+
What is machine learning?
75+
```
76+
"Field of study that gives computers the ability to learn without
77+
being explicitly programmed."
78+
79+
-- Arthur Samuel, 1959
80+
```
81+
82+
Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone.
83+
84+
85+
What is Machine Learning vs. Statistics?
86+
```
87+
"Machine learning and statistics are closely related fields.
88+
The ideas of machine learning, from methodological principles
89+
to theoretical tools, have had a long pre-history in statistics."
90+
91+
"I personally don't make the distinction between statistics and
92+
machine learning..."
93+
94+
-- Michael I. Jordan, 2014
95+
```
96+
And to put it more bluntly...
97+
```
98+
"When Leo Breiman developed random forests, was he being a
99+
statistician or a machine learner?
100+
101+
When my colleagues and I developed latent Dirichlet allocation,
102+
were we being statisticians or machine learners?
103+
104+
Are the SVM and boosting machine learning while logistic
105+
regression is statistics, even though they're solving essentially
106+
the same optimization problems up to slightly different shapes in
107+
a loss function?
108+
109+
Why does anyone think that these are meaningful distinctions?"
110+
111+
-- Michael I. Jordan, 2014
112+
```
113+
Michael I. Jordan also suggested the term "data science" as a placeholder to call the overall field.
114+
115+
27116
## Machine Learning Tasks
28117

118+
There are a few concepts that you should become familiar with when first exploring machine learning.
119+
120+
- Training Data:
121+
- Features:
122+
- Models:
123+
- Supervised Learning:
124+
- Unsupervised Learning:
125+
126+
### Regression
127+
Regression is the term for training a model to predict a real-valued response (e.g. weight, price, viral load)
128+
... TO DO
129+
### Classification (Binary and Multiclass)
130+
TO DO
131+
### Ranking
132+
TO DO
133+
### Clustering
134+
TO DO
135+
### Dimensionality Reduction
136+
Dimension reduction can be useful when .... (TO DO)
137+
138+
There are several types of ways to reduce the dimensionality of your data, but you can divide them into two types of methods: Feature Selection and Feature Extraction
139+
#### Feature Selection
140+
Feature selection is the process of selecting a subset of the original set of features from the dataset, usually with the intention of producing a better predicitve model by using the limited set of features instead of the original set. Feature selection is also known by the names, "variable selection" or "variable subset selection".
141+
142+
In the age of Big Data, many noisy and/or useless variables are recorded and included in a training set. If there are too many of these variables present, some algorithms will have a harder time finding the signal in the noise (i.e. producing a strong predictive model).
143+
144+
Here are a few common feature selection methods:
145+
- Lasso Regression
146+
- Forward and Backward Stepwise Regression
147+
- Forward and Backward Stagewise Regression
148+
- TO DO ...
149+
150+
#### Feature Extraction
151+
Feature extraction is a general term to describe deriving new features from the original dataset. This is usually performed in an effort to reduce the dimensionality of the data (reduce the number of features/columns). The data transformation may be linear (as in PCA), or nonlinear.
152+
153+
Here are a few feature extraction methods:
154+
- Principal Components Analysis (PCA)
155+
- Linear Discriminant Analysis (LDA)
156+
- Generalized Low Rank Models (GLRM)
157+
- TO DO ...
158+
159+
160+
161+
### Feature Importance
162+
- TO DO: Random Forest, GLM Coefficients, etc
163+
164+
29165
## Data Science Pipelines
30166

167+
![DS Google Trends](images/ml_workflow.png)
168+
169+
170+
## References
171+
[1] [https://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/ckelmtt](https://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/ckelmtt)
172+

0 commit comments

Comments
 (0)