Skip to content

Latest commit

 

History

History
 
 

README.md

#Machine Learning Tutorial

This module illustrates some basic machine learning in Python using Sci-Kit Learn.

We do not assume a priori that any single model will be best for the data. Instead, we loop over multiple classifiers and parameterizations. In this way, we can run hundreds of models and select the best one on a variety of metrics, such as precision and recall.

Example Precision-Recall Plot:


Quick Start Guide

Example data for this repository comes from the General Social Survey (GSS) 2014. More notes on the data preprocessing are detailed in the data folder.

To run the example:

git clone --recursive https://github.com/jmausolf/Python_Tutorials
cd Python_Tutorials/Machine_Learning
python run.py

The Details

To modify the default data, outcome variable, or parameters, open the run.py script with your favorite text editor, such as Atom, Sublime, or Vim. Here, you must specify the dataset (as .CSV), the outcome variable you would like to predict (for binary classification), and the features you would like to use to make the predictions.

Default Example:

#Define Data
dataset  = 'file'
outcome  = 'partyid_str_rep'
features = ['age', 'sex', 'race', 'educ', 'rincome']

Once you edit these fields, save the script, and in terminal execute: python run.py

Note:

Your data may have hundreds of features (independent variables/predictors). If you would like to use all of them (and would rather not type write them all out explicitly) simply uses the --all_features option of the magic loop.

python run.py --all_features True

Of course, an overlooked aspect thus far is feature development. The GSS data in the example is not in the ideal form. For example, most of the data is categorical. We might want to make indicators for each categorical column, calculate various aggregations or interactions, among other possibilities. An ideal data pipeline might make the changes to the feature set prior running this script.


Acknowledgements

This tutorial makes use of a modified submodule, originally forked from @rayidghani magicloops. It has been updated to run in Python2 or Python3. In addition, my modified fork modifies the plotting code and several of the functions to take a user-specified dataset, outcome variable, and features.