Topic_Models_for_Text

Welcome to Topic Model Tutorial

This package allows you to analyze a set of .txt text files or a dataset of Twitter tweets using topic model algorithms from [SciKit-Learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

Dependencies

To run this package, you will need several functions.

This package was written on a MAC OSX system. It has not been tested on Linux or Windows.
This package requires an Anaconda Distribution of Python, either 2.7+ or 3.5+. See https://www.continuum.io/downloads. Although these packages should be included, you can install them if needed:
- Sci-Kit Learn pip install sklearn
- NLTK pip install NLTK

To Run

Git clone this repository and navigate to the tutorial:

git clone https://github.com/jmausolf/Python_Tutorials
cd Python_Tutorials/Topic_Models_for_Text/

Run one of the examples:

Run a Non-Negative Matrix Factorization (NMF) topic model using a TFIDF vectorizer with custom tokenization

# Run the NMF Model on Presidential Speech
python topic_modelr.py "text_tfidf_custom" "nmf" 15 10 2 4 "data/president"

Run a Latent Dirichlet Allocation (LDA) topic model using a TFIDF vectorizer with custom tokenization

# Run the LDA Model on Clinton Tweets
python topic_modelr.py "tweet_tfidf_custom" "lda" 15 5 1 4 "data/twitter"

Diving Into the Code

To learn more about the code, please check out the tutorial

To get help in running this function, consult the help file:

python topic_modelr.py --help

This yields the following:

usage: topic_modelr.py [-h]
                       vectorizer_type topic_clf n_topics n_top_terms
                       req_ngram_range [req_ngram_range ...] file_path

Prepare input file

positional arguments:
  vectorizer_type  Select the desired vectorizer for either text or tweet
                   @ text_tfidf_std       | TFIDF Vectorizer (for text)
                   @ text_tfidf_custom    | TFIDF Vectorizer with Custom Tokenizer (for text)
                   @ text_count_std       | Count Vectorizer

                   @ tweet_tfidf_std      | TFIDF Vectorizer (for tweets)
                   @ tweet_tfidf_custom   | TFIDF Vectorizer with Custom Tokenizer (for tweets)

  topic_clf        Select the desired topic model classifier (clf)
                   @ lda     | Topic Model: LatentDirichletAllocation (LDA)
                   @ nmf     | Topic Model: Non-Negative Matrix Factorization (NMF)
                   @ pca     | Topic Model: Principal Components Analysis (PCA)

  n_topics         Select the number of topics to return (as integer)
                   Note: requires n >= number of text files or tweets

                   Consider the following examples:

                   @ 10     | Example: Returns 5 topics
                   @ 15     | Example: Returns 10 topics

  n_top_terms      Select the number of top terms to return for each topic (as integer)
                   Consider the following examples:

                   @ 10     | Example: Returns 10 terms for each topic
                   @ 15     | Example: Returns 15 terms for each topic

  req_ngram_range  Select the requested 'ngram' or number of words per term
                   @ NG-1:  | ngram of length 1, e.g. "pay"
                   @ NG-2:  | ngram of length 2, e.g. "fair share"
                   @ NG-3:  | ngram of length 3, e.g. "pay fair share"

                   Consider the following ngram range examples:

                   @ [1, 2] | Return ngrams of lengths 1 and 2
                   @ [2, 5] | Return ngrams of lengths 2 through 5

  file_path        Select the desired file path for the data

                   Consider the following ngram range examples:

                   @ data/twitter      | Uses data in the data/twitter subdirectory
                   @ data/president    | Uses data in the data/president subdirectory
                   @ .                 | Uses data in the current directory


optional arguments:
  -h, --help       show this help message and exit

Name		Name	Last commit message	Last commit date
parent directory ..
data		data
README.md		README.md
Topic_Models_for_Text.Rmd		Topic_Models_for_Text.Rmd
Topic_Models_for_Text.html		Topic_Models_for_Text.html
custom_stopword_tokens.py		custom_stopword_tokens.py
topic_modelr.py		topic_modelr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Welcome to Topic Model Tutorial

Dependencies

To Run

Run a Non-Negative Matrix Factorization (NMF) topic model using a TFIDF vectorizer with custom tokenization

Run a Latent Dirichlet Allocation (LDA) topic model using a TFIDF vectorizer with custom tokenization

Diving Into the Code

This yields the following:

FilesExpand file tree

Topic_Models_for_Text

Directory actions

More options

Directory actions

More options

Latest commit

History

Topic_Models_for_Text

Folders and files

parent directory

README.md

Welcome to Topic Model Tutorial

Dependencies

To Run

Run a Non-Negative Matrix Factorization (NMF) topic model using a TFIDF vectorizer with custom tokenization

Run a Latent Dirichlet Allocation (LDA) topic model using a TFIDF vectorizer with custom tokenization

Diving Into the Code

This yields the following: