Skip to content

shrreya24/Amazon-Review-Analysis

Repository files navigation

NLP-Based Insight Extraction from Amazon Customer Reviews

A production-grade NLP system that transforms large-scale unstructured customer reviews into structured, actionable business insights.

Python 3.8+ License: MIT

🎯 Project Overview

This system processes 568,000+ Amazon Fine Food Reviews to automatically:

  • Perform sentiment analysis at scale
  • Discover topics and themes in customer feedback
  • Extract product aspects and aspect-level sentiments
  • Track sentiment and topic trends over time
  • Generate actionable business insights

Built for: Product teams, business analysts, data scientists working with large-scale customer feedback

πŸ—οΈ System Architecture

Raw Reviews (568K) β†’ Preprocessing β†’ NLP Pipelines β†’ Insights β†’ Streamlit Dashboard
                                     β”œβ”€ Sentiment Analysis
                                     β”œβ”€ Topic Modeling
                                     β”œβ”€ Aspect Extraction
                                     └─ Temporal Analysis

✨ Key Features

🎭 Multi-Level Sentiment Analysis

  • Baseline: VADER, TextBlob for fast processing
  • Advanced: DistilBERT transformer for nuanced understanding
  • Ensemble: Combines approaches for robust predictions

πŸ“Š Topic Modeling

  • LDA: Classical probabilistic topic modeling
  • BERTopic: Modern transformer-based approach
  • Auto-labeled topics mapped to business categories

πŸ” Aspect-Based Analysis

  • Automatic extraction of product aspects (taste, price, packaging, delivery)
  • Sentiment analysis per aspect
  • Product-level aspect comparisons

πŸ“ˆ Temporal Trends

  • Sentiment evolution over time
  • Topic drift detection
  • Seasonal pattern analysis
  • Anomaly detection

πŸ’‘ Auto-Generated Insights

  • Executive summaries
  • Top complaints and praise themes
  • Actionable recommendations for product teams

πŸ“± Interactive Dashboard

  • Real-time visualizations with Plotly
  • Multi-page Streamlit interface
  • Exportable reports and charts

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • 8GB+ RAM (16GB recommended for full dataset)
  • Kaggle API credentials (for dataset download)

Installation

# Clone or navigate to project directory
cd amazon-produt\ review

# Run automated setup
bash setup.sh

# Activate virtual environment
source venv/bin/activate

Configuration

  1. Update .env file with your settings:
cp .env.example .env
# Edit .env with your Kaggle credentials if needed
  1. Download dataset (if not done during setup):
kaggle datasets download -d snap/amazon-fine-food-reviews -p data/raw --unzip

Usage

Option 1: Run Full Pipeline

# Process 10K sample (fast, for testing)
python pipelines/run_full_pipeline.py --sample_size 10000

# Process full dataset (slow, production)
python pipelines/run_full_pipeline.py --full

Option 2: Run Individual Pipelines

# 1. Preprocessing
python pipelines/run_preprocessing.py

# 2. Sentiment Analysis
python pipelines/run_sentiment_analysis.py

# 3. Topic Modeling
python pipelines/run_topic_modeling.py

# 4. Aspect Analysis
python pipelines/run_aspect_analysis.py

# 5. Temporal Analysis
python pipelines/run_temporal_analysis.py

Launch Dashboard

streamlit run app.py

Then open http://localhost:8501 in your browser.

πŸ“ Project Structure

amazon-produt review/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Dependencies
β”œβ”€β”€ setup.sh                           # Automated setup
β”œβ”€β”€ .env.example                       # Configuration template
β”‚
β”œβ”€β”€ docs/                              # Detailed documentation
β”‚   β”œβ”€β”€ PROJECT_OVERVIEW.md
β”‚   β”œβ”€β”€ SYSTEM_DESIGN.md
β”‚   β”œβ”€β”€ DATA_DICTIONARY.md
β”‚   β”œβ”€β”€ NLP_TECHNIQUES.md
β”‚   └── EVALUATION_STRATEGY.md
β”‚
β”œβ”€β”€ data/                              # Data storage
β”‚   β”œβ”€β”€ raw/                           # Original dataset
β”‚   β”œβ”€β”€ processed/                     # Cleaned data
β”‚   └── results/                       # Model outputs
β”‚
β”œβ”€β”€ notebooks/                         # Jupyter notebooks
β”‚   β”œβ”€β”€ 01_eda.ipynb
β”‚   β”œβ”€β”€ 02_baseline_sentiment.ipynb
β”‚   β”œβ”€β”€ 03_topic_modeling.ipynb
β”‚   └── 04_aspect_extraction.ipynb
β”‚
β”œβ”€β”€ src/                               # Source code
β”‚   β”œβ”€β”€ config.py                      # Configuration
β”‚   β”œβ”€β”€ utils.py                       # Utilities
β”‚   β”œβ”€β”€ data/                          # Data processing
β”‚   β”œβ”€β”€ models/                        # NLP models
β”‚   β”œβ”€β”€ insights/                      # Insight generation
β”‚   └── evaluation/                    # Evaluation metrics
β”‚
β”œβ”€β”€ pipelines/                         # End-to-end pipelines
β”‚   β”œβ”€β”€ run_preprocessing.py
β”‚   β”œβ”€β”€ run_sentiment_analysis.py
β”‚   β”œβ”€β”€ run_topic_modeling.py
β”‚   β”œβ”€β”€ run_aspect_analysis.py
β”‚   β”œβ”€β”€ run_temporal_analysis.py
β”‚   └── run_full_pipeline.py
β”‚
β”œβ”€β”€ app.py                             # Streamlit dashboard
└── streamlit_app/                     # Dashboard components
    β”œβ”€β”€ pages/
    └── components/

πŸ“Š Dataset

Source: Amazon Fine Food Reviews

  • Size: 568,454 reviews
  • Time Range: Oct 1999 - Oct 2012
  • Columns: Review text, rating (1-5), product ID, timestamp, helpfulness votes

See DATA_DICTIONARY.md for detailed schema.

πŸ› οΈ Technology Stack

NLP & ML:

  • spaCy, NLTK: Text processing
  • Transformers (HuggingFace): Advanced sentiment
  • Gensim: LDA topic modeling
  • BERTopic: Modern topic modeling
  • scikit-learn: ML utilities

Visualization & UI:

  • Streamlit: Interactive dashboard
  • Plotly: Dynamic charts
  • pyLDAvis: Topic visualization

πŸ“ˆ Performance Metrics

Sentiment Analysis

  • Accuracy: 82%+ on held-out data
  • F1 Score: 0.80 (weighted)
  • Processing Speed: ~500 reviews/second

Topic Modeling

  • LDA Coherence: 0.52 (C_v)
  • BERTopic: 20-25 coherent topics
  • Coverage: 95%+ reviews mapped

Scalability

  • 10K reviews: ~2 minutes
  • 100K reviews: ~15 minutes
  • 568K reviews: ~90 minutes

Benchmarked on: Intel i7, 16GB RAM

πŸ§ͺ Testing

# Run unit tests
pytest tests/

# Run specific test
pytest tests/test_sentiment.py -v

πŸ“– Documentation

πŸŽ“ Research Alignment

This project demonstrates principles used in FAANG-level systems:

  • Scalability: Handles hundreds of thousands of reviews
  • Multi-task NLP: Combines sentiment, topics, and aspects
  • Production-Ready: Config-driven, modular, testable
  • Interpretability: Explainable insights for non-technical stakeholders
  • Reproducibility: Seeded experiments, versioned dependencies

🀝 Contributing

This is an educational/portfolio project. Suggestions and improvements welcome!

πŸ“ License

MIT License - see LICENSE file for details

πŸ‘€ Author

ML Engineer & NLP Researcher

Built as a demonstration of production-grade NLP and large-scale system design.

πŸ™ Acknowledgments

  • Dataset: Stanford Network Analysis Project (SNAP)
  • Libraries: HuggingFace, spaCy, Streamlit communities
  • Inspiration: Real-world review analysis systems at Amazon, Google, Meta

Note: This is a research/educational project. For production deployment, consider additional security, privacy, and compliance measures.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors