A production-grade NLP system that transforms large-scale unstructured customer reviews into structured, actionable business insights.
This system processes 568,000+ Amazon Fine Food Reviews to automatically:
- Perform sentiment analysis at scale
- Discover topics and themes in customer feedback
- Extract product aspects and aspect-level sentiments
- Track sentiment and topic trends over time
- Generate actionable business insights
Built for: Product teams, business analysts, data scientists working with large-scale customer feedback
Raw Reviews (568K) β Preprocessing β NLP Pipelines β Insights β Streamlit Dashboard
ββ Sentiment Analysis
ββ Topic Modeling
ββ Aspect Extraction
ββ Temporal Analysis
- Baseline: VADER, TextBlob for fast processing
- Advanced: DistilBERT transformer for nuanced understanding
- Ensemble: Combines approaches for robust predictions
- LDA: Classical probabilistic topic modeling
- BERTopic: Modern transformer-based approach
- Auto-labeled topics mapped to business categories
- Automatic extraction of product aspects (taste, price, packaging, delivery)
- Sentiment analysis per aspect
- Product-level aspect comparisons
- Sentiment evolution over time
- Topic drift detection
- Seasonal pattern analysis
- Anomaly detection
- Executive summaries
- Top complaints and praise themes
- Actionable recommendations for product teams
- Real-time visualizations with Plotly
- Multi-page Streamlit interface
- Exportable reports and charts
- Python 3.8+
- 8GB+ RAM (16GB recommended for full dataset)
- Kaggle API credentials (for dataset download)
# Clone or navigate to project directory
cd amazon-produt\ review
# Run automated setup
bash setup.sh
# Activate virtual environment
source venv/bin/activate- Update
.envfile with your settings:
cp .env.example .env
# Edit .env with your Kaggle credentials if needed- Download dataset (if not done during setup):
kaggle datasets download -d snap/amazon-fine-food-reviews -p data/raw --unzip# Process 10K sample (fast, for testing)
python pipelines/run_full_pipeline.py --sample_size 10000
# Process full dataset (slow, production)
python pipelines/run_full_pipeline.py --full# 1. Preprocessing
python pipelines/run_preprocessing.py
# 2. Sentiment Analysis
python pipelines/run_sentiment_analysis.py
# 3. Topic Modeling
python pipelines/run_topic_modeling.py
# 4. Aspect Analysis
python pipelines/run_aspect_analysis.py
# 5. Temporal Analysis
python pipelines/run_temporal_analysis.pystreamlit run app.pyThen open http://localhost:8501 in your browser.
amazon-produt review/
βββ README.md # This file
βββ requirements.txt # Dependencies
βββ setup.sh # Automated setup
βββ .env.example # Configuration template
β
βββ docs/ # Detailed documentation
β βββ PROJECT_OVERVIEW.md
β βββ SYSTEM_DESIGN.md
β βββ DATA_DICTIONARY.md
β βββ NLP_TECHNIQUES.md
β βββ EVALUATION_STRATEGY.md
β
βββ data/ # Data storage
β βββ raw/ # Original dataset
β βββ processed/ # Cleaned data
β βββ results/ # Model outputs
β
βββ notebooks/ # Jupyter notebooks
β βββ 01_eda.ipynb
β βββ 02_baseline_sentiment.ipynb
β βββ 03_topic_modeling.ipynb
β βββ 04_aspect_extraction.ipynb
β
βββ src/ # Source code
β βββ config.py # Configuration
β βββ utils.py # Utilities
β βββ data/ # Data processing
β βββ models/ # NLP models
β βββ insights/ # Insight generation
β βββ evaluation/ # Evaluation metrics
β
βββ pipelines/ # End-to-end pipelines
β βββ run_preprocessing.py
β βββ run_sentiment_analysis.py
β βββ run_topic_modeling.py
β βββ run_aspect_analysis.py
β βββ run_temporal_analysis.py
β βββ run_full_pipeline.py
β
βββ app.py # Streamlit dashboard
βββ streamlit_app/ # Dashboard components
βββ pages/
βββ components/
Source: Amazon Fine Food Reviews
- Size: 568,454 reviews
- Time Range: Oct 1999 - Oct 2012
- Columns: Review text, rating (1-5), product ID, timestamp, helpfulness votes
See DATA_DICTIONARY.md for detailed schema.
NLP & ML:
- spaCy, NLTK: Text processing
- Transformers (HuggingFace): Advanced sentiment
- Gensim: LDA topic modeling
- BERTopic: Modern topic modeling
- scikit-learn: ML utilities
Visualization & UI:
- Streamlit: Interactive dashboard
- Plotly: Dynamic charts
- pyLDAvis: Topic visualization
- Accuracy: 82%+ on held-out data
- F1 Score: 0.80 (weighted)
- Processing Speed: ~500 reviews/second
- LDA Coherence: 0.52 (C_v)
- BERTopic: 20-25 coherent topics
- Coverage: 95%+ reviews mapped
- 10K reviews: ~2 minutes
- 100K reviews: ~15 minutes
- 568K reviews: ~90 minutes
Benchmarked on: Intel i7, 16GB RAM
# Run unit tests
pytest tests/
# Run specific test
pytest tests/test_sentiment.py -v- PROJECT_OVERVIEW.md - Objectives and motivation
- SYSTEM_DESIGN.md - Architecture deep-dive
- DATA_DICTIONARY.md - Dataset schema
- NLP_TECHNIQUES.md - Methodology details
- EVALUATION_STRATEGY.md - Metrics and validation
This project demonstrates principles used in FAANG-level systems:
- Scalability: Handles hundreds of thousands of reviews
- Multi-task NLP: Combines sentiment, topics, and aspects
- Production-Ready: Config-driven, modular, testable
- Interpretability: Explainable insights for non-technical stakeholders
- Reproducibility: Seeded experiments, versioned dependencies
This is an educational/portfolio project. Suggestions and improvements welcome!
MIT License - see LICENSE file for details
ML Engineer & NLP Researcher
Built as a demonstration of production-grade NLP and large-scale system design.
- Dataset: Stanford Network Analysis Project (SNAP)
- Libraries: HuggingFace, spaCy, Streamlit communities
- Inspiration: Real-world review analysis systems at Amazon, Google, Meta
Note: This is a research/educational project. For production deployment, consider additional security, privacy, and compliance measures.