You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apache Parquet is an open-source, columnar storage file format designed for efficient data storage and retrieval. It was developed as part of the Apache Hadoop ecosystem.
Key Characteristics:
Columnar Storage: Data is stored column by column, not row by row
Self-describing: Schema is embedded within the file
Compressed: Built-in compression support (Snappy, GZIP, LZO, etc.)
Splittable: Can be processed in parallel across distributed systems
Why is Parquet Popular?
┌─────────────────────────────────────────────────────────────┐
│ ROW-BASED vs COLUMNAR │
├─────────────────────────────────────────────────────────────┤
│ │
│ ROW-BASED (CSV/JSON): │
│ ┌─────┬───────┬─────┬────────┐ │
│ │ ID │ Name │ Age │ Salary │ → Row 1 │
│ │ ID │ Name │ Age │ Salary │ → Row 2 │
│ │ ID │ Name │ Age │ Salary │ → Row 3 │
│ └─────┴───────┴─────┴────────┘ │
│ │
│ COLUMNAR (Parquet): │
│ ┌─────┐ ┌───────┐ ┌─────┐ ┌────────┐ │
│ │ ID │ │ Name │ │ Age │ │ Salary │ │
│ │ ID │ │ Name │ │ Age │ │ Salary │ │
│ │ ID │ │ Name │ │ Age │ │ Salary │ │
│ └─────┘ └───────┘ └─────┘ └────────┘ │
│ ↓ ↓ ↓ ↓ │
│ Col 1 Col 2 Col 3 Col 4 │
└─────────────────────────────────────────────────────────────┘
Reasons for Popularity:
Benefit
Description
Efficient Compression
Similar data types stored together compress better (up to 75% smaller)
Column Pruning
Read only required columns, skip unnecessary data
Predicate Pushdown
Filter data at storage level before loading into memory
Schema Evolution
Add, remove, or modify columns without rewriting entire dataset
Big Data Ecosystem
Native support in Spark, Hive, Presto, AWS Athena, BigQuery
Fast Aggregations
Analytical queries (SUM, AVG, COUNT) run much faster
Use Cases for Data Engineers
1. ETL/ELT Pipelines
Raw Data (CSV/JSON) → Transform → Parquet (Data Lake)
2. Data Lake Storage
Store processed data in cloud storage (S3, GCS, Azure Blob)
Partitioned storage for efficient querying
3. Data Warehouse Staging
Intermediate format between source systems and data warehouse
4. Batch Processing
Efficient format for Spark, Hive, and Presto jobs
5. Data Archival
Long-term storage with excellent compression
6. Schema Enforcement
Enforce data types and schema validation
Use Cases for Data Analysts
1. Fast Analytical Queries
Quick aggregations on large datasets
2. Self-Service Analytics
Query directly using SQL engines (Athena, Presto, Trino)
3. Reporting Datasets
Pre-aggregated data for dashboards
4. Historical Analysis
Analyze historical data stored efficiently
5. Ad-hoc Analysis
Explore large datasets with tools like DuckDB
Comparison: Parquet vs JSON vs CSV
Feature Comparison Table
Feature
Parquet
JSON
CSV
Storage Format
Columnar (Binary)
Row-based (Text)
Row-based (Text)
Human Readable
❌ No
✅ Yes
✅ Yes
File Size
⭐ Smallest (70-90% smaller)
Largest
Medium
Compression
⭐ Excellent (built-in)
Poor
Moderate
Schema
⭐ Embedded & enforced
Flexible/None
None (header only)
Data Types
⭐ Rich type support
Limited types
All text (no types)
Nested Data
✅ Excellent support
✅ Native support
❌ Not supported
Read Speed (Analytics)
⭐ Very Fast
Slow
Slow
Write Speed
Moderate
Fast
Fast
Partial Reading
✅ Column selection
❌ Full file read
❌ Full file read
Splittable
✅ Yes
❌ No (unless line-delimited)
✅ Yes
Big Data Tools
⭐ Excellent support
Moderate
Moderate
Streaming
❌ No
✅ Yes
✅ Yes
Schema Evolution
✅ Yes
✅ Flexible
❌ No
Query Pushdown
✅ Yes
❌ No
❌ No
Performance Comparison
Metric
Parquet
JSON
CSV
Storage Size (1M rows)
~50 MB
~500 MB
~200 MB
Read Time (full scan)
1x
5-10x slower
3-5x slower
Read Time (2 columns)
0.2x
5-10x slower
3-5x slower
Compression Ratio
10:1
2:1
3:1
When to Use Each Format
Format
Best For
Parquet
Analytics, Data Lakes, Data Warehouses, Large datasets, Columnar queries
JSON
APIs, Configuration files, Document storage, Nested/flexible data, Web applications
CSV
Data exchange, Simple data, Spreadsheet compatibility, Small datasets, Quick exports
importpyarrow.parquetaspq# Read metadata without loading dataparquet_file=pq.ParquetFile('large_file.parquet')
print(f"Number of row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
# Read specific row grouprow_group_0=parquet_file.read_row_group(0)
print(row_group_0.to_pandas())
# Read specific row groupsrow_groups=parquet_file.read_row_groups([0, 1])
print(row_groups.to_pandas())
8. Using DuckDB with Parquet (Fast Analytics)
importduckdb# Query Parquet files directly with SQLresult=duckdb.query(""" SELECT department, COUNT(*) as employee_count, AVG(salary) as avg_salary, MAX(salary) as max_salary FROM 'comparison.parquet' GROUP BY department ORDER BY avg_salary DESC""").df()
print(result)
# Query partitioned Parquetresult=duckdb.query(""" SELECT year, region, SUM(amount) as total_sales FROM 'sales_data/**/*.parquet' GROUP BY year, region ORDER BY year, total_sales DESC""").df()
print(result)
Industry Use Cases
1. E-Commerce / Retail
Use Case
Description
Order Analytics
Store billions of order records for trend analysis
# Example: CDR processing pipelineimportpandasaspdimportpyarrow.parquetaspqdefprocess_cdr_batch(raw_cdr_path, output_path):
# Read raw CDR datacdr=pd.read_csv(raw_cdr_path)
# Process and aggregatehourly_summary=cdr.groupby(['cell_tower_id', 'hour']).agg({
'call_duration': 'sum',
'data_usage_mb': 'sum',
'call_id': 'count'
}).reset_index()
# Store as Parquet with compressionhourly_summary.to_parquet(
output_path,
compression='snappy',
index=False
)
5. Logistics & Supply Chain
Use Case
Description
Shipment Tracking
Historical shipment data for analytics
Route Optimization
Historical route data analysis
Warehouse Analytics
Inventory movement patterns
Supplier Performance
Vendor metrics over time
6. Media & Entertainment
Use Case
Description
Viewing Analytics
Content consumption patterns
Ad Performance
Advertisement metrics and attribution
Content Catalog
Metadata storage and retrieval
User Engagement
Session data and user behavior
Best Practices Summary
┌────────────────────────────────────────────────────────────┐
│ PARQUET BEST PRACTICES │
├────────────────────────────────────────────────────────────┤
│ │
│ ✅ DO: │
│ • Use partitioning for large datasets │
│ • Choose appropriate compression (snappy for speed, │
│ gzip for size) │
│ • Read only required columns │
│ • Use predicate pushdown for filtering │
│ • Keep row group sizes between 50MB - 200MB │
│ │
│ ❌ DON'T: │
│ • Use Parquet for streaming/real-time data │
│ • Store very small files (overhead not worth it) │
│ • Over-partition (too many small files) │
│ • Ignore schema evolution when modifying columns │
│ │
└────────────────────────────────────────────────────────────┘
Conclusion
Aspect
Summary
When to Use Parquet
Large datasets, analytical queries, data lakes, ETL pipelines
Key Benefits
Compression, column pruning, fast analytics, schema support
Python Libraries
pandas, pyarrow, fastparquet, duckdb
Best For
Read-heavy workloads, aggregations, historical data
Parquet has become the de facto standard for storing analytical data in modern data platforms due to its efficiency, compatibility with big data tools, and excellent compression characteristics.
Note: The content you provided was already in Markdown format. I've preserved all the original formatting including headers, tables, code blocks, lists, and special characters. You can copy this directly into any Markdown editor or viewer.