apache_parquet.md

Apache Parquet: A Comprehensive Guide

What is Parquet?

Apache Parquet is an open-source, columnar storage file format designed for efficient data storage and retrieval. It was developed as part of the Apache Hadoop ecosystem.

Key Characteristics:

Columnar Storage: Data is stored column by column, not row by row
Self-describing: Schema is embedded within the file
Compressed: Built-in compression support (Snappy, GZIP, LZO, etc.)
Splittable: Can be processed in parallel across distributed systems

Why is Parquet Popular?

┌─────────────────────────────────────────────────────────────┐
│                  ROW-BASED vs COLUMNAR                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ROW-BASED (CSV/JSON):                                      │
│  ┌─────┬───────┬─────┬────────┐                            │
│  │ ID  │ Name  │ Age │ Salary │  → Row 1                   │
│  │ ID  │ Name  │ Age │ Salary │  → Row 2                   │
│  │ ID  │ Name  │ Age │ Salary │  → Row 3                   │
│  └─────┴───────┴─────┴────────┘                            │
│                                                             │
│  COLUMNAR (Parquet):                                        │
│  ┌─────┐ ┌───────┐ ┌─────┐ ┌────────┐                      │
│  │ ID  │ │ Name  │ │ Age │ │ Salary │                      │
│  │ ID  │ │ Name  │ │ Age │ │ Salary │                      │
│  │ ID  │ │ Name  │ │ Age │ │ Salary │                      │
│  └─────┘ └───────┘ └─────┘ └────────┘                      │
│    ↓        ↓        ↓        ↓                            │
│  Col 1   Col 2    Col 3    Col 4                           │
└─────────────────────────────────────────────────────────────┘

Reasons for Popularity:

Benefit	Description
Efficient Compression	Similar data types stored together compress better (up to 75% smaller)
Column Pruning	Read only required columns, skip unnecessary data
Predicate Pushdown	Filter data at storage level before loading into memory
Schema Evolution	Add, remove, or modify columns without rewriting entire dataset
Big Data Ecosystem	Native support in Spark, Hive, Presto, AWS Athena, BigQuery
Fast Aggregations	Analytical queries (SUM, AVG, COUNT) run much faster

Use Cases for Data Engineers

1. ETL/ELT Pipelines

Raw Data (CSV/JSON) → Transform → Parquet (Data Lake)

2. Data Lake Storage

Store processed data in cloud storage (S3, GCS, Azure Blob)
Partitioned storage for efficient querying

3. Data Warehouse Staging

Intermediate format between source systems and data warehouse

4. Batch Processing

Efficient format for Spark, Hive, and Presto jobs

5. Data Archival

Long-term storage with excellent compression

6. Schema Enforcement

Enforce data types and schema validation

Use Cases for Data Analysts

1. Fast Analytical Queries

Quick aggregations on large datasets

2. Self-Service Analytics

Query directly using SQL engines (Athena, Presto, Trino)

3. Reporting Datasets

Pre-aggregated data for dashboards

4. Historical Analysis

Analyze historical data stored efficiently

5. Ad-hoc Analysis

Explore large datasets with tools like DuckDB

Comparison: Parquet vs JSON vs CSV

Feature Comparison Table

Feature	Parquet	JSON	CSV
Storage Format	Columnar (Binary)	Row-based (Text)	Row-based (Text)
Human Readable	❌ No	✅ Yes	✅ Yes
File Size	⭐ Smallest (70-90% smaller)	Largest	Medium
Compression	⭐ Excellent (built-in)	Poor	Moderate
Schema	⭐ Embedded & enforced	Flexible/None	None (header only)
Data Types	⭐ Rich type support	Limited types	All text (no types)
Nested Data	✅ Excellent support	✅ Native support	❌ Not supported
Read Speed (Analytics)	⭐ Very Fast	Slow	Slow
Write Speed	Moderate	Fast	Fast
Partial Reading	✅ Column selection	❌ Full file read	❌ Full file read
Splittable	✅ Yes	❌ No (unless line-delimited)	✅ Yes
Big Data Tools	⭐ Excellent support	Moderate	Moderate
Streaming	❌ No	✅ Yes	✅ Yes
Schema Evolution	✅ Yes	✅ Flexible	❌ No
Query Pushdown	✅ Yes	❌ No	❌ No

Performance Comparison

Metric	Parquet	JSON	CSV
Storage Size (1M rows)	~50 MB	~500 MB	~200 MB
Read Time (full scan)	1x	5-10x slower	3-5x slower
Read Time (2 columns)	0.2x	5-10x slower	3-5x slower
Compression Ratio	10:1	2:1	3:1

When to Use Each Format

Format	Best For
Parquet	Analytics, Data Lakes, Data Warehouses, Large datasets, Columnar queries
JSON	APIs, Configuration files, Document storage, Nested/flexible data, Web applications
CSV	Data exchange, Simple data, Spreadsheet compatibility, Small datasets, Quick exports

Python Implementation

Installation

# Install required libraries
pip install pandas pyarrow fastparquet

Basic Operations

1. Writing Parquet Files

import pandas as pd

# Create sample data
data = {
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'amount': [1200.50, 800.00, 450.75, 350.00, 75.25],
    'order_date': pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', 
                                   '2024-01-17', '2024-01-17'])
}

df = pd.DataFrame(data)

# Write to Parquet (default compression: snappy)
df.to_parquet('orders.parquet', index=False)

# Write with specific compression
df.to_parquet('orders_gzip.parquet', compression='gzip', index=False)
df.to_parquet('orders_snappy.parquet', compression='snappy', index=False)
df.to_parquet('orders_brotli.parquet', compression='brotli', index=False)

print("Parquet files created successfully!")

2. Reading Parquet Files

import pandas as pd

# Read entire Parquet file
df = pd.read_parquet('orders.parquet')
print(df)

# Read only specific columns (Column Pruning)
df_subset = pd.read_parquet('orders.parquet', columns=['order_id', 'amount'])
print(df_subset)

# Read with filters (requires pyarrow)
df_filtered = pd.read_parquet(
    'orders.parquet',
    filters=[('amount', '>', 400)]
)
print(df_filtered)

3. Working with PyArrow Directly

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Create data
data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Product A', 'Product B', 'Product C', 'Product D', 'Product E'],
    'price': [10.5, 20.0, 15.75, 8.25, 12.00],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Food', 'Clothing']
}

df = pd.DataFrame(data)

# Convert to PyArrow Table
table = pa.Table.from_pandas(df)

# Write with PyArrow
pq.write_table(table, 'products.parquet', compression='snappy')

# Read with PyArrow
table_read = pq.read_table('products.parquet')
print(table_read.to_pandas())

# Read metadata
parquet_file = pq.ParquetFile('products.parquet')
print("Schema:", parquet_file.schema)
print("Metadata:", parquet_file.metadata)
print("Number of row groups:", parquet_file.num_row_groups)

4. Partitioned Parquet Files

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create sample sales data
data = {
    'sale_id': range(1, 101),
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor'] * 25,
    'region': ['North', 'South', 'East', 'West'] * 25,
    'year': [2023, 2024] * 50,
    'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] * 8 + [1, 2, 3, 4],
    'amount': [round(100 + i * 10.5, 2) for i in range(100)]
}

df = pd.DataFrame(data)
table = pa.Table.from_pandas(df)

# Write partitioned Parquet (creates folder structure)
pq.write_to_dataset(
    table,
    root_path='sales_data',
    partition_cols=['year', 'region']
)

# This creates:
# sales_data/
#   ├── year=2023/
#   │   ├── region=North/
#   │   │   └── *.parquet
#   │   ├── region=South/
#   │   │   └── *.parquet
#   │   └── ...
#   └── year=2024/
#       └── ...

# Read partitioned data with filters (partition pruning)
df_filtered = pd.read_parquet(
    'sales_data',
    filters=[('year', '=', 2024), ('region', '=', 'North')]
)
print(df_filtered)

5. Appending Data to Parquet

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Initial data
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': ['A', 'B', 'C']
})

# New data to append
df2 = pd.DataFrame({
    'id': [4, 5, 6],
    'value': ['D', 'E', 'F']
})

# Write initial data
df1.to_parquet('data.parquet', index=False)

# Method 1: Read, concatenate, and write
existing_df = pd.read_parquet('data.parquet')
combined_df = pd.concat([existing_df, df2], ignore_index=True)
combined_df.to_parquet('data.parquet', index=False)

# Method 2: Using ParquetWriter for large files
writer = pq.ParquetWriter('data_appended.parquet', pa.Table.from_pandas(df1).schema)
writer.write_table(pa.Table.from_pandas(df1))
writer.write_table(pa.Table.from_pandas(df2))
writer.close()

6. Compare File Sizes

import pandas as pd
import os
import json

# Create a larger dataset for comparison
n_rows = 100000
data = {
    'id': range(n_rows),
    'name': [f'Customer_{i}' for i in range(n_rows)],
    'email': [f'customer_{i}@email.com' for i in range(n_rows)],
    'age': [25 + (i % 50) for i in range(n_rows)],
    'salary': [50000.0 + (i * 10.5) for i in range(n_rows)],
    'department': ['Sales', 'Marketing', 'Engineering', 'HR', 'Finance'] * (n_rows // 5)
}

df = pd.DataFrame(data)

# Save in different formats
df.to_csv('comparison.csv', index=False)
df.to_json('comparison.json', orient='records')
df.to_parquet('comparison.parquet', index=False)
df.to_parquet('comparison_gzip.parquet', compression='gzip', index=False)

# Compare file sizes
files = ['comparison.csv', 'comparison.json', 'comparison.parquet', 'comparison_gzip.parquet']

print("\n" + "="*50)
print("FILE SIZE COMPARISON (100,000 rows)")
print("="*50)

for file in files:
    size = os.path.getsize(file)
    print(f"{file:30} : {size/1024/1024:.2f} MB")

Example Output:

==================================================
FILE SIZE COMPARISON (100,000 rows)
==================================================
comparison.csv                 : 5.82 MB
comparison.json                : 8.45 MB
comparison.parquet             : 1.23 MB
comparison_gzip.parquet        : 0.89 MB

7. Reading Specific Row Groups

import pyarrow.parquet as pq

# Read metadata without loading data
parquet_file = pq.ParquetFile('large_file.parquet')

print(f"Number of row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")

# Read specific row group
row_group_0 = parquet_file.read_row_group(0)
print(row_group_0.to_pandas())

# Read specific row groups
row_groups = parquet_file.read_row_groups([0, 1])
print(row_groups.to_pandas())

8. Using DuckDB with Parquet (Fast Analytics)

import duckdb

# Query Parquet files directly with SQL
result = duckdb.query("""
    SELECT 
        department,
        COUNT(*) as employee_count,
        AVG(salary) as avg_salary,
        MAX(salary) as max_salary
    FROM 'comparison.parquet'
    GROUP BY department
    ORDER BY avg_salary DESC
""").df()

print(result)

# Query partitioned Parquet
result = duckdb.query("""
    SELECT year, region, SUM(amount) as total_sales
    FROM 'sales_data/**/*.parquet'
    GROUP BY year, region
    ORDER BY year, total_sales DESC
""").df()

print(result)

Industry Use Cases

1. E-Commerce / Retail

Use Case	Description
Order Analytics	Store billions of order records for trend analysis
Customer 360	Unified customer data for segmentation
Inventory Reports	Historical inventory levels and movements
Sales Dashboards	Pre-aggregated data for real-time dashboards

# Example: E-commerce daily sales pipeline
import pandas as pd

# Daily sales ETL
def process_daily_sales(date):
    # Extract from source
    raw_data = pd.read_csv(f'raw_sales_{date}.csv')
    
    # Transform
    processed = raw_data.groupby(['product_id', 'category']).agg({
        'quantity': 'sum',
        'revenue': 'sum',
        'order_id': 'count'
    }).reset_index()
    
    # Load to Parquet (partitioned by date)
    processed['date'] = date
    processed.to_parquet(
        f'data_lake/sales/date={date}/sales.parquet',
        index=False
    )

2. Financial Services

Use Case	Description
Transaction History	Store years of transaction data efficiently
Risk Reporting	Daily risk calculations and historical comparisons
Regulatory Compliance	Long-term data retention with fast retrieval
Portfolio Analytics	Historical portfolio performance analysis

# Example: Financial transaction storage
import pandas as pd
from datetime import datetime

def store_transactions(transactions_df):
    # Add partitioning columns
    transactions_df['year'] = transactions_df['timestamp'].dt.year
    transactions_df['month'] = transactions_df['timestamp'].dt.month
    
    # Write partitioned Parquet
    transactions_df.to_parquet(
        'transactions_lake',
        partition_cols=['year', 'month'],
        index=False
    )

# Query specific month (only reads relevant partition)
df = pd.read_parquet(
    'transactions_lake',
    filters=[('year', '=', 2024), ('month', '=', 1)]
)

3. Healthcare

Use Case	Description
Patient Records	Historical patient data for analysis
Claims Processing	Insurance claims data warehouse
Clinical Analytics	Treatment outcomes and patterns
Operational Reports	Hospital capacity and resource utilization

4. Telecommunications

Use Case	Description
CDR Processing	Call Detail Records storage and analysis
Network Analytics	Network performance metrics
Customer Usage	Data usage patterns and billing
Churn Analysis	Historical customer behavior data

# Example: CDR processing pipeline
import pandas as pd
import pyarrow.parquet as pq

def process_cdr_batch(raw_cdr_path, output_path):
    # Read raw CDR data
    cdr = pd.read_csv(raw_cdr_path)
    
    # Process and aggregate
    hourly_summary = cdr.groupby(['cell_tower_id', 'hour']).agg({
        'call_duration': 'sum',
        'data_usage_mb': 'sum',
        'call_id': 'count'
    }).reset_index()
    
    # Store as Parquet with compression
    hourly_summary.to_parquet(
        output_path,
        compression='snappy',
        index=False
    )

5. Logistics & Supply Chain

Use Case	Description
Shipment Tracking	Historical shipment data for analytics
Route Optimization	Historical route data analysis
Warehouse Analytics	Inventory movement patterns
Supplier Performance	Vendor metrics over time

6. Media & Entertainment

Use Case	Description
Viewing Analytics	Content consumption patterns
Ad Performance	Advertisement metrics and attribution
Content Catalog	Metadata storage and retrieval
User Engagement	Session data and user behavior

Best Practices Summary

┌────────────────────────────────────────────────────────────┐
│                   PARQUET BEST PRACTICES                   │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  ✅ DO:                                                    │
│     • Use partitioning for large datasets                  │
│     • Choose appropriate compression (snappy for speed,    │
│       gzip for size)                                       │
│     • Read only required columns                           │
│     • Use predicate pushdown for filtering                 │
│     • Keep row group sizes between 50MB - 200MB            │
│                                                            │
│  ❌ DON'T:                                                 │
│     • Use Parquet for streaming/real-time data             │
│     • Store very small files (overhead not worth it)       │
│     • Over-partition (too many small files)                │
│     • Ignore schema evolution when modifying columns       │
│                                                            │
└────────────────────────────────────────────────────────────┘

Conclusion

Aspect	Summary
When to Use Parquet	Large datasets, analytical queries, data lakes, ETL pipelines
Key Benefits	Compression, column pruning, fast analytics, schema support
Python Libraries	`pandas`, `pyarrow`, `fastparquet`, `duckdb`
Best For	Read-heavy workloads, aggregations, historical data

Parquet has become the de facto standard for storing analytical data in modern data platforms due to its efficiency, compatibility with big data tools, and excellent compression characteristics.

Note: The content you provided was already in Markdown format. I've preserved all the original formatting including headers, tables, code blocks, lists, and special characters. You can copy this directly into any Markdown editor or viewer.

FilesExpand file tree

apache_parquet.md

Latest commit

History

apache_parquet.md

File metadata and controls

Apache Parquet: A Comprehensive Guide

Table of Contents

What is Parquet?

Key Characteristics:

Why is Parquet Popular?

Reasons for Popularity:

Use Cases for Data Engineers

1. ETL/ELT Pipelines

2. Data Lake Storage

3. Data Warehouse Staging

4. Batch Processing

5. Data Archival

6. Schema Enforcement

Use Cases for Data Analysts

1. Fast Analytical Queries

2. Self-Service Analytics

3. Reporting Datasets

4. Historical Analysis

5. Ad-hoc Analysis

Comparison: Parquet vs JSON vs CSV

Feature Comparison Table

Performance Comparison

When to Use Each Format

Python Implementation

Installation

Basic Operations

1. Writing Parquet Files

2. Reading Parquet Files

3. Working with PyArrow Directly

4. Partitioned Parquet Files

5. Appending Data to Parquet

6. Compare File Sizes

7. Reading Specific Row Groups

8. Using DuckDB with Parquet (Fast Analytics)

Industry Use Cases

1. E-Commerce / Retail

2. Financial Services

3. Healthcare

4. Telecommunications

5. Logistics & Supply Chain

6. Media & Entertainment

Best Practices Summary

Conclusion