Skip to content

Commit 9743400

Browse files
committed
Add Claude Code context files
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
1 parent 0ab3cfd commit 9743400

File tree

31 files changed

+8523
-6
lines changed

31 files changed

+8523
-6
lines changed

.devcontainer/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ ENV PATH="$HOME/.cargo/bin:$PATH"
6868
## Install Bun.js
6969
RUN curl -fsSL https://bun.sh/install | bash -s "bun-v1.2.2"
7070
ENV PATH="$HOME/.bun/bin:$PATH"
71-
RUN $HOME/.bun/bin/bun install --global @hey-api/openapi-ts
71+
RUN $HOME/.bun/bin/bun install --global @hey-api/openapi-ts @anthropic-ai/claude-code
7272

7373
RUN \
7474
rustup install $RUST_VERSION && \

.github/workflows/CLAUDE.md

Lines changed: 508 additions & 0 deletions
Large diffs are not rendered by default.

CLAUDE.md

Lines changed: 254 additions & 0 deletions
Large diffs are not rendered by default.

CLAUDE_GUIDELINES.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# CLAUDE.md Authoring Guide (for Claude Code)
2+
3+
This repo uses layered `CLAUDE.md` docs so Claude Code can navigate context quickly. Follow these governing principles.
4+
5+
## Purpose and placement
6+
7+
* Use `CLAUDE.md` as the entry point for humans and Claude Code within each directory.
8+
* Write only what’s needed to work effectively in that directory; avoid project history; concise underlying reasoning is OK.
9+
10+
## Layering and scope
11+
12+
* Top level provides a high-level overview: the project’s purpose, key components, core workflows, and dependencies between major areas.
13+
* In the top level, include exactly one short paragraph per important subdirectory describing what lives there, why it exists, and where to start. Keep it concrete but brief.
14+
* Subdirectory docs add progressively more detail the deeper you go. Each level narrows to responsibilities, interfaces, commands, schemas, and caveats specific to that scope.
15+
16+
## DRY information flow
17+
18+
* Do not repeat what a parent `CLAUDE.md` already states about a subdirectory. Instead, link up to the relevant section.
19+
* Put cross-cutting concepts at the highest level that owns them, and reference from below.
20+
* Keep a single source of truth for contracts, schemas, and commands; everything else links to it.
21+
22+
## Clarity for Claude Code
23+
24+
* Prefer crisp headings, short paragraphs, and tight bullets over prose.
25+
* Name files, entry points, public interfaces, and primary commands explicitly where they belong.
26+
* Call out constraints, feature flags, performance notes, and “gotchas” near the workflows they affect.
27+
28+
## Maintenance rules
29+
30+
* Update the highest applicable level first; ensure lower levels still defer to it.
31+
* Remove stale sections rather than letting them drift; shorter and correct beats exhaustive and outdated.
32+
* When adding a new directory, add its paragraph to the top level and create its own `CLAUDE.md` that deepens—never duplicates—the parent’s description.
33+
34+
## Quality checklist
35+
36+
* Top level gives a true overview and one concise paragraph per important subdirectory.
37+
* Every subdirectory doc increases detail appropriate to its scope.
38+
* No duplication across levels; links replace repetition.
39+
* Commands, interfaces, and data shapes are precise and current. It is OK to document different arguments for the same command for different use-cases.
40+
* Formatting is skim-friendly and consistent across the repo.

benchmark/CLAUDE.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with the benchmark directory in this repository.
4+
5+
## Overview
6+
7+
The `benchmark/` directory contains comprehensive benchmarking infrastructure for comparing Feldera's performance against other stream processing systems. It implements industry-standard benchmarks (NEXMark, TPC-H, TikTok) across multiple platforms to provide objective performance comparisons.
8+
9+
## Benchmark Ecosystem
10+
11+
### Supported Systems
12+
13+
The benchmarking framework supports comparative analysis across:
14+
15+
- **Feldera** - Both Rust-native and SQL implementations
16+
- **Apache Flink** - Standalone and Kafka-integrated configurations
17+
- **Apache Beam** - Multiple runners:
18+
- Direct runner (development/testing)
19+
- Flink runner
20+
- Spark runner
21+
- Google Cloud Dataflow runner
22+
23+
### Benchmark Suites
24+
25+
#### **NEXMark Benchmark**
26+
- **Industry Standard**: Streaming benchmark for auction data processing
27+
- **22 Queries**: Complete suite of streaming analytics queries (q0-q22)
28+
- **Realistic Data**: Auction, bidder, and seller event generation
29+
- **Multiple Modes**: Streaming and batch processing modes
30+
31+
#### **TPC-H Benchmark**
32+
- **OLAP Standard**: Traditional analytical processing benchmark
33+
- **22 Queries**: Complex analytical queries for business intelligence
34+
- **Batch Processing**: Focus on analytical query performance
35+
36+
#### **TikTok Benchmark**
37+
- **Custom Workload**: Social media analytics patterns
38+
- **Streaming Focus**: Real-time social media data processing
39+
40+
## Key Development Commands
41+
42+
### Running Individual Benchmarks
43+
44+
```bash
45+
# Basic Feldera benchmark
46+
./run-nexmark.sh --runner=feldera --events=100M
47+
48+
# Compare Feldera vs Flink
49+
./run-nexmark.sh --runner=flink --events=100M
50+
51+
# SQL implementation on Feldera
52+
./run-nexmark.sh --runner=feldera --language=sql
53+
54+
# Batch processing mode
55+
./run-nexmark.sh --batch --events=100M
56+
57+
# Specific query testing
58+
./run-nexmark.sh --query=q3 --runner=feldera
59+
60+
# Core count specification
61+
./run-nexmark.sh --cores=8 --runner=feldera
62+
```
63+
64+
### Running Benchmark Suites
65+
66+
```bash
67+
# Full benchmark suite using Makefile
68+
make -f suite.mk
69+
70+
# Limited runners and modes
71+
make -f suite.mk runners='feldera flink' modes=batch events=1M
72+
73+
# Specific configuration
74+
make -f suite.mk runners=feldera events=100M cores=16
75+
```
76+
77+
### Analysis and Results
78+
79+
```bash
80+
# Generate analysis (requires PSPP/SPSS)
81+
pspp analysis.sps
82+
83+
# View results in CSV format
84+
cat nexmark.csv
85+
```
86+
87+
## Project Structure
88+
89+
### Core Scripts
90+
- `run-nexmark.sh` - Main benchmark execution script
91+
- `suite.mk` - Makefile for running comprehensive benchmark suites
92+
- `analysis.sps` - SPSS/PSPP script for statistical analysis
93+
94+
### Implementation Directories
95+
96+
#### `feldera-sql/`
97+
- **SQL Benchmarks**: Pure SQL implementations of benchmark queries
98+
- **Pipeline Management**: Integration with Feldera's pipeline manager
99+
- **Query Definitions**: Standard benchmark queries in SQL format
100+
- **Table Schemas**: Database schema definitions for benchmarks
101+
102+
#### `flink/` & `flink-kafka/`
103+
- **Flink Integration**: Standalone and Kafka-integrated Flink setups
104+
- **Docker Containers**: Containerized Flink environments
105+
- **Configuration**: Flink-specific performance tuning configurations
106+
- **NEXMark Implementation**: Java-based NEXMark implementation
107+
108+
#### `beam/`
109+
- **Apache Beam**: Multi-runner Beam implementations
110+
- **Language Support**: Java, SQL (Calcite), and ZetaSQL implementations
111+
- **Cloud Integration**: Google Cloud Dataflow configuration
112+
- **Setup Scripts**: Environment preparation and dependency management
113+
114+
## Important Implementation Details
115+
116+
### Performance Optimization
117+
118+
#### **Feldera Optimizations**
119+
- **Storage Configuration**: Uses `/tmp` by default, configure `TMPDIR` for real filesystem
120+
- **Multi-threading**: Automatic core detection with 16-core maximum default
121+
- **Memory Management**: Efficient incremental computation with minimal memory overhead
122+
123+
#### **System-Specific Tuning**
124+
- **Flink**: RocksDB and HashMap state backends available
125+
- **Beam**: Multiple language implementations (Java, SQL, ZetaSQL)
126+
- **Cloud**: Optimized configurations for cloud deployments
127+
128+
### Benchmark Modes
129+
130+
#### **Streaming Mode (Default)**
131+
- **Real-time Processing**: Continuous data processing simulation
132+
- **Incremental Results**: Measure throughput and latency
133+
- **Event Generation**: Configurable event rates and patterns
134+
135+
#### **Batch Mode**
136+
- **Analytical Processing**: Traditional batch analytics
137+
- **Complete Data**: Process entire datasets at once
138+
- **Throughput Focus**: Optimized for maximum data processing rates
139+
140+
### Data Generation
141+
142+
- **Configurable Scale**: From 100K to 100M+ events
143+
- **Realistic Patterns**: Auction data with realistic distributions
144+
- **Reproducible**: Deterministic data generation for consistent comparisons
145+
146+
## Development Workflow
147+
148+
### For New Benchmarks
149+
150+
1. Add query definitions to appropriate `benchmarks/*/queries/` directory
151+
2. Update table schemas in `table.sql` files
152+
3. Implement runner-specific logic in system directories
153+
4. Add query to `run-nexmark.sh` query list
154+
5. Test across multiple systems for consistency
155+
156+
### For System Integration
157+
158+
1. Create system-specific directory (e.g., `newsystem/`)
159+
2. Implement setup and configuration scripts
160+
3. Add runner option to `run-nexmark.sh`
161+
4. Update `suite.mk` runner list
162+
5. Document setup requirements
163+
164+
### Testing Strategy
165+
166+
#### **Correctness Validation**
167+
- **Cross-System Consistency**: Results should match across systems
168+
- **Query Verification**: Validate SQL semantics and outputs
169+
- **Edge Case Testing**: Test with various data sizes and patterns
170+
171+
#### **Performance Analysis**
172+
- **Throughput Measurement**: Events processed per second
173+
- **Latency Analysis**: End-to-end processing delays
174+
- **Resource Usage**: CPU, memory, and I/O utilization
175+
- **Scalability Testing**: Performance across different core counts
176+
177+
### Configuration Management
178+
179+
#### **Environment Variables**
180+
- `TMPDIR` - Storage location for temporary files
181+
- `FELDERA_API_URL` - Pipeline manager endpoint (default: localhost:8080)
182+
- Cloud credentials for Dataflow benchmarks
183+
184+
#### **System Requirements**
185+
- **Java 11+** - Required for Beam and Flink
186+
- **Docker** - For containerized system testing
187+
- **Python 3** - For analysis scripts
188+
- **Cloud SDK** - For Google Cloud Dataflow testing
189+
190+
### Results Analysis
191+
192+
#### **Statistical Analysis**
193+
- **PSPP Integration**: Statistical analysis with `analysis.sps`
194+
- **Performance Tables**: Formatted comparison tables
195+
- **Trend Analysis**: Performance trends across system configurations
196+
197+
#### **Output Formats**
198+
- **CSV Results**: Machine-readable performance data
199+
- **Formatted Tables**: Human-readable comparison tables
200+
- **Statistical Reports**: Detailed statistical analysis
201+
202+
## Best Practices
203+
204+
### Benchmark Execution
205+
- **Warm-up Runs**: Allow systems to reach steady state
206+
- **Multiple Iterations**: Run benchmarks multiple times for statistical significance
207+
- **Resource Isolation**: Ensure consistent resource availability
208+
- **Environment Control**: Use consistent hardware and software configurations
209+
210+
### Performance Comparison
211+
- **Fair Comparison**: Use equivalent configurations across systems
212+
- **Resource Limits**: Apply consistent memory and CPU limits
213+
- **Data Consistency**: Use identical datasets across systems
214+
- **Metric Standardization**: Use consistent performance metrics
215+
216+
### System Setup
217+
- **Documentation**: Follow setup instructions for each system
218+
- **Version Control**: Pin specific versions for reproducible results
219+
- **Configuration**: Use optimized configurations for each system
220+
- **Monitoring**: Monitor resource usage during benchmarks
221+
222+
This benchmarking infrastructure provides comprehensive tools for validating Feldera's performance advantages and identifying optimization opportunities across different workloads and system configurations.

crates/CLAUDE.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with the crates directory in this repository.
4+
5+
## Overview
6+
7+
The `crates/` directory contains all Rust crates that make up the Feldera platform. This is a workspace-based project where each crate serves a specific purpose in the overall architecture, from core computational engines to I/O adapters and development tools.
8+
9+
## Crate Descriptions
10+
11+
### Core Engine Crates
12+
13+
**`dbsp/`** - The Database Stream Processor is the heart of Feldera's computational engine. It provides incremental computation capabilities where changes propagate in time proportional to the size of the change rather than the dataset. Contains operators for filtering, mapping, joining, aggregating, and advanced streaming operations with support for multi-threaded execution and persistent storage.
14+
15+
**`sqllib/`** - Runtime support library providing SQL function implementations for the compiled circuits. Includes aggregate functions (SUM, COUNT, AVG), scalar functions (string manipulation, date/time operations), type conversions, and null handling following SQL standard semantics. Essential for SQL compatibility in the DBSP computational model.
16+
17+
**`feldera-types/`** - Shared type definitions and configuration structures used across the entire platform. Provides pipeline configuration types, transport configurations, data format schemas, error types, and validation frameworks. Serves as the foundational type system ensuring consistency across all Feldera components.
18+
19+
### Service and Management Crates
20+
21+
**`pipeline-manager/`** - HTTP API service for managing the lifecycle of data pipelines. Provides RESTful endpoints for creating, configuring, starting, stopping, and monitoring pipelines. Integrates with PostgreSQL for persistence, handles authentication, and orchestrates the compilation and deployment of SQL programs to DBSP circuits.
22+
23+
**`rest-api/`** - Type definitions and OpenAPI specification generation for the Feldera REST API. Automatically generates machine-readable API specifications from Rust types, ensuring consistency between server implementation and client SDKs. Includes comprehensive request/response schemas and validation rules.
24+
25+
### I/O and Integration Crates
26+
27+
**`adapters/`** - Comprehensive I/O framework providing input and output adapters for DBSP circuits. Supports multiple transport protocols (Kafka, HTTP, File, Redis, S3) and data formats (JSON, CSV, Avro, Parquet). Includes integrated connectors for databases like PostgreSQL and data lake formats like Delta Lake with fault-tolerant processing and automatic retry logic.
28+
29+
**`adapterlib/`** - Foundational abstractions and utilities for building I/O adapters. Provides generic transport traits, format processing abstractions, circuit catalog interfaces for runtime introspection, and comprehensive error handling frameworks. Enables consistent adapter implementation across different data sources and sinks.
30+
31+
**`iceberg/`** - Apache Iceberg table format integration enabling Feldera to work with modern data lake architectures. Supports schema evolution, time travel queries, and efficient data partitioning. Includes S3 and cloud storage integration with optimized reading patterns for large analytic datasets.
32+
33+
### Storage and Persistence Crates
34+
35+
**`storage/`** - Pluggable storage abstraction layer supporting multiple backends including memory-based storage for testing and POSIX I/O for production deployments. Provides async file operations, block-level caching, buffer management, and error recovery mechanisms optimized for DBSP's access patterns.
36+
37+
### Mathematical and Type System Crates
38+
39+
**`fxp/`** - High-precision fixed-point arithmetic library for financial and decimal computations. Provides exact decimal arithmetic without floating-point errors, SQL DECIMAL type compatibility, and DBSP integration with zero-copy serialization. Supports both compile-time fixed scales and dynamic precision for flexible numeric processing.
40+
41+
### Benchmarking and Testing Crates
42+
43+
**`nexmark/`** - Industry-standard streaming benchmark suite implementing the NEXMark benchmark queries. Generates realistic auction data with configurable rates and distributions, implements all 22 standard benchmark queries, and provides performance measurement tools for validating DBSP's streaming capabilities.
44+
45+
**`datagen/`** - Test data generation utilities providing realistic datasets for testing and benchmarking. Supports configurable data distributions, correlated data generation, high-throughput batch generation, and deterministic seeded generation for reproducible testing scenarios.
46+
47+
### Development and Tooling Crates
48+
49+
**`fda/`** - Feldera Development Assistant providing CLI tools and interactive development environment. Includes command-line utilities for common development tasks, interactive shell for exploratory development, benchmarking tools, and API specification validation. Serves as the primary development companion tool.
50+
51+
**`ir/`** - Multi-level intermediate representation system for SQL compilation pipeline. Provides High-level IR (HIR) close to SQL structure, Mid-level IR (MIR) for optimization, and Low-level IR (LIR) for code generation. Includes comprehensive analysis frameworks and transformation passes for SQL program compilation.
52+
53+
## Development Workflow
54+
55+
### Building All Crates
56+
57+
```bash
58+
# Build all crates
59+
cargo build --workspace
60+
61+
# Test all crates
62+
cargo test --workspace
63+
64+
# Test documentation (limit threads to prevent OOM)
65+
cargo test --doc -- --test-threads 12
66+
67+
# Check all crates
68+
cargo check --workspace
69+
70+
# Build specific crate
71+
cargo build -p <crate-name>
72+
```
73+
74+
### Workspace Management
75+
76+
```bash
77+
# List all workspace members
78+
cargo metadata --format-version=1 | jq '.workspace_members'
79+
80+
# Run command on all workspace members
81+
cargo workspaces exec cargo check
82+
83+
# Update dependencies across workspace
84+
cargo update
85+
```
86+
87+
### Cross-Crate Dependencies
88+
89+
The crates form a dependency graph where:
90+
- **Core crates** (`dbsp`, `feldera-types`) are dependencies for most other crates
91+
- **Service crates** (`pipeline-manager`, `adapters`) depend on core and utility crates
92+
- **Tool crates** (`fda`, `nexmark`) typically depend on multiple core crates
93+
- **Library crates** (`sqllib`, `adapterlib`) provide functionality to higher-level crates
94+
95+
### Testing Strategy
96+
97+
- **Unit Tests**: Each crate contains comprehensive unit tests
98+
- **Integration Tests**: Cross-crate integration testing
99+
- **Workspace Tests**: Full workspace testing for compatibility
100+
- **Benchmark Tests**: Performance validation across crates
101+
102+
## Best Practices
103+
104+
### Crate Organization
105+
- Each crate has a single, well-defined responsibility
106+
- Dependencies flow from higher-level to lower-level crates
107+
- Shared types and utilities are extracted to common crates
108+
- Feature flags control optional functionality and integrations
109+
110+
### Development Guidelines
111+
- Follow consistent coding patterns across crates
112+
- Use workspace-level dependency management
113+
- Maintain comprehensive documentation for each crate
114+
- Write tests that work both in isolation and as part of the workspace
115+
116+
### Performance Considerations
117+
- Core computational crates (`dbsp`, `sqllib`) are highly optimized
118+
- I/O crates (`adapters`, `storage`) focus on throughput and efficiency
119+
- Tool crates prioritize developer experience over raw performance
120+
- Benchmark crates provide performance validation and regression detection
121+
122+
This workspace architecture enables modular development while maintaining consistency and performance across the entire Feldera platform.

0 commit comments

Comments
 (0)