feldera
diff --git a/‎.devcontainer/Dockerfile‎
Lines changed: 1 addition & 1 deletion b/‎.devcontainer/Dockerfile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/CLAUDE.md‎
Lines changed: 508 additions & 0 deletions b/‎.github/workflows/CLAUDE.md‎
Lines changed: 508 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 254 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 254 additions & 0 deletions
diff --git a/‎CLAUDE_GUIDELINES.md‎
Lines changed: 40 additions & 0 deletions b/‎CLAUDE_GUIDELINES.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎benchmark/CLAUDE.md‎
Lines changed: 222 additions & 0 deletions b/‎benchmark/CLAUDE.md‎
Lines changed: 222 additions & 0 deletions
diff --git a/‎crates/CLAUDE.md‎
Lines changed: 122 additions & 0 deletions b/‎crates/CLAUDE.md‎
Lines changed: 122 additions & 0 deletions
@@ -68,7 +68,7 @@ ENV PATH="$HOME/.cargo/bin:$PATH"
 ## Install Bun.js
 RUN curl -fsSL https://bun.sh/install | bash -s "bun-v1.2.2"
 ENV PATH="$HOME/.bun/bin:$PATH"
-RUN $HOME/.bun/bin/bun install --global @hey-api/openapi-ts
+RUN $HOME/.bun/bin/bun install --global @hey-api/openapi-ts @anthropic-ai/claude-code
 
 RUN \
    rustup install $RUST_VERSION && \
 
@@ -0,0 +1,40 @@
+# CLAUDE.md Authoring Guide (for Claude Code)
+
+This repo uses layered `CLAUDE.md` docs so Claude Code can navigate context quickly. Follow these governing principles.
+
+## Purpose and placement
+
+* Use `CLAUDE.md` as the entry point for humans and Claude Code within each directory.
+* Write only what’s needed to work effectively in that directory; avoid project history; concise underlying reasoning is OK.
+
+## Layering and scope
+
+* Top level provides a high-level overview: the project’s purpose, key components, core workflows, and dependencies between major areas.
+* In the top level, include exactly one short paragraph per important subdirectory describing what lives there, why it exists, and where to start. Keep it concrete but brief.
+* Subdirectory docs add progressively more detail the deeper you go. Each level narrows to responsibilities, interfaces, commands, schemas, and caveats specific to that scope.
+
+## DRY information flow
+
+* Do not repeat what a parent `CLAUDE.md` already states about a subdirectory. Instead, link up to the relevant section.
+* Put cross-cutting concepts at the highest level that owns them, and reference from below.
+* Keep a single source of truth for contracts, schemas, and commands; everything else links to it.
+
+## Clarity for Claude Code
+
+* Prefer crisp headings, short paragraphs, and tight bullets over prose.
+* Name files, entry points, public interfaces, and primary commands explicitly where they belong.
+* Call out constraints, feature flags, performance notes, and “gotchas” near the workflows they affect.
+
+## Maintenance rules
+
+* Update the highest applicable level first; ensure lower levels still defer to it.
+* Remove stale sections rather than letting them drift; shorter and correct beats exhaustive and outdated.
+* When adding a new directory, add its paragraph to the top level and create its own `CLAUDE.md` that deepens—never duplicates—the parent’s description.
+
+## Quality checklist
+
+* Top level gives a true overview and one concise paragraph per important subdirectory.
+* Every subdirectory doc increases detail appropriate to its scope.
+* No duplication across levels; links replace repetition.
+* Commands, interfaces, and data shapes are precise and current. It is OK to document different arguments for the same command for different use-cases.
+* Formatting is skim-friendly and consistent across the repo.
@@ -0,0 +1,222 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with the benchmark directory in this repository.
+
+## Overview
+
+The `benchmark/` directory contains comprehensive benchmarking infrastructure for comparing Feldera's performance against other stream processing systems. It implements industry-standard benchmarks (NEXMark, TPC-H, TikTok) across multiple platforms to provide objective performance comparisons.
+
+## Benchmark Ecosystem
+
+### Supported Systems
+
+The benchmarking framework supports comparative analysis across:
+
+- **Feldera** - Both Rust-native and SQL implementations
+- **Apache Flink** - Standalone and Kafka-integrated configurations
+- **Apache Beam** - Multiple runners:
+  - Direct runner (development/testing)
+  - Flink runner
+  - Spark runner
+  - Google Cloud Dataflow runner
+
+### Benchmark Suites
+
+#### **NEXMark Benchmark**
+- **Industry Standard**: Streaming benchmark for auction data processing
+- **22 Queries**: Complete suite of streaming analytics queries (q0-q22)
+- **Realistic Data**: Auction, bidder, and seller event generation
+- **Multiple Modes**: Streaming and batch processing modes
+
+#### **TPC-H Benchmark**
+- **OLAP Standard**: Traditional analytical processing benchmark
+- **22 Queries**: Complex analytical queries for business intelligence
+- **Batch Processing**: Focus on analytical query performance
+
+#### **TikTok Benchmark**
+- **Custom Workload**: Social media analytics patterns
+- **Streaming Focus**: Real-time social media data processing
+
+## Key Development Commands
+
+### Running Individual Benchmarks
+
+```bash
+# Basic Feldera benchmark
+./run-nexmark.sh --runner=feldera --events=100M
+
+# Compare Feldera vs Flink
+./run-nexmark.sh --runner=flink --events=100M
+
+# SQL implementation on Feldera
+./run-nexmark.sh --runner=feldera --language=sql
+
+# Batch processing mode
+./run-nexmark.sh --batch --events=100M
+
+# Specific query testing
+./run-nexmark.sh --query=q3 --runner=feldera
+
+# Core count specification
+./run-nexmark.sh --cores=8 --runner=feldera
+```
+
+### Running Benchmark Suites
+
+```bash
+# Full benchmark suite using Makefile
+make -f suite.mk
+
+# Limited runners and modes
+make -f suite.mk runners='feldera flink' modes=batch events=1M
+
+# Specific configuration
+make -f suite.mk runners=feldera events=100M cores=16
+```
+
+### Analysis and Results
+
+```bash
+# Generate analysis (requires PSPP/SPSS)
+pspp analysis.sps
+
+# View results in CSV format
+cat nexmark.csv
+```
+
+## Project Structure
+
+### Core Scripts
+- `run-nexmark.sh` - Main benchmark execution script
+- `suite.mk` - Makefile for running comprehensive benchmark suites
+- `analysis.sps` - SPSS/PSPP script for statistical analysis
+
+### Implementation Directories
+
+#### `feldera-sql/`
+- **SQL Benchmarks**: Pure SQL implementations of benchmark queries
+- **Pipeline Management**: Integration with Feldera's pipeline manager
+- **Query Definitions**: Standard benchmark queries in SQL format
+- **Table Schemas**: Database schema definitions for benchmarks
+
+#### `flink/` & `flink-kafka/`
+- **Flink Integration**: Standalone and Kafka-integrated Flink setups
+- **Docker Containers**: Containerized Flink environments
+- **Configuration**: Flink-specific performance tuning configurations
+- **NEXMark Implementation**: Java-based NEXMark implementation
+
+#### `beam/`
+- **Apache Beam**: Multi-runner Beam implementations
+- **Language Support**: Java, SQL (Calcite), and ZetaSQL implementations
+- **Cloud Integration**: Google Cloud Dataflow configuration
+- **Setup Scripts**: Environment preparation and dependency management
+
+## Important Implementation Details
+
+### Performance Optimization
+
+#### **Feldera Optimizations**
+- **Storage Configuration**: Uses `/tmp` by default, configure `TMPDIR` for real filesystem
+- **Multi-threading**: Automatic core detection with 16-core maximum default
+- **Memory Management**: Efficient incremental computation with minimal memory overhead
+
+#### **System-Specific Tuning**
+- **Flink**: RocksDB and HashMap state backends available
+- **Beam**: Multiple language implementations (Java, SQL, ZetaSQL)
+- **Cloud**: Optimized configurations for cloud deployments
+
+### Benchmark Modes
+
+#### **Streaming Mode (Default)**
+- **Real-time Processing**: Continuous data processing simulation
+- **Incremental Results**: Measure throughput and latency
+- **Event Generation**: Configurable event rates and patterns
+
+#### **Batch Mode**
+- **Analytical Processing**: Traditional batch analytics
+- **Complete Data**: Process entire datasets at once
+- **Throughput Focus**: Optimized for maximum data processing rates
+
+### Data Generation
+
+- **Configurable Scale**: From 100K to 100M+ events
+- **Realistic Patterns**: Auction data with realistic distributions
+- **Reproducible**: Deterministic data generation for consistent comparisons
+
+## Development Workflow
+
+### For New Benchmarks
+
+1. Add query definitions to appropriate `benchmarks/*/queries/` directory
+2. Update table schemas in `table.sql` files
+3. Implement runner-specific logic in system directories
+4. Add query to `run-nexmark.sh` query list
+5. Test across multiple systems for consistency
+
+### For System Integration
+
+1. Create system-specific directory (e.g., `newsystem/`)
+2. Implement setup and configuration scripts
+3. Add runner option to `run-nexmark.sh`
+4. Update `suite.mk` runner list
+5. Document setup requirements
+
+### Testing Strategy
+
+#### **Correctness Validation**
+- **Cross-System Consistency**: Results should match across systems
+- **Query Verification**: Validate SQL semantics and outputs
+- **Edge Case Testing**: Test with various data sizes and patterns
+
+#### **Performance Analysis**
+- **Throughput Measurement**: Events processed per second
+- **Latency Analysis**: End-to-end processing delays
+- **Resource Usage**: CPU, memory, and I/O utilization
+- **Scalability Testing**: Performance across different core counts
+
+### Configuration Management
+
+#### **Environment Variables**
+- `TMPDIR` - Storage location for temporary files
+- `FELDERA_API_URL` - Pipeline manager endpoint (default: localhost:8080)
+- Cloud credentials for Dataflow benchmarks
+
+#### **System Requirements**
+- **Java 11+** - Required for Beam and Flink
+- **Docker** - For containerized system testing
+- **Python 3** - For analysis scripts
+- **Cloud SDK** - For Google Cloud Dataflow testing
+
+### Results Analysis
+
+#### **Statistical Analysis**
+- **PSPP Integration**: Statistical analysis with `analysis.sps`
+- **Performance Tables**: Formatted comparison tables
+- **Trend Analysis**: Performance trends across system configurations
+
+#### **Output Formats**
+- **CSV Results**: Machine-readable performance data
+- **Formatted Tables**: Human-readable comparison tables
+- **Statistical Reports**: Detailed statistical analysis
+
+## Best Practices
+
+### Benchmark Execution
+- **Warm-up Runs**: Allow systems to reach steady state
+- **Multiple Iterations**: Run benchmarks multiple times for statistical significance
+- **Resource Isolation**: Ensure consistent resource availability
+- **Environment Control**: Use consistent hardware and software configurations
+
+### Performance Comparison
+- **Fair Comparison**: Use equivalent configurations across systems
+- **Resource Limits**: Apply consistent memory and CPU limits
+- **Data Consistency**: Use identical datasets across systems
+- **Metric Standardization**: Use consistent performance metrics
+
+### System Setup
+- **Documentation**: Follow setup instructions for each system
+- **Version Control**: Pin specific versions for reproducible results
+- **Configuration**: Use optimized configurations for each system
+- **Monitoring**: Monitor resource usage during benchmarks
+
+This benchmarking infrastructure provides comprehensive tools for validating Feldera's performance advantages and identifying optimization opportunities across different workloads and system configurations.
@@ -0,0 +1,122 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with the crates directory in this repository.
+
+## Overview
+
+The `crates/` directory contains all Rust crates that make up the Feldera platform. This is a workspace-based project where each crate serves a specific purpose in the overall architecture, from core computational engines to I/O adapters and development tools.
+
+## Crate Descriptions
+
+### Core Engine Crates
+
+**`dbsp/`** - The Database Stream Processor is the heart of Feldera's computational engine. It provides incremental computation capabilities where changes propagate in time proportional to the size of the change rather than the dataset. Contains operators for filtering, mapping, joining, aggregating, and advanced streaming operations with support for multi-threaded execution and persistent storage.
+
+**`sqllib/`** - Runtime support library providing SQL function implementations for the compiled circuits. Includes aggregate functions (SUM, COUNT, AVG), scalar functions (string manipulation, date/time operations), type conversions, and null handling following SQL standard semantics. Essential for SQL compatibility in the DBSP computational model.
+
+**`feldera-types/`** - Shared type definitions and configuration structures used across the entire platform. Provides pipeline configuration types, transport configurations, data format schemas, error types, and validation frameworks. Serves as the foundational type system ensuring consistency across all Feldera components.
+
+### Service and Management Crates
+
+**`pipeline-manager/`** - HTTP API service for managing the lifecycle of data pipelines. Provides RESTful endpoints for creating, configuring, starting, stopping, and monitoring pipelines. Integrates with PostgreSQL for persistence, handles authentication, and orchestrates the compilation and deployment of SQL programs to DBSP circuits.
+
+**`rest-api/`** - Type definitions and OpenAPI specification generation for the Feldera REST API. Automatically generates machine-readable API specifications from Rust types, ensuring consistency between server implementation and client SDKs. Includes comprehensive request/response schemas and validation rules.
+
+### I/O and Integration Crates
+
+**`adapters/`** - Comprehensive I/O framework providing input and output adapters for DBSP circuits. Supports multiple transport protocols (Kafka, HTTP, File, Redis, S3) and data formats (JSON, CSV, Avro, Parquet). Includes integrated connectors for databases like PostgreSQL and data lake formats like Delta Lake with fault-tolerant processing and automatic retry logic.
+
+**`adapterlib/`** - Foundational abstractions and utilities for building I/O adapters. Provides generic transport traits, format processing abstractions, circuit catalog interfaces for runtime introspection, and comprehensive error handling frameworks. Enables consistent adapter implementation across different data sources and sinks.
+
+**`iceberg/`** - Apache Iceberg table format integration enabling Feldera to work with modern data lake architectures. Supports schema evolution, time travel queries, and efficient data partitioning. Includes S3 and cloud storage integration with optimized reading patterns for large analytic datasets.
+
+### Storage and Persistence Crates
+
+**`storage/`** - Pluggable storage abstraction layer supporting multiple backends including memory-based storage for testing and POSIX I/O for production deployments. Provides async file operations, block-level caching, buffer management, and error recovery mechanisms optimized for DBSP's access patterns.
+
+### Mathematical and Type System Crates
+
+**`fxp/`** - High-precision fixed-point arithmetic library for financial and decimal computations. Provides exact decimal arithmetic without floating-point errors, SQL DECIMAL type compatibility, and DBSP integration with zero-copy serialization. Supports both compile-time fixed scales and dynamic precision for flexible numeric processing.
+
+### Benchmarking and Testing Crates
+
+**`nexmark/`** - Industry-standard streaming benchmark suite implementing the NEXMark benchmark queries. Generates realistic auction data with configurable rates and distributions, implements all 22 standard benchmark queries, and provides performance measurement tools for validating DBSP's streaming capabilities.
+
+**`datagen/`** - Test data generation utilities providing realistic datasets for testing and benchmarking. Supports configurable data distributions, correlated data generation, high-throughput batch generation, and deterministic seeded generation for reproducible testing scenarios.
+
+### Development and Tooling Crates
+
+**`fda/`** - Feldera Development Assistant providing CLI tools and interactive development environment. Includes command-line utilities for common development tasks, interactive shell for exploratory development, benchmarking tools, and API specification validation. Serves as the primary development companion tool.
+
+**`ir/`** - Multi-level intermediate representation system for SQL compilation pipeline. Provides High-level IR (HIR) close to SQL structure, Mid-level IR (MIR) for optimization, and Low-level IR (LIR) for code generation. Includes comprehensive analysis frameworks and transformation passes for SQL program compilation.
+
+## Development Workflow
+
+### Building All Crates
+
+```bash
+# Build all crates
+cargo build --workspace
+
+# Test all crates
+cargo test --workspace
+
+# Test documentation (limit threads to prevent OOM)
+cargo test --doc -- --test-threads 12
+
+# Check all crates
+cargo check --workspace
+
+# Build specific crate
+cargo build -p <crate-name>
+```
+
+### Workspace Management
+
+```bash
+# List all workspace members
+cargo metadata --format-version=1 | jq '.workspace_members'
+
+# Run command on all workspace members
+cargo workspaces exec cargo check
+
+# Update dependencies across workspace
+cargo update
+```
+
+### Cross-Crate Dependencies
+
+The crates form a dependency graph where:
+- **Core crates** (`dbsp`, `feldera-types`) are dependencies for most other crates
+- **Service crates** (`pipeline-manager`, `adapters`) depend on core and utility crates
+- **Tool crates** (`fda`, `nexmark`) typically depend on multiple core crates
+- **Library crates** (`sqllib`, `adapterlib`) provide functionality to higher-level crates
+
+### Testing Strategy
+
+- **Unit Tests**: Each crate contains comprehensive unit tests
+- **Integration Tests**: Cross-crate integration testing
+- **Workspace Tests**: Full workspace testing for compatibility
+- **Benchmark Tests**: Performance validation across crates
+
+## Best Practices
+
+### Crate Organization
+- Each crate has a single, well-defined responsibility
+- Dependencies flow from higher-level to lower-level crates
+- Shared types and utilities are extracted to common crates
+- Feature flags control optional functionality and integrations
+
+### Development Guidelines
+- Follow consistent coding patterns across crates
+- Use workspace-level dependency management
+- Maintain comprehensive documentation for each crate
+- Write tests that work both in isolation and as part of the workspace
+
+### Performance Considerations
+- Core computational crates (`dbsp`, `sqllib`) are highly optimized
+- I/O crates (`adapters`, `storage`) focus on throughput and efficiency
+- Tool crates prioritize developer experience over raw performance
+- Benchmark crates provide performance validation and regression detection
+
+This workspace architecture enables modular development while maintaining consistency and performance across the entire Feldera platform.