# Felderize — SQL to Feldera SQL Translator

felderize translates SQL from various dialects into valid [Feldera](https://www.feldera.com/) SQL using LLM-based translation with optional compiler validation.

> **Dialects:** Spark SQL is currently the only supported dialect. Support for additional dialects is planned.

## Setup

```bash
cd python/felderize
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
```

> **Note:** `pip install -e .` is required before running `felderize`. It registers the package and CLI command.

Create a `.env` file:

```bash
ANTHROPIC_API_KEY=your-key-here
FELDERA_COMPILER=/path/to/sql-to-dbsp  # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp
FELDERIZE_MODEL=claude-sonnet-4-6
```

The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`.

The compiler must be built before use (requires Java 19–21 and Maven):

```bash
cd sql-to-dbsp-compiler
./build.sh
```

## Usage

### Run a built-in example

```bash
# List available examples
felderize spark example

# Translate an example (validates by default)
felderize spark example simple

# Without compiler validation
felderize spark example simple --no-validate

# Log SQL submitted to the validator at each attempt
felderize spark example json --verbose

# Use a specific compiler binary
felderize spark example simple --compiler /path/to/sql-to-dbsp

# Output as JSON
felderize spark example simple --json-output
```

Available examples:

| Name | Description |
|------|-------------|
| `simple` | Date truncation, GROUP BY |
| `strings` | INITCAP, LPAD, NVL, CONCAT_WS |
| `arrays` | array_contains, size, element_at |
| `joins` | Null-safe equality (`<=>`) |
| `windows` | LAG, running SUM OVER |
| `aggregations` | COUNT DISTINCT, AVG, SUM, HAVING |
| `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* |
| `topk` | ROW_NUMBER TopK, QUALIFY, datediff *(combined file)* |
| `dates` | to_date → PARSE_DATE, date_format → FORMAT_DATE/EXTRACT *(combined file)* |
| `arithmetic` | pmod, NULLIF division, subtraction *(combined file)* |

The JSON output contains:

```json
{
  "feldera_schema": "...",   // translated DDL (CREATE TABLE statements)
  "feldera_query": "...",    // translated query (CREATE VIEW statements)
  "unsupported": [...],      // unsupported Spark features found
  "warnings": [...],         // non-fatal issues
  "explanations": [...],     // explanations for translation decisions
  "status": "success|unsupported|error"
}
```

### Translate your own SQL

Two input formats are supported:

**Separate schema and query files:**
```bash
felderize spark translate path/to/schema.sql path/to/query.sql
felderize spark translate path/to/schema.sql path/to/query.sql --validate
```

**Single combined file** (CREATE TABLE and CREATE VIEW statements in one file):
```bash
felderize spark translate-file path/to/combined.sql
felderize spark translate-file path/to/combined.sql --validate
```

> **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler.

Both commands accept:
- `--validate` to validate output against the Feldera compiler (opt-in; `example` validates by default, use `--no-validate` to skip)
- `--compiler PATH` to specify the path to the Feldera compiler binary (overrides `FELDERA_COMPILER` env var)
- `--model MODEL` to specify the LLM model (overrides `FELDERIZE_MODEL` env var)
- `--no-docs` to disable Feldera SQL reference docs in the prompt
- `--force-docs` to include docs on the first pass instead of only as a fallback
- `--verbose` to log the SQL submitted to the validator at each repair attempt
- `--json-output` to output results as JSON

## Configuration

Environment variables (set in `.env`):

| Variable | Description | Default |
|---|---|---|
| `ANTHROPIC_API_KEY` | Anthropic API key | (required) |
| `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) |
| `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) |
| `ANTHROPIC_BASE_URL` | Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) |

## How it works

1. Loads translation rules from a single skill file (`spark/skills/spark_skills.md`)
2. Sends Spark SQL to the LLM with rules, validated examples, and relevant Feldera SQL documentation (from `docs.feldera.com/docs/sql/`)
3. Parses the translated Feldera SQL from the LLM response
4. Optionally validates output against the Feldera compiler, retrying with error feedback if needed

## Support

Contact us at support@feldera.com for assistance with unsupported Spark SQL features.