# Felderize — SQL to Feldera SQL Translator felderize translates SQL from various dialects into valid [Feldera](https://www.feldera.com/) SQL using LLM-based translation with optional compiler validation. > **Dialects:** Spark SQL is currently the only supported dialect. Support for additional dialects is planned. ## Setup ```bash cd python/felderize python3 -m venv .venv source .venv/bin/activate pip install -e . ``` > **Note:** `pip install -e .` is required before running `felderize`. It registers the package and CLI command. **The Feldera SQL compiler JAR** (used only for `--validate`; requires Java 19–21 installed): felderize downloads it for you. The first time you run a command with `--validate` and no compiler configured, felderize fetches the latest `sql2dbsp-jar-with-dependencies-*.jar` from [GitHub Releases](https://github.com/feldera/feldera/releases) into `~/.felderize/` and reuses it on later runs. To opt out (e.g. in CI or offline), set `FELDERIZE_AUTO_DOWNLOAD=0`; validation is then skipped unless you point `FELDERA_COMPILER` / `--compiler` at a JAR. To fetch or update it explicitly: ```bash felderize download-compiler ``` This saves the latest JAR to `~/.felderize/` and prints its path. Re-run it any time to pick up a newer release; it reports whether you are already on the latest one. felderize automatically uses the newest JAR cached in `~/.felderize/`, so you do not need to set `FELDERA_COMPILER` unless you want a specific JAR. > **Requirement:** felderize needs compiler **v0.304.0 or newer** — earlier releases lack SQL features felderize relies on (e.g. `div_null`, `MAKE_DATE`). `download-compiler` always fetches the latest release, and felderize warns at validation time if the configured compiler is older than v0.304.0. > **Note:** Java 19, 20, or 21 must be installed and on your `PATH` before running validation. Check with `java -version`. Later versions (22+) are not supported. **Create a `.env` file in the `python/felderize/` directory:** ```bash ANTHROPIC_API_KEY=your-key-here FELDERA_COMPILER=~/.felderize/sql2dbsp-jar-with-dependencies-vX.Y.Z.jar FELDERIZE_MODEL=claude-sonnet-4-6 ``` `ANTHROPIC_API_KEY` and `FELDERIZE_MODEL` are required. `FELDERA_COMPILER` is optional: it is used only for validation, and when unset felderize auto-downloads (and caches) the compiler on first `--validate`. Set it to pin a specific JAR. You can also pass `--compiler PATH` and `--model MODEL` per command. > **Note:** felderize currently requires an Anthropic API key — only Claude models are supported. > **Tip — use a large-context-window model.** Each request bundles the full rule > set (`felderize/skills/spark_skills.md`), validated examples, and (on the retry pass) > Feldera reference documentation *on top of* your schema and query — easily tens > of thousands of tokens, and the `--validate` repair and docs passes add more. > Prefer a model with a large context window (e.g. a recent Claude Sonnet/Opus) > so nothing is truncated and the model has the full rules in view; set it via > `FELDERIZE_MODEL` or `--model`. If a program still doesn't fit, felderize stops > with an `error` asking you to shorten the input (translate fewer views at a > time, drop unused tables, or split it into smaller files). ## Usage ### Run a built-in example ```bash # List available examples felderize spark example # Translate an example (validates by default) felderize spark example simple # Without compiler validation felderize spark example simple --no-validate # Log SQL submitted to the validator at each attempt felderize spark example json --verbose # Use a specific compiler binary felderize spark example simple --compiler /path/to/sql-to-dbsp # Output as JSON felderize spark example simple --json-output ``` Available examples: | Name | Description | |------|-------------| | `simple` | Date truncation, GROUP BY | | `strings` | INITCAP, LPAD, NVL, CONCAT_WS | | `arrays` | array_contains, size, element_at | | `joins` | Null-safe equality (`<=>`) | | `windows` | LAG, running SUM OVER | | `aggregations` | COUNT DISTINCT, AVG, SUM, HAVING | | `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* | | `topk` | ROW_NUMBER TopK, QUALIFY, datediff *(combined file)* | | `dates` | to_date → PARSE_DATE, date_format → FORMAT_DATE/EXTRACT *(combined file)* | | `arithmetic` | pmod, NULLIF division, subtraction *(combined file)* | The JSON output contains: ```json { "feldera_schema": "...", // translated DDL (CREATE TABLE statements) "feldera_query": "...", // translated query (CREATE VIEW statements) "unsupported": [...], // Spark features with no Feldera equivalent "warnings": [...], // non-fatal notes (compiler repairs, validation result) "explanations": [...], // explanations for translation decisions "status": "success|unsupported|error" } ``` **`status`:** - `success` — translated, and (with `--validate`) compiled cleanly. - `unsupported` — translated, but some constructs have no Feldera equivalent and were emitted as `CAST(NULL AS )` placeholders (listed in `unsupported`). - `error` — the LLM response couldn't be parsed, or (with `--validate`) the output still failed to compile after the repair attempts. felderize always returns the best-effort SQL in `feldera_query` (salvaging it from the raw response when the reply wasn't valid JSON), with the compiler errors in `warnings` — it may not compile, but it always attempts a translation. ### Translate your own SQL Each form below writes the translated, deployable Feldera SQL to a file. The file leads with a comment header recording the translation status and any unsupported constructs / warnings, so it is self-documenting; the status is also printed to stderr. **Separate schema and query files:** ```bash felderize spark translate schema.sql query.sql --validate -o out.sql # → out.sql (the translated CREATE TABLE + CREATE VIEW) ``` **Single combined file** (CREATE TABLE and CREATE VIEW statements in one file): ```bash felderize spark translate-file combined.sql --validate -o out.sql # → out.sql ``` **Multiple query files against a shared schema** (batch — faster than a shell loop): ```bash felderize spark translate-batch schema.sql queries/*.sql --validate --output-dir out/ # → out/_feldera.sql, one per query ``` **Structured JSON instead of a `.sql` file** (for automation — parse with `jq`): ```bash felderize spark translate schema.sql query.sql --validate --json-output > result.json # → result.json with feldera_schema, feldera_query, status, unsupported, warnings ``` `translate-batch` processes all queries in a single process so doc and example caches stay warm across queries. Omitting `-o` / `--output-dir` / `--json-output` prints the result as readable sections to the terminal instead. > **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler. All commands accept: - `--validate` to validate output against the Feldera compiler (opt-in; `example` validates by default, use `--no-validate` to skip) - `--compiler PATH` to specify the path to the Feldera compiler binary (overrides `FELDERA_COMPILER` env var) - `--model MODEL` to specify the LLM model (overrides `FELDERIZE_MODEL` env var) - `--no-docs` to disable Feldera SQL reference docs in the prompt - `--verbose` to log the SQL submitted to the validator at each repair attempt - `--json-output` to output results as JSON (the structured machine interface) - `-o, --output PATH` (`translate` / `translate-file`) to write the translated schema + views to a deployable `.sql` file; the status prints to stderr so stdout/the file stay clean. (`translate-batch` uses `--output-dir` instead.) ### Programmatic API (Python) To call felderize from your own code instead of shelling out to the CLI, use the single entry point `translate_spark_to_feldera`: ```python from felderize import translate_spark_to_feldera, Config, Status cfg = Config.from_env() # reads ANTHROPIC_API_KEY, FELDERA_COMPILER, FELDERIZE_MODEL result = translate_spark_to_feldera( schema_sql, # Spark CREATE TABLE ... DDL (str) query_sql, # Spark CREATE VIEW / SELECT ... (str) cfg, validate=True, # compile against the Feldera compiler and repair (default: False) ) if result.status is Status.SUCCESS: deploy(result.feldera_schema, result.feldera_query) else: # UNSUPPORTED -> NULL-placeholder views (see result.unsupported); # ERROR -> best-effort SQL that did not compile. review(result.unsupported, result.warnings) ``` `TranslationResult` exposes `feldera_schema`, `feldera_query`, `status`, `unsupported`, `warnings`, `explanations`, and `to_dict()`. `validate=False` skips the compiler (faster, but the output is not verified). Runnable examples: ```bash .venv/bin/python examples/api_usage.py # translate one schema + query .venv/bin/python examples/translate_all_examples.py # translate all built-in examples + summary ``` ## Configuration Environment variables (set in `.env`): | Variable | Description | Default | |---|---|---| | `ANTHROPIC_API_KEY` | Anthropic API key | (required) | | `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) | | `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (optional; auto-downloaded when unset) | | `FELDERIZE_AUTO_DOWNLOAD` | Auto-download the compiler on first `--validate` when none is configured. Set to `0`/`false` to disable. | `1` | | `ANTHROPIC_BASE_URL` | Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) | ## Customizing translation You can teach felderize your project-specific patterns by adding **rules** and **examples**. ### Rules Rules tell the LLM how to rewrite specific Spark constructs. Each `.md` file should start with a YAML frontmatter block with a `name:` and `description:` field, followed by plain-markdown bullet points: ```markdown --- name: my-project-rules description: Project-specific Spark-to-Feldera rewrites. --- - **[PROJ-HASH] Internal UDF `my_hash(col)`:** Rewrite as `MD5(CAST(col AS VARCHAR))`. - **[PROJ-ID] `CUSTOM_ID` columns:** Always map to `BIGINT NOT NULL` in Feldera. ``` > **Note:** Frontmatter is recommended. Files without it are still loaded but produce a warning. Place `.md` files in one of these locations — all are loaded **automatically**, no flag needed: | Location | Scope | |---|---| | `~/.felderize/rules/` | All your projects (survives pip upgrades) | | `.felderize/rules/` in your project dir | This project only (commit to git) | Or pass one or more files explicitly (repeatable): ```bash felderize spark translate schema.sql query.sql --rules rules1.md --rules rules2.md ``` ### Examples Examples are validated Spark → Feldera pairs shown to the LLM alongside the built-in ones. The more precise your examples, the better the translation quality for your specific SQL patterns. Each `.md` file must start with a YAML frontmatter block. Use `categories:` to load the example only when those SQL constructs are detected in the query being translated. Omit `categories:` (but keep the frontmatter) to always include the example: ```markdown --- categories: [datetime] --- ### Example: Monthly revenue **Spark SQL:** ```sql SELECT date_trunc('MONTH', ts) AS month, SUM(amount) AS revenue FROM sales GROUP BY date_trunc('MONTH', ts); \``` **Feldera SQL:** ```sql SELECT FLOOR(ts TO MONTH) AS month, SUM(amount) AS revenue FROM sales GROUP BY FLOOR(ts TO MONTH); \``` ``` > **Note:** The frontmatter block (`---`) is required. Files without it are skipped. Valid categories: `aggregates`, `string`, `datetime`, `array`, `json`, `map`, `types`. Place `.md` files in one of these locations — loaded automatically, no flag needed: | Location | Scope | |---|---| | `~/.felderize/examples/` | All your projects (survives pip upgrades) | | `.felderize/examples/` in your project dir | This project only (commit to git) | Or pass individual files or directories explicitly (repeatable, accepts both): ```bash felderize spark translate schema.sql query.sql --examples ex1.md --examples my_examples/ ``` ## How it works felderize translates the whole program (schema + all views) in a single LLM call: 1. Loads translation rules from the skill file (`felderize/skills/spark_skills.md`). 2. Trims the schema to the tables the query actually references, then sends the Spark schema + query to the LLM with the rules and validated examples. 3. Parses the translated Feldera SQL from the response. Constructs with no Feldera equivalent are emitted as `CAST(NULL AS )` placeholders and listed in `unsupported`. 4. With `--validate`, compiles the output against the Feldera compiler and repairs it using the compiler's error feedback for up to a few attempts. If that first pass still doesn't compile, it retries once more with relevant Feldera documentation added to the prompt (from `docs.feldera.com/docs/sql/`); use `--no-docs` to skip the documentation pass. ## Support Contact us at support@feldera.com for assistance with unsupported Spark SQL features.