[migrations] Spark-to-Feldera migration tool PoC. by wilmaontherun · Pull Request #5837 · feldera/feldera

wilmaontherun · 2026-03-16T17:50:22Z

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.

Requires Anthropic API key in felderize/.env

Describe Manual Test Plan

No automated tests yet. Tested manually using examples in the demo folder.

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL. Signed-off-by: Wilma <wilmaontherun@gmail.com>

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

python/felderize/spark/data/docs/sql/array.md

python/felderize/spark/data/samples/array_lambda.md

python/felderize/spark/data/samples/null_safe_equality.md

python/felderize/spark/data/skills/function-reference/SKILL.md

python/felderize/spark/data/skills/query-rewrite/SKILL.md

python/felderize/spark/translator.py

mihaibudiu · 2026-03-19T16:27:01Z

We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL.

addressed remaining comments

…oting rule

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

…ing and skills

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

- Added --model CLI option to translate, translate-file, and example commands - Model and compiler path now read exclusively from .env / CLI flags - Removed OpenAI provider support (untested) - Removed hardcoded default compiler path - Updated README for consistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replaced custom semicolon scanner with sqlparse.split() — handles string literals, comments, block comments correctly - Added sqlparse>=0.5.0 to dependencies, removed openai dependency - Fixed README: clarified FELDERA_COMPILER comment (not a default, just repo location) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- llm.py: wrap system prompt in cache_control ephemeral block to enable Anthropic prompt caching; add retry with exponential backoff on rate limits - translator.py: omit examples on first translation attempt (skills only) to reduce token usage and latency (~20s → ~4s for simple queries) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- docs.py: replace module-level _FUNC_ANCHORS with per-dir _get_cats_and_anchors() cache - llm.py: move imports to top level, add unreachable guard - translator.py: move sqlparse import to top level, fix LLMClient type annotation, remove double-strip - feldera_client.py: keep f.name usage inside with block - skills.py: remove redundant intermediate sort - cli.py: remove untested batch command, fix Status import, add missing --compiler/--model to all commands - pyproject.toml: remove unused httpx dependency - README.md: update to reflect removed batch command and full options list - spark_skills.md: add rewrite rules and unsupported constructs from test investigation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- STRING/TEXT type mapping: STRING→VARCHAR, TEXT→VARCHAR - Remove duplicate HEX/UNHEX from Hashing section - Remove CAST(INTERVAL SECOND AS DECIMAL) from Unsupported (contradicted Supported section) - Window: unify ROWS/RANGE BETWEEN into one Unsupported entry - split(str,delim,limit): clarify 2-arg form is supported - LN/LOG10: "runtime error" → "drops the row (WorkerPanic)" for negative input - TIMESTAMP_NTZ: clarify "replace with TIMESTAMP in DDL" - FIRST_VALUE/LAST_VALUE notes consistent with Window unsupported section - Scalar subquery rule: fix incorrect "subquery with FROM → mark unsupported" - Remove unexplained CREATE TYPE + jsonstring_as_ hint - trunc(d,'Q'): move to Unsupported (DATE_TRUNC QUARTER fails at runtime) - make_timestamp: move to Rewritable with PARSE_TIMESTAMP rewrite - from_unixtime: use TIMESTAMPADD directly (consistent with to_timestamp) - encode/decode: remove misleading "IS rewritable as CASE WHEN" note - width_bucket: remove stray extra column - SIGN: remove misleading "Input/output: DECIMAL" note - date_format: handle TIMESTAMP input via CAST to DATE - log(base,x): add examples to reinforce arg swap rule - [GBD-ARRAY-ORDER]: new GBD entry; annotate ARRAY_UNION/ARRAY_EXCEPT - Bitwise scalar operators moved to Unsupported section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

- Add ground truth note: all signatures from spark.apache.org/docs/latest/api/sql/index.html - Fix unix_millis/unix_micros: take timestamp arg (not no-arg current-time) - Fix pmod: unified formula MOD(MOD(a,ABS(b))+ABS(b),ABS(b)) for all divisor signs - Move try_divide/try_add/try_subtract/try_multiply to unsupported (semantic mismatch) - Move map_entries to unsupported (returns array of structs, no Feldera equivalent) - Fix from_unixtime: note STRING vs TIMESTAMP type difference, mark fmt-arg as unsupported - Fix posexplode: subtract 1 from ORDINALITY (Spark 0-based, SQL 1-based) - Add translate warning: REGEXP_REPLACE treats chars as regex patterns - Fix lpad/rpad: document optional pad arg (defaults to space) - Fix months_between: add roundOff note, precise fractional example - Fix trunc WEEK: move to unsupported (same Sunday/Monday mismatch as date_trunc WEEK) - Move try_* from String to Math in unsupported section - Add trunc YYYY/MM/MON aliases, to_date using PARSE_TIMESTAMP Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vial entries - Fix json_extraction: lateral alias not supported in Feldera; repeat PARSE_JSON per field - Fix datediff: use correct DATEDIFF(DAY, start, end) instead of TIMESTAMPDIFF - Replace null_safe_equality with LOG argument order reversal (critical gotcha) - Replace nvl_coalesce with LPAD/RPAD rewrite (no native support in Feldera) - Improve array_map_functions: add element_at(map,key) → map[key], CARDINALITY NULL note - Add explode_unnest: LATERAL VIEW explode/posexplode/inline → UNNEST patterns - Add json_extraction: get_json_object → PARSE_JSON + bracket syntax, CTE for GROUP BY - Remove array_lambda (unsupported-only, no rewrite value) - Remove row_number_topk (trivial CREATE VIEW wrapper, no translation needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d, to_date, date_format - Remove trunc(d,'Q') from supported section — DATE_TRUNC(QUARTER) fails at runtime in Feldera (date_trunc_quarter_DateN missing); unsupported entry at line 716 was already correct - Fix contains(binary,...) — POSITION rewrite works for binary args; only boolean args are truly unsupported - Fix pmod formula to CASE WHEN MOD(a,b)<0 AND b>0 THEN MOD(a,b)+b ELSE MOD(a,b) END (empirically verified against all sign combinations) - Fix to_date: use PARSE_DATE not PARSE_TIMESTAMP (panics on date-only strings) - Fix date_format: FORMAT_DATE truncates time; use CONCAT+EXTRACT for time components; FORMAT_TIMESTAMP does not exist; LPAD does not work here - Fix JSON lateral alias note: Feldera does NOT support lateral aliases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- nvl_coalesce.md → lpad_rpad.md (contains LPAD/RPAD rewrite) - null_safe_equality.md → log_arg_order.md (contains LOG arg order reversal) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- left_semi_join: add CREATE TABLE schemas, fix WHERE→ON clause placement - to_date_date_format (new): PARSE_DATE vs PARSE_TIMESTAMP, FORMAT_DATE+EXTRACT pattern for time components, FORMAT_TIMESTAMP nonexistence - window_functions (new): ROW_NUMBER TopK, LAG/LEAD, partition SUM; notes on ROWS/RANGE frames unsupported and TopK outer-WHERE requirement - pmod_try_arithmetic (new): pmod CASE WHEN rewrite, try_divide NULL approximation, try_subtract direct translation with overflow warning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… examples - windows_query: remove ROWS BETWEEN frame (unsupported in Feldera) - aggregations_query: replace PERCENTILE_APPROX with STDDEV (no Feldera equivalent) - json_combined: replace $.items[0] array path with scalar path (array paths unsupported) - topk_combined: replace Feldera 3-arg DATEDIFF with Spark 2-arg datediff (Spark input) - Add dates_combined: to_date / date_format Spark input demo - Add arithmetic_combined: pmod / try_divide / try_subtract Spark input demo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eatures - aggregations: remove COLLECT_LIST (unsupported), STDDEV → SUM - arithmetic: replace try_divide/try_subtract with NULLIF division and direct subtraction Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mihaibudiu · 2026-03-25T17:04:18Z

@anandbraman I think we should both review this PR

mihaibudiu

Maybe @anandbraman and I should make the fixes I recommend and take over this project.

python/felderize/README.md

python/felderize/spark/data/samples/aggregation_date_trunc.md

python/felderize/spark/data/samples/array_map_functions.md

python/felderize/spark/data/samples/json_extraction.md

python/felderize/spark/data/skills/spark_skills.md

python/felderize/spark/cli.py

python/felderize/spark/docs.py

anandbraman · 2026-03-27T15:38:26Z

python/felderize/spark/data/skills/spark_skills.md

+
+| Function | Notes |
+|----------|-------|
+| `split_part(str, delim, n)` | Feldera's `SPLIT_PART` treats the delimiter as a regex — special chars like `.` match any character and produce wrong results. Negative indices not supported. Always mark unsupported. |


Minor correction - feldera split_part delimiter is not a regex.

…eger inputs - AVG(INT/BIGINT/SMALLINT/TINYINT) is safely rewritable via CAST to DOUBLE - Remove known_unsupported entries for AVG(INT) tests now covered by the rewrite (cast_to_date_003, select_no_from_002, grouping_sets_002, cube_agg_002, null-handling_015) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Commands are now under a 'spark' subgroup to support future dialects. felderize spark translate felderize spark translate-file felderize spark example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

skills are dialect-agnostic rules; placing at data/skills/ makes the directory structure cleaner for future dialects alongside spark/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…i-join note - json_extraction.md: use lateral alias pattern (PARSE_JSON once, reuse) - spark_skills.md: Feldera supports lateral aliases, update JSON notes - array_map_functions.md: use COALESCE(CARDINALITY(tags), -1) for exact NULL match - spark_skills.md: update size() → COALESCE(CARDINALITY, -1) rewrite - left_semi_join.md: clarify note about right-table filter placement - README.md: update commands to felderize spark subcommand Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ee79db25981c099d' of github.com:feldera/feldera into felderize

…ries - Move rewrite-requiring functions from Direct equivalents to Rewritable patterns: LTRIM/RTRIM, NVL/ZEROIFNULL/NULLIFZERO, any→bool_or, collect_list/collect_set, DAY→DAYOFMONTH - Move direct equivalents out of Rewritable patterns: GROUPING SETS/ROLLUP/CUBE/grouping_id, JOIN USING/NATURAL JOIN, sec/csc/cot - Fix Higher-order array functions table: add missing header, move exists→ARRAY_EXISTS to Rewritable - Remove RANGE BETWEEN from Unsupported (it is supported); move TIMESTAMP_NTZ to DDL Rewrites - Add missing entries: IF(), INSTR(), isnull()/isnotnull() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Skills corrections: - GBD-FLOAT-GROUP: clarify it affects ORDER BY (not GROUP BY) on special float values - BINARY→VARBINARY: add type mapping; x'...' hex literals work in both engines - every/bool_and/bool_or: add note that window functions with ORDER BY BOOLEAN unsupported (#457) - SPLIT_PART: remove (same in both engines, no special handling needed) - get_json_object: document that CAST(VARIANT AS VARCHAR) returns NULL for numbers/booleans - collect_list/ARRAY_AGG: tag with GBD-ARRAY-ORDER; add ORDER BY recommendation - positive()/negative(): simplify to x/-x (Feldera auto-casts string args) - hex(x): document UPPER(TO_HEX()) required; TO_HEX accepts VARBINARY only - split(str, regex): document Feldera SPLIT only supports literal delimiters - PARSE_JSON lateral alias: reframe as translation warning (don't expose v as view column) - GROUPING SETS + HAVING: add concrete example of alias collision bug - VALUES implicit column names: improve example to clearly show wrong vs correct Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Extract long inline notes into tagged behavioral difference blocks (GBD-JSON-CAST, GBD-OUTER-EXPLODE, GBD-PIVOT-NULL, GBD-TIMEZONE, GBD-MONTHS-BETWEEN, GBD-SAFE-CAST) - Add emoji heading markers (⚠️ 🔄 📌 📝) throughout Rewritable patterns - Move LOCATE 3-arg and translate() examples to Notes blocks - Move log() arg-order warning to Notes block - Restructure UNNEST details, DATE_ADD, named_struct with proper subsections - Restructure SQL Syntax differences subsections with translation rules and examples - Clean up Unsupported table: merge duplicate rows, fix malformed split row, reference GBD-NONDETERMINISTIC for uuid() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix GROUPING SETS + HAVING example: Spark returns 2 rows not 1 (verified by having_006 test: feldera=0, spark=2) - Add split_part with regex special char delimiters to Unsupported (verified by string-functions_024: SPLIT_PART('11.12.13', '.', 2) returns '' in Feldera vs '12' in Spark — filed as Feldera bug) - Expand GBD-ARRAY-ORDER to include ARRAY_INTERSECT, MAP_KEYS, MAP_VALUES, collect_set (verified by test failures in known_unsupported.yaml) - All 28 validate_skills failures confirmed in known_unsupported.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eric AS TIMESTAMP) semantics; update README model and ANTHROPIC_BASE_URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

wilmaontherun requested review from gz, mihaibudiu and ryzhyk March 16, 2026 17:50

wilmaontherun force-pushed the felderize branch 3 times, most recently from f878931 to 3f816fb Compare March 16, 2026 18:48

[migrations] Spark-to-Feldera migration tool PoC.

046f795

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL. Signed-off-by: Wilma <wilmaontherun@gmail.com>

wilmaontherun force-pushed the felderize branch from 4d43839 to 046f795 Compare March 16, 2026 18:56

[ci] apply automatic fixes

bb1b24c

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

mihaibudiu reviewed Mar 16, 2026

View reviewed changes

wilmaontherun added 3 commits March 18, 2026 21:46

Intermediate progress based on Mihai's comments.

871d79c

fixed comments on skills

e856080

Fixed all comments before we refactor skills

b672679

wilmaontherun and others added 13 commits March 19, 2026 10:26

Merge skills

3d8536f

Fixed the rest of the code w.r.t. new skill file

219f8b7

Revised doc indexing

83d7731

merged skills

98f7dee

addressed remaining comments

add --verbose flag, translate-file, combined demos, and Feldera PK/qu…

56bd911

…oting rule

more demo files

b10f7a4

revised samples and skills

7d56201

[ci] apply automatic fixes

9e17a33

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

add --compiler option, fix no-compiler handling, improve example list…

3a75739

…ing and skills

[ci] apply automatic fixes

ab4f746

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

fixed readme

6672146

[ci] apply automatic fixes

9975f33

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

wilmaontherun force-pushed the felderize branch from 6bd2d52 to 14e7cc6 Compare March 20, 2026 03:07

wilmaontherun and others added 2 commits March 19, 2026 20:17

wilmaontherun and others added 10 commits March 21, 2026 16:56

[ci] apply automatic fixes

80a5e7f

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

Rename misnamed sample files to match their content

05388f9

- nvl_coalesce.md → lpad_rpad.md (contains LPAD/RPAD rewrite) - null_safe_equality.md → log_arg_order.md (contains LOG arg order reversal) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mihaibudiu requested a review from anandbraman March 25, 2026 17:04

mihaibudiu reviewed Mar 25, 2026

View reviewed changes

anandbraman reviewed Mar 27, 2026

View reviewed changes

anandbraman and others added 15 commits March 27, 2026 12:35

split_part skill & base_url config

b438037

Untrack known_unsupported.yaml (ignored by .gitignore)

9de37f9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add dialect subcommand structure: felderize spark <cmd>

e088426

Commands are now under a 'spark' subgroup to support future dialects. felderize spark translate felderize spark translate-file felderize spark example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move data/skills one level above spark/ to felderize root

d596584

skills are dialect-agnostic rules; placing at data/skills/ makes the directory structure cleaner for future dialects alongside spark/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move skills file to spark/skills/

d5bf6e0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'gh-readonly-queue/main/pr-5953-63f28bb9543f28137c91d7a2…

621dd95

…ee79db25981c099d' of github.com:feldera/feldera into felderize

Fix skills: STDDEV rewrite rule, GBD-REGEX-ESCAPE for RLIKE, CAST(num…

4faca22

…eric AS TIMESTAMP) semantics; update README model and ANTHROPIC_BASE_URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

README: note that Spark is current dialect, more planned

c1f740d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[ci] apply automatic fixes

d8a1ed6

Signed-off-by: feldera-bot <feldera-bot@feldera.com>

Conversation

wilmaontherun commented Mar 16, 2026

Describe Manual Test Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihaibudiu commented Mar 19, 2026

Uh oh!

mihaibudiu commented Mar 25, 2026

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anandbraman Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants