Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
046f795
[migrations] Spark-to-Feldera migration tool PoC.
wilmaontherun Mar 16, 2026
bb1b24c
[ci] apply automatic fixes
feldera-bot Mar 16, 2026
871d79c
Intermediate progress based on Mihai's comments.
wilmaontherun Mar 19, 2026
e856080
fixed comments on skills
wilmaontherun Mar 19, 2026
b672679
Fixed all comments before we refactor skills
wilmaontherun Mar 19, 2026
3d8536f
Merge skills
wilmaontherun Mar 19, 2026
219f8b7
Fixed the rest of the code w.r.t. new skill file
wilmaontherun Mar 19, 2026
83d7731
Revised doc indexing
wilmaontherun Mar 19, 2026
98f7dee
merged skills
wilmaontherun Mar 19, 2026
56bd911
add --verbose flag, translate-file, combined demos, and Feldera PK/qu…
wilmaontherun Mar 19, 2026
b10f7a4
more demo files
wilmaontherun Mar 19, 2026
7d56201
revised samples and skills
wilmaontherun Mar 20, 2026
9e17a33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
3a75739
add --compiler option, fix no-compiler handling, improve example list…
wilmaontherun Mar 20, 2026
ab4f746
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
6672146
fixed readme
wilmaontherun Mar 20, 2026
9975f33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
14e7cc6
Add --model option, remove OpenAI support and hardcoded compiler path
wilmaontherun Mar 20, 2026
7ac2fb9
Use sqlparse for SQL splitting, fix README inconsistencies
wilmaontherun Mar 20, 2026
6655f7c
Add prompt caching and rate limit retry; skip examples on first pass
wilmaontherun Mar 20, 2026
8fa1210
Clean up code quality: fix imports, types, and consistency issues
wilmaontherun Mar 21, 2026
4358104
Fix spark_skills.md inconsistencies
wilmaontherun Mar 24, 2026
80a5e7f
[ci] apply automatic fixes
feldera-bot Mar 24, 2026
7da4c30
Verify and fix spark_skills.md against Apache Spark SQL reference
wilmaontherun Mar 24, 2026
589406c
Overhaul spark/data/samples: fix errors, add new patterns, remove tri…
wilmaontherun Mar 24, 2026
b53a97a
Fix skills inconsistencies: QUARTER unsupported, contains/binary, pmo…
wilmaontherun Mar 25, 2026
05388f9
Rename misnamed sample files to match their content
wilmaontherun Mar 25, 2026
643ba2b
Improve and expand sample demos
wilmaontherun Mar 25, 2026
d1b0c95
Fix demo files: remove unsupported patterns, add dates and arithmetic…
wilmaontherun Mar 25, 2026
d7057b9
Fix aggregations and arithmetic demos to use only supported Feldera f…
wilmaontherun Mar 25, 2026
b438037
split_part skill & base_url config
anandbraman Mar 27, 2026
aaf02dd
Update AVG(integer) rule: rewrite to AVG(CAST(col AS DOUBLE)) for int…
wilmaontherun Mar 30, 2026
9de37f9
Untrack known_unsupported.yaml (ignored by .gitignore)
wilmaontherun Mar 30, 2026
e088426
Add dialect subcommand structure: felderize spark <cmd>
wilmaontherun Mar 30, 2026
d596584
Move data/skills one level above spark/ to felderize root
wilmaontherun Mar 30, 2026
d5bf6e0
Move skills file to spark/skills/
wilmaontherun Mar 30, 2026
e714f19
Address review comments: lateral aliases, size/CARDINALITY, JSON, sem…
wilmaontherun Mar 30, 2026
621dd95
Merge branch 'gh-readonly-queue/main/pr-5953-63f28bb9543f28137c91d7a2…
wilmaontherun Mar 30, 2026
a84c82f
Reorganize spark_skills.md: fix section placement and add missing ent…
wilmaontherun Mar 31, 2026
06b7bc6
Fix and improve spark_skills.md translation rules
wilmaontherun Apr 2, 2026
425d5ab
Reorganize spark_skills.md: structured subsections with emoji markers
wilmaontherun Apr 2, 2026
2b98d4e
Verify spark_skills.md claims against test evidence; fix errors found
wilmaontherun Apr 2, 2026
4faca22
Fix skills: STDDEV rewrite rule, GBD-REGEX-ESCAPE for RLIKE, CAST(num…
wilmaontherun Apr 2, 2026
c1f740d
README: note that Spark is current dialect, more planned
wilmaontherun Apr 2, 2026
d8a1ed6
[ci] apply automatic fixes
feldera-bot Apr 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add --verbose flag, translate-file, combined demos, and Feldera PK/qu…
…oting rule
  • Loading branch information
wilmaontherun committed Mar 19, 2026
commit 56bd911b8b5957c5b8710dfab25c147d5cb39e73
37 changes: 33 additions & 4 deletions python/felderize/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,32 @@ echo 'ANTHROPIC_API_KEY=your-key-here' > .env
# List available examples
felderize example

# Attempts to translate an example
# Translate an example (validates by default)
felderize example simple

# With compiler validation
felderize example simple --validate
# Without compiler validation
felderize example simple --no-validate

# Output translation result as JSON
# Log SQL submitted to the validator at each attempt
felderize example json --verbose

# Output as JSON
felderize example simple --json-output
```

Available examples:

| Name | Description |
|------|-------------|
| `simple` | Date truncation, GROUP BY |
| `strings` | INITCAP, LPAD, NVL, CONCAT_WS |
| `arrays` | array_contains, size, element_at |
| `joins` | Null-safe equality (`<=>`) |
| `windows` | LAG, running SUM OVER |
| `aggregations` | COUNT DISTINCT, HAVING (includes unsupported: COLLECT_LIST, PERCENTILE_APPROX) |
| `json` | get_json_object → PARSE_JSON + VARIANT access *(combined file)* |
| `topk` | ROW_NUMBER TopK, QUALIFY, DATEDIFF → TIMESTAMPDIFF *(combined file)* |

The JSON output contains:

```json
Expand All @@ -52,11 +68,24 @@ The JSON output contains:

### Translate your own SQL

Two input formats are supported:

**Separate schema and query files:**
```bash
felderize translate path/to/schema.sql path/to/query.sql
felderize translate path/to/schema.sql path/to/query.sql --validate
Comment thread
wilmaontherun marked this conversation as resolved.
Outdated
```

**Single combined file** (CREATE TABLE and CREATE VIEW statements in one file):
```bash
felderize translate-file path/to/combined.sql
felderize translate-file path/to/combined.sql --validate
```

> **Note:** Running without `--validate` prints a warning — the output SQL has not been verified against the Feldera compiler.

Both commands accept `--verbose` to log the SQL submitted to the validator at each repair attempt.

### Batch translation

```bash
Expand Down
1 change: 0 additions & 1 deletion python/felderize/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ felderize = [
"data/skills/**/*.md",
"data/samples/*.md",
"data/demo/*.sql",
"data/demo/expected/*.sql",
]

[project.scripts]
Expand Down
57 changes: 41 additions & 16 deletions python/felderize/spark/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,13 @@ def cli():
@click.option("--validate", is_flag=True, help="Validate against Feldera instance")
@click.option("--json-output", is_flag=True, help="Output as JSON")
@click.option("--no-docs", is_flag=True, help="Disable Feldera doc inclusion in prompt")
@click.option("--verbose", is_flag=True, help="Log SQL submitted to validator at each attempt")
def translate(
schema_file: str, query_file: str, validate: bool, json_output: bool, no_docs: bool
schema_file: str, query_file: str, validate: bool, json_output: bool, no_docs: bool, verbose: bool
):
"""Translate a single Spark SQL schema + query pair to Feldera SQL."""
if not validate:
click.echo("Warning: running without validation — output SQL is not verified against the Feldera compiler.", err=True)
config = Config.from_env()
schema_sql = Path(schema_file).read_text()
query_sql = Path(query_file).read_text()
Expand All @@ -36,6 +39,7 @@ def translate(
config,
validate=validate,
include_docs=not no_docs,
verbose=verbose,
)

if json_output:
Expand All @@ -49,8 +53,11 @@ def translate(
@click.option("--validate", is_flag=True, help="Validate against Feldera instance")
@click.option("--json-output", is_flag=True, help="Output as JSON")
@click.option("--no-docs", is_flag=True, help="Disable Feldera doc inclusion in prompt")
def translate_file(sql_file: str, validate: bool, json_output: bool, no_docs: bool):
@click.option("--verbose", is_flag=True, help="Log SQL submitted to validator at each attempt")
def translate_file(sql_file: str, validate: bool, json_output: bool, no_docs: bool, verbose: bool):
"""Translate a single combined Spark SQL file (schema + views) to Feldera SQL."""
if not validate:
click.echo("Warning: running without validation — output SQL is not verified against the Feldera compiler.", err=True)
config = Config.from_env()
combined_sql = Path(sql_file).read_text()
schema_sql, query_sql = split_combined_sql(combined_sql)
Expand All @@ -61,6 +68,7 @@ def translate_file(sql_file: str, validate: bool, json_output: bool, no_docs: bo
config,
validate=validate,
include_docs=not no_docs,
verbose=verbose,
)

if json_output:
Expand Down Expand Up @@ -140,7 +148,8 @@ def batch(data_dir: str, validate: bool, output_dir: str | None, no_docs: bool):
)
@click.option("--json-output", is_flag=True, help="Output as JSON")
@click.option("--no-docs", is_flag=True, help="Disable Feldera doc inclusion in prompt")
def example(name: str | None, validate: bool, json_output: bool, no_docs: bool):
@click.option("--verbose", is_flag=True, help="Log SQL submitted to validator at each attempt")
def example(name: str | None, validate: bool, json_output: bool, no_docs: bool, verbose: bool):
"""Run a built-in example translation.

Without NAME, lists available examples. With NAME, translates that example.
Expand All @@ -150,43 +159,59 @@ def example(name: str | None, validate: bool, json_output: bool, no_docs: bool):
felderize example # list available examples
felderize example simple # translate the 'simple' example
"""
# Discover available examples
pairs: dict[str, tuple[Path, Path]] = {}
# Discover available examples: schema+query pairs and combined files
pairs: dict[str, tuple[Path, Path] | Path] = {}
for schema_file in sorted(_EXAMPLES_DIR.glob("*_schema.sql")):
example_name = schema_file.name.replace("_schema.sql", "")
query_file = _EXAMPLES_DIR / f"{example_name}_query.sql"
if query_file.is_file():
pairs[example_name] = (schema_file, query_file)
for combined_file in sorted(_EXAMPLES_DIR.glob("*_combined.sql")):
example_name = combined_file.name.replace("_combined.sql", "")
pairs[example_name] = combined_file

if not name:
click.echo("Available examples:\n")
for ex_name, (sf, qf) in pairs.items():
schema_preview = sf.read_text().strip().split("\n")[0]
click.echo(f" {ex_name:20s} {schema_preview}")
for ex_name, files in pairs.items():
if isinstance(files, Path):
preview = files.read_text().strip().split("\n")[0]
click.echo(f" {ex_name:20s} {preview} [combined]")
else:
preview = files[0].read_text().strip().split("\n")[0]
click.echo(f" {ex_name:20s} {preview}")
click.echo("\nRun one with: felderize example <name>")
return

if name not in pairs:
click.echo(f"Unknown example '{name}'. Available: {', '.join(pairs)}", err=True)
sys.exit(1)

schema_file, query_file = pairs[name]
schema_sql = schema_file.read_text()
query_sql = query_file.read_text()

click.echo(f"-- Spark Schema ({name}) --", err=True)
click.echo(schema_sql.strip(), err=True)
click.echo(f"\n-- Spark Query ({name}) --", err=True)
click.echo(query_sql.strip(), err=True)
files = pairs[name]
if isinstance(files, Path):
combined_sql = files.read_text()
schema_sql, query_sql = split_combined_sql(combined_sql)
click.echo(f"-- Spark SQL ({name}) --", err=True)
click.echo(combined_sql.strip(), err=True)
else:
schema_file, query_file = files
schema_sql = schema_file.read_text()
query_sql = query_file.read_text()
click.echo(f"-- Spark Schema ({name}) --", err=True)
click.echo(schema_sql.strip(), err=True)
click.echo(f"\n-- Spark Query ({name}) --", err=True)
click.echo(query_sql.strip(), err=True)
click.echo("\nTranslating...\n", err=True)

if not validate:
click.echo("Warning: running without validation — output SQL is not verified against the Feldera compiler.", err=True)
config = Config.from_env()
result = translate_spark_to_feldera(
schema_sql,
query_sql,
config,
validate=validate,
include_docs=not no_docs,
verbose=verbose,
)

if json_output:
Expand Down
21 changes: 0 additions & 21 deletions python/felderize/spark/data/demo/expected/aggregations.sql

This file was deleted.

15 changes: 0 additions & 15 deletions python/felderize/spark/data/demo/expected/arrays.sql

This file was deleted.

23 changes: 0 additions & 23 deletions python/felderize/spark/data/demo/expected/joins.sql

This file was deleted.

18 changes: 0 additions & 18 deletions python/felderize/spark/data/demo/expected/simple.sql

This file was deleted.

18 changes: 0 additions & 18 deletions python/felderize/spark/data/demo/expected/strings.sql

This file was deleted.

17 changes: 0 additions & 17 deletions python/felderize/spark/data/demo/expected/windows.sql

This file was deleted.

60 changes: 60 additions & 0 deletions python/felderize/spark/data/skills/spark_skills.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,64 @@ If the compiler reports `Encountered "<" ... ARRAY<VARCHAR>`: rewrite ALL `ARRAY
| `CREATE TEMPORARY VIEW` | → `CREATE VIEW` |
| `USING parquet` / `delta` / `csv` | Remove clause |
| `PARTITIONED BY (...)` | Remove clause |
| `CONSTRAINT name PRIMARY KEY (cols)` | → `PRIMARY KEY (cols)` — drop the `CONSTRAINT name` wrapper; Feldera rejects the named constraint syntax |
| PK column without `NOT NULL` | Add `NOT NULL` — Feldera requires all PRIMARY KEY columns to be NOT NULL |

### PRIMARY KEY rules

Two constraints Feldera enforces that Spark does not:

1. **No `CONSTRAINT name` wrapper** — Feldera rejects `CONSTRAINT pk PRIMARY KEY (col)`. Use bare `PRIMARY KEY (col)`.
2. **All PK columns must be `NOT NULL`** — Feldera rejects nullable PK columns:

```
error: PRIMARY KEY cannot be nullable: PRIMARY KEY column 'borrowerid' has type VARCHAR, which is nullable
```

```sql
-- Spark (both issues)
CREATE TABLE orders (
order_id STRING,
item_id STRING,
CONSTRAINT orders_pk PRIMARY KEY (order_id, item_id)
);

-- Feldera (fixed)
CREATE TABLE orders (
order_id VARCHAR NOT NULL,
item_id VARCHAR NOT NULL,
PRIMARY KEY (order_id, item_id)
);
```

### Reserved words as column names must be quoted

**Only quote column names that are SQL reserved words** — do not quote ordinary identifiers. Quoting non reserved words is wrong: it makes identifiers case-sensitive and adds unnecessary noise.

Quote a column name only when the compiler rejects it unquoted:
```
error: Error parsing SQL: Encountered ", TimeStamp" at line 33, column 25.
```

When quoting is needed, apply it consistently in both `CREATE TABLE` and every query reference:

```sql
-- Schema: only "TimeStamp" is quoted — it clashes with the TIMESTAMP type keyword
CREATE TABLE events (
id BIGINT NOT NULL,
source VARCHAR,
"TimeStamp" TIMESTAMP,
PRIMARY KEY (id)
);

-- Query: quote "TimeStamp" everywhere it appears, leave other columns unquoted
SELECT e.source, e."TimeStamp" as ts,
MAX(e."TimeStamp") OVER (PARTITION BY e.id) as max_ts
FROM events e
WHERE e."TimeStamp" >= TIMESTAMP '2024-01-01 00:00:00'
```

Known column names that clash with SQL keywords: `TimeStamp`, `Date`, `Time`, `Value`, `Type`, `Name`, `Language`.

### DDL Examples

Expand Down Expand Up @@ -578,6 +636,8 @@ When the Feldera compiler rejects translated SQL, check these common causes firs
| `No match found for function signature day(<TIMESTAMP>)` | Used `DAY(ts)` on a TIMESTAMP | Use `DAYOFMONTH(ts)` or `EXTRACT(DAY FROM ts)` |
| `No match found for function signature X` | Function is unsupported | Check this reference; if listed as unsupported, return immediately — do NOT retry |
| `Encountered "<" ... ARRAY<VARCHAR>` | Used Spark array syntax | Rewrite ALL `ARRAY<T>` to `T ARRAY` suffix form |
| `Error parsing SQL: Encountered ", ColumnName"` | Column name is a SQL reserved word | Double-quote the column name in schema and all query references, e.g. `"TimeStamp"` |
| `PRIMARY KEY cannot be nullable: column 'x' has type T, which is nullable` | PK column missing `NOT NULL` | Add `NOT NULL` to every column listed in the PRIMARY KEY |

## Important rules

Expand Down
Loading