Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
046f795
[migrations] Spark-to-Feldera migration tool PoC.
wilmaontherun Mar 16, 2026
bb1b24c
[ci] apply automatic fixes
feldera-bot Mar 16, 2026
871d79c
Intermediate progress based on Mihai's comments.
wilmaontherun Mar 19, 2026
e856080
fixed comments on skills
wilmaontherun Mar 19, 2026
b672679
Fixed all comments before we refactor skills
wilmaontherun Mar 19, 2026
3d8536f
Merge skills
wilmaontherun Mar 19, 2026
219f8b7
Fixed the rest of the code w.r.t. new skill file
wilmaontherun Mar 19, 2026
83d7731
Revised doc indexing
wilmaontherun Mar 19, 2026
98f7dee
merged skills
wilmaontherun Mar 19, 2026
56bd911
add --verbose flag, translate-file, combined demos, and Feldera PK/qu…
wilmaontherun Mar 19, 2026
b10f7a4
more demo files
wilmaontherun Mar 19, 2026
7d56201
revised samples and skills
wilmaontherun Mar 20, 2026
9e17a33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
3a75739
add --compiler option, fix no-compiler handling, improve example list…
wilmaontherun Mar 20, 2026
ab4f746
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
6672146
fixed readme
wilmaontherun Mar 20, 2026
9975f33
[ci] apply automatic fixes
feldera-bot Mar 20, 2026
14e7cc6
Add --model option, remove OpenAI support and hardcoded compiler path
wilmaontherun Mar 20, 2026
7ac2fb9
Use sqlparse for SQL splitting, fix README inconsistencies
wilmaontherun Mar 20, 2026
6655f7c
Add prompt caching and rate limit retry; skip examples on first pass
wilmaontherun Mar 20, 2026
8fa1210
Clean up code quality: fix imports, types, and consistency issues
wilmaontherun Mar 21, 2026
4358104
Fix spark_skills.md inconsistencies
wilmaontherun Mar 24, 2026
80a5e7f
[ci] apply automatic fixes
feldera-bot Mar 24, 2026
7da4c30
Verify and fix spark_skills.md against Apache Spark SQL reference
wilmaontherun Mar 24, 2026
589406c
Overhaul spark/data/samples: fix errors, add new patterns, remove tri…
wilmaontherun Mar 24, 2026
b53a97a
Fix skills inconsistencies: QUARTER unsupported, contains/binary, pmo…
wilmaontherun Mar 25, 2026
05388f9
Rename misnamed sample files to match their content
wilmaontherun Mar 25, 2026
643ba2b
Improve and expand sample demos
wilmaontherun Mar 25, 2026
d1b0c95
Fix demo files: remove unsupported patterns, add dates and arithmetic…
wilmaontherun Mar 25, 2026
d7057b9
Fix aggregations and arithmetic demos to use only supported Feldera f…
wilmaontherun Mar 25, 2026
b438037
split_part skill & base_url config
anandbraman Mar 27, 2026
aaf02dd
Update AVG(integer) rule: rewrite to AVG(CAST(col AS DOUBLE)) for int…
wilmaontherun Mar 30, 2026
9de37f9
Untrack known_unsupported.yaml (ignored by .gitignore)
wilmaontherun Mar 30, 2026
e088426
Add dialect subcommand structure: felderize spark <cmd>
wilmaontherun Mar 30, 2026
d596584
Move data/skills one level above spark/ to felderize root
wilmaontherun Mar 30, 2026
d5bf6e0
Move skills file to spark/skills/
wilmaontherun Mar 30, 2026
e714f19
Address review comments: lateral aliases, size/CARDINALITY, JSON, sem…
wilmaontherun Mar 30, 2026
621dd95
Merge branch 'gh-readonly-queue/main/pr-5953-63f28bb9543f28137c91d7a2…
wilmaontherun Mar 30, 2026
a84c82f
Reorganize spark_skills.md: fix section placement and add missing ent…
wilmaontherun Mar 31, 2026
06b7bc6
Fix and improve spark_skills.md translation rules
wilmaontherun Apr 2, 2026
425d5ab
Reorganize spark_skills.md: structured subsections with emoji markers
wilmaontherun Apr 2, 2026
2b98d4e
Verify spark_skills.md claims against test evidence; fix errors found
wilmaontherun Apr 2, 2026
4faca22
Fix skills: STDDEV rewrite rule, GBD-REGEX-ESCAPE for RLIKE, CAST(num…
wilmaontherun Apr 2, 2026
c1f740d
README: note that Spark is current dialect, more planned
wilmaontherun Apr 2, 2026
d8a1ed6
[ci] apply automatic fixes
feldera-bot Apr 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fixed comments on skills
  • Loading branch information
wilmaontherun committed Mar 19, 2026
commit e856080069a70dacd34ce5c8fdb0a4c36b2c6ecf
10 changes: 5 additions & 5 deletions python/felderize/spark/data/skills/function-reference/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ These Spark functions exist in Feldera — translate directly:
|-------|---------|-------|
| `YEAR(d)` | Same | |
| `MONTH(d)` | Same | |
| `DAY(d)` | `DAYOFMONTH(d)` | Use EXTRACT for timestamps |
| `DAY(d)` | `DAYOFMONTH(d)` | |
| `HOUR(ts)` | Same | |
| `MINUTE(ts)` | Same | |
| `SECOND(ts)` | Same | |
Expand Down Expand Up @@ -185,7 +185,8 @@ These require translation but ARE supported:
| `CUBE(a, b)` | Same | |
| `date_add(d, n)` | `d + INTERVAL 'n' DAY` | time-converter |
| `date_sub(d, n)` | `d - INTERVAL 'n' DAY` | time-converter |
| `datediff(end, start)` | `TIMESTAMPDIFF(DAY, start, end)` | time-converter |
| `datediff(end, start)` | `DATEDIFF(DAY, start, end)` | Feldera DATEDIFF takes 3 args (unit, start, end); Spark takes 2 |
| `months_between(end, start)` | `DATEDIFF(MONTH, start, end)` | Spark returns fractional months; Feldera returns integer months |
| `date_trunc('MONTH', d)` | `DATE_TRUNC(d, MONTH)` | time-converter |
| `date_trunc('MONTH', ts)` | `TIMESTAMP_TRUNC(ts, MONTH)` | time-converter |
| `LPAD(s, n, pad)` | `CASE WHEN LENGTH(s) >= n THEN SUBSTRING(s,1,n) ELSE CONCAT(REPEAT(pad, n-LENGTH(s)), s) END` | query-rewrite |
Expand All @@ -199,10 +200,11 @@ These require translation but ARE supported:
| `weekofyear(d)` | `EXTRACT(WEEK FROM d)` | query-rewrite |
| `add_months(d, n)` | `d + INTERVAL 'n' MONTH` | |
| `last_day(d)` | `DATE_TRUNC(d, MONTH) + INTERVAL '1' MONTH - INTERVAL '1' DAY` | |
| `MAKE_DATE(y, m, d)` | `PARSE_DATE('%Y-%m-%d', CONCAT(CAST(y AS VARCHAR), '-', RIGHT(CONCAT('0', CAST(m AS VARCHAR)), 2), '-', RIGHT(CONCAT('0', CAST(d AS VARCHAR)), 2)))` | Zero-pads month/day; years < 1000 may produce wrong results |
| `unix_timestamp(ts)` | `EXTRACT(EPOCH FROM ts)` | |
| `unix_millis(ts)` | `CAST(EXTRACT(EPOCH FROM ts) * 1000 AS BIGINT)` | |
| `unix_micros(ts)` | `CAST(EXTRACT(EPOCH FROM ts) * 1000000 AS BIGINT)` | |
| `from_unixtime(n)` | `TIMESTAMPADD(SECOND, n, DATE '1970-01-01')` | Returns TIMESTAMP; Spark returns STRING in yyyy-MM-dd HH:mm:ss — Feldera has no FORMAT_TIMESTAMP, use CONCAT with YEAR/MONTH/DAY/HOUR/MINUTE/SECOND to match |
| `from_unixtime(n)` | `MAKE_TIMESTAMP(n)` | Define UDF first: `CREATE FUNCTION MAKE_TIMESTAMP(SECONDS BIGINT) RETURNS TIMESTAMP AS TIMESTAMPADD(SECOND, SECONDS, DATE '1970-01-01')` |
| `to_timestamp(n)` (numeric) | `TIMESTAMPADD(SECOND, n, DATE '1970-01-01')` | |
| `to_timestamp(s[, fmt])` | `PARSE_TIMESTAMP(fmt, s)` | Argument order reversed; default fmt is `%Y-%m-%d %H:%M:%S`; translate Java fmt to strftime |
| `map_entries(m)` | `CROSS JOIN UNNEST(m) AS t(k, v)` | Flatten map to rows |
Expand All @@ -225,8 +227,6 @@ Do NOT attempt to translate these. Return as unsupported immediately.
| `parse_url` | URL parsing |
| `SHA`, `SHA2`, `SHA256` | Hashing |
| `next_day` | Date |
| `MAKE_DATE` | Date constructor |
| `months_between` | Date diff |
| `sequence()` for date ranges | Date generation |
| `CORR` | Statistical aggregate |
| `approx_count_distinct`, `APPROX_DISTINCT` | Approximate aggregate |
Expand Down
125 changes: 11 additions & 114 deletions python/felderize/spark/data/skills/query-rewrite/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: query-rewrite
description: Rewrites Spark SQL query patterns that have no direct Feldera equivalent into semantically equivalent Feldera SQL. Covers PIVOT, GROUPING SETS, ROLLUP, CUBE, and get_json_object.
description: Rewrites Spark SQL query patterns that have no direct Feldera equivalent into semantically equivalent Feldera SQL. Covers named_struct, nvl2, pmod, LPAD/RPAD, LEFT ANTI JOIN, and weekofyear.
---

# Query Rewrite
Expand All @@ -9,113 +9,6 @@ description: Rewrites Spark SQL query patterns that have no direct Feldera equiv

Use this skill when Spark SQL contains query constructs that Feldera does not support directly but can be rewritten to equivalent SQL.

## PIVOT

Feldera does not support `PIVOT` syntax. Rewrite to conditional aggregation with `CASE WHEN`.

### Pattern

Spark:
```sql
SELECT * FROM source
PIVOT (
agg_func(value_col)
FOR pivot_col IN ('val1', 'val2', 'val3')
)
```

Feldera:
```sql
SELECT
<non-pivot columns>,
agg_func(CASE WHEN pivot_col = 'val1' THEN value_col END) AS val1,
agg_func(CASE WHEN pivot_col = 'val2' THEN value_col END) AS val2,
agg_func(CASE WHEN pivot_col = 'val3' THEN value_col END) AS val3
FROM source
GROUP BY <non-pivot columns>
```

### Rules
- Preserve the aggregation function (COUNT, SUM, AVG, etc.).
- Each PIVOT value becomes a separate `CASE WHEN` expression.
- The non-pivot, non-value columns become the GROUP BY columns.
- Column aliases come from the IN list values.
- Double-check GROUP BY spelling — do not introduce typos.

### Example

Input:
```sql
SELECT * FROM (
SELECT team_name, status, ticket_id
FROM support_tickets
) src
PIVOT (
COUNT(ticket_id)
FOR status IN ('OPEN', 'IN_PROGRESS', 'CLOSED')
)
```

Output:
```sql
CREATE VIEW result AS
SELECT
team_name,
COUNT(CASE WHEN status = 'OPEN' THEN ticket_id END) AS OPEN,
COUNT(CASE WHEN status = 'IN_PROGRESS' THEN ticket_id END) AS IN_PROGRESS,
COUNT(CASE WHEN status = 'CLOSED' THEN ticket_id END) AS CLOSED
FROM support_tickets
GROUP BY team_name
```

## GROUPING SETS

Feldera does not support `GROUPING SETS`. Rewrite to `UNION ALL` of separate `GROUP BY` queries.

### Pattern

Spark:
```sql
SELECT a, b, SUM(x) AS total
FROM t
GROUP BY GROUPING SETS ((a, b), (a), ())
```

Feldera:
```sql
SELECT a, b, SUM(x) AS total FROM t GROUP BY a, b
UNION ALL
SELECT a, CAST(NULL AS <type_of_b>) AS b, SUM(x) AS total FROM t GROUP BY a
UNION ALL
SELECT CAST(NULL AS <type_of_a>) AS a, CAST(NULL AS <type_of_b>) AS b, SUM(x) AS total FROM t
```

### Rules
- Each grouping set becomes a separate SELECT with its own GROUP BY.
- Columns not in a grouping set must be NULL with proper CAST to match types.
- All branches must have identical column names, types, and order.
- Use CAST(NULL AS type) rather than bare NULL to avoid type mismatches in UNION ALL.

## ROLLUP

Feldera does not support `ROLLUP`. Expand to equivalent UNION ALL.

`GROUP BY ROLLUP(a, b, c)` is equivalent to:
```
GROUPING SETS ((a, b, c), (a, b), (a), ())
```

Then apply the GROUPING SETS rewrite above.

## CUBE

`GROUP BY CUBE(a, b)` is equivalent to:
```
GROUPING SETS ((a, b), (a), (b), ())
```

Then apply the GROUPING SETS → UNION ALL rewrite above.

Note: `grouping_id()` function is NOT available in Feldera. If the query uses `grouping_id()`, mark it as unsupported.

## named_struct
Expand All @@ -124,7 +17,7 @@ Spark's `named_struct('field1', val1, 'field2', val2, ...)` creates a struct wit

### Feldera equivalent

Use `ROW(val1, val2, ...)` constructor.
Use `ROW(val1, val2, ...)` constructor, or `CAST(ROW(val1, val2) AS ROW(field1 T1, field2 T2))` to preserve field names, or a user-defined type.

```sql
-- Spark
Expand All @@ -138,7 +31,7 @@ ROW(left_id, right_id)
- Drop the field name strings — Feldera ROW constructors are positional.
- Preserve the value expressions in order.
- `ROW(x, y, z)` creates an anonymous struct.
- Field access on ROW values uses dot notation: `row_val.field_name`.
- Field access on named ROW values uses dot notation: `row_val.field_name`. For anonymous structs (no field names), use 1-based index access: `row_val[1]`.

### Example

Expand All @@ -148,12 +41,18 @@ SELECT source, COUNT(DISTINCT named_struct('l', left_id, 'r', right_id)) AS uniq
FROM pair_events GROUP BY source
```

Output:
Output (anonymous ROW):
```sql
SELECT source, COUNT(DISTINCT ROW(left_id, right_id)) AS unique_pair_count
FROM pair_events GROUP BY source
```

Output (named fields via CAST):
```sql
SELECT source, COUNT(DISTINCT CAST(ROW(left_id, right_id) AS ROW(l BIGINT, r BIGINT))) AS unique_pair_count
FROM pair_events GROUP BY source
```

## nvl2

Spark's `nvl2(expr, val_if_not_null, val_if_null)` returns `val_if_not_null` when `expr` is not NULL, otherwise `val_if_null`.
Expand Down Expand Up @@ -184,9 +83,7 @@ This ensures the result is always non-negative. Do NOT mark as unsupported.

## from_unixtime / to_timestamp from epoch

Spark's `from_unixtime(unix_seconds)` and `to_timestamp(unix_seconds)` convert unix epoch seconds to timestamp.

Mark as unsupported — Feldera does not support `to_timestamp(<NUMERIC>)` or `from_unixtime`.
See function-reference skill for rewrites using `TIMESTAMPADD`.

## LPAD / RPAD

Expand Down
8 changes: 4 additions & 4 deletions python/felderize/spark/data/skills/type-converter/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: type-converter
description: Converts Spark SQL DDL to Feldera-compatible DDL. Covers type mappings (STRING→VARCHAR), DDL clause removal (USING parquet, TEMP VIEW), and schema-level rewrites.
description: Converts Spark SQL DDL to Feldera-compatible DDL. Covers type mappings, DDL clause removal (USING parquet, TEMP VIEW), and schema-level rewrites.
---

# Type Converter
Expand All @@ -9,13 +9,13 @@ description: Converts Spark SQL DDL to Feldera-compatible DDL. Covers type mappi

| Spark | Feldera | Notes |
|-------|---------|-------|
Comment thread
wilmaontherun marked this conversation as resolved.
Outdated
| `STRING` | `VARCHAR` | |
| `TEXT` | `VARCHAR` | |
| `STRING` | `STRING` or `VARCHAR` | Both supported natively |
| `TEXT` | `TEXT` or `VARCHAR` | Both supported natively |
| `INT` / `INTEGER` | `INT` | Same |
| `BIGINT` | `BIGINT` | Same |
| `BOOLEAN` | `BOOLEAN` | Same |
| `DECIMAL(p,s)` | `DECIMAL(p,s)` | Same |
| `FLOAT` | `FLOAT` | Same |
| `FLOAT` | `REAL` | Feldera uses REAL instead of FLOAT |
| `DOUBLE` | `DOUBLE` | Same |
| `DATE` | `DATE` | Same |
| `TIMESTAMP` | `TIMESTAMP` | Same |
Expand Down