You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python/felderize/README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ Create a `.env` file:
18
18
```bash
19
19
ANTHROPIC_API_KEY=your-key-here
20
20
FELDERA_COMPILER=/path/to/sql-to-dbsp # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp
21
-
FELDERIZE_MODEL=claude-sonnet-4-5
21
+
FELDERIZE_MODEL=claude-sonnet-4-6
22
22
```
23
23
24
24
The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`.
@@ -118,6 +118,7 @@ Environment variables (set in `.env`):
118
118
|`ANTHROPIC_API_KEY`| Anthropic API key | (required) |
119
119
|`FELDERIZE_MODEL`| LLM model to use (can also be set with `--model`) | (required, set in `.env`) |
120
120
|`FELDERA_COMPILER`| Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) |
121
+
|`ANTHROPIC_BASE_URL`| Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) |
Copy file name to clipboardExpand all lines: python/felderize/spark/skills/spark_skills.md
+12-7Lines changed: 12 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,8 +186,8 @@ These Spark functions exist in Feldera — translate directly:
186
186
| Spark | Feldera | Notes |
187
187
|-------|---------|-------|
188
188
|`AVG(col)`|`AVG(CAST(col AS DOUBLE))` if col is integer type; `AVG(col)` otherwise | Integer input: rewrite to return DOUBLE matching Spark. Decimal/float: leave as-is → [GBD-AGG-TYPE] scale mismatch |
189
-
|`STDDEV_SAMP(col)`|Same | → [GBD-AGG-TYPE]: decimal input preserves scale, not widened to DOUBLE|
190
-
|`STDDEV_POP(col)`|Same | → [GBD-AGG-TYPE]: decimal input preserves scale, not widened to DOUBLE|
189
+
|`STDDEV_SAMP(col)`|`STDDEV_SAMP(CAST(col AS DOUBLE))` if col is non-DOUBLE; `STDDEV_SAMP(col)` otherwise | → [GBD-AGG-TYPE]: Spark always returns DOUBLE; Feldera preserves input type. Rewrite for INT/BIGINT/DECIMAL inputs.|
190
+
|`STDDEV_POP(col)`|`STDDEV_POP(CAST(col AS DOUBLE))` if col is non-DOUBLE; `STDDEV_POP(col)` otherwise | → [GBD-AGG-TYPE]: Spark always returns DOUBLE; Feldera preserves input type. Rewrite for INT/BIGINT/DECIMAL inputs.|
191
191
|`every(col)`| Same | Alias for `bool_and` — supported as aggregate; as window function → [GBD-BOOL-WINDOW]|
192
192
|`some(col)`| Same | Supported as aggregate only; as window function → [GBD-BOOL-WINDOW]|
193
193
|`bit_and(col)` — aggregate |`BIT_AND(col)`| Aggregate: bitwise AND over all rows in group |
@@ -211,10 +211,14 @@ These Spark functions exist in Feldera — translate directly:
211
211
| Spark | Feldera | Notes |
212
212
|-------|---------|-------|
213
213
|`TRIM(s)`| Same | → [GBD-WHITESPACE]|
214
-
|`RLIKE(s, pattern)`| Same | Infix `s RLIKE pattern` also works |
214
+
|`RLIKE(s, pattern)`| Same | Infix `s RLIKE pattern` also works → [GBD-REGEX-ESCAPE]|
215
215
|`OCTET_LENGTH(s)`| Same | Returns byte length of string |
216
216
|`overlay(str placing repl from pos for len)`|`OVERLAY(str PLACING repl FROM pos FOR len)`| Same syntax |
217
217
218
+
#### ⚠️ Behavioral differences (Spark vs Feldera)
219
+
220
+
-**[GBD-REGEX-ESCAPE]**`RLIKE` / `REGEXP_REPLACE` regex pattern escaping: Spark SQL string literals apply Java-style `\\` backslash escaping, so `'\\.'` passes the regex `\.` (escaped dot) to the engine. Feldera SQL follows the SQL standard and does **not** interpret `\\` as an escape, so `'\\.'` is the two-character sequence `\.` in the regex — which may match differently. Use POSIX character classes instead: `[.]` for literal dot, `[+]` for literal plus, `[*]` for literal star, etc.
@@ -359,6 +363,8 @@ GROUP BY CAST(v['user_id'] AS VARCHAR);
359
363
|`EXP(x)`| Same | Input/output: DOUBLE |
360
364
|`SIGN(x)`| Same | Returns -1, 0, or 1 |
361
365
|`sec(x)`|`SEC(x)`||
366
+
|`csc(x)`|`CSC(x)`||
367
+
|`cot(x)`|`COT(x)`||
362
368
363
369
#### ⚠️ Behavioral differences (Spark vs Feldera)
364
370
@@ -369,8 +375,6 @@ GROUP BY CAST(v['user_id'] AS VARCHAR);
369
375
#### 📝 Notes
370
376
371
377
-`ABS`, `POWER`, `SQRT` work identically — no translation needed
372
-
|`csc(x)`|`CSC(x)`||
373
-
|`cot(x)`|`COT(x)`||
374
378
375
379
#### Null handling
376
380
@@ -550,7 +554,7 @@ GROUP BY region
550
554
|`TRY_CAST(expr AS type)`|`SAFE_CAST(expr AS type)`| → [GBD-SAFE-CAST]|
551
555
|`CAST(string AS DATE)`|`CAST(string AS DATE)`| Pass through unchanged — Spark returns NULL for invalid inputs; Feldera may panic at runtime. Use `SAFE_CAST` if NULL-on-failure is required. |
552
556
|`CAST(string AS TIMESTAMP)`|`CAST(string AS TIMESTAMP)`| Pass through unchanged — same rules as CAST to DATE. Use `SAFE_CAST` if NULL-on-failure is required. |
553
-
|`CAST(numeric AS TIMESTAMP)`|`CAST(numeric AS TIMESTAMP)`|Pass through unchanged — Feldera compiler accepts this syntax including edge cases like `CAST(CAST('inf' AS DOUBLE) AS TIMESTAMP)`. Do not mark as unsupported based on runtime semantics — if it compiles, translate it. |
557
+
|`CAST(numeric AS TIMESTAMP)`|`CAST(numeric AS TIMESTAMP)`|⚠️ **Semantics differ**: Spark interprets the numeric as **seconds** since epoch; Feldera interprets it as **microseconds** since epoch. Results differ by a factor of 1,000,000. Mark unsupported if the numeric value represents seconds (the common Spark case). Only translate if the source is already in microseconds. |
554
558
|`CAST('<value>' AS INTERVAL <unit>)`|`INTERVAL '<value>' <unit>`| For constant strings: drop the CAST, use interval literal directly (`INTERVAL '3' DAY`, `INTERVAL '3-1' YEAR TO MONTH`). For string expressions: `CAST(col AS INTERVAL DAY)` or `INTERVAL col DAY` both work in Feldera when used inside arithmetic (e.g. `d + CAST(col AS INTERVAL DAY)`). |
555
559
|`CAST(INTERVAL '...' <unit> AS <numeric>)`| Same | Pass through unchanged. Single time units (SECOND, MINUTE, HOUR, DAY, MONTH, YEAR) to any numeric type are supported. Compound intervals (`YEAR TO MONTH`, `DAY TO SECOND`) to numeric are unsupported. |
556
560
@@ -941,6 +945,7 @@ Rationale: a partial translation produces incorrect results, which is worse than
941
945
|`SELECT CAST(... AS INTERVAL ...)` as final output |`INTERVAL` cannot be a view column output type — mark unsupported even if the literal rewrite applies. Do NOT use `CREATE LOCAL VIEW` as a workaround — it changes semantics. |
942
946
|`CAST(INTERVAL 'x-y' YEAR TO MONTH AS numeric)`| Compound interval (YEAR TO MONTH, DAY TO SECOND, HOUR TO SECOND) to numeric — not supported |
943
947
|`CAST(str AS BOOLEAN)` where `str` contains `\t`, `\n`, `\r` whitespace | → [GBD-WHITESPACE] — Feldera only strips spaces before parsing, so `'\t\t true \n\r '` returns `False`. Mark unsupported if input may contain non-space whitespace. |
948
+
|`CAST(numeric AS TIMESTAMP)` where numeric is **seconds** since epoch | Spark interprets as **seconds**; Feldera interprets as **microseconds** — off by factor of 1,000,000. Mark unsupported when the numeric value represents seconds (the typical Spark use case). |
944
949
|`CAST(numeric AS INTERVAL ...)`| Numeric-to-interval — not supported |
945
950
|`CAST(TIME '...' AS numeric/decimal)`| TIME to numeric or decimal — not supported |
0 commit comments