Skip to content

Commit 4faca22

Browse files
wilmaontherunclaude
andcommitted
Fix skills: STDDEV rewrite rule, GBD-REGEX-ESCAPE for RLIKE, CAST(numeric AS TIMESTAMP) semantics; update README model and ANTHROPIC_BASE_URL
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 2b98d4e commit 4faca22

File tree

2 files changed

+14
-8
lines changed

2 files changed

+14
-8
lines changed

python/felderize/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Create a `.env` file:
1818
```bash
1919
ANTHROPIC_API_KEY=your-key-here
2020
FELDERA_COMPILER=/path/to/sql-to-dbsp # in Feldera repo: ../../sql-to-dbsp-compiler/SQL-compiler/sql-to-dbsp
21-
FELDERIZE_MODEL=claude-sonnet-4-5
21+
FELDERIZE_MODEL=claude-sonnet-4-6
2222
```
2323

2424
The `FELDERA_COMPILER` path is required for validation. Without it, translation still works but output SQL is not verified. You can also pass it per-command with `--compiler PATH`.
@@ -118,6 +118,7 @@ Environment variables (set in `.env`):
118118
| `ANTHROPIC_API_KEY` | Anthropic API key | (required) |
119119
| `FELDERIZE_MODEL` | LLM model to use (can also be set with `--model`) | (required, set in `.env`) |
120120
| `FELDERA_COMPILER` | Path to sql-to-dbsp compiler (can also be set with `--compiler`) | (required for validation) |
121+
| `ANTHROPIC_BASE_URL` | Override Anthropic API base URL (for proxies or alternate endpoints) | (optional) |
121122

122123
## How it works
123124

python/felderize/spark/skills/spark_skills.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -186,8 +186,8 @@ These Spark functions exist in Feldera — translate directly:
186186
| Spark | Feldera | Notes |
187187
|-------|---------|-------|
188188
| `AVG(col)` | `AVG(CAST(col AS DOUBLE))` if col is integer type; `AVG(col)` otherwise | Integer input: rewrite to return DOUBLE matching Spark. Decimal/float: leave as-is → [GBD-AGG-TYPE] scale mismatch |
189-
| `STDDEV_SAMP(col)` | Same |[GBD-AGG-TYPE]: decimal input preserves scale, not widened to DOUBLE |
190-
| `STDDEV_POP(col)` | Same |[GBD-AGG-TYPE]: decimal input preserves scale, not widened to DOUBLE |
189+
| `STDDEV_SAMP(col)` | `STDDEV_SAMP(CAST(col AS DOUBLE))` if col is non-DOUBLE; `STDDEV_SAMP(col)` otherwise |[GBD-AGG-TYPE]: Spark always returns DOUBLE; Feldera preserves input type. Rewrite for INT/BIGINT/DECIMAL inputs. |
190+
| `STDDEV_POP(col)` | `STDDEV_POP(CAST(col AS DOUBLE))` if col is non-DOUBLE; `STDDEV_POP(col)` otherwise |[GBD-AGG-TYPE]: Spark always returns DOUBLE; Feldera preserves input type. Rewrite for INT/BIGINT/DECIMAL inputs. |
191191
| `every(col)` | Same | Alias for `bool_and` — supported as aggregate; as window function → [GBD-BOOL-WINDOW] |
192192
| `some(col)` | Same | Supported as aggregate only; as window function → [GBD-BOOL-WINDOW] |
193193
| `bit_and(col)` — aggregate | `BIT_AND(col)` | Aggregate: bitwise AND over all rows in group |
@@ -211,10 +211,14 @@ These Spark functions exist in Feldera — translate directly:
211211
| Spark | Feldera | Notes |
212212
|-------|---------|-------|
213213
| `TRIM(s)` | Same |[GBD-WHITESPACE] |
214-
| `RLIKE(s, pattern)` | Same | Infix `s RLIKE pattern` also works |
214+
| `RLIKE(s, pattern)` | Same | Infix `s RLIKE pattern` also works [GBD-REGEX-ESCAPE] |
215215
| `OCTET_LENGTH(s)` | Same | Returns byte length of string |
216216
| `overlay(str placing repl from pos for len)` | `OVERLAY(str PLACING repl FROM pos FOR len)` | Same syntax |
217217

218+
#### ⚠️ Behavioral differences (Spark vs Feldera)
219+
220+
- **[GBD-REGEX-ESCAPE]** `RLIKE` / `REGEXP_REPLACE` regex pattern escaping: Spark SQL string literals apply Java-style `\\` backslash escaping, so `'\\.'` passes the regex `\.` (escaped dot) to the engine. Feldera SQL follows the SQL standard and does **not** interpret `\\` as an escape, so `'\\.'` is the two-character sequence `\.` in the regex — which may match differently. Use POSIX character classes instead: `[.]` for literal dot, `[+]` for literal plus, `[*]` for literal star, etc.
221+
218222
#### 📝 Notes
219223

220224
- `UPPER`, `LOWER`, `LENGTH`, `SUBSTRING`, `CONCAT`, `CONCAT_WS`, `REPLACE`, `REGEXP_REPLACE`, `INITCAP`, `REVERSE`, `REPEAT`, `LEFT`, `RIGHT`, `MD5`, `ASCII`, `CHR` work identically — no translation needed
@@ -232,7 +236,7 @@ These Spark functions exist in Feldera — translate directly:
232236
| `arrays_overlap(a, b)` | `ARRAYS_OVERLAP(a, b)` | |
233237
| `array_repeat(val, n)` | `ARRAY_REPEAT(val, n)` | |
234238
| `array_union(a, b)` | `ARRAY_UNION(a, b)` |[GBD-ARRAY-ORDER]: Feldera returns elements in sorted order; Spark preserves input order. |
235-
| `array_intersect(a, b)` | `ARRAY_INTERSECT(a, b)` | |
239+
| `array_intersect(a, b)` | `ARRAY_INTERSECT(a, b)` | [GBD-ARRAY-ORDER]: Feldera returns elements in sorted order; Spark preserves input order. |
236240
| `array_except(a, b)` | `ARRAY_EXCEPT(a, b)` |[GBD-ARRAY-ORDER]: Feldera returns elements in sorted order; Spark preserves input order. |
237241
| `array_join(arr, sep)` | `ARRAY_JOIN(arr, sep)` | Alias for ARRAY_TO_STRING |
238242
| `size(arr)` | `COALESCE(CARDINALITY(arr), -1)` |[GBD-SIZE-NULL] |
@@ -359,6 +363,8 @@ GROUP BY CAST(v['user_id'] AS VARCHAR);
359363
| `EXP(x)` | Same | Input/output: DOUBLE |
360364
| `SIGN(x)` | Same | Returns -1, 0, or 1 |
361365
| `sec(x)` | `SEC(x)` | |
366+
| `csc(x)` | `CSC(x)` | |
367+
| `cot(x)` | `COT(x)` | |
362368

363369
#### ⚠️ Behavioral differences (Spark vs Feldera)
364370

@@ -369,8 +375,6 @@ GROUP BY CAST(v['user_id'] AS VARCHAR);
369375
#### 📝 Notes
370376

371377
- `ABS`, `POWER`, `SQRT` work identically — no translation needed
372-
| `csc(x)` | `CSC(x)` | |
373-
| `cot(x)` | `COT(x)` | |
374378

375379
#### Null handling
376380

@@ -550,7 +554,7 @@ GROUP BY region
550554
| `TRY_CAST(expr AS type)` | `SAFE_CAST(expr AS type)` |[GBD-SAFE-CAST] |
551555
| `CAST(string AS DATE)` | `CAST(string AS DATE)` | Pass through unchanged — Spark returns NULL for invalid inputs; Feldera may panic at runtime. Use `SAFE_CAST` if NULL-on-failure is required. |
552556
| `CAST(string AS TIMESTAMP)` | `CAST(string AS TIMESTAMP)` | Pass through unchanged — same rules as CAST to DATE. Use `SAFE_CAST` if NULL-on-failure is required. |
553-
| `CAST(numeric AS TIMESTAMP)` | `CAST(numeric AS TIMESTAMP)` | Pass through unchanged — Feldera compiler accepts this syntax including edge cases like `CAST(CAST('inf' AS DOUBLE) AS TIMESTAMP)`. Do not mark as unsupported based on runtime semantics — if it compiles, translate it. |
557+
| `CAST(numeric AS TIMESTAMP)` | `CAST(numeric AS TIMESTAMP)` | ⚠️ **Semantics differ**: Spark interprets the numeric as **seconds** since epoch; Feldera interprets it as **microseconds** since epoch. Results differ by a factor of 1,000,000. Mark unsupported if the numeric value represents seconds (the common Spark case). Only translate if the source is already in microseconds. |
554558
| `CAST('<value>' AS INTERVAL <unit>)` | `INTERVAL '<value>' <unit>` | For constant strings: drop the CAST, use interval literal directly (`INTERVAL '3' DAY`, `INTERVAL '3-1' YEAR TO MONTH`). For string expressions: `CAST(col AS INTERVAL DAY)` or `INTERVAL col DAY` both work in Feldera when used inside arithmetic (e.g. `d + CAST(col AS INTERVAL DAY)`). |
555559
| `CAST(INTERVAL '...' <unit> AS <numeric>)` | Same | Pass through unchanged. Single time units (SECOND, MINUTE, HOUR, DAY, MONTH, YEAR) to any numeric type are supported. Compound intervals (`YEAR TO MONTH`, `DAY TO SECOND`) to numeric are unsupported. |
556560

@@ -941,6 +945,7 @@ Rationale: a partial translation produces incorrect results, which is worse than
941945
| `SELECT CAST(... AS INTERVAL ...)` as final output | `INTERVAL` cannot be a view column output type — mark unsupported even if the literal rewrite applies. Do NOT use `CREATE LOCAL VIEW` as a workaround — it changes semantics. |
942946
| `CAST(INTERVAL 'x-y' YEAR TO MONTH AS numeric)` | Compound interval (YEAR TO MONTH, DAY TO SECOND, HOUR TO SECOND) to numeric — not supported |
943947
| `CAST(str AS BOOLEAN)` where `str` contains `\t`, `\n`, `\r` whitespace |[GBD-WHITESPACE] — Feldera only strips spaces before parsing, so `'\t\t true \n\r '` returns `False`. Mark unsupported if input may contain non-space whitespace. |
948+
| `CAST(numeric AS TIMESTAMP)` where numeric is **seconds** since epoch | Spark interprets as **seconds**; Feldera interprets as **microseconds** — off by factor of 1,000,000. Mark unsupported when the numeric value represents seconds (the typical Spark use case). |
944949
| `CAST(numeric AS INTERVAL ...)` | Numeric-to-interval — not supported |
945950
| `CAST(TIME '...' AS numeric/decimal)` | TIME to numeric or decimal — not supported |
946951

0 commit comments

Comments
 (0)