Skip to content

fix(sqlite): translate antlr rune offsets to byte offsets#4492

Open
abgoyal wants to merge 1 commit into
sqlc-dev:mainfrom
abgoyal:fix/non-ascii-sql-output
Open

fix(sqlite): translate antlr rune offsets to byte offsets#4492
abgoyal wants to merge 1 commit into
sqlc-dev:mainfrom
abgoyal:fix/non-ascii-sql-output

Conversation

@abgoyal

@abgoyal abgoyal commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Fixes a bug where SQLite code generation produced truncated or corrupt
output whenever the source .sql file contained a non-ASCII character
(em-dash, accented letters, etc.).

The bug

Given a query with a multi-byte character in a comment (here the é in
"café", which is 2 bytes in UTF-8):

-- name: GetItem :one
-- Lookup used by the café menu screen
SELECT id, name FROM items WHERE id = ?;

sqlc generated a truncated SQL constant, dropping the trailing ?:

const getItem = `-- name: GetItem :one
SELECT id, name FROM items WHERE id =
`

In queries that use sqlc.arg() or SELECT *, the misaligned offsets
caused edits to land at the wrong byte positions and produced SQL that
fails to parse (e.g. SEid ...). There was no error at generate time —
the corruption was silent.

Root cause

antlr's InputStream stores the input as a []rune and reports all
token positions as rune indices. The rest of sqlc, shared with the
Postgres and MySQL engines, treats query offsets as byte offsets
into the original source:

  • source.Pluck slices the raw SQL for each statement
  • source.Mutate applies parameter / star-expansion edits

For ASCII, rune index == byte offset, so the bug never surfaced. A
multi-byte rune (em-dash = 3 bytes, 1 rune) makes every later rune
index lag its true byte offset, truncating the statement slice and
shifting every edit. Postgres and MySQL are unaffected because they get
byte offsets from libpg_query and TiDB.

The fix

Confined to the SQLite engine. We build a rune-index → byte-offset table
for the source and translate offsets at the point they are produced, so
a rune-based offset never escapes the parser:

  • parse.go converts the statement's StmtLocation / StmtLen and
    threads the table into the converter (cc.convertPos).
  • convert.go routes every node Location through a single cc.pos
    helper instead of using the raw antlr token offset.

By the time the AST leaves the parser, all offsets are byte offsets,
matching the invariant the other engines already satisfy. The
translation lives where the offsets originate, so there is no separate
post-processing pass and no reflection.

Testing

  • New regression test internal/engine/sqlite/parse_test.go (fails
    before the fix, passes after).
  • Verified sqlc generate end-to-end on the query above and on a harder
    case (multi-byte string literal before sqlc.arg() params +
    SELECT *); both are correct after the fix, and the latter produced
    corrupt SQL before it.

antlr's InputStream stores the source as a []rune and reports every
token position as a rune index. The rest of sqlc treats query offsets
as byte offsets into the original source string (source.Pluck slices
the raw SQL, source.Mutate applies parameter and star-expansion edits).
For ASCII input the two coincide, so the mismatch was invisible. Any
multi-byte rune in the source (e.g. an em-dash in a comment or string
literal) shifts every later byte offset, which truncated the generated
SQL constant and, in queries using sqlc.arg() or `*`, applied edits at
the wrong positions and produced corrupt SQL.

Translate antlr's rune indices to byte offsets in the SQLite parser, so
that no rune-based offset ever escapes into the compiler: build a
rune-index -> byte-offset table for the source and route the statement's
StmtLocation/StmtLen and every AST node's Location through it (via a
single cc.pos helper). Postgres and MySQL are unaffected since they
receive byte offsets from libpg_query and TiDB.

Add a regression test covering a query with a multi-byte rune in a
comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant