Skip to content

Emit more specific syntax errors for incompatible string prefixes#19466

Closed
dylwil3 wants to merge 6 commits intoastral-sh:mainfrom
dylwil3:prefix-syntax-errors
Closed

Emit more specific syntax errors for incompatible string prefixes#19466
dylwil3 wants to merge 6 commits intoastral-sh:mainfrom
dylwil3:prefix-syntax-errors

Conversation

@dylwil3
Copy link
Copy Markdown
Collaborator

@dylwil3 dylwil3 commented Jul 21, 2025

Following the CPython changes here:

we modify the lexing of double character string prefixes to emit a more useful syntax error in the case of invalid double character prefixes. Our implementation differs slightly from that of CPython in that our error message retains the casing of the prefixes and we emit a separate error for repeated prefixes (like uu"hello").

Example

def foo():
    x = uf"a more complicated {example}"
    return x

Before:

ruff check --no-cache --isolated ex.py
ex.py:2:11: SyntaxError: Simple statements must be separated by newlines or semicolons
  |
1 | def foo():
2 |     x = uf"a more complicated {example}"
  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3 |     return x
  |

Found 1 error.

After:

cargo run -p ruff -- check --no-cache --isolated ex.py
ex.py:2:9: SyntaxError: "u" and "f" prefixes are incompatible
  |
1 | def foo():
2 |     x = uf"a more complicated {example}"
  |         ^^
3 |     return x
  |

Found 1 error.

Closes #19342

@dylwil3 dylwil3 added the parser Related to the parser label Jul 21, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jul 21, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

ℹ️ ecosystem check encountered format errors. (no format changes; 1 project error)

openai/openai-cookbook (error)

warning: Detected debug build without --no-cache.
error: Failed to parse examples/mcp/databricks_mcp_cookbook.ipynb:12:1:8: Simple statements must be separated by newlines or semicolons

Formatter (preview)

ℹ️ ecosystem check encountered format errors. (no format changes; 1 project error)

openai/openai-cookbook (error)

ruff format --preview

warning: Detected debug build without --no-cache.
error: Failed to parse examples/mcp/databricks_mcp_cookbook.ipynb:12:1:8: Simple statements must be separated by newlines or semicolons

Copy link
Copy Markdown
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

I'm curious to see how this looks in a parsed AST to better understand the impact of this change.

Comment thread crates/ruff_python_parser/src/lexer.rs Outdated
source: crates/ruff_python_parser/src/lexer.rs
expression: "lex(r#\"uU\"hello\"\"#, Mode::Module, TextSize::default())"
---
## Tokens
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a parser test. I'm curious to see how the parser recovers overall.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

What happens is that the parser will behave as though the string had no prefixes and parse it as an ordinary string literal. What may be potentially misleading is that the range of this string will include the skipped prefix. So, for example, if you added up the lengths for the content of the string, the quote (single or triple), and the prefix (apparently 0), you would come up short.

I don't think this trips up the parser, and we don't do anything downstream with source that has invalid syntax - i.e. no lint rules are run - so I don't think we have to worry.

But I could try to arrange for the range to exclude the prefix? Then the AST would behave as if there was nothing there - like it does when we skip trivia.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this trips up the parser, and we don't do anything downstream with source that has invalid syntax - i.e. no lint rules are run - so I don't think we have to worry.

I think we do run all token based rules.

But I could try to arrange for the range to exclude the prefix? Then the AST would behave as if there was nothing there - like it does when we skip trivia.

We would need to emit the prefix as an Unknown token because the lexer isn't allowed to skip over any content. I suspect that this will undo much of the improvements you made (as the Unknown token will confuse the parser)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... what if instead we parsed the thing as a string, like we're currently doing, but then expanded the possible prefixes to include invalid ones. So instead of the AST showing prefix: Empty it was like prefix: InvalidPair(first,second)? Then all the characters would be accounted for.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do run all token based rules.

oh interesting! I'll add a test fixture for some of those then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's certainly worth exploring. One challenge is that the data needs to fit into TokenFlags, which is a u8. The other, that you still need to pick a TokenKind.

The last challenge is that it requires designing how the different StringFlags represent this new state (most of those currently assert that the flags are valid).

I'm not sure I can help you much with this right now as it seems fairly low priority (while a nice improvement).

For now, I suggest making a local improvement (which might be to go with Unknown)

Comment thread crates/ruff_python_parser/src/lexer.rs Outdated
Comment on lines +629 to +630
// need to record it.
self.push_error(LexicalError::new(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to include an explaination why it's okay to drop the unknown token (push_token should probably be marked with #[must_use]).

Is there a risk that any downstream user of the tokens or AST will drip over the fact that the token/ast prefix doesn't match the source prefix?

value: "there",
flags: StringLiteralFlags {
quote_style: Double,
prefix: Empty,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty seems suboptimal. I think we should try to preserve the most important prefix (e.g. if f is present, favor it over u)

Copy link
Copy Markdown
Collaborator Author

@dylwil3 dylwil3 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the most important prefix for ft"hello {there}"? or bf"hey"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure for ft but I would ignore the binary prefix for the second case.

My concern with the current solution is that ignoring both prefixes is the least correct. Ignoring just one would be closer to what the user intended.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw in the current state of affairs we are parsing the full prefix as a Name and then the string literal as an ordinary string. So I feel like even if we parsed this as an ordinary string with an invalid prefix it would be an improvement.

but I'll try to think of a way to implement what you're suggesting!

@MichaReiser
Copy link
Copy Markdown
Member

@dylwil3 what's the status of this PR?

@dylwil3 dylwil3 marked this pull request as draft August 18, 2025 12:26
@dylwil3
Copy link
Copy Markdown
Collaborator Author

dylwil3 commented Aug 18, 2025

I'll revert to draft and re-open when I get a chance to visit it again - apologies for cluttering the open PRs!

@dylwil3
Copy link
Copy Markdown
Collaborator Author

dylwil3 commented Dec 10, 2025

Gonna close this since I won't be able to get back to it until higher priority items are cleared - which may not be soon. Others are welcome to pick it up!

@dylwil3 dylwil3 closed this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parser Related to the parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit better syntax error for unicode prefix

2 participants