Emit more specific syntax errors for incompatible string prefixes#19466
Emit more specific syntax errors for incompatible string prefixes#19466dylwil3 wants to merge 6 commits intoastral-sh:mainfrom
Conversation
|
MichaReiser
left a comment
There was a problem hiding this comment.
Nice.
I'm curious to see how this looks in a parsed AST to better understand the impact of this change.
| source: crates/ruff_python_parser/src/lexer.rs | ||
| expression: "lex(r#\"uU\"hello\"\"#, Mode::Module, TextSize::default())" | ||
| --- | ||
| ## Tokens |
There was a problem hiding this comment.
Let's also add a parser test. I'm curious to see how the parser recovers overall.
There was a problem hiding this comment.
Added.
What happens is that the parser will behave as though the string had no prefixes and parse it as an ordinary string literal. What may be potentially misleading is that the range of this string will include the skipped prefix. So, for example, if you added up the lengths for the content of the string, the quote (single or triple), and the prefix (apparently 0), you would come up short.
I don't think this trips up the parser, and we don't do anything downstream with source that has invalid syntax - i.e. no lint rules are run - so I don't think we have to worry.
But I could try to arrange for the range to exclude the prefix? Then the AST would behave as if there was nothing there - like it does when we skip trivia.
There was a problem hiding this comment.
I don't think this trips up the parser, and we don't do anything downstream with source that has invalid syntax - i.e. no lint rules are run - so I don't think we have to worry.
I think we do run all token based rules.
But I could try to arrange for the range to exclude the prefix? Then the AST would behave as if there was nothing there - like it does when we skip trivia.
We would need to emit the prefix as an Unknown token because the lexer isn't allowed to skip over any content. I suspect that this will undo much of the improvements you made (as the Unknown token will confuse the parser)
There was a problem hiding this comment.
Hmm... what if instead we parsed the thing as a string, like we're currently doing, but then expanded the possible prefixes to include invalid ones. So instead of the AST showing prefix: Empty it was like prefix: InvalidPair(first,second)? Then all the characters would be accounted for.
There was a problem hiding this comment.
I think we do run all token based rules.
oh interesting! I'll add a test fixture for some of those then
There was a problem hiding this comment.
That's certainly worth exploring. One challenge is that the data needs to fit into TokenFlags, which is a u8. The other, that you still need to pick a TokenKind.
The last challenge is that it requires designing how the different StringFlags represent this new state (most of those currently assert that the flags are valid).
I'm not sure I can help you much with this right now as it seems fairly low priority (while a nice improvement).
For now, I suggest making a local improvement (which might be to go with Unknown)
| // need to record it. | ||
| self.push_error(LexicalError::new( |
There was a problem hiding this comment.
It would be helpful to include an explaination why it's okay to drop the unknown token (push_token should probably be marked with #[must_use]).
Is there a risk that any downstream user of the tokens or AST will drip over the fact that the token/ast prefix doesn't match the source prefix?
| value: "there", | ||
| flags: StringLiteralFlags { | ||
| quote_style: Double, | ||
| prefix: Empty, |
There was a problem hiding this comment.
Empty seems suboptimal. I think we should try to preserve the most important prefix (e.g. if f is present, favor it over u)
There was a problem hiding this comment.
What is the most important prefix for ft"hello {there}"? or bf"hey"?
There was a problem hiding this comment.
Not sure for ft but I would ignore the binary prefix for the second case.
My concern with the current solution is that ignoring both prefixes is the least correct. Ignoring just one would be closer to what the user intended.
There was a problem hiding this comment.
fwiw in the current state of affairs we are parsing the full prefix as a Name and then the string literal as an ordinary string. So I feel like even if we parsed this as an ordinary string with an invalid prefix it would be an improvement.
but I'll try to think of a way to implement what you're suggesting!
|
@dylwil3 what's the status of this PR? |
|
I'll revert to draft and re-open when I get a chance to visit it again - apologies for cluttering the open PRs! |
|
Gonna close this since I won't be able to get back to it until higher priority items are cleared - which may not be soon. Others are welcome to pick it up! |
Following the CPython changes here:
ft""andbt""cases python/cpython#133202we modify the lexing of double character string prefixes to emit a more useful syntax error in the case of invalid double character prefixes. Our implementation differs slightly from that of CPython in that our error message retains the casing of the prefixes and we emit a separate error for repeated prefixes (like
uu"hello").Example
Before:
After:
Closes #19342