-
Notifications
You must be signed in to change notification settings - Fork 456
Feat/good error message extensions non ascii #12844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/good error message extensions non ascii #12844
Conversation
3c0162b to
1784558
Compare
rgrinberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach of separately validating using declarations is rather weird and does not solve the underlying issue in all other stanzas as far as I can tell. We should be handling this sort of stuff at the level of dune_sexp.
|
Hey @rgrinberg, You're right that this approach is specific to these two declaration types and doesn't solve the general problem. I initially followed the existing pattern for lang declarations (versioned_file_first_line.mll) and applied the same approach to using declarations. The core issue is that non-ASCII characters in version strings get rejected by the s-expression parser before Syntax.Version.decode can run and provide a helpful error message. My current approach pre-validates specific declarations before s-expression parsing to catch these cases early. |
|
I followed the existing pattern in the codebase. The lang dune version already uses versioned_file_first_line.mll which does exactly this: it scans raw text before s-expression parsing, extracts lang and version as strings, then validates them with better error messages. I extended this same pattern to using declarations with using_declaration_parser.mll. However I'm happy to rework this to handle it at the dune_sexp level if you can point me in the right direction. |
The reason I think the way to go is to modify our existing lexers instead of adding separate validation passes. In short, the lexing stage should reject all non ascii files that we cannot handle (even better would be to handle them of course) and produce appropriate error messages. The most important lexer where this issue is relevant is A word of caution: this file is quite important to the performance of dune. So I'd recommend some sanity checks to make sure that we haven't considerably slowed down the parsing of valid sexp files. |
|
Thank you for the clarification! That makes much more sense now. I misunderstood I will start reworking it with this different approach. |
4912f08 to
b489524
Compare
src/dune_lang/dune_project.ml
Outdated
| match sexp with | ||
| | Atom (loc, A s) -> | ||
| (* Check if version has invalid format (non-ASCII or not X.Y pattern) *) | ||
| let has_invalid_format = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this check here? Why not just do it in lexer.mll or wherever else we might be creating an invalid atom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this check at the decoder level (rather than just the lexer) because that's where we have the semantic context to provide helpful, extension-specific hints.
At the decoder level in dune_project.ml, we know:
- This atom is specifically a version for an extension
- Which extension it's for (menhir, melange, etc.)
- What the latest valid version is for that extension
This allows us to provide context-aware error messages like Hint: using menhir 3.0 instead of a generic lexer error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do have more context, but I think you're going to find it rather tedious to add such hints everywhere. In the end, all the information the user needs is to remove the special characters to form a valid atom.
if you do it at the lexer, the error would be simpler, but it would work everywhere and not just in this one specific case.
Or do you intend to perhaps support non-ascii characters in some places where dune accepts atoms? Then I think it would make sense to handle this stuff at the decoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the lexer-level approach is simpler and more robust. Adding validation everywhere would be tedious and error-prone.
I'm happy to implement the lexer-level solution instead. However, since the idea of providing context-specific hints for version errors was discussed in earlier PRs, can we check with @Alizter as well before reverting to the simpler approach.
cc: @Alizter
b489524 to
512b27f
Compare
src/dune_sexp/versioned_file.ml
Outdated
| ] | ||
| else | ||
| Code_error.raise | ||
| "Atom.parse failed for unexpected reason" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really unexpected? Can't it happen for some other invalid character? A regular error should suffice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, my bad, fixed it now.
src/dune_sexp/versioned_file.ml
Outdated
| User_error.raise | ||
| ~loc:ver_loc | ||
| [ Pp.text | ||
| "Invalid atom: contains non-ASCII character(s). Atoms must only \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share this error message between the two files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shared it through atom.ml as both files are using atom parsing, can u check once.
e61943a to
fb1288e
Compare
rgrinberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @Alizter do you intend to review this?
src/dune_sexp/versioned_file.ml
Outdated
| let has_non_ascii = String.exists ver ~f:(fun c -> Char.code c >= 128) in | ||
| if has_non_ascii | ||
| then User_error.raise ~loc:ver_loc [ Pp.text Atom.non_ascii_error_message ] | ||
| else User_error.raise ~loc:ver_loc [ Pp.textf "Invalid atom: %S" ver ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should preserve the message for the else clause:
[ Pp.text "Invalid version. Version must be two numbers separated by a dot." ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the change
2a11379 to
1a0d184
Compare
|
@rgrinberg Yes, I will give it a review. |
| CR-someday benodiwal: The version_loc is greedy and captures the closing | ||
| parenthesis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are OK now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are partially fixed. Apparently they are only for the extensions part, for the first line we have to handle it differently. I have explained this in detail in the related issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should handle that in different PR, I will able to test this more for extensions as well there along with fix for first line.
| Error: Invalid version. Version must be two numbers separated by a dot. | ||
| Hint: lang dune 3.21 | ||
| Error: Invalid atom: contains non-ASCII character(s). Atoms must only contain | ||
| ASCII characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've lost the hint here which is fine since this is a different kind of error. I think the hint is still useful in the ASCII case and looking above to the Ali case we don't provide one. Could you add another CR about adding that hint to the validation step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Alizter, I am afk for a while, will do it in some time.
f15ae27 to
7c3a713
Compare
|
Hey @Alizter, I have updated the CR somedays, you can check now. Thanks |
7c3a713 to
c1f4aff
Compare
… versions Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
…I characters Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
763c294 to
344ccb4
Compare
Closes #12836
The fix modifies the s-expression lexer to accept non-ASCII characters in atoms, then validates version format at the decoder level where we have semantic context to provide helpful hints.
Previously, non-ASCII characters were rejected by the lexer with a generic "invalid atom" error. Now, the lexer accepts them, and validation occurs at the decoder level where we have context about which extension/lang the version is for. This allows us to provide consistent error messages ("Invalid version. Version must be two numbers separated by a dot.") with helpful, context-specific hints for both ASCII and non-ASCII invalid versions.
Tests cover single extensions, multiple extensions, and various non-ASCII characters including East Asian characters and emoji.