Skip to content

Conversation

@benodiwal
Copy link
Contributor

@benodiwal benodiwal commented Dec 2, 2025

Closes #12836

The fix modifies the s-expression lexer to accept non-ASCII characters in atoms, then validates version format at the decoder level where we have semantic context to provide helpful hints.

Previously, non-ASCII characters were rejected by the lexer with a generic "invalid atom" error. Now, the lexer accepts them, and validation occurs at the decoder level where we have context about which extension/lang the version is for. This allows us to provide consistent error messages ("Invalid version. Version must be two numbers separated by a dot.") with helpful, context-specific hints for both ASCII and non-ASCII invalid versions.

Tests cover single extensions, multiple extensions, and various non-ASCII characters including East Asian characters and emoji.

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from 3c0162b to 1784558 Compare December 2, 2025 09:20
@Alizter Alizter self-requested a review December 2, 2025 10:09
Copy link
Member

@rgrinberg rgrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach of separately validating using declarations is rather weird and does not solve the underlying issue in all other stanzas as far as I can tell. We should be handling this sort of stuff at the level of dune_sexp.

@benodiwal
Copy link
Contributor Author

Hey @rgrinberg, You're right that this approach is specific to these two declaration types and doesn't solve the general problem. I initially followed the existing pattern for lang declarations (versioned_file_first_line.mll) and applied the same approach to using declarations.

The core issue is that non-ASCII characters in version strings get rejected by the s-expression parser before Syntax.Version.decode can run and provide a helpful error message. My current approach pre-validates specific declarations before s-expression parsing to catch these cases early.

@benodiwal
Copy link
Contributor Author

I followed the existing pattern in the codebase. The lang dune version already uses versioned_file_first_line.mll which does exactly this: it scans raw text before s-expression parsing, extracts lang and version as strings, then validates them with better error messages. I extended this same pattern to using declarations with using_declaration_parser.mll.

However I'm happy to rework this to handle it at the dune_sexp level if you can point me in the right direction.

@rgrinberg
Copy link
Member

The lang dune version already uses versioned_file_first_line.mll which does exactly this: it scans raw text before s-expression parsing, extracts lang and version as strings, then validates them with better error messages. I extended this same pattern to using declarations with using_declaration_parser.mll.

The reason versioned_file_first_line is this way is because it cannot assume that it is parsing a sexp file. The first line is what determines the format (language + version) of this document, and what follows after this line isn't necessarily sexp. So this pattern is necessary only to solve this rather specific problem. It is not a general purpose solution.

I think the way to go is to modify our existing lexers instead of adding separate validation passes. In short, the lexing stage should reject all non ascii files that we cannot handle (even better would be to handle them of course) and produce appropriate error messages. The most important lexer where this issue is relevant is dune_sexp/lexer.mll. Solving the problem there would solve the issue for the vast majority of users. Therefore, I would suggest to start with that lexer.

A word of caution: this file is quite important to the performance of dune. So I'd recommend some sanity checks to make sure that we haven't considerably slowed down the parsing of valid sexp files.

@benodiwal
Copy link
Contributor Author

benodiwal commented Dec 2, 2025

Thank you for the clarification! That makes much more sense now. I misunderstood versioned_file_first_line as a general pattern when it's actually solving a very specific problem where the document format itself is unknown until the first line is parsed.

I will start reworking it with this different approach.

@benodiwal benodiwal marked this pull request as draft December 2, 2025 21:26
@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch 3 times, most recently from 4912f08 to b489524 Compare December 3, 2025 09:31
match sexp with
| Atom (loc, A s) ->
(* Check if version has invalid format (non-ASCII or not X.Y pattern) *)
let has_invalid_format =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this check here? Why not just do it in lexer.mll or wherever else we might be creating an invalid atom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this check at the decoder level (rather than just the lexer) because that's where we have the semantic context to provide helpful, extension-specific hints.

At the decoder level in dune_project.ml, we know:

  • This atom is specifically a version for an extension
  • Which extension it's for (menhir, melange, etc.)
  • What the latest valid version is for that extension

This allows us to provide context-aware error messages like Hint: using menhir 3.0 instead of a generic lexer error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do have more context, but I think you're going to find it rather tedious to add such hints everywhere. In the end, all the information the user needs is to remove the special characters to form a valid atom.

if you do it at the lexer, the error would be simpler, but it would work everywhere and not just in this one specific case.

Or do you intend to perhaps support non-ascii characters in some places where dune accepts atoms? Then I think it would make sense to handle this stuff at the decoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the lexer-level approach is simpler and more robust. Adding validation everywhere would be tedious and error-prone.
I'm happy to implement the lexer-level solution instead. However, since the idea of providing context-specific hints for version errors was discussed in earlier PRs, can we check with @Alizter as well before reverting to the simpler approach.

cc: @Alizter

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from b489524 to 512b27f Compare December 3, 2025 18:37
@benodiwal benodiwal marked this pull request as ready for review December 4, 2025 11:17
]
else
Code_error.raise
"Atom.parse failed for unexpected reason"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really unexpected? Can't it happen for some other invalid character? A regular error should suffice here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, my bad, fixed it now.

User_error.raise
~loc:ver_loc
[ Pp.text
"Invalid atom: contains non-ASCII character(s). Atoms must only \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share this error message between the two files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared it through atom.ml as both files are using atom parsing, can u check once.

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from e61943a to fb1288e Compare December 4, 2025 15:16
Copy link
Member

@rgrinberg rgrinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @Alizter do you intend to review this?

let has_non_ascii = String.exists ver ~f:(fun c -> Char.code c >= 128) in
if has_non_ascii
then User_error.raise ~loc:ver_loc [ Pp.text Atom.non_ascii_error_message ]
else User_error.raise ~loc:ver_loc [ Pp.textf "Invalid atom: %S" ver ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should preserve the message for the else clause:

[ Pp.text "Invalid version. Version must be two numbers separated by a dot." ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from 2a11379 to 1a0d184 Compare December 5, 2025 08:06
@Alizter
Copy link
Collaborator

Alizter commented Dec 5, 2025

@rgrinberg Yes, I will give it a review.

Comment on lines 23 to 24
CR-someday benodiwal: The version_loc is greedy and captures the closing
parenthesis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are OK now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are partially fixed. Apparently they are only for the extensions part, for the first line we have to handle it differently. I have explained this in detail in the related issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle that in different PR, I will able to test this more for extensions as well there along with fix for first line.

Error: Invalid version. Version must be two numbers separated by a dot.
Hint: lang dune 3.21
Error: Invalid atom: contains non-ASCII character(s). Atoms must only contain
ASCII characters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've lost the hint here which is fine since this is a different kind of error. I think the hint is still useful in the ASCII case and looking above to the Ali case we don't provide one. Could you add another CR about adding that hint to the validation step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Alizter, I am afk for a while, will do it in some time.

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from f15ae27 to 7c3a713 Compare December 5, 2025 10:36
@benodiwal
Copy link
Contributor Author

Hey @Alizter, I have updated the CR somedays, you can check now. Thanks

@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from 7c3a713 to c1f4aff Compare December 5, 2025 11:28
… versions

Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
…I characters

Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
Signed-off-by: Sachin Beniwal <s474996633@gmail.com>
@benodiwal benodiwal force-pushed the feat/good-error-message-extensions-non-ascii branch from 763c294 to 344ccb4 Compare December 5, 2025 12:52
@Alizter Alizter enabled auto-merge December 5, 2025 12:53
@Alizter Alizter merged commit 15eead0 into ocaml:main Dec 5, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve error message for non-ASCII characters in extension versions

3 participants