-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libsyntax: be more accepting of whitespace in lexer #29734
Conversation
/// this function is Unicode-ignorant; fortunately, the careful design of | ||
/// UTF-8 mitigates this ignorance. In particular, this function only collapses | ||
/// sequences of \n, \r, ' ', and \t, but it should otherwise tolerate Unicode | ||
/// chars. Unsurprisingly, it doesn't do NKF-normalization(?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should leave the rest of this comment in. The function is more unicode-aware now, but presumably still doesn't do any normalisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This function is relatively Unicode-ignorant; fortunately, the careful design
/// of UTF-8 mitigates this ignorance. It doesn't do NKF-normalization(?).
Would be ok for you? Or just the normalization sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks fine to me.
Test looks fine to me. r=me with the nits addressed. |
0680cc9
to
de89895
Compare
@bors r+ de89895 |
This is technically a new feature... probably doesn't need a full RFC, but shouldn't it at least be discussed a bit? |
Should we cc @nikomatsakis (who suggested accepting all whitespace in #29590) |
I guess I don't really have any specific concerns; it's not like we could interpret a character of class whitespace as anything other than whitespace or an error. We could end up with a sort of confusing parse with ZWNJs and other characters like that, but I don't see any reason not to allow users to inflict that on themselves. |
Err, actually, correction, I do have a specific objection: UAX #31 specifically recommends not doing this. We should be using the Pattern_White_Space property, not White_Space. |
@bors r- |
Sorry, I guess I got a little too zealous there. |
My bad. I should have known better than to think anything about unicode was a black-and-white decision! |
de89895
to
23d4302
Compare
I've updated the branch to accept only The github diff doesn't seem to display the testfile too well, it shows as the below in my local git which shows the test more clearly: - assert_eq!(4 + 7 * 2
-
+assert_eq!(4^L+
-/ 3 * 2 , 4 + 7 * 2 / 3 * 2);
+7 * 2<U+0085>/<U+200E>3<U+200F>*<U+2028>2<U+2029>, 4 + 7 * 2 / 3 * 2);
}``` |
☔ The latest upstream changes (presumably #30187) made this pull request unmergeable. Please resolve the merge conflicts. |
What's the status of this PR? |
@steveklabnik Needs some feedback on if the changes are now ok, perhaps this change still needs more discussion though? Can rebase after that! |
} | ||
} | ||
|
||
// check if a has *only* trailing whitespace | ||
a_iter.all(|c| is_pattern_whitespace(c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the closure needed? Wouldn't a_iter.all(is_pattern_whitespace)
work just fine?
As mentioned in rust-lang#29734, the range comparison closure can be improved. The LLVM IR and the assembly from the new version are much simpler and unfortunately we cannot rely on the compiler to optimise this much, as it would need to know that `lo <= hi`. Besides from simpler code, there might also be a performance advantage, although it is unlikely to appear on benchmarks, as we are doing a binary search, which should always involve few comparisons. The code is available on the playpen for ease of comparison: http://is.gd/4raMmH
@Ryman so it now conforms with UAX #31 (as cited by @eefriedman)? |
@nikomatsakis I think it partially does, from the guideline:
This PR doesn't change anything about the acceptance of Pattern_Syntax, so I don't think the final note is true, but it does bring us into alignment with the whitespace requirements of the guideline. (unless I've mis-implemented!) |
@Ryman great! |
and the associated update of tables.rs The last commit is related to my comment to #29734.
This aligns with unicode recommendations and should be stable for all future unicode releases. See http://unicode.org/reports/tr31/#R3. This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace` so potentially breaks users of libsyntax.
23d4302
to
24578e0
Compare
Rebased, I think I got all of your suggestions accounted for @ranma42 :) |
Yes, thank you! :) |
Triage: it seems this PR has been updated with review and is still siting here almost a month later, anyone willing to r+? |
was this intended to be r+'d? just confirming... |
Woops, looks like this managed to slip by. |
⌛ Testing commit 24578e0 with merge 8b7c3f2... |
update reference for rust-lang#29734
update reference for rust-lang#29734
update reference for rust-lang#29734
Fixes #29590.
Perhaps this may need more thorough testing?
r? @Aatch