libsyntax: be more accepting of whitespace in lexer #29734

Ryman · 2015-11-10T01:49:09Z

Perhaps this may need more thorough testing?

Aatch · 2015-11-10T01:57:33Z

src/libsyntax/util/parser_testing.rs

-/// this function is Unicode-ignorant; fortunately, the careful design of
-/// UTF-8 mitigates this ignorance.  In particular, this function only collapses
-/// sequences of \n, \r, ' ', and \t, but it should otherwise tolerate Unicode
-/// chars. Unsurprisingly, it doesn't do NKF-normalization(?).


Probably should leave the rest of this comment in. The function is more unicode-aware now, but presumably still doesn't do any normalisation.

/// This function is relatively Unicode-ignorant; fortunately, the careful design /// of UTF-8 mitigates this ignorance. It doesn't do NKF-normalization(?).

Would be ok for you? Or just the normalization sentence?

That looks fine to me.

Aatch · 2015-11-10T02:05:11Z

Test looks fine to me. r=me with the nits addressed.

Aatch · 2015-11-10T03:22:22Z

@bors r+ de89895

eefriedman · 2015-11-10T19:03:40Z

This is technically a new feature... probably doesn't need a full RFC, but shouldn't it at least be discussed a bit?

Ryman · 2015-11-11T00:24:04Z

Should we r- bors if there are concerns?

cc @nikomatsakis (who suggested accepting all whitespace in #29590)

eefriedman · 2015-11-11T00:55:45Z

I guess I don't really have any specific concerns; it's not like we could interpret a character of class whitespace as anything other than whitespace or an error. We could end up with a sort of confusing parse with ZWNJs and other characters like that, but I don't see any reason not to allow users to inflict that on themselves.

eefriedman · 2015-11-11T01:05:48Z

Err, actually, correction, I do have a specific objection: UAX #31 specifically recommends not doing this. We should be using the Pattern_White_Space property, not White_Space.

Aatch · 2015-11-11T02:08:25Z

@bors r-

Aatch · 2015-11-11T02:08:41Z

Sorry, I guess I got a little too zealous there.

nikomatsakis · 2015-11-11T23:39:17Z

My bad. I should have known better than to think anything about unicode was a black-and-white decision!

Ryman · 2015-11-12T04:34:50Z

I've updated the branch to accept only PATTERN_WHITE_SPACE, I've assumed exposing more from librustc_unicode is fine due to it being unstable.

The github diff doesn't seem to display the testfile too well, it shows as the below in my local git which shows the test more clearly:

-    assert_eq!(4 + 　7 * 2
-
+assert_eq!(4^L+

-/ 3 * 2 , 4 + 7 * 2 / 3 * 2);
+7   * 2<U+0085>/<U+200E>3<U+200F>*<U+2028>2<U+2029>, 4 + 7 * 2 / 3 * 2);
 }```

bors · 2015-12-06T06:49:20Z

☔ The latest upstream changes (presumably #30187) made this pull request unmergeable. Please resolve the merge conflicts.

steveklabnik · 2015-12-31T17:09:40Z

What's the status of this PR?

Ryman · 2016-01-01T18:33:24Z

@steveklabnik Needs some feedback on if the changes are now ok, perhaps this change still needs more discussion though? Can rebase after that!

ranma42 · 2016-01-02T20:52:54Z

src/libsyntax/util/parser_testing.rs

        }
    }
+
+    // check if a has *only* trailing whitespace
+    a_iter.all(|c| is_pattern_whitespace(c))


Is the closure needed? Wouldn't a_iter.all(is_pattern_whitespace) work just fine?

As mentioned in rust-lang#29734, the range comparison closure can be improved. The LLVM IR and the assembly from the new version are much simpler and unfortunately we cannot rely on the compiler to optimise this much, as it would need to know that `lo <= hi`. Besides from simpler code, there might also be a performance advantage, although it is unlikely to appear on benchmarks, as we are doing a binary search, which should always involve few comparisons. The code is available on the playpen for ease of comparison: http://is.gd/4raMmH

nikomatsakis · 2016-01-04T20:38:07Z

@Ryman so it now conforms with UAX #31 (as cited by @eefriedman)?

Ryman · 2016-01-07T18:00:59Z

@nikomatsakis I think it partially does, from the guideline:

Pattern_White_Space and Pattern_Syntax Characters: To meet this requirement, an implementation shall use Pattern_White_Space characters as all and only those characters interpreted as whitespace in parsing, and shall use Pattern_Syntax characters as all and only those characters with syntactic use. ...

Note: When meeting this requirement, all characters except those that have the Pattern_White_Space or Pattern_Syntax properties are available for use as identifiers or literals.

This PR doesn't change anything about the acceptance of Pattern_Syntax, so I don't think the final note is true, but it does bring us into alignment with the whitespace requirements of the guideline. (unless I've mis-implemented!)

nikomatsakis · 2016-01-08T22:44:46Z

@Ryman great!

and the associated update of tables.rs The last commit is related to my comment to #29734.

Fixes rust-lang#29590.

This aligns with unicode recommendations and should be stable for all future unicode releases. See http://unicode.org/reports/tr31/#R3. This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace` so potentially breaks users of libsyntax.

Ryman · 2016-01-16T01:02:55Z

Rebased, I think I got all of your suggestions accounted for @ranma42 :)

ranma42 · 2016-01-16T09:43:12Z

Yes, thank you! :)

steveklabnik · 2016-02-08T19:13:16Z

Triage: it seems this PR has been updated with review and is still siting here almost a month later, anyone willing to r+?

alexcrichton · 2016-03-08T01:39:11Z

r? @nikomatsakis

was this intended to be r+'d? just confirming...

Aatch · 2016-03-08T01:45:33Z

@bors r+ 24578e0

Aatch · 2016-03-08T01:45:55Z

Woops, looks like this managed to slip by.

bors · 2016-03-08T04:06:17Z

⌛ Testing commit 24578e0 with merge 8b7c3f2...

@Aatch

libsyntax: be more accepting of whitespace in lexer Fixes #29590. Perhaps this may need more thorough testing? r? @Aatch

bors · 2016-03-08T06:57:37Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-debug-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-linux-cross-opt, auto-linux-musl-64-opt, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-mac-ios-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

update reference for rust-lang#29734

rust-highfive assigned Aatch Nov 10, 2015

Aatch reviewed Nov 10, 2015
View reviewed changes

Ryman force-pushed the whitespace_consistency branch from 0680cc9 to de89895 Compare November 10, 2015 03:18

Ryman force-pushed the whitespace_consistency branch from de89895 to 23d4302 Compare November 12, 2015 04:25

ranma42 reviewed Jan 2, 2016
View reviewed changes

ranma42 mentioned this pull request Jan 4, 2016

Some minor cleanup and improvements to unicode.py #30695

Merged

bors added a commit that referenced this pull request Jan 12, 2016

Auto merge of #30695 - ranma42:cleanup-unicode, r=alexcrichton

7cffc9b

and the associated update of tables.rs The last commit is related to my comment to #29734.

Ryman added 2 commits January 14, 2016 22:47

libsyntax: use char::is_whitespace instead of custom implementations

8a27230

Fixes rust-lang#29590.

libsyntax: make matches_codepattern unicode aware

9e3e43f

Ryman force-pushed the whitespace_consistency branch from 23d4302 to 24578e0 Compare January 16, 2016 00:57

alexcrichton assigned nikomatsakis and unassigned Aatch Mar 8, 2016

bors added a commit that referenced this pull request Mar 8, 2016

Auto merge of #29734 - Ryman:whitespace_consistency, r=Aatch

8b7c3f2

libsyntax: be more accepting of whitespace in lexer Fixes #29590. Perhaps this may need more thorough testing? r? @Aatch

bors merged commit 24578e0 into rust-lang:master Mar 8, 2016

durka added a commit to durka/rust that referenced this pull request Jun 15, 2016

update reference for rust-lang#29734

523dbfc

steveklabnik added a commit to steveklabnik/rust that referenced this pull request Jun 27, 2016

Rollup merge of rust-lang#34287 - durka:patch-26, r=steveklabnik

087ac1d

update reference for rust-lang#29734

steveklabnik added a commit to steveklabnik/rust that referenced this pull request Jun 27, 2016

Rollup merge of rust-lang#34287 - durka:patch-26, r=steveklabnik

23725a6

update reference for rust-lang#29734

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Jun 28, 2016

Rollup merge of rust-lang#34287 - durka:patch-26, r=steveklabnik

be1c2b9

update reference for rust-lang#29734

dlrobertson pushed a commit to dlrobertson/rust that referenced this pull request Nov 29, 2018

update reference for rust-lang#29734

fb8ad01

libsyntax: be more accepting of whitespace in lexer #29734

libsyntax: be more accepting of whitespace in lexer #29734

Uh oh!

Conversation

Ryman commented Nov 10, 2015

Uh oh!

Aatch Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

Ryman Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

Aatch Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

Aatch commented Nov 10, 2015

Uh oh!

Aatch commented Nov 10, 2015

Uh oh!

eefriedman commented Nov 10, 2015

Uh oh!

Ryman commented Nov 11, 2015

Uh oh!

eefriedman commented Nov 11, 2015

Uh oh!

eefriedman commented Nov 11, 2015

Uh oh!

Aatch commented Nov 11, 2015

Uh oh!

Aatch commented Nov 11, 2015

Uh oh!

nikomatsakis commented Nov 11, 2015

Uh oh!

Ryman commented Nov 12, 2015

Uh oh!

bors commented Dec 6, 2015

Uh oh!

steveklabnik commented Dec 31, 2015

Uh oh!

Ryman commented Jan 1, 2016

Uh oh!

ranma42 Jan 2, 2016

Choose a reason for hiding this comment

Uh oh!

nikomatsakis commented Jan 4, 2016

Uh oh!

Ryman commented Jan 7, 2016

Uh oh!

nikomatsakis commented Jan 8, 2016

Uh oh!

Ryman commented Jan 16, 2016

Uh oh!

ranma42 commented Jan 16, 2016

Uh oh!

steveklabnik commented Feb 8, 2016

Uh oh!

alexcrichton commented Mar 8, 2016

Uh oh!

Aatch commented Mar 8, 2016

Uh oh!

Aatch commented Mar 8, 2016

Uh oh!

bors commented Mar 8, 2016

Uh oh!

bors commented Mar 8, 2016

Uh oh!

Uh oh!