-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new regex option RegexOptions.AnyNewLine #1449
Conversation
@shishirchawla I looked into this build failure. Couple of things:
I |
OpenSSL failures are #1129 and can be ignored. The warnings must be unrelated - I didn't look at them though. All this needs is a code review signoff. |
@@ -299,6 +299,39 @@ public static IEnumerable<object[]> Match_Basic_TestData() | |||
|
|||
// Surrogate pairs splitted up into UTF-16 code units. | |||
yield return new object[] { @"(\uD82F[\uDCA0-\uDCA3])", "\uD82F\uDCA2", RegexOptions.CultureInvariant, 0, 2, true, "\uD82F\uDCA2" }; | |||
|
|||
// AnyNewLine (with none of the special characters used as line ending) | |||
yield return new object[] { @"line3\nline4$", "line1\nline2\nline3\nline4", RegexOptions.AnyNewLine, 0, 23, true, "line3\nline4" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add tests with \z
and it's combinations here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the PR, but one way or another we sohuld have tests that cover the whole of my table. It might be good to include it as a comment even given the amount of noodling we did about it.
Subsequent to this PR, one of us should put a PR up against the API docs to include something like this table.
@shishirchawla, thanks for this. I merged some extensive changes to Regex this morning. I don't think it should much impact your changes, but there are some conflicts. Can you rebase? Then I can review. |
f1822ed
to
55c713b
Compare
Done. |
High-level questions:
|
Good point. I have a summary of Jeffrey Freidl's breakdown of anchor modifiers, here: https://github.com/jzabroski/Home/tree/master/RegularExpressions
I was not a fan of '\r' as new line, as the only popular OS to adopt this in the last ~30 years was "classic Mac OS". Source: Wikipedia's Newline page. However, beggars cannot be choosers, and so if the .NET maintainers wish to allow matching The current line anchor mode in .NET has taken its ancestry from Perl, and was copied by Java, Python, Delphi, and became the PCRE (Perl Compatible Regular Expressions) standard. The mistake, in my opinion, is that PCRE is only useful on UNIX systems and so fixing "end of line" to be Background Noise: In his book, Mastering Regular Expressions, Jeffrey Freidl traces the behavior of how |
Thank you for the follow-up, though some of the questions I asked are I think still unanswered:
|
Yes, it should.
I think this is a matter of taste that no one person should decide, but I was unaware of StreamReader behaving that way, and it does seem nicer from a testing standpoint to have uniform assumptions. Keep in mind, this flag is opt-in only. So, while it might be "slower" to match this way, users have to opt-in, and when they do, it's most likely so they can use the metacharacters to compactly express intent, not worry about performance. Frankly, having loaded NYSE tick data into kdb+, mechanical sympathy between mmap, bit splitter and target data store is tops in optimizing data loads. This is about every day usability and portability to get stuff done when you need a handy regex. |
It may have been me that introduced Admittedly Wikipedia tells me there are several other even more obscure newlines so I guess we would still be drawing a line somewhere. On balance given the name of the option I suggest to include |
if (UseOptionA()) | ||
AddUnitType(UseOptionM() ? RegexNode.AnyEol : RegexNode.AnyEndZ); | ||
else | ||
AddUnitType(UseOptionM() ? RegexNode.Eol : RegexNode.EndZ); | ||
break; | ||
|
||
case '.': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub regarding "should it not also change the newlines that . maps to?", does this look right, assuming that we are allowing '\n', '\r' and '\r\n' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Unicode spec I linked recommends that .
can match \r\n
as if it was a single character:
Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)
To achieve that is more complicated than the [^\r\n]
. It suggests this (including all the newline characters):
(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern above is missing a paren. I think in .NET that would be (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])
(edited -- the hyphen means that 0A through 0D are individually all valid newlines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other implementations follow this standard in general?
As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from.
tl;dr: Yes, agree, TR18 seems perfect:
- Nobody can say .NET made an arbitrary, quick decision and missed something obvious
- Easier for new C# developers to on-board from other languages
Thanks for finding this standard. I was unaware of it, despite reading Freidl's Mastering Regular Expressions. I searched through my PDF copy of the book, and he doesn't mention TR18. I guess his book was last updated in 2006, and TR18 was published in 2000, so maybe it's taken time for it to become popular. Amusingly, Oracle's java.util.regex documentation does not mention TR18 but instead says "Go read Mastering Regular Expressions for help using this library" 🤣
Related Documentation
An excellent tutorial and overview of regular expressions is Mastering Regular Expressions, Jeffrey E. F. Friedl, O'Reilly and Associates, 1997.
Jokes aside, Rust's regex 1.1.0 crate, RogueWave's Internationalization module, and the eponymous ICU Regular Expression Engine all support TR18 Level 1, at minimum. The package in the R CRAN repository, stringi, internally uses C bindings to the ICU C Regex Engine. It looks like Python 3.7 considered replacing the re
module with a regex
module that supports ICU as well.
In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.
There's more. Eg., we are also missing symmetric difference ([a-g~~b-h]
), and it looks like we are missing some properties as well, e.g I noticed \p{LC}
is (accidentally?) missing. I only looked briefly though, I think it would be worth doing a more careful look and opening a separate issue(s) for discrepancies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])
will presumably bypass Stephen's ASCII optimizations to character class matching, even though the pattern might be otherwise ASCII. That might justify some tuning later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will presumably bypass Stephen's ASCII optimizations to character class matching
Most of it will still apply, e.g. we'll still generate a fast lookup table for ASCII, and only fall back to the slower path if the actual character being examined isn't ASCII. And it'll only happen if you opt-in to this AnyNewLine feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and also: The Level 1 of the specification seems most solid, whereas Level 2 and Level 3 are undergoing change. Level 3 is proposed to be completely removed in the next version, subject to vendor feedback. They're basically asking, "Is this too crazy for anyone to implement?"
They've seemed to have taken feedback from EcmaScript Tc39 as some basic guideline on what's reasonable, as part of JavaScript's major contribution to humanity in the last decade: making sure you can Regex match emojis via RegEx character class:
const reRgiEmoji = /\p{RGI_Emoji}/u;
I hadn't realized and this wasn't mentioned before, but there is a Unicode spec specifically regarding regular expressions and has a section on newline treatment: http://www.unicode.org/reports/tr18/#RL1.6
Re dot, https://www.regular-expressions.info/dot.html describes some of the varying support of newlines in different engines -- varying combinations of the above list of 6. (Incidentally it also describes What this means for me is -- it makes sense to keep |
Do other implementations follow this standard in general? Seems to me like if we're adding "AnyNewLine" now, we should do our best now to make it as correct as possible. It would be a breaking change in the future to augment the meaning of that with additional characters. |
Sounds fine to me. |
As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from. |
@shishirchawla I'm guessing @jzabroski might be willing to help push to this branch to help get this PR over the line, if you've set permissions appropriately. |
@shishirchawla I'm sorry I've tormented you. I hate seeing geniuses wait to see their invention unfold to success. Happy to help bring it to life for you. |
@danmosemsft @jzabroski Sorry about the late response, I am actually out traveling and it would be a bit hard for me to update this in the next couple of days, so please feel free to make any changes and complete the PR. I can pick it up again later if required. |
@jzabroski can you successfully push to this branch? If not - I think @shishirchawla has to check the box - or you can continue work in a new PR |
You're just asking me to add:
Additionally, it is worth disclosing that although the rust regex 1.1.0 crate claims to implement TR18 Level 1, the code suggests otherwise, at least as far as R1.6 "Unicode New Line" is concerned: https://github.com/rust-lang/regex/blob/master/src/compile.rs#L271-L286 - it uses just The only code that likely actually implements TR18 Level 1 R1.6 "Unicode New Line" is maybe ICU's C library. |
I think so. I'd like @stephentoub to confirm, since he's got regex top of mind at the moment, and I'm sure we'd all like to avoid asking you guys for yet more iterations 😃 |
I see in rust-lang/regex#244 they don't have a plan to support |
I will also submit a PR to update VSCode ripgrep TypeScript code with this comment: https://xkcd.com/208/ |
@@ -243,6 +243,7 @@ public enum RegexOptions | |||
RightToLeft = 64, | |||
ECMAScript = 256, | |||
CultureInvariant = 512, | |||
AnyNewLine = 1024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shishirchawla Please make sure to get this new Enum documented, either with triple slash comments on top of the implementation (src file) or directly in the dotnet-api-docs repo.
Sounds right. @jzabroski, should we close this PR and you can open a new one when you're ready? @shishirchawla, thanks for working on this. |
@stephentoub Yes, I think that's fair - let's keep things tidy & close it, as I imagine it adds onerous burden to see open PRs not getting closed. I tried to get to it this weekend but spent ~4 hours troubleshooting a build issue I was expecting to resolve in 2 minutes. :( |
Thanks. Sorry to hear you're having build issues. Please open an issue if they continue and we can get you unblocked. @ViktorHofer |
Thanks - issue is not with this project, but with FluentMigrator project I co-maintain. I pinged rainer for help. Works in Visual Studio, not in JetBrains Rider. |
@@ -1764,6 +1778,7 @@ private static RegexOptions OptionFromCode(char ch) | |||
'd' => RegexOptions.Debug, | |||
#endif | |||
'e' => RegexOptions.ECMAScript, | |||
'a' => RegexOptions.AnyNewLine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed that the original proposal didn't suggest we add an inline option for AnyNewLine - it's reasonable to do but we'd need tests for it. It looks like we don't have a great set of tests for existing inline regex options other than just setting them once.
Fixes https://github.com/dotnet/corefx/issues/28410 (edit: #25598)
@jzabroski