Added new regex option RegexOptions.AnyNewLine #1449

shishirchawla · 2020-01-08T15:18:18Z

Fixes https://github.com/dotnet/corefx/issues/28410 (edit: #25598)

@jzabroski

jzabroski · 2020-01-08T17:23:40Z

@shishirchawla I looked into this build failure.

Couple of things:

The warnings about not being able to convert from LC_ALL to US-en.UTF-8 are a bit worrisome. Most likely not related to your check-in, probably a pre-existing warning the .NET build team should investigate on OSX.
The failure indicated by https://helix.dot.net/api/2019-06-17/jobs/90a814ab-72c0-4914-941d-6afaafc4a328/workitems/System.Security.Cryptography.OpenSsl.Tests/console suggests some weird stuff regarding how the .NET Runtime build works - there is https://github.com/dotnet/runtime/blob/3b9abae5aae537258611b5750dc7e2963cfc73ed/src/libraries/System.Security.Cryptography.OpenSsl/src/System/Security/Cryptography/RSAOpenSsl.cs but also https://github.com/dotnet/runtime/blob/3b9abae5aae537258611b5750dc7e2963cfc73ed/src/libraries/Common/src/System/Security/Cryptography/RSAOpenSsl.cs

I

danmoseley · 2020-01-08T17:25:05Z

OpenSSL failures are #1129 and can be ignored.

The warnings must be unrelated - I didn't look at them though.

All this needs is a code review signoff.

pgovind · 2020-01-08T23:29:53Z

src/libraries/System.Text.RegularExpressions/tests/Regex.Match.Tests.cs

@@ -299,6 +299,39 @@ public static IEnumerable<object[]> Match_Basic_TestData()

            // Surrogate pairs splitted up into UTF-16 code units.
            yield return new object[] { @"(\uD82F[\uDCA0-\uDCA3])", "\uD82F\uDCA2", RegexOptions.CultureInvariant, 0, 2, true, "\uD82F\uDCA2" };
+
+            // AnyNewLine (with none of the special characters used as line ending)
+            yield return new object[] { @"line3\nline4$", "line1\nline2\nline3\nline4", RegexOptions.AnyNewLine, 0, 23, true, "line3\nline4" };


Should we add tests with \z and it's combinations here?

I didn't look at the PR, but one way or another we sohuld have tests that cover the whole of my table. It might be good to include it as a comment even given the amount of noodling we did about it.

dotnet/corefx#41195 (comment)

Subsequent to this PR, one of us should put a PR up against the API docs to include something like this table.

stephentoub · 2020-01-09T18:17:55Z

@shishirchawla, thanks for this. I merged some extensive changes to Regex this morning. I don't think it should much impact your changes, but there are some conflicts. Can you rebase? Then I can review.

Sync with dotnet

shishirchawla · 2020-01-10T13:28:40Z

@shishirchawla, thanks for this. I merged some extensive changes to Regex this morning. I don't think it should much impact your changes, but there are some conflicts. Can you rebase? Then I can review.

Done.

stephentoub · 2020-01-10T17:12:59Z

High-level questions:

It looks like this change doesn't impact the behavior of .. Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n], i.e. it matches everything except for a new line. That's the same new line that's matched by $ and whatnot. This PR is changing the new lines allowed by $ and friends... should it not also change the newlines that . maps to?
This is allowing a new line to be '\n' and \r\n, but also \r. Are there any platforms in use today that use \r as a new line? Do any other regex implementations by any other modern language / framework allow treating \r as a new line? I'm trying to understand the use case for that.

jzabroski · 2020-01-10T17:42:52Z

High-level questions:

It looks like this change doesn't impact the behavior of .. Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n], i.e. it matches everything except for a new line. That's the same new line that's matched by $ and whatnot. This PR is changing the new lines allowed by $ and friends... should it not also change the newlines that . maps to?

Good point. I have a summary of Jeffrey Freidl's breakdown of anchor modifiers, here: https://github.com/jzabroski/Home/tree/master/RegularExpressions

This is allowing a new line to be '\n' and \r\n, but also \r. Are there any platforms in use today that use \r as a new line? Do any other regex implementations by any other modern language / framework allow treating \r as a new line? I'm trying to understand the use case for that.

I was not a fan of '\r' as new line, as the only popular OS to adopt this in the last ~30 years was "classic Mac OS". Source: Wikipedia's Newline page. However, beggars cannot be choosers, and so if the .NET maintainers wish to allow matching \r, I see no great harm.

The current line anchor mode in .NET has taken its ancestry from Perl, and was copied by Java, Python, Delphi, and became the PCRE (Perl Compatible Regular Expressions) standard. The mistake, in my opinion, is that PCRE is only useful on UNIX systems and so fixing "end of line" to be \n isn't portable.

Background Noise: In his book, Mastering Regular Expressions, Jeffrey Freidl traces the behavior of how \Z works all the way back to Ken Thompson creating ed, and Alfred Aho isolating the search code in that editor to become grep and the introduction of the metacharacters we discuss now. However, grep was not a free-standing library and so until 1986-1987, there was no way for programmers to "call" a regex like a function. It was Larry Wall introducing his conventions in Perl that leave us with where we are today, where Regex's aren't strictly portable across Windows and Linux, ergo why I submitted this issue in the first place.

stephentoub · 2020-01-10T18:00:46Z

Thank you for the follow-up, though some of the questions I asked are I think still unanswered:

"should it not also change the newlines that . maps to?" I've not looked to see how complicated this would be, but it could have non-trivial perf implications, as it means . would need to map to something like (?:\n|\r\n) (or (?:\n|\r\n|\r) if we allow \r). However, it seems wrong to me to have the meaning of . diverge from the meaning of the anchors.
"Do any other regex implementations by any other modern language / framework allow treating \r as a new line?" I know we consume \r as a valid newline in StreamReader, but that was also done in the days where there was such an OS. I'm not convinced it's worth incurring the additional costs for something that's not actually meaningful today, though there's an argument to be made for consistency with StreamReader. @JeremyKuhne, do you have an opinion?

jzabroski · 2020-01-10T18:12:39Z

"should it not also change the newlines that . maps to?"

Yes, it should.

"Do any other regex implementations by any other modern language / framework allow treating \r as a new line?"

I think this is a matter of taste that no one person should decide, but I was unaware of StreamReader behaving that way, and it does seem nicer from a testing standpoint to have uniform assumptions.

Keep in mind, this flag is opt-in only. So, while it might be "slower" to match this way, users have to opt-in, and when they do, it's most likely so they can use the metacharacters to compactly express intent, not worry about performance. Frankly, having loaded NYSE tick data into kdb+, mechanical sympathy between mmap, bit splitter and target data store is tops in optimizing data loads. This is about every day usability and portability to get stuff done when you need a handy regex.

danmoseley · 2020-01-10T21:54:34Z

It may have been me that introduced \r into the original conversation by putting it in the lookup table. I guess I figured "any new line" doesn't suggest a subset and maybe there's some datasets that use \r.

Admittedly Wikipedia tells me there are several other even more obscure newlines so I guess we would still be drawing a line somewhere. On balance given the name of the option I suggest to include \r but I could go either way.

shishirchawla · 2020-01-11T10:29:07Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs

+                        if (UseOptionA())
+                            AddUnitType(UseOptionM() ? RegexNode.AnyEol : RegexNode.AnyEndZ);
+                        else
+                            AddUnitType(UseOptionM() ? RegexNode.Eol : RegexNode.EndZ);
                        break;

                    case '.':


@stephentoub regarding "should it not also change the newlines that . maps to?", does this look right, assuming that we are allowing '\n', '\r' and '\r\n' ?

The Unicode spec I linked recommends that . can match \r\n as if it was a single character:

Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.)

To achieve that is more complicated than the [^\r\n]. It suggests this (including all the newline characters):

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

The pattern above is missing a paren. I think in .NET that would be (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])

(edited -- the hyphen means that 0A through 0D are individually all valid newlines)

Do other implementations follow this standard in general?

As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from.

tl;dr: Yes, agree, TR18 seems perfect:

Nobody can say .NET made an arbitrary, quick decision and missed something obvious

Easier for new C# developers to on-board from other languages

Thanks for finding this standard. I was unaware of it, despite reading Freidl's Mastering Regular Expressions. I searched through my PDF copy of the book, and he doesn't mention TR18. I guess his book was last updated in 2006, and TR18 was published in 2000, so maybe it's taken time for it to become popular. Amusingly, Oracle's java.util.regex documentation does not mention TR18 but instead says "Go read Mastering Regular Expressions for help using this library" 🤣

Related Documentation

An excellent tutorial and overview of regular expressions is Mastering Regular Expressions, Jeffrey E. F. Friedl, O'Reilly and Associates, 1997.

Jokes aside, Rust's regex 1.1.0 crate, RogueWave's Internationalization module, and the eponymous ICU Regular Expression Engine all support TR18 Level 1, at minimum. The package in the R CRAN repository, stringi, internally uses C bindings to the ICU C Regex Engine. It looks like Python 3.7 considered replacing the re module with a regex module that supports ICU as well.

In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.

In looking through the .NET implementation, it seems like the one feature necessary for full TR18 compliance would be "character class intersection". That should be a separate issue.

There's more. Eg., we are also missing symmetric difference ([a-g~~b-h] ), and it looks like we are missing some properties as well, e.g I noticed \p{LC} is (accidentally?) missing. I only looked briefly though, I think it would be worth doing a more careful look and opening a separate issue(s) for discrepancies.

Incidentally (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029]) will presumably bypass Stephen's ASCII optimizations to character class matching, even though the pattern might be otherwise ASCII. That might justify some tuning later.

will presumably bypass Stephen's ASCII optimizations to character class matching

Most of it will still apply, e.g. we'll still generate a fast lookup table for ASCII, and only fall back to the slower path if the actual character being examined isn't ASCII. And it'll only happen if you opt-in to this AnyNewLine feature.

Yes, and also: The Level 1 of the specification seems most solid, whereas Level 2 and Level 3 are undergoing change. Level 3 is proposed to be completely removed in the next version, subject to vendor feedback. They're basically asking, "Is this too crazy for anyone to implement?"

They've seemed to have taken feedback from EcmaScript Tc39 as some basic guideline on what's reasonable, as part of JavaScript's major contribution to humanity in the last decade: making sure you can Regex match emojis via RegEx character class:

const reRgiEmoji = /\p{RGI_Emoji}/u;

danmoseley · 2020-01-13T23:00:02Z

I hadn't realized and this wasn't mentioned before, but there is a Unicode spec specifically regarding regular expressions and has a section on newline treatment:

http://www.unicode.org/reports/tr18/#RL1.6
It says:

RL1.6 Line Boundaries

To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).

Re dot, https://www.regular-expressions.info/dot.html describes some of the varying support of newlines in different engines -- varying combinations of the above list of 6.

(Incidentally it also describes \N which is "dot not affected by single line mode". Normally in single line mode, dot will match line breaks; \N will not. This is not listed in RL1.6 though. It mentions \R)

What this means for me is -- it makes sense to keep \r in since Unicode recommends it -- and at some future point, we may want to add support for \u0085, \u2029 and \u2028 as well. (It might be interesting to read this RL1.6 entirely to compare with our implementation in general.)

stephentoub · 2020-01-13T23:04:16Z

at some future point

Do other implementations follow this standard in general?

Seems to me like if we're adding "AnyNewLine" now, we should do our best now to make it as correct as possible. It would be a breaking change in the future to augment the meaning of that with additional characters.

danmoseley · 2020-01-13T23:04:54Z

Sounds fine to me.

danmoseley · 2020-01-13T23:18:30Z

Do other implementations follow this standard in general?

As noted in https://www.regular-expressions.info/dot.html (section "Line Break Characters") -- it varies. According to this, Java is missing \f (formfeed - \u0C) for example. It does seem that Unicode is a safe standard to choose from.

danmoseley · 2020-01-14T22:58:01Z

@shishirchawla I'm guessing @jzabroski might be willing to help push to this branch to help get this PR over the line, if you've set permissions appropriately.

jzabroski · 2020-01-15T01:59:08Z

@shishirchawla I'm sorry I've tormented you. I hate seeing geniuses wait to see their invention unfold to success. Happy to help bring it to life for you.

shishirchawla · 2020-01-15T05:39:28Z

@danmosemsft @jzabroski Sorry about the late response, I am actually out traveling and it would be a bit hard for me to update this in the next couple of days, so please feel free to make any changes and complete the PR. I can pick it up again later if required.

danmoseley · 2020-01-15T20:55:01Z

@jzabroski can you successfully push to this branch? If not - I think @shishirchawla has to check the box - or you can continue work in a new PR

jzabroski · 2020-01-16T15:18:15Z

@jzabroski can you successfully push to this branch? If not - I think @shishirchawla has to check the box - or you can continue work in a new PR

You're just asking me to add:

TR18-style AnyNewLine
Update UseOptionA() documentation to describe it supports TR18-style line endings
Historically, . (unless SingleLine is set) has basically been the equivalent of [^\n]. If AnyNewLine is set, . should be able to match \r\n as a single character. The pattern should be implemented as (?:\r\n}|(?!\r\n)[\r-\n\u0085\u2028\u2029])
New tests to cover TR18-style line endings

Additionally, it is worth disclosing that although the rust regex 1.1.0 crate claims to implement TR18 Level 1, the code suggests otherwise, at least as far as R1.6 "Unicode New Line" is concerned: https://github.com/rust-lang/regex/blob/master/src/compile.rs#L271-L286 - it uses just \n

The only code that likely actually implements TR18 Level 1 R1.6 "Unicode New Line" is maybe ICU's C library. icu4j does not implement this, as far as I can tell. It instead implements a class called UnicodeSet, which is very fast but lacks backreference and doesn't seem to handle line endings at all (if it does, I missed where in the code it does this).

danmoseley · 2020-01-16T16:42:52Z

I think so. I'd like @stephentoub to confirm, since he's got regex top of mind at the moment, and I'm sure we'd all like to avoid asking you guys for yet more iterations 😃

danmoseley · 2020-01-16T16:59:53Z

although the rust regex 1.1.0 crate claims to implement TR18 Level 1, the code suggests otherwise, at least as far as R1.6 "Unicode New Line" is concerned:

I see in rust-lang/regex#244 they don't have a plan to support \r\n. I guess this would affect ripgrep as well (?). I understand that VS Code uses ripgrep and apparently they work around it by normalizing \n to \r?\n (just speculation, I didn't really look, but I did verify that $ seems to match \r\n successfully)
https://github.com/microsoft/vscode/blob/ee0960b25bb95d129c638e2f2781282ab9e60793/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L533-L536

jzabroski · 2020-01-16T17:14:14Z

https://github.com/microsoft/vscode/blob/ee0960b25bb95d129c638e2f2781282ab9e60793/src/vs/workbench/services/search/node/ripgrepTextSearchEngine.ts#L533-L536

I will also submit a PR to update VSCode ripgrep TypeScript code with this comment: https://xkcd.com/208/

carlossanlop · 2020-01-21T20:46:42Z

src/libraries/System.Text.RegularExpressions/ref/System.Text.RegularExpressions.cs

@@ -243,6 +243,7 @@ public enum RegexOptions
        RightToLeft = 64,
        ECMAScript = 256,
        CultureInvariant = 512,
+        AnyNewLine = 1024,


@shishirchawla Please make sure to get this new Enum documented, either with triple slash comments on top of the implementation (src file) or directly in the dotnet-api-docs repo.

stephentoub · 2020-02-17T15:10:25Z

I'd like @stephentoub to confirm

Sounds right.

@jzabroski, should we close this PR and you can open a new one when you're ready?

@shishirchawla, thanks for working on this.

jzabroski · 2020-02-17T15:17:49Z

@stephentoub Yes, I think that's fair - let's keep things tidy & close it, as I imagine it adds onerous burden to see open PRs not getting closed. I tried to get to it this weekend but spent ~4 hours troubleshooting a build issue I was expecting to resolve in 2 minutes. :(

stephentoub · 2020-02-17T15:19:16Z

Thanks. Sorry to hear you're having build issues. Please open an issue if they continue and we can get you unblocked. @ViktorHofer

jzabroski · 2020-02-17T15:20:54Z

Thanks. Sorry to hear you're having build issues. Please open an issue if they continue and we can get you unblocked. @ViktorHofer

Thanks - issue is not with this project, but with FluentMigrator project I co-maintain. I pinged rainer for help. Works in Visual Studio, not in JetBrains Rider.

danmoseley · 2020-07-27T01:01:26Z

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs

@@ -1764,6 +1778,7 @@ private static RegexOptions OptionFromCode(char ch)
                'd' => RegexOptions.Debug,
 #endif
                'e' => RegexOptions.ECMAScript,
+                'a' => RegexOptions.AnyNewLine,


Just noticed that the original proposal didn't suggest we add an inline option for AnyNewLine - it's reasonable to do but we'd need tests for it. It looks like we don't have a great set of tests for existing inline regex options other than just setting them once.

Dotnet-GitSync-Bot added the area-System.Text.RegularExpressions label Jan 8, 2020

shishirchawla mentioned this pull request Jan 8, 2020

Adds a new regex option - RegexOptions.AnyNewLine. dotnet/corefx#41195

Closed

danmoseley requested review from pgovind and stephentoub January 8, 2020 17:25

pgovind reviewed Jan 8, 2020

View reviewed changes

jzabroski mentioned this pull request Jan 9, 2020

Official build uptake failure: "Unable to find an entry point named 'AppleCryptoNative_SslCreateContext'", "CryptoNative_Tls13Supported", during dotnet restore #1129

Closed

shishirchawla added 2 commits January 10, 2020 04:57

Merge pull request #1 from dotnet/master

0bdab8a

Sync with dotnet

Added new regex option RegexOptions.AnyNewLine

55c713b

shishirchawla force-pushed the dev/shchawl/anynewline branch from f1822ed to 55c713b Compare January 10, 2020 13:25

shishirchawla commented Jan 11, 2020

View reviewed changes

carlossanlop reviewed Jan 21, 2020

View reviewed changes

carlossanlop added the new-api-needs-documentation label Jan 21, 2020

danmoseley assigned jzabroski and shishirchawla Jan 24, 2020

jzabroski mentioned this pull request Feb 17, 2020

Fix dotnet/runtime#1449 jzabroski/Home#3

Open

stephentoub closed this Feb 17, 2020

danmoseley mentioned this pull request Apr 6, 2020

Optimize newline handling for RegexOptions.Multiline #34566

Merged

danmoseley reviewed Jul 27, 2020

View reviewed changes

danmoseley mentioned this pull request Jul 31, 2020

Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598

Open

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Added new regex option RegexOptions.AnyNewLine #1449

Added new regex option RegexOptions.AnyNewLine #1449

Conversation

shishirchawla commented Jan 8, 2020 • edited by danmoseley Loading

jzabroski commented Jan 8, 2020

danmoseley commented Jan 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub commented Jan 9, 2020

shishirchawla commented Jan 10, 2020

stephentoub commented Jan 10, 2020

jzabroski commented Jan 10, 2020

stephentoub commented Jan 10, 2020

jzabroski commented Jan 10, 2020

danmoseley commented Jan 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley Jan 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Related Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley commented Jan 13, 2020 • edited Loading

stephentoub commented Jan 13, 2020

danmoseley commented Jan 13, 2020

danmoseley commented Jan 13, 2020

danmoseley commented Jan 14, 2020

jzabroski commented Jan 15, 2020

shishirchawla commented Jan 15, 2020

danmoseley commented Jan 15, 2020

jzabroski commented Jan 16, 2020

danmoseley commented Jan 16, 2020

danmoseley commented Jan 16, 2020

jzabroski commented Jan 16, 2020

Choose a reason for hiding this comment

stephentoub commented Feb 17, 2020

jzabroski commented Feb 17, 2020

stephentoub commented Feb 17, 2020

jzabroski commented Feb 17, 2020

Choose a reason for hiding this comment

shishirchawla commented Jan 8, 2020 •

edited by danmoseley

Loading

danmoseley commented Jan 10, 2020 •

edited

Loading

danmoseley Jan 13, 2020 •

edited

Loading

danmoseley commented Jan 13, 2020 •

edited

Loading