Document that for "usual" regex behavior `multiline` is required #232

Hi-Angel · 2024-10-15T19:55:26Z

Description of the change

Regular expression users typically expect that matching a $ in a multiline string would match the end of current line and not the end of the string past many lines. This is default behavior in pretty much every regexp engine: grep, perl, text editors, you name it… So it is fair to expect such expectation, so warn a user about necessity to pass multiline

Fixes: #231

Checklist:

Added the change to the changelog's "Unreleased" section with a link to this PR and your username
Linked any existing issues or proposals that this pull request should close
Updated or added relevant documentation in the README and/or documentation directory
Added a test for the contribution (if applicable)

Hi-Angel · 2024-10-15T19:56:07Z

So, this is my suggestion to improve the docs. I'd additionally propose to replace the noFlags in the example code with multiline, WDYT?

Hi-Angel · 2024-10-15T19:57:55Z

Actually, let me edit to show what I mean

Hi-Angel · 2024-10-15T19:58:22Z

Done

jamesdbrock · 2024-10-16T05:53:32Z

src/Parsing/String.purs

@@ -235,7 +238,7 @@ match p = do
 -- | capture the regular expression pattern `x*`.
 -- |
 -- | ```purescript
-- | case regex "x*" noFlags of
+-- | case regex "x*" multiline of


Instead of changing the basic example, can you please instead write a new example in the Flags section which shows the effect of the multiline flag?

jamesdbrock · 2024-10-16T05:54:38Z

src/Parsing/String.purs

@@ -195,6 +195,9 @@ match p = do

 -- | Compile a regular expression `String` into a regular expression parser.
 -- |
+-- | Note that per JS RegExp semantics matching a single line in a multiline


This is good, thanks, but would you please put it down below in the Flags section?

Hi-Angel · 2024-10-16T11:06:07Z

@jamesdbrock please see if you're okay with changes. Regarding example, I realized the existing example passes dotAll, but doesn't really use its functional. So I replaced it with noFlags, and instead added a third example that makes use of dotAll.

As a matter of fact, the two new examples show the distinction between just multiline and dotAll, as in, that dotAll will match everything including newlines, whereas multiline would only match till end-of-line.

garyb · 2024-10-16T13:49:30Z

I think these updates are definitely helpful but it might also be good to do something to reinforce the fact that the regex parser only operates from the current parser location. We could explain that by saying it inserts an implicit ^(...) grouping around the provided pattern, and/or show an example where it fails to match something that a pattern might otherwise be expected to match in the absence of ^.

Hi-Angel · 2024-10-16T13:54:30Z

I think these updates are definitely helpful but it might also be good to do something to reinforce the fact that the regex parser only operates from the current parser location. We could explain that by saying it inserts an implicit ^(...) grouping around the provided pattern, and/or show an example where it fails to match something that a pattern might otherwise be expected to match in the absence of ^.

Well, "operating from the current position" is just how every parser works, and the ^ auto-insertion per my understanding is just an invisible internal detail.

It seems to me that the only user-visible behavior is that ^ character that a user might insert into the regexp will always match, so maybe just add that?

garyb · 2024-10-16T13:57:21Z

Yeah, maybe you're right, it's just because of the way it was interacting with multiline stuff that made me think it needed more explanation.

Hi-Angel · 2024-10-16T14:06:19Z

So, I'm writing the note about ^ always matching the beginning, and it made me thinking: is there any case where ^ could at all be useful? For example, is it possible to craft such combination of flags and regex where ^ would match something in the middle of the string? If not, I presume worth mentioning that ^ is practically useless and should not be used.

garyb · 2024-10-16T14:14:13Z

I think in general no, it doesn't matter what you do with ^, it will have no effect on the pattern... but it has just occurred to me that using multiline in this parser violates that assertion that it operates from the current position: the pattern /^foo.*$/m will match "foo" out of "bar\nfoo".

Regular expression users typically expect that matching a `$` in a multiline string would match the end of current line and not the end of the string past many lines. This is default behavior in pretty much every regexp engine: `grep`, `perl`, text editors, you name it… So it is fair to expect such expectation, so warn a user about necessity to pass `multiline` Fixes: purescript-contrib#231

Hi-Angel · 2024-10-16T14:28:03Z

Hahah, indeed, odd, even though ^.* would match bar. Alright, I'm not sure what to mention about that corner-case, so I just wrote that ^ will match the current position even in absence of a preceding newline. I wasn't sure where to put it, but I see there was a text ending with [match] starting at the current parser position, and from the communication on the linked issue I assumed the text was referring to the situation discussed, so I added to that paragraph.

Either way, please see if it looks okay or maybe you'd prefer to change something 😊

garyb · 2024-10-16T16:55:50Z

Ah sorry, I left the thought in that comment unfinished. I think maybe we should change the options that can even be used with regex parsing to exclude multiline:

logShow $ runParser "some\nvarious\nlines" (regexP "various$" *> PS.rest)

> (Right "rious\nlines")

This is because it advances the consumed parser position by the length of the first pattern match. It would need to be able to know the offset of that match as well as the length of it to update consumed correctly. We could get that through running search too perhaps, but I think it would be nice to disallow flag(s) that don't make sense for a parser style definition.

global and sticky also don't really make sense here, in that they'll have no effect. (I'm actually not really sure if sticky ever makes sense given the interface we provide for RegExp).

Hi-Angel · 2024-10-16T17:21:55Z

Ah sorry, I left the thought in that comment unfinished. I think maybe we should change the options that can even be used with regex parsing to exclude multiline:
logShow $ runParser "some\nvarious\nlines" (regexP "various$" *> PS.rest)

> (Right "rious\nlines")
This is because it advances the consumed parser position by the length of the first pattern match. It would need to be able to know the offset of that match as well as the length of it to update consumed correctly. We could get that through running search too perhaps, but I think it would be nice to disallow flag(s) that don't make sense for a parser style definition.

Well, this is an interesting bug, but I don't think it warrants removing multiline, because being able to match at least eol is very much used and necessary functional. I am saying this as someone who had an assortment of experience modifying/writing/debugging Emacs syntax parsers, which are usually regexp-based. At the same time, combining regex and PS.rest sounds like pretty rare thing to do.

garyb · 2024-10-16T23:28:45Z

The use of PS.rest was to show the parser state after that multiline match is incorrect.

Hi-Angel · 2024-10-17T04:57:22Z

Ah… Well, I'm not sure what to say… Disallowing multiline sounds very wrong, because that's the OOTB behavior people would expect from regular expressions (per reasons explained in the commit message and first post).

jamesdbrock · 2024-10-17T07:03:23Z

src/Parsing/String.purs

-- |   Left compileError -> unsafeCrashWith $ "xMany failed to compile: " <> compileError
-- |   Right xMany -> runParser "xxxZ" do
-- |     xMany
+-- | example re flags text =


The first example should show the simplest basic usage. Adding this example helper function is an extra indirection.

But the code has always been there, I just moved it to a separate function. I can move the function after example if you want.

jamesdbrock · 2024-10-17T07:03:55Z

src/Parsing/String.purs

+-- |     Right xMany -> runParser text do
+-- |       xMany
+-- |
+-- | -- Capturing a string per `x*` regex.


For separate examples I think would should have separate markdown code blocks.

Wouldn't it be a lot of duplicate code for a reader to dig through?

jamesdbrock · 2024-10-17T07:07:40Z

src/Parsing/String.purs

-- | a `regex` parser. The other flags will
-- | probably cause surprising behavior and you should avoid them.
+-- | The `dotAll`, `multiline`, `unicode`, and `ignoreCase` flags might make
+-- | sense for a `regex` parser. In fact, per JS RegExp semantics matching a


I think that before we encourage people to use the multiline flag until we should add a bunch of multiline test cases to the test suite to be sure that multiline parsing works the way that we think it does.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions (as explained in the commit message), which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work. Now you're pushing against. I don't get that, but you're the maintainer, so ok.

multiline is the behavior users would expect from regular expressions

to add to that: besides the explanation referred, there's also a point that explicitly matching newline characters is often frowned upon. This is because there exist at least 3 different styles of newlines. So the usual advice is not to match a \n for example, but to match a $ instead, which again implies multiline.

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss multiline at all).

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

FWIW, the current documentation encourages to prefer regex over other parsers. Quoting the relevant block:

This parser may be useful for quickly consuming a large section of the input String, because in a JavaScript runtime environment a compiled RegExp is a lot faster than a monadic parser built from parsing primitives.

It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

Fair, although till now I haven't even noticed lack of groups, because the "combinators style" implies that when there's a pattern one wants to retrieve, you'd just consume it separately.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions, which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work.

Do we even understand how it works though? I checked @garyb ’s case and confirmed that

runParser "some\nvarious\nlines" do m <- fromRight' unsafeCoerce $ regex "various$" multiline r <- rest pure $ Tuple m r

gives the result

(Right (Tuple "various" "rious\nlines"))

which is surprising and wrong.

So I think that the current advice in the docs

is pretty good advice until we decide what kind of behavior the multiline flag should have, fix it so that it behaves that way, and test it.

Thanks for calling attention to this, though. It would be nice if the multiline flag just worked in an intuitively correct way, but it currently doesn't.

It's just plain broken with multiline currently. It's advancing consumed by 7 for the "various" match, but should advance it an additional 5 as that's the position that "various" starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with "foo\nbar\nbaz" at first and almost ended up confusing myself 😄)

It is fixable if we want to support it though, we'd have to do something like perform a search to find the offset of the match and include that when updating the consumed position of the parser. Or most optimally, offer another function in purescript-strings that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.

I made an issue #233

Hi-Angel mentioned this pull request Oct 15, 2024

Regex for "match everything till eol" matches nothing #231

Open

Hi-Angel force-pushed the fix-docs branch from 8baad6d to 73d9955 Compare October 15, 2024 19:58

jamesdbrock reviewed Oct 16, 2024

View reviewed changes

Hi-Angel force-pushed the fix-docs branch 2 times, most recently from a7eaab0 to 08e6a06 Compare October 16, 2024 11:05

Hi-Angel force-pushed the fix-docs branch from 08e6a06 to d1adc17 Compare October 16, 2024 14:27

jamesdbrock reviewed Oct 17, 2024

View reviewed changes

jamesdbrock mentioned this pull request Oct 17, 2024

multiline flag doesn't work #233

Open

Document that for "usual" regex behavior multiline is required #232

Are you sure you want to change the base?

Document that for "usual" regex behavior multiline is required #232

Conversation

Hi-Angel commented Oct 15, 2024

Uh oh!

Hi-Angel commented Oct 15, 2024

Uh oh!

Hi-Angel commented Oct 15, 2024

Uh oh!

Hi-Angel commented Oct 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesdbrock Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hi-Angel commented Oct 16, 2024

Uh oh!

garyb commented Oct 16, 2024

Uh oh!

Hi-Angel commented Oct 16, 2024

Uh oh!

garyb commented Oct 16, 2024

Uh oh!

Hi-Angel commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

garyb commented Oct 16, 2024

Uh oh!

Hi-Angel commented Oct 16, 2024

Uh oh!

garyb commented Oct 16, 2024

Uh oh!

Hi-Angel commented Oct 16, 2024

Uh oh!

garyb commented Oct 16, 2024

Uh oh!

Hi-Angel commented Oct 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesdbrock Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesdbrock Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hi-Angel Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hi-Angel Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyb Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garyb Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Document that for "usual" regex behavior `multiline` is required #232

Document that for "usual" regex behavior `multiline` is required #232

jamesdbrock Oct 16, 2024 •

edited

Loading

Hi-Angel commented Oct 16, 2024 •

edited

Loading

jamesdbrock Oct 17, 2024 •

edited

Loading

jamesdbrock Oct 17, 2024 •

edited

Loading

Hi-Angel Oct 17, 2024 •

edited

Loading

Hi-Angel Oct 17, 2024 •

edited

Loading

garyb Oct 17, 2024 •

edited

Loading

garyb Oct 17, 2024 •

edited

Loading