Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that for "usual" regex behavior multiline is required #232

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Hi-Angel
Copy link

Description of the change

Regular expression users typically expect that matching a $ in a multiline string would match the end of current line and not the end of the string past many lines. This is default behavior in pretty much every regexp engine: grep, perl, text editors, you name it… So it is fair to expect such expectation, so warn a user about necessity to pass multiline

Fixes: #231


Checklist:

  • Added the change to the changelog's "Unreleased" section with a link to this PR and your username
  • Linked any existing issues or proposals that this pull request should close
  • Updated or added relevant documentation in the README and/or documentation directory
  • Added a test for the contribution (if applicable)

@Hi-Angel
Copy link
Author

So, this is my suggestion to improve the docs. I'd additionally propose to replace the noFlags in the example code with multiline, WDYT?

@Hi-Angel
Copy link
Author

Actually, let me edit to show what I mean

@Hi-Angel
Copy link
Author

Done

@@ -235,7 +238,7 @@ match p = do
-- | capture the regular expression pattern `x*`.
-- |
-- | ```purescript
-- | case regex "x*" noFlags of
-- | case regex "x*" multiline of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the basic example, can you please instead write a new example in the Flags section which shows the effect of the multiline flag?

@@ -195,6 +195,9 @@ match p = do

-- | Compile a regular expression `String` into a regular expression parser.
-- |
-- | Note that per JS RegExp semantics matching a single line in a multiline
Copy link
Member

@jamesdbrock jamesdbrock Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, thanks, but would you please put it down below in the Flags section?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@Hi-Angel Hi-Angel force-pushed the fix-docs branch 2 times, most recently from a7eaab0 to 08e6a06 Compare October 16, 2024 11:05
@Hi-Angel
Copy link
Author

@jamesdbrock please see if you're okay with changes. Regarding example, I realized the existing example passes dotAll, but doesn't really use its functional. So I replaced it with noFlags, and instead added a third example that makes use of dotAll.

As a matter of fact, the two new examples show the distinction between just multiline and dotAll, as in, that dotAll will match everything including newlines, whereas multiline would only match till end-of-line.

@garyb
Copy link
Member

garyb commented Oct 16, 2024

I think these updates are definitely helpful but it might also be good to do something to reinforce the fact that the regex parser only operates from the current parser location. We could explain that by saying it inserts an implicit ^(...) grouping around the provided pattern, and/or show an example where it fails to match something that a pattern might otherwise be expected to match in the absence of ^.

@Hi-Angel
Copy link
Author

I think these updates are definitely helpful but it might also be good to do something to reinforce the fact that the regex parser only operates from the current parser location. We could explain that by saying it inserts an implicit ^(...) grouping around the provided pattern, and/or show an example where it fails to match something that a pattern might otherwise be expected to match in the absence of ^.

Well, "operating from the current position" is just how every parser works, and the ^ auto-insertion per my understanding is just an invisible internal detail.

It seems to me that the only user-visible behavior is that ^ character that a user might insert into the regexp will always match, so maybe just add that?

@garyb
Copy link
Member

garyb commented Oct 16, 2024

Yeah, maybe you're right, it's just because of the way it was interacting with multiline stuff that made me think it needed more explanation.

@Hi-Angel
Copy link
Author

Hi-Angel commented Oct 16, 2024

So, I'm writing the note about ^ always matching the beginning, and it made me thinking: is there any case where ^ could at all be useful? For example, is it possible to craft such combination of flags and regex where ^ would match something in the middle of the string? If not, I presume worth mentioning that ^ is practically useless and should not be used.

@garyb
Copy link
Member

garyb commented Oct 16, 2024

I think in general no, it doesn't matter what you do with ^, it will have no effect on the pattern... but it has just occurred to me that using multiline in this parser violates that assertion that it operates from the current position: the pattern /^foo.*$/m will match "foo" out of "bar\nfoo".

Regular expression users typically expect that matching a `$` in a
multiline string would match the end of current line and not the end
of the string past many lines. This is default behavior in pretty much
every regexp engine: `grep`, `perl`, text editors, you name it… So it
is fair to expect such expectation, so warn a user about necessity to
pass `multiline`

Fixes: purescript-contrib#231
@Hi-Angel
Copy link
Author

Hahah, indeed, odd, even though ^.* would match bar. Alright, I'm not sure what to mention about that corner-case, so I just wrote that ^ will match the current position even in absence of a preceding newline. I wasn't sure where to put it, but I see there was a text ending with [match] starting at the current parser position, and from the communication on the linked issue I assumed the text was referring to the situation discussed, so I added to that paragraph.

Either way, please see if it looks okay or maybe you'd prefer to change something 😊

@garyb
Copy link
Member

garyb commented Oct 16, 2024

Ah sorry, I left the thought in that comment unfinished. I think maybe we should change the options that can even be used with regex parsing to exclude multiline:

logShow $ runParser "some\nvarious\nlines" (regexP "various$" *> PS.rest)

> (Right "rious\nlines")

This is because it advances the consumed parser position by the length of the first pattern match. It would need to be able to know the offset of that match as well as the length of it to update consumed correctly. We could get that through running search too perhaps, but I think it would be nice to disallow flag(s) that don't make sense for a parser style definition.

global and sticky also don't really make sense here, in that they'll have no effect. (I'm actually not really sure if sticky ever makes sense given the interface we provide for RegExp).

@Hi-Angel
Copy link
Author

Ah sorry, I left the thought in that comment unfinished. I think maybe we should change the options that can even be used with regex parsing to exclude multiline:

logShow $ runParser "some\nvarious\nlines" (regexP "various$" *> PS.rest)

> (Right "rious\nlines")

This is because it advances the consumed parser position by the length of the first pattern match. It would need to be able to know the offset of that match as well as the length of it to update consumed correctly. We could get that through running search too perhaps, but I think it would be nice to disallow flag(s) that don't make sense for a parser style definition.

Well, this is an interesting bug, but I don't think it warrants removing multiline, because being able to match at least eol is very much used and necessary functional. I am saying this as someone who had an assortment of experience modifying/writing/debugging Emacs syntax parsers, which are usually regexp-based. At the same time, combining regex and PS.rest sounds like pretty rare thing to do.

@garyb
Copy link
Member

garyb commented Oct 16, 2024

The use of PS.rest was to show the parser state after that multiline match is incorrect.

@Hi-Angel
Copy link
Author

Ah… Well, I'm not sure what to say… Disallowing multiline sounds very wrong, because that's the OOTB behavior people would expect from regular expressions (per reasons explained in the commit message and first post).

-- | Left compileError -> unsafeCrashWith $ "xMany failed to compile: " <> compileError
-- | Right xMany -> runParser "xxxZ" do
-- | xMany
-- | example re flags text =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first example should show the simplest basic usage. Adding this example helper function is an extra indirection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the code has always been there, I just moved it to a separate function. I can move the function after example if you want.

-- | Right xMany -> runParser text do
-- | xMany
-- |
-- | -- Capturing a string per `x*` regex.
Copy link
Member

@jamesdbrock jamesdbrock Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For separate examples I think would should have separate markdown code blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be a lot of duplicate code for a reader to dig through?

-- | a `regex` parser. The other flags will
-- | probably cause surprising behavior and you should avoid them.
-- | The `dotAll`, `multiline`, `unicode`, and `ignoreCase` flags might make
-- | sense for a `regex` parser. In fact, per JS RegExp semantics matching a
Copy link
Member

@jamesdbrock jamesdbrock Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that before we encourage people to use the multiline flag until we should add a bunch of multiline test cases to the test suite to be sure that multiline parsing works the way that we think it does.

Copy link
Author

@Hi-Angel Hi-Angel Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions (as explained in the commit message), which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work. Now you're pushing against. I don't get that, but you're the maintainer, so ok.

Copy link
Author

@Hi-Angel Hi-Angel Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiline is the behavior users would expect from regular expressions

to add to that: besides the explanation referred, there's also a point that explicitly matching newline characters is often frowned upon. This is because there exist at least 3 different styles of newlines. So the usual advice is not to match a \n for example, but to match a $ instead, which again implies multiline.

Copy link
Member

@garyb garyb Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss multiline at all).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

FWIW, the current documentation encourages to prefer regex over other parsers. Quoting the relevant block:

This parser may be useful for quickly consuming a large section of the input String, because in a JavaScript runtime environment a compiled RegExp is a lot faster than a monadic parser built from parsing primitives.


It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

Fair, although till now I haven't even noticed lack of groups, because the "combinators style" implies that when there's a pattern one wants to retrieve, you'd just consume it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions, which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work.

Do we even understand how it works though? I checked @garyb ’s case and confirmed that

  runParser "some\nvarious\nlines" do
    m <- fromRight' unsafeCoerce $ regex "various$" multiline
    r <- rest
    pure $ Tuple m r

gives the result

(Right (Tuple "various" "rious\nlines"))

which is surprising and wrong.

So I think that the current advice in the docs

image

is pretty good advice until we decide what kind of behavior the multiline flag should have, fix it so that it behaves that way, and test it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling attention to this, though. It would be nice if the multiline flag just worked in an intuitively correct way, but it currently doesn't.

Copy link
Member

@garyb garyb Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just plain broken with multiline currently. It's advancing consumed by 7 for the "various" match, but should advance it an additional 5 as that's the position that "various" starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with "foo\nbar\nbaz" at first and almost ended up confusing myself 😄)

It is fixable if we want to support it though, we'd have to do something like perform a search to find the offset of the match and include that when updating the consumed position of the parser. Or most optimally, offer another function in purescript-strings that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue #233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Regex for "match everything till eol" matches nothing
3 participants