-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document that for "usual" regex behavior multiline
is required
#232
base: main
Are you sure you want to change the base?
Conversation
So, this is my suggestion to improve the docs. I'd additionally propose to replace the |
Actually, let me edit to show what I mean |
Done |
src/Parsing/String.purs
Outdated
@@ -235,7 +238,7 @@ match p = do | |||
-- | capture the regular expression pattern `x*`. | |||
-- | | |||
-- | ```purescript | |||
-- | case regex "x*" noFlags of | |||
-- | case regex "x*" multiline of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of changing the basic example, can you please instead write a new example in the Flags section which shows the effect of the multiline
flag?
src/Parsing/String.purs
Outdated
@@ -195,6 +195,9 @@ match p = do | |||
|
|||
-- | Compile a regular expression `String` into a regular expression parser. | |||
-- | | |||
-- | Note that per JS RegExp semantics matching a single line in a multiline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, thanks, but would you please put it down below in the Flags section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
a7eaab0
to
08e6a06
Compare
@jamesdbrock please see if you're okay with changes. Regarding example, I realized the existing example passes As a matter of fact, the two new examples show the distinction between just |
I think these updates are definitely helpful but it might also be good to do something to reinforce the fact that the regex parser only operates from the current parser location. We could explain that by saying it inserts an implicit |
Well, "operating from the current position" is just how every parser works, and the It seems to me that the only user-visible behavior is that |
Yeah, maybe you're right, it's just because of the way it was interacting with multiline stuff that made me think it needed more explanation. |
So, I'm writing the note about |
I think in general no, it doesn't matter what you do with |
Regular expression users typically expect that matching a `$` in a multiline string would match the end of current line and not the end of the string past many lines. This is default behavior in pretty much every regexp engine: `grep`, `perl`, text editors, you name it… So it is fair to expect such expectation, so warn a user about necessity to pass `multiline` Fixes: purescript-contrib#231
Hahah, indeed, odd, even though Either way, please see if it looks okay or maybe you'd prefer to change something 😊 |
Ah sorry, I left the thought in that comment unfinished. I think maybe we should change the options that can even be used with regex parsing to exclude
This is because it advances the
|
Well, this is an interesting bug, but I don't think it warrants removing |
The use of |
Ah… Well, I'm not sure what to say… Disallowing |
-- | Left compileError -> unsafeCrashWith $ "xMany failed to compile: " <> compileError | ||
-- | Right xMany -> runParser "xxxZ" do | ||
-- | xMany | ||
-- | example re flags text = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first example should show the simplest basic usage. Adding this example
helper function is an extra indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the code has always been there, I just moved it to a separate function. I can move the function after example if you want.
-- | Right xMany -> runParser text do | ||
-- | xMany | ||
-- | | ||
-- | -- Capturing a string per `x*` regex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For separate examples I think would should have separate markdown code blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be a lot of duplicate code for a reader to dig through?
-- | a `regex` parser. The other flags will | ||
-- | probably cause surprising behavior and you should avoid them. | ||
-- | The `dotAll`, `multiline`, `unicode`, and `ignoreCase` flags might make | ||
-- | sense for a `regex` parser. In fact, per JS RegExp semantics matching a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that before we encourage people to use the multiline
flag until we should add a bunch of multiline
test cases to the test suite to be sure that multiline
parsing works the way that we think it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that makes whole PR a moot point. multiline
is the behavior users would expect from regular expressions (as explained in the commit message), which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags
gives, I went to document how it should work. Now you're pushing against. I don't get that, but you're the maintainer, so ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiline
is the behavior users would expect from regular expressions
to add to that: besides the explanation referred, there's also a point that explicitly matching newline characters is often frowned upon. This is because there exist at least 3 different styles of newlines. So the usual advice is not to match a \n
for example, but to match a $
instead, which again implies multiline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think personally I just view this parser as serving a different purpose than for what enabling multiline
would give it. I view it as similar to satisfy
, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.
It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.
(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss multiline
at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think personally I just view this parser as serving a different purpose than for what enabling
multiline
would give it. I view it as similar tosatisfy
, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.
FWIW, the current documentation encourages to prefer regex
over other parsers. Quoting the relevant block:
This parser may be useful for quickly consuming a large section of the input
String
, because in a JavaScript runtime environment a compiled RegExp is a lot faster than a monadic parser built from parsing primitives.
It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.
Fair, although till now I haven't even noticed lack of groups, because the "combinators style" implies that when there's a pattern one wants to retrieve, you'd just consume it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that makes whole PR a moot point.
multiline
is the behavior users would expect from regular expressions, which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one thatnoFlags
gives, I went to document how it should work.
Do we even understand how it works though? I checked @garyb ’s case and confirmed that
runParser "some\nvarious\nlines" do
m <- fromRight' unsafeCoerce $ regex "various$" multiline
r <- rest
pure $ Tuple m r
gives the result
(Right (Tuple "various" "rious\nlines"))
which is surprising and wrong.
So I think that the current advice in the docs
is pretty good advice until we decide what kind of behavior the multiline
flag should have, fix it so that it behaves that way, and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling attention to this, though. It would be nice if the multiline
flag just worked in an intuitively correct way, but it currently doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just plain broken with multiline
currently. It's advancing consumed
by 7 for the "various"
match, but should advance it an additional 5 as that's the position that "various"
starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with "foo\nbar\nbaz"
at first and almost ended up confusing myself 😄)
It is fixable if we want to support it though, we'd have to do something like perform a search
to find the offset of the match and include that when updating the consumed
position of the parser. Or most optimally, offer another function in purescript-strings
that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made an issue #233
Description of the change
Regular expression users typically expect that matching a
$
in a multiline string would match the end of current line and not the end of the string past many lines. This is default behavior in pretty much every regexp engine:grep
,perl
, text editors, you name it… So it is fair to expect such expectation, so warn a user about necessity to passmultiline
Fixes: #231
Checklist: