-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for bytes regular expressions #14686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally happy with the direction but have requested quite a few changes. Once those changes are done, and testing is passing, I think this will be good to go in.
Can you add a test of using regexp I/O to read file contents that are invalid UTF-8? Can you check that it reads OK with a bytes regexp and gives some sort of error when reading with a string regexp?
modules/standard/Regexp.chpl
Outdated
proc compile(pattern: ?t, out error:syserr, posix, literal, nocapture, | ||
/*i*/ ignorecase, /*m*/ multiline, /*s*/ dotnl, | ||
/*U*/ nongreedy): regexp(t) where t==string || t==bytes { | ||
|
||
compilerWarning("'out error: syserr' pattern has been deprecated, use 'throws' function instead"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just remove this function
modules/standard/Regexp.chpl
Outdated
@@ -447,15 +447,20 @@ class BadRegexpError : Error { | |||
``(?U)``. | |||
|
|||
*/ | |||
proc compile(pattern: string, utf8=true, posix=false, literal=false, nocapture=false, /*i*/ ignorecase=false, /*m*/ multiline=false, /*s*/ dotnl=false, /*U*/ nongreedy=false):regexp throws { | |||
proc compile(pattern: ?t, posix=false, literal=false, nocapture=false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in a standard module, I think we have some responsibility to try making a deprecation overload. Did you already do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we might as well make the arguments use camelCase while we are deprecating.
@@ -578,15 +572,19 @@ proc string.this(m:reMatch) { | |||
*/ | |||
pragma "ignore noinit" | |||
record regexp { | |||
|
|||
pragma "no doc" | |||
type exprType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open a design issue about whether or not regexp
should have the default of string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #14693
modules/standard/Regexp.chpl
Outdated
proc init() { | ||
} | ||
/*proc init() {*/ | ||
/*}*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to prevent users from calling another initializer (e.g. the compiler one).
To get this working, you just need to make it proc init(type exprType)
.
if exprType == string then | ||
yield text[pos+1..splitstart]; | ||
else | ||
yield text[(pos+1):int..splitstart:int]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't index into a bytes
with a bytesIndex
? That seems like something we should provide, even if it's just to make generic code like this nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make an issue about this one too? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a PR: #14695
modules/standard/Regexp.chpl
Outdated
@@ -936,29 +916,32 @@ record regexp { | |||
:returns: a tuple containing (new string, number of substitutions made) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure there are many places in the documentation that mention string
that now need adjustment.
modules/standard/Regexp.chpl
Outdated
@@ -1046,20 +1033,24 @@ proc =(ref ret:regexp, x:regexp) | |||
|
|||
// Cast regexp to string. | |||
pragma "no doc" | |||
inline proc _cast(type t, x: regexp) where t == string { | |||
var pattern: string; | |||
inline proc _cast(type t, x: regexp(?exprType)) where t == exprType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where
clauses on these casts slow down compilation.
Can you make 2 versions of this cast, one for string and one for bytes?
inline proc _cast(type t:string, x: regexp(string)) { }
inline proc _cast(type t:bytes, x: regexp(bytes)) { }
modules/standard/Regexp.chpl
Outdated
} | ||
return pattern; | ||
} | ||
// Cast string to regexp | ||
pragma "no doc" | ||
inline proc _cast(type t, x: string) throws where t == regexp { | ||
inline proc _cast(type t, x: ?valType) throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
inline proc _cast(type t:regexp(string), x: string) { }
inline proc _cast(type t:regexp(bytes), x: bytes) { }
Also I think we should consider deprecating this one (because compile
is a more reasonable way to write it). Could you open up a design issue asking that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #14694
- Remove deprecated non-throwing `compile` - Deprecate `compile` with `utf8` argument - Make passing `utf8=false` a throwing error - Make new `compile` arguments camelCase and update its doc *only* for that - Add a deprecation test
@mppf I think this is ready for another round of review. Maybe you can check the commits after your initial review one-by-one? Thanks! |
you may have to escape backslashes. For example, to | ||
:arg pattern: the regular expression to compile. This argument can be string | ||
or bytes. See :ref:`regular-expression-syntax` for details. | ||
Note that you may have to escape backslashes. For example, to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is unrelated to your PR, but while updating this documentation could you please emphasize in it that raw string literals exist? The docs here predate the raw string literals and regexps are an improtant motivating example.
// read from it -- following four reads are how these should be read | ||
var r = f.reader(); | ||
var captureString: string; | ||
var captureBytes: bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some testing with passing a compiled bytes/string regexp to this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Current version of this test does that via the testRead helper.
Skipif bytes regular expression tests #14686 added bytes regular expressions along with some tests, but those tests aren't skipped when there is no RE2. This PR adds those skipifs.
Allow indexing/slicing bytes with byteIndex This PR enables indexing and slicing bytes using `byteIndex` type. `byteIndex` is a type used in string indexing to differentiate codepoint indexing and byte indexing. While writing some generic code using strings and bytes, it'd be nice to be able to index/slice bytes using this type, too. For the motivation see: #14686 (comment) [Reviewed by @mppf] Test: - [x] full standard - [x] full gasnet
Update a call to regexp.compile in mason Fixes an argument name that was causing deprecation warnings due to #14686 Test: make -C tools/mason does not produce the warning with this PR. Trivial, not reviewed.
Allow string-by-def regexps with warning This PR makes `string` default for `regexp` with a deprecation warning for 1.21 Resolves #14693 Resolves #14895 `regexp`s were string-based until #14686 made them generic to support bytes and strings. Under #14693, we reached a consensus that generic `regexp` is what we want going forward, but we need to support `regexp` with string-by-default type for 1.21 with a deprecation warning. This PR also adds deprecation tests [Reviewed by @dlongnecke-cray] Test: - [x] standard local - [x] standard gasnet
This PR adds support for using
bytes
values as regular expressions to be usedwith other
bytes
values.Summary:
Makes
regexp
generic ontype exprType
which can bestring
orbytes
Currently, this PR doesn't assign a default to
exprType
. But we might want tomake
string
the default type. See Default type for regular expressions #14693Removes
stringPart
from Regexp module. This was based on some comments inthe code that says it is not tested, and in some places, support for it is not
fully implemented.
This is not absolutely necessary for this PR to move on and I can revert back.
But it makes the interface and sometimes implementation much easier to remove
that.
Remove the deprecated
compile
without err
, deprecates thecompile
with
utf8
argument and all lowercase arguments.The new
compile
usesutf8
encoding with strings andLatin1
with bytesinternally.
Adds
bytes
methods with regexp arguments.bytes.this(reMatch)
,bytes.search(regexp(bytes))
etcAdds support for doing formatted reads for
regexp(bytes)
.The internal class
channel_regexp_info
now stores onlybytes
. They aredecode
d into strings as necessary when they are extracted.Making this also generic cannot be easily done, I think. Because
readf
needs to create a
channel_regexp_info
instance before starting to read thearguments. And before starting to read the arguments we don't know whether
they are
regexp(bytes)
orregexp(string)
.Note that the type of the format string does not determine the type of the
regular expression. Therefore, one can capture a bytes-based regular
expression using
"%/*/"
and a string-based regular expression usingb"%/*/"
Modifies some tests to test bytes-based regular expressions in the same way
string-based ones are tested.
Adjusts the
RecordParser
module to useregexp(string)
Adjusts mason to use new
compile
Adds tests for non-UTF8 regexp matching and casts.
Minor adjustment in the docs to emphasize that we have raw string/bytes
literals.
Test: