Add support for bytes regular expressions #14686

e-kayrakli · 2019-12-20T02:29:36Z

This PR adds support for using bytes values as regular expressions to be used
with other bytes values.

Summary:

Makes regexp generic on type exprType which can be string or bytes

Currently, this PR doesn't assign a default to exprType. But we might want to
make string the default type. See Default type for regular expressions #14693
Removes stringPart from Regexp module. This was based on some comments in
the code that says it is not tested, and in some places, support for it is not
fully implemented.

This is not absolutely necessary for this PR to move on and I can revert back.
But it makes the interface and sometimes implementation much easier to remove
that.
Remove the deprecated compile with out err, deprecates the compile
with utf8 argument and all lowercase arguments.

The new compile uses utf8 encoding with strings and Latin1 with bytes
internally.
Adds bytes methods with regexp arguments. bytes.this(reMatch),
bytes.search(regexp(bytes)) etc
Adds support for doing formatted reads for regexp(bytes).

The internal class channel_regexp_info now stores only bytes. They are
decoded into strings as necessary when they are extracted.

Making this also generic cannot be easily done, I think. Because readf
needs to create a channel_regexp_info instance before starting to read the
arguments. And before starting to read the arguments we don't know whether
they are regexp(bytes) or regexp(string).

Note that the type of the format string does not determine the type of the
regular expression. Therefore, one can capture a bytes-based regular
expression using "%/*/" and a string-based regular expression using
b"%/*/"
Modifies some tests to test bytes-based regular expressions in the same way
string-based ones are tested.
Adjusts the RecordParser module to use regexp(string)
Adjusts mason to use new compile
Adds tests for non-UTF8 regexp matching and casts.
Minor adjustment in the docs to emphasize that we have raw string/bytes
literals.

Test:

standard
gasnet

mppf

I'm generally happy with the direction but have requested quite a few changes. Once those changes are done, and testing is passing, I think this will be good to go in.

Can you add a test of using regexp I/O to read file contents that are invalid UTF-8? Can you check that it reads OK with a bytes regexp and gives some sort of error when reading with a string regexp?

mppf · 2019-12-20T13:51:28Z

modules/standard/Regexp.chpl

+proc compile(pattern: ?t, out error:syserr, posix, literal, nocapture,
+             /*i*/ ignorecase, /*m*/ multiline, /*s*/ dotnl,
+             /*U*/ nongreedy): regexp(t) where t==string || t==bytes {
+
  compilerWarning("'out error: syserr' pattern has been deprecated, use 'throws' function instead");


Please just remove this function

mppf · 2019-12-20T13:51:58Z

modules/standard/Regexp.chpl

@@ -447,15 +447,20 @@ class BadRegexpError : Error {
                   ``(?U)``.

 */
-proc compile(pattern: string, utf8=true, posix=false, literal=false, nocapture=false, /*i*/ ignorecase=false, /*m*/ multiline=false, /*s*/ dotnl=false, /*U*/ nongreedy=false):regexp throws {
+proc compile(pattern: ?t, posix=false, literal=false, nocapture=false,


Since this is in a standard module, I think we have some responsibility to try making a deprecation overload. Did you already do that?

Also we might as well make the arguments use camelCase while we are deprecating.

mppf · 2019-12-20T13:53:12Z

modules/standard/Regexp.chpl

@@ -578,15 +572,19 @@ proc string.this(m:reMatch) {
  */
 pragma "ignore noinit"
 record regexp {
+
+  pragma "no doc"
+  type exprType;


Please open a design issue about whether or not regexp should have the default of string

mppf · 2019-12-20T13:54:29Z

modules/standard/Regexp.chpl

-  proc init() {
-  }
+  /*proc init() {*/
+  /*}*/


The intent is to prevent users from calling another initializer (e.g. the compiler one).
To get this working, you just need to make it proc init(type exprType).

mppf · 2019-12-20T13:56:58Z

modules/standard/Regexp.chpl

+        if exprType == string then
+          yield text[pos+1..splitstart];
+        else 
+          yield text[(pos+1):int..splitstart:int];


You can't index into a bytes with a bytesIndex ? That seems like something we should provide, even if it's just to make generic code like this nicer.

Can you make an issue about this one too? Thanks.

Here is a PR: #14695

mppf · 2019-12-20T13:57:27Z

modules/standard/Regexp.chpl

@@ -936,29 +916,32 @@ record regexp {
     :returns: a tuple containing (new string, number of substitutions made)


I'm sure there are many places in the documentation that mention string that now need adjustment.

mppf · 2019-12-20T13:59:00Z

modules/standard/Regexp.chpl

@@ -1046,20 +1033,24 @@ proc =(ref ret:regexp, x:regexp)

 // Cast regexp to string.
 pragma "no doc"
-inline proc _cast(type t, x: regexp) where t == string {
-  var pattern: string;
+inline proc _cast(type t, x: regexp(?exprType)) where t == exprType  {


where clauses on these casts slow down compilation.
Can you make 2 versions of this cast, one for string and one for bytes?

inline proc _cast(type t:string, x: regexp(string)) { } inline proc _cast(type t:bytes, x: regexp(bytes)) { }

mppf · 2019-12-20T13:59:16Z

modules/standard/Regexp.chpl

  }
  return pattern;
 }
 // Cast string to regexp
 pragma "no doc"
-inline proc _cast(type t, x: string) throws where t == regexp {
+inline proc _cast(type t, x: ?valType) throws


same here

inline proc _cast(type t:regexp(string), x: string) { } inline proc _cast(type t:regexp(bytes), x: bytes) { }

Also I think we should consider deprecating this one (because compile is a more reasonable way to write it). Could you open up a design issue asking that?

- Remove deprecated non-throwing `compile` - Deprecate `compile` with `utf8` argument - Make passing `utf8=false` a throwing error - Make new `compile` arguments camelCase and update its doc *only* for that - Add a deprecation test

e-kayrakli · 2019-12-20T23:46:53Z

@mppf I think this is ready for another round of review. Maybe you can check the commits after your initial review one-by-one? Thanks!

mppf · 2019-12-21T00:03:22Z

modules/standard/Regexp.chpl

-                 you may have to escape backslashes. For example, to
+   :arg pattern: the regular expression to compile. This argument can be string
+                 or bytes. See :ref:`regular-expression-syntax` for details.
+                 Note that you may have to escape backslashes. For example, to


I know this is unrelated to your PR, but while updating this documentation could you please emphasize in it that raw string literals exist? The docs here predate the raw string literals and regexps are an improtant motivating example.

mppf · 2019-12-21T00:08:37Z

test/regexp/bytes/nonUTFIO.chpl

+  // read from it -- following four reads are how these should be read
+  var r = f.reader();
+  var captureString: string;
+  var captureBytes: bytes;


Can you add some testing with passing a compiled bytes/string regexp to this test?

Done. Current version of this test does that via the testRead helper.

Skipif bytes regular expression tests #14686 added bytes regular expressions along with some tests, but those tests aren't skipped when there is no RE2. This PR adds those skipifs.

@mppf

Allow indexing/slicing bytes with byteIndex This PR enables indexing and slicing bytes using `byteIndex` type. `byteIndex` is a type used in string indexing to differentiate codepoint indexing and byte indexing. While writing some generic code using strings and bytes, it'd be nice to be able to index/slice bytes using this type, too. For the motivation see: #14686 (comment) [Reviewed by @mppf] Test: - [x] full standard - [x] full gasnet

Update a call to regexp.compile in mason Fixes an argument name that was causing deprecation warnings due to #14686 Test: make -C tools/mason does not produce the warning with this PR. Trivial, not reviewed.

@dlongnecke-cray

Allow string-by-def regexps with warning This PR makes `string` default for `regexp` with a deprecation warning for 1.21 Resolves #14693 Resolves #14895 `regexp`s were string-based until #14686 made them generic to support bytes and strings. Under #14693, we reached a consensus that generic `regexp` is what we want going forward, but we need to support `regexp` with string-by-default type for 1.21 with a deprecation warning. This PR also adds deprecation tests [Reviewed by @dlongnecke-cray] Test: - [x] standard local - [x] standard gasnet

e-kayrakli added 13 commits December 19, 2019 13:51

Remove stringPart

584904b

Add bytes type to where clauses

56b6c73

Make regexp generic on the expresion type, support bytes

ec04c22

Add a comment about encoding, add bytes methods

39eaf93

Add str.foo(re) cases to the test

db41bf0

Adjustments, mainly IO support for bytes regexps

e2c07b0

Few more fixes before valgrind testing

f329c87

Add the missing compopts file

f795261

Fix the segfault -- thanks to Michael

d020535

Cleanup

359704c

Make channel_regexp_info deal with bytes only

74f3ca9

Fix the RecordParser module

ff43de1

Add a workaround for reading enums

e8a43d1

e-kayrakli requested a review from mppf December 20, 2019 02:33

mppf reviewed Dec 20, 2019

View reviewed changes

e-kayrakli added 4 commits December 20, 2019 10:42

Make channel._format_reader take string/bytes

6288c57

Change the error mode for capturing non-UTF8 data in string

fa23872

Add test

7682c18

Grabbag of deprecation related changes

57ee904

- Remove deprecated non-throwing `compile` - Deprecate `compile` with `utf8` argument - Make passing `utf8=false` a throwing error - Make new `compile` arguments camelCase and update its doc *only* for that - Add a deprecation test

e-kayrakli mentioned this pull request Dec 20, 2019

Default type for regular expressions #14693

Closed

e-kayrakli added 2 commits December 20, 2019 14:08

Bring back the initializer, minor fix

6b29c4d

Add explicit overloads for casts, and a test

44701bf

e-kayrakli mentioned this pull request Dec 20, 2019

Should we deprecate casting strings to regular expressions? #14694

Closed

Update regexp docs

c178e4f

mppf reviewed Dec 21, 2019

View reviewed changes

mppf approved these changes Dec 21, 2019

View reviewed changes

e-kayrakli mentioned this pull request Dec 21, 2019

Allow indexing/slicing bytes with byteIndex #14695

Merged

2 tasks

Update docs to mention raw literals

55da61b

e-kayrakli added 2 commits December 23, 2019 11:13

Modify test to run with precompiled REs

48bb7b8

Add last-resort to the deprecated compile, fix mason to use new compile

acc2d8d

e-kayrakli merged commit 11d1c66 into chapel-lang:master Jan 2, 2020

e-kayrakli deleted the bytes-regexp branch January 2, 2020 17:35

e-kayrakli mentioned this pull request Jan 4, 2020

Skipif bytes regular expression tests #14719

Merged

e-kayrakli mentioned this pull request Jan 16, 2020

Update a call to regexp.compile in mason #14768

Merged

e-kayrakli mentioned this pull request Feb 25, 2020

Allow string-by-def regexps with warning #14982

Merged

2 tasks

jabraham17 mentioned this pull request Nov 9, 2023

What to do with the empty regex initializer? #23822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for bytes regular expressions #14686

Add support for bytes regular expressions #14686

e-kayrakli commented Dec 20, 2019 •

edited

Loading

mppf left a comment

mppf Dec 20, 2019

mppf Dec 20, 2019

mppf Dec 20, 2019

mppf Dec 20, 2019

e-kayrakli Dec 20, 2019

mppf Dec 20, 2019

mppf Dec 20, 2019

mppf Dec 21, 2019

e-kayrakli Dec 21, 2019

mppf Dec 20, 2019

mppf Dec 20, 2019

mppf Dec 20, 2019

e-kayrakli Dec 20, 2019

e-kayrakli commented Dec 20, 2019

mppf Dec 21, 2019

mppf Dec 21, 2019

e-kayrakli Jan 2, 2020

		@@ -936,29 +916,32 @@ record regexp {
		:returns: a tuple containing (new string, number of substitutions made)

Add support for bytes regular expressions #14686

Add support for bytes regular expressions #14686

Conversation

e-kayrakli commented Dec 20, 2019 • edited Loading

mppf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-kayrakli commented Dec 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-kayrakli commented Dec 20, 2019 •

edited

Loading