Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bytes regular expressions #14686

Merged
merged 23 commits into from
Jan 2, 2020

Conversation

e-kayrakli
Copy link
Contributor

@e-kayrakli e-kayrakli commented Dec 20, 2019

This PR adds support for using bytes values as regular expressions to be used
with other bytes values.

Summary:

  • Makes regexp generic on type exprType which can be string or bytes

    Currently, this PR doesn't assign a default to exprType. But we might want to
    make string the default type. See Default type for regular expressions #14693

  • Removes stringPart from Regexp module. This was based on some comments in
    the code that says it is not tested, and in some places, support for it is not
    fully implemented.

    This is not absolutely necessary for this PR to move on and I can revert back.
    But it makes the interface and sometimes implementation much easier to remove
    that.

  • Remove the deprecated compile with out err, deprecates the compile
    with utf8 argument and all lowercase arguments.

    The new compile uses utf8 encoding with strings and Latin1 with bytes
    internally.

  • Adds bytes methods with regexp arguments. bytes.this(reMatch),
    bytes.search(regexp(bytes)) etc

  • Adds support for doing formatted reads for regexp(bytes).

    The internal class channel_regexp_info now stores only bytes. They are
    decoded into strings as necessary when they are extracted.

    Making this also generic cannot be easily done, I think. Because readf
    needs to create a channel_regexp_info instance before starting to read the
    arguments. And before starting to read the arguments we don't know whether
    they are regexp(bytes) or regexp(string).

    Note that the type of the format string does not determine the type of the
    regular expression. Therefore, one can capture a bytes-based regular
    expression using "%/*/" and a string-based regular expression using
    b"%/*/"

  • Modifies some tests to test bytes-based regular expressions in the same way
    string-based ones are tested.

  • Adjusts the RecordParser module to use regexp(string)

  • Adjusts mason to use new compile

  • Adds tests for non-UTF8 regexp matching and casts.

  • Minor adjustment in the docs to emphasize that we have raw string/bytes
    literals.

Test:

  • standard
  • gasnet

@e-kayrakli e-kayrakli requested a review from mppf December 20, 2019 02:33
Copy link
Member

@mppf mppf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with the direction but have requested quite a few changes. Once those changes are done, and testing is passing, I think this will be good to go in.

Can you add a test of using regexp I/O to read file contents that are invalid UTF-8? Can you check that it reads OK with a bytes regexp and gives some sort of error when reading with a string regexp?

proc compile(pattern: ?t, out error:syserr, posix, literal, nocapture,
/*i*/ ignorecase, /*m*/ multiline, /*s*/ dotnl,
/*U*/ nongreedy): regexp(t) where t==string || t==bytes {

compilerWarning("'out error: syserr' pattern has been deprecated, use 'throws' function instead");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just remove this function

@@ -447,15 +447,20 @@ class BadRegexpError : Error {
``(?U)``.

*/
proc compile(pattern: string, utf8=true, posix=false, literal=false, nocapture=false, /*i*/ ignorecase=false, /*m*/ multiline=false, /*s*/ dotnl=false, /*U*/ nongreedy=false):regexp throws {
proc compile(pattern: ?t, posix=false, literal=false, nocapture=false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in a standard module, I think we have some responsibility to try making a deprecation overload. Did you already do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we might as well make the arguments use camelCase while we are deprecating.

@@ -578,15 +572,19 @@ proc string.this(m:reMatch) {
*/
pragma "ignore noinit"
record regexp {

pragma "no doc"
type exprType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open a design issue about whether or not regexp should have the default of string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #14693

proc init() {
}
/*proc init() {*/
/*}*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to prevent users from calling another initializer (e.g. the compiler one).
To get this working, you just need to make it proc init(type exprType).

if exprType == string then
yield text[pos+1..splitstart];
else
yield text[(pos+1):int..splitstart:int];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't index into a bytes with a bytesIndex ? That seems like something we should provide, even if it's just to make generic code like this nicer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an issue about this one too? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR: #14695

@@ -936,29 +916,32 @@ record regexp {
:returns: a tuple containing (new string, number of substitutions made)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure there are many places in the documentation that mention string that now need adjustment.

@@ -1046,20 +1033,24 @@ proc =(ref ret:regexp, x:regexp)

// Cast regexp to string.
pragma "no doc"
inline proc _cast(type t, x: regexp) where t == string {
var pattern: string;
inline proc _cast(type t, x: regexp(?exprType)) where t == exprType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where clauses on these casts slow down compilation.
Can you make 2 versions of this cast, one for string and one for bytes?

 inline proc _cast(type t:string, x: regexp(string)) { }
 inline proc _cast(type t:bytes, x: regexp(bytes)) { }

}
return pattern;
}
// Cast string to regexp
pragma "no doc"
inline proc _cast(type t, x: string) throws where t == regexp {
inline proc _cast(type t, x: ?valType) throws
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

 inline proc _cast(type t:regexp(string), x: string) { }
 inline proc _cast(type t:regexp(bytes), x: bytes) { }

Also I think we should consider deprecating this one (because compile is a more reasonable way to write it). Could you open up a design issue asking that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #14694

- Remove deprecated non-throwing `compile`
- Deprecate `compile` with `utf8` argument
  - Make passing `utf8=false` a throwing error
- Make new `compile` arguments camelCase and update its doc *only* for
  that
- Add a deprecation test
@e-kayrakli
Copy link
Contributor Author

@mppf I think this is ready for another round of review. Maybe you can check the commits after your initial review one-by-one? Thanks!

you may have to escape backslashes. For example, to
:arg pattern: the regular expression to compile. This argument can be string
or bytes. See :ref:`regular-expression-syntax` for details.
Note that you may have to escape backslashes. For example, to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is unrelated to your PR, but while updating this documentation could you please emphasize in it that raw string literals exist? The docs here predate the raw string literals and regexps are an improtant motivating example.

// read from it -- following four reads are how these should be read
var r = f.reader();
var captureString: string;
var captureBytes: bytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some testing with passing a compiled bytes/string regexp to this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Current version of this test does that via the testRead helper.

@e-kayrakli e-kayrakli merged commit 11d1c66 into chapel-lang:master Jan 2, 2020
@e-kayrakli e-kayrakli deleted the bytes-regexp branch January 2, 2020 17:35
e-kayrakli added a commit that referenced this pull request Jan 4, 2020
Skipif bytes regular expression tests

#14686 added bytes regular expressions along with some tests, but those tests aren't skipped when there is no RE2. This PR adds those skipifs.
e-kayrakli added a commit that referenced this pull request Jan 7, 2020
Allow indexing/slicing bytes with byteIndex

This PR enables indexing and slicing bytes using `byteIndex` type.

`byteIndex` is a type used in string indexing to differentiate codepoint
indexing and byte indexing. While writing some generic code using strings and
bytes, it'd be nice to be able to index/slice bytes using this type, too.

For the motivation see:
#14686 (comment)

[Reviewed by @mppf]

Test:
- [x] full standard
- [x] full gasnet
e-kayrakli added a commit that referenced this pull request Jan 16, 2020
Update a call to regexp.compile in mason

Fixes an argument name that was causing deprecation warnings due to #14686

Test:
make -C tools/mason does not produce the warning with this PR.

Trivial, not reviewed.
e-kayrakli added a commit that referenced this pull request Mar 6, 2020
Allow string-by-def regexps with warning

This PR makes `string` default for `regexp` with a deprecation warning for 1.21

Resolves #14693
Resolves #14895

`regexp`s were string-based until #14686 made them generic to support bytes and
strings. Under #14693, we reached a consensus that generic `regexp` is what we
want going forward, but we need to support `regexp` with string-by-default type
for 1.21 with a deprecation warning.

This PR also adds deprecation tests

[Reviewed by @dlongnecke-cray]

Test:
- [x] standard local
- [x] standard gasnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants