Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default type for regular expressions #14693

Closed
e-kayrakli opened this issue Dec 20, 2019 · 2 comments · Fixed by #14982
Closed

Default type for regular expressions #14693

e-kayrakli opened this issue Dec 20, 2019 · 2 comments · Fixed by #14982

Comments

@e-kayrakli
Copy link
Contributor

e-kayrakli commented Dec 20, 2019

#14686 will add support for bytes regular expressions. In order to do that it makes regexp record generic on type exprType. The meaningful concrete types in this scenario are regexp(string) and regexp(bytes).

That PR does not assign a default value to exprType, but maybe it should. We have the following options:

  1. No default type -- it'd be nice to have regexp neutral.
    This helps with some of the confusions I was having during the implementation of Add support for bytes regular expressions #14686. For example:
class Message {
  proc parse(r: regexp) { /* bla */ }
}

if regexp doesn't have default type, then, parse can be used with both string and bytes regular expressions. Otherwise, we'd be stuck with string only, or change the prototype to proc parse(r: regexp(?))

This is also a breaking change, but probably not very impactful. Note that you can use regular expressions without using this type explicitly:

var myRe = compile("regular expression string");

gives you a regexp(string). If the literal was bytes then it would give regexp(bytes)

  1. default value is string
    string should be the more common use for this class, and it might make sense to just make it default.

  2. default value is bytes -- just for completeness. I don't think this is a real option

Even though #14686 chooses 1, I don't really have a preference between 1 and 2.

e-kayrakli added a commit that referenced this issue Jan 2, 2020
Add support for bytes regular expressions

This PR adds support for using `bytes` values as regular expressions to be used
with other `bytes` values.

Summary:
- Makes `regexp` generic on `type exprType` which can be `string` or `bytes`
   
  Currently, this PR doesn't assign a default to `exprType`. But we might want to
  make `string` the default type. See #14693 

- Removes `stringPart` from Regexp module. This was based on some comments in
  the code that says it is not tested, and in some places, support for it is not
  fully implemented.

  This is not absolutely necessary for this PR to move on and I can revert back.
  But it makes the interface and sometimes implementation much easier to remove
  that.

- Remove the deprecated `compile` with `out err`, deprecates the `compile`
  with `utf8`  argument and all lowercase arguments.

  The new `compile` uses `utf8` encoding with strings and `Latin1` with bytes
  internally.

- Adds `bytes` methods with regexp arguments. `bytes.this(reMatch)`,
  `bytes.search(regexp(bytes))` etc
  
- Adds support for doing formatted reads for `regexp(bytes)`.

  The internal class `channel_regexp_info` now stores only `bytes`. They are
  `decode`d into strings as necessary when they are extracted.

  Making this also generic cannot be easily done, I think. Because `readf`
  needs to create a `channel_regexp_info` instance before starting to read the
  arguments. And before starting to read the arguments we don't know whether
  they are `regexp(bytes)` or `regexp(string)`.

  Note that the type of the format string does not determine the type of the
  regular expression. Therefore, one can capture a bytes-based regular
  expression using `"%/*/"` and a string-based regular expression using
  `b"%/*/"`

- Modifies some tests to test bytes-based regular expressions in the same way
  string-based ones are tested.

- Adjusts the `RecordParser` module to use `regexp(string)`

- Adjusts mason to use new `compile`

- Adds tests for non-UTF8 regexp matching and casts.

- Minor adjustment in the docs to emphasize that we have raw string/bytes
  literals.

[Reviewed by @mppf]

Test:
- [x] standard
- [x] gasnet
@BryantLam
Copy link

I would go with option 1: no default type. It's the least restrictive option between 1 and 2 if some person wants to make a custom string type for their use case.

@bradcray
Copy link
Member

I think I said this elsewhere, but: I agree that it shouldn't have a default type. I could imagine it having a default type with deprecation warning for the 1.21 release to help people transition their existing codes that relied on it being string-only. I'm imagining this could be supported by having a 0-argument initializer that generated the warning and called a 1-argument initializer passing string in.

e-kayrakli added a commit that referenced this issue Mar 6, 2020
Allow string-by-def regexps with warning

This PR makes `string` default for `regexp` with a deprecation warning for 1.21

Resolves #14693
Resolves #14895

`regexp`s were string-based until #14686 made them generic to support bytes and
strings. Under #14693, we reached a consensus that generic `regexp` is what we
want going forward, but we need to support `regexp` with string-by-default type
for 1.21 with a deprecation warning.

This PR also adds deprecation tests

[Reviewed by @dlongnecke-cray]

Test:
- [x] standard local
- [x] standard gasnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants