Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Regex Support in Standard Library for JS Backend #592

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

leonfuss
Copy link

@leonfuss leonfuss commented Sep 16, 2024

This pull request significantly improves regex support in Effekt's standard library for the JavaScript backend. It introduces capture groups, group indices, and regex flags, addressing current limitations and providing a more powerful regex interface.

Key Enhancements:

  1. Support for capture groups and their indices
  2. Implementation of regex flags
  3. Stateful regex object for sequential matching

Current Limitations:
Effekt's regex support is currently limited, lacking support for capture groups and advanced regex features. This implementation bridges this gap for JavaScript environments.

New Features:

  • Capture Groups: Ability to use and access capture groups in regex patterns
  • Group Indices: Retrieve the starting index of each capture group
  • Regex Flags: Support for standard regex flags (e.g., 'i' for case-insensitive matching)
  • Stateful Matching: Global flag ('g') set by default for sequential matching

Backwards compatibility
This change is not backwards compatible.

Feedback on the API design and usability from experienced Effekt developers is highly appreciated.

Copy link
Contributor

@jiribenes jiribenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! :)
I can't review it fully, but I have a few nitpicks I pointed out in the review: most of them are just stylistic.

What I would really like to see before merging are a few relevant tests in effekt/examples/stdlib/, feel free to make a new folder.
Our tests are always regression tests, so there's an .effekt file and a corresponding .check file. You can either just print out the results (see here) or use the test suite written in Effekt (see here).
The tests are especially important if you're planning to depend on this library going forward -- otherwise we won't notice a regression.

libraries/common/option.effekt Outdated Show resolved Hide resolved
libraries/js/text/regex.effekt Show resolved Hide resolved
libraries/js/text/regex.effekt Outdated Show resolved Hide resolved
libraries/js/text/regex.effekt Outdated Show resolved Hide resolved
libraries/js/text/regex.effekt Outdated Show resolved Hide resolved
libraries/js/text/regex.effekt Outdated Show resolved Hide resolved
libraries/js/text/regex.effekt Outdated Show resolved Hide resolved
@jiribenes
Copy link
Contributor

The tests are failing since examples/casestudies/lexer.effekt.md uses text/regex.
Here are some more modules importing text/regex: https://github.com/search?q=repo%3Aeffekt-lang%2Feffekt+%22text%2Fregex%22+path%3Aexamples%2F**%2F*.effekt&type=code

@leonfuss
Copy link
Author

The tests are failing since examples/casestudies/lexer.effekt.md uses text/regex. Here are some more modules importing text/regex: https://github.com/search?q=repo%3Aeffekt-lang%2Feffekt+%22text%2Fregex%22+path%3Aexamples%2F**%2F*.effekt&type=code

I refactored the lexer example to account for the changes. The other occurrences shouldn't be affected since I'm only changing the JS backend? Please correct me if I oversaw anything.

@leonfuss
Copy link
Author

Thank you for your thorough review. I've implemented your suggestions, making the code more idiomatic. The addition of typed flags should notably enhance usability. I plan to add the test cases tomorrow.

jiribenes
jiribenes previously approved these changes Sep 18, 2024
Copy link
Contributor

@jiribenes jiribenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the PR again and LGTM :shipit:
Thank you for the contribution :)

@leonfuss
Copy link
Author

I ran into some problems designing the test. While the LSP server and I cannot find any error in the code, the compiler fails with the error [error] Cannot typecheck call. Unfortunately it doesn't specify the failed type check. I experimented now for one hour and could not get it working.

@jiribenes If you see any obvious problems why it doesn't type check I really would appreciate your help :)
Otherwise I will just experiment around until I get a more expressive error :/

@jiribenes
Copy link
Contributor

I'll take a look. This error usually means that UFCS (foo.bar(baz) standing for bar(foo, baz)) failed.

@leonfuss
Copy link
Author

Is there a way to run certain tests only for the JS-Backend? I would like to add test for the new features, but it seems that the chez-backend is not happy with that. At least I have no trouble running the test succesful on my machine using only the JS-Backend.

@b-studios
Copy link
Collaborator

It looks like this test even fails on JS:

https://github.com/effekt-lang/effekt/actions/runs/10919082132/job/30306075187?pr=592#step:11:402

@jiribenes
Copy link
Contributor

jiribenes commented Sep 18, 2024

Is there a way to run certain tests only for the JS-Backend? I would like to add test for the new features, but it seems that the chez-backend is not happy with that. At least I have no trouble running the test succesful on my machine using only the JS-Backend.

You can add the file into ignored here for the backends it should not run on together with a comment why it's disabled: https://github.com/effekt-lang/effekt/blob/master/effekt/jvm/src/test/scala/effekt/StdlibTests.scala

So for example for Chez-$something backends:

abstract class StdlibChezTests extends StdlibTests {
  override def ignored: List[File] = List(
    // Not implemented yet
    examplesDir / "stdlib" / "bytes",
    examplesDir / "stdlib" / "io",

    // Not implemented: advanced regex features
    examplesDir / "stdlib" / "string" / "regex.effekt",
  )
}

@jiribenes
Copy link
Contributor

jiribenes commented Sep 18, 2024

It looks like this test even fails on JS:

https://github.com/effekt-lang/effekt/actions/runs/10919082132/job/30306075187?pr=592#step:11:402

It seems to work for me locally so it might be a NodeJS version issue?
Did the API for regexes change there?

EDIT: it seems that the last part of the test labelled "capture groups" throws some error somewhere.

EDIT 2: I think that NodeJS 12.x does not support the d (GenerateIndices()) flag since it was only introduced in ES9 and implemented in NodeJS 16.x according to MDN.

@jiribenes
Copy link
Contributor

jiribenes commented Sep 18, 2024

Don't know how to resolve this if we want to keep the functionality. We could update NodeJS: that might be a wise move anyway since even the extended support for NodeJS 12.x ended 2 years and 4 months ago (30 Apr 2022). Current oldest somewhat supported LTS version is NodeJS 18.x (where security support ends in 7 months).
[I'd also be fine with updating to "just" NodeJS 16.x]

@leonfuss
Copy link
Author

EDIT 2: I think that NodeJS 12.x does not support the d (GenerateIndices()) flag since it was only introduced in ES9 and implemented in NodeJS 16.x according to MDN.

I just tested it locally and can confirm that. I will try to add a condition that the d flag will only works if the node version is >16.x and change the test to only run if node is >16.x

Do you see any problems with that?

@b-studios
Copy link
Collaborator

We can just require a newer node version, that shouldn't be a problem.

Comment on lines 203 to 207
function process$version() {
const v = process.version
const parts = v.split('.')
return parseInt(parts[0].replace('v', ''))
}
Copy link
Contributor

@jiribenes jiribenes Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a tiny nitpick: according to StackOverflow, process.versions.node should have the version without v :)
https://stackoverflow.com/questions/6656324/check-for-current-node-version

(though I haven't tried it on NodeJS 12.x, it might be too new 🤔)

@b-studios
Copy link
Collaborator

Thanks for the PR! I think it is great to have improved regex support in the language.

I have one major concern, which is the compatibility with the LLVM backend. In general, I do not care so much anymore about the Chez and ML backends. However, we should aim for feature completeness of the LLVM backend.

Adding JS-only features now will make this more and more difficult in the future.

@phischu what is your opinion when viewing this from the LLVM perspective?

@jiribenes jiribenes mentioned this pull request Sep 23, 2024
@b-studios
Copy link
Collaborator

Which of the extensions that are proposed in this PR are compatible with the regular expressions in POSIX?

(https://www.man7.org/linux/man-pages/man3/regex.3.html)

@jiribenes
Copy link
Contributor

jiribenes commented Sep 23, 2024

master is updated to NodeJS 16.x via #599, you can just revert the last commit and rebase :)

@b-studios
Copy link
Collaborator

To clarify: I am not proposing that you should implement the LLVM part. I just want to know what the intersection between posix regex and your proposed features is.

record Match(matched: String, index: Int)
// matches and groups
record Range(start: Int, end: Int)
record Group(matched: String, index: Option[Range])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just always generate indices and turn this into index: Range?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely possible and should decrease complexity

namespace internal {
extern io def unsafeExec(reg: Regex, str: String): RegexMatch = js "regex$exec(${reg}, ${str})"

extern pure def unsafeRegex(str: String, flags: String): Regex = js "new RegExp(${str}, ${flags})"
Copy link
Collaborator

@b-studios b-studios Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be io (or global, not sure which one this should be), because otherwise the same regex will potentially be inlined over and over again and not have the "stateful" behavior.

Copy link
Author

@leonfuss leonfuss Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification. I sometimes still struggle with FFI boundaries in effekt

Comment on lines 108 to 103
* The returned Match object contains:
* - matched: The full matched string.
* - start: The starting index of the match.
* - end: The ending index of the match if `Global` flag is set, otherwise 0
* - groups: A list of Group objects for each capturing group.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this documentation to Match and refer to it here?

Also, why is end 0 if global isn' set?

@leonfuss
Copy link
Author

Which of the extensions that are proposed in this PR are compatible with the regular expressions in POSIX?

(https://www.man7.org/linux/man-pages/man3/regex.3.html)

The POSIX standard for Regex does only include basic support for capture groups. Only the extraction of the content is possible here, but generating indices for those is not. Stateful matching is definitely not known in the POSIX world.

Flags are a bit more difficult in compatibility. Usually flags are passed as additional command line arguments which makes it dependent on the used program if they are compatible or not.

Does that answer the question?

@leonfuss
Copy link
Author

u can just revert the last commit and rebase

Thank you :)

@leonfuss leonfuss force-pushed the feat/regex/capture_groups branch 2 times, most recently from e981f37 to 5d34ec2 Compare October 28, 2024 14:36
@leonfuss leonfuss force-pushed the feat/regex/capture_groups branch from 6459232 to aab54e3 Compare October 30, 2024 12:34
@leonfuss
Copy link
Author

leonfuss commented Nov 6, 2024

I have implemented the following changes as suggested previously:

  1. The Global Flag is now handled directly. The exec function returns a Regex object that captures the string in the capture field. This allows subsequent calls on the returned object to iterate (lazily) through all matches. A shorthand for map on the Regex Object is provided, which maps directly onto the captured Match for an ergonomic API. This also eliminates the need for a global or io modifier to prevent inlining. Both attributes previously prevented the use of regex queries in areas where these effects are not allowed.
  2. The examples have been added to the tests and verified to work as expected.
  3. The GenerateIndices flag has been removed as an option and is now enabled by default.

Please feel free to share your thoughts on the current design.

@leonfuss leonfuss requested a review from b-studios November 6, 2024 15:14
@leonfuss
Copy link
Author

leonfuss commented Nov 6, 2024

Currently, the tests fail for both LLVM and the Chez backend because they lack the flags interface. To proceed, I will disable the regex tests on those backends. As long as the regex engine of a backend supports flags, implementing those should be straightforward, as the only stateful Flag (Global) is now directly expressed in Effekt.

@leonfuss leonfuss force-pushed the feat/regex/capture_groups branch from 95eebc4 to c6997f0 Compare November 6, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants