Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract regex and add fancy_regex backend #908

Closed

Conversation

josephrocca
Copy link
Contributor

This is the first part of one potential approach to getting Wasm and pure rust builds of tokenizers.

Relevant issue: #63

Here's the command to test the fancy_regex backend:

cargo test --no-default-features --features progressbar,http,regex-fancy

It's a bit strange that there's not a nicer way to do that (IIUC), but it seems like this is a known problem.

You can also test both backends at the same time (return values checked against one another for consistency):

cargo test --features regex-all-test

The abstraction layer code was created by the nlprule and syntect project devs, with only minor edits from myself:

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi since this is your first contribution, and even if I don't approve of just using a different regex engine, I did review it so you can have in mind what this projects concerns are.

Cheers !

SplitPattern::String(s) => Regex::new(&regex::escape(s))?,
SplitPattern::Regex(r) => Regex::new(r)?,
SplitPattern::String(s) => Regex::new(regex::escape(s)),
SplitPattern::Regex(r) => Regex::new(r.to_owned()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to refrain making everything owned when the compiler asks you too.

It's the easy way out, but the incorrect way out most likely.
Enjoy the ride of understanding lifetimes :) . In my experience, it should be roughly trivial to add lifetimes (when you know how that works), and when it's not, usually the design has some flaw actually (which the compiler warns me about actually)

@@ -3,6 +3,7 @@

.vim
.env
.venv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. That's a user thing it does not belong here.

Please use a global .gitignore for you (then you also only define it once on your machine).

@@ -59,11 +58,18 @@ cached-path = { version = "0.5", optional = true }
aho-corasick = "0.7"
paste = "1.0.6"
proc_macros = { path = "./src/utils/proc_macros" }
once_cell = "1.8"
cfg-if = "1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed actually.

You can probably write this as pure rust compilation #cfg.

impl Regex {
/// Create a new regex from the pattern string.
///
/// Note that the regex compilation happens on first use, which is why this method does not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think this library wants to compile the regex on load, not on-use, that changes this to a Result.

It also probably enables your to pass &str since the Regex compilation will create the object without making the copy and doesn't need to keep a reference to the string, (so no lifetime issues here it seems).

@@ -59,11 +58,18 @@ cached-path = { version = "0.5", optional = true }
aho-corasick = "0.7"
paste = "1.0.6"
proc_macros = { path = "./src/utils/proc_macros" }
once_cell = "1.8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that either.

Even keeping the idea of fancy-regex, a simple enum is OK.
And even then, we don't want an enum. We want something that uses the correct regex engine ( onig or onig-rs) directly at compilation time.

@@ -0,0 +1,497 @@
// All code and comments below this first line are copied from here (with some edits): https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/utils/regex.rs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, overall refrain from doing things this way.

Copy pasting code from another repo is likely to be the wrong solution.
They have a different context, different goals, you can't just reuse and hope it fits the current projects needs.

Taking an inspiration from other sources is GREAT, and you should definitely continuing looking at other repos for ideas on how to handle problems.

But you cannot blindly apply things, you need to understand the current project scopes, philosophy, concerns.

Rule of thumb:

  • Less changes is better
  • Imitate the style of previous code is better

@josephrocca
Copy link
Contributor Author

@Narsil Thanks for the detailed review! It was helpful. I'll close this per your comment here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants