Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of regexp_count #13364

Merged
merged 3 commits into from
Nov 12, 2024
Merged

improve performance of regexp_count #13364

merged 3 commits into from
Nov 12, 2024

Conversation

Dimchikkk
Copy link
Contributor

@Dimchikkk Dimchikkk commented Nov 11, 2024

Which issue does this PR close?

Closes #13011

Rationale for this change

regexp_count becomes the fastest from regexp functions :)

Are these changes tested?

     Running benches/regx.rs (target/release/deps/regx-47e5923f6186ee59)
regexp_count_1000 string
                        time:   [834.71 µs 875.47 µs 914.87 µs]
                        change: [-82.379% -81.688% -80.955%] (p = 0.00 < 0.05)
                        Performance has improved.

regexp_count_1000 utf8view
                        time:   [813.73 µs 862.53 µs 908.74 µs]
                        change: [-82.999% -82.228% -81.393%] (p = 0.00 < 0.05)
                        Performance has improved.

regexp_like_1000        time:   [2.4620 ms 2.4930 ms 2.5463 ms]
                        change: [-0.1759% +1.1978% +3.3854%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

regexp_match_1000       time:   [2.8397 ms 2.8786 ms 2.9387 ms]
                        change: [-2.2173% +0.7988% +3.4682%] (p = 0.65 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

regexp_replace_1000     time:   [2.3042 ms 2.3113 ms 2.3183 ms]
                        change: [-1.0469% +0.4878% +1.7127%] (p = 0.53 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

Are there any user-facing changes?

No

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dimchikkk for your contribution.
Do you mean the Entry API misbehaved returning Vacant all the time and forced the regexp pattern to be recompiled?

@Dimchikkk
Copy link
Contributor Author

Thanks @Dimchikkk for your contribution. Do you mean the Entry API misbehaved returning Vacant all the time and forced the regexp pattern to be recompiled?

Hi @comphead ,
No, I haven't found out the root cause why the code using Entry API less performant. I went from another angle... looking into regexp_replace source file (that was like 50% faster) and trying to do things in a similar way & benchmark it.... so I basically rewrote the function that inserts into the cache.

@comphead
Copy link
Contributor

Please rebase from latest main to avoid the CI failure and personally I like the numbers

@Dimchikkk
Copy link
Contributor Author

@comphead I actually found the root cause... it was the cloning of regex that is expensive. Now the numbers even more sexy:
image

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dimchikkk its great first PR 👍
Since this is a first I'll wait for another member to approve and we can merge it.

@Dandandan if you dont mind to approve?

@comphead
Copy link
Contributor

comphead commented Nov 12, 2024

@comphead I actually found the root cause... it was the cloning of regex that is expensive.

Yeah, that makes much more sense

@comphead comphead merged commit 705dd0e into apache:main Nov 12, 2024
25 checks passed
@Omega359
Copy link
Contributor

Nice find!

@Dimchikkk
Copy link
Contributor Author

Thank you guys, now I am wondering why other regexp functions slower than regexp_count :)

@Dandandan
Copy link
Contributor

Thank you guys, now I am wondering why other regexp functions slower than regexp_count :)

That would be a great thing to check... one thing I saw in the arrow-rs kernels some (string) cloning is happening. Would be great to check & improve!

@comphead
Copy link
Contributor

I'm wondering if other regexp_* functions have the same flaw

@comphead
Copy link
Contributor

Thank you guys, now I am wondering why other regexp functions slower than regexp_count :)

Would be nice if you can check really quick other regexp functions if they can be optimized the same way

alamb pushed a commit to alamb/datafusion that referenced this pull request Nov 13, 2024
* improve performance of regexp_count

* fix clippy

* collect with Int64Array to eliminate one temp Vec

---------

Co-authored-by: Dima <dima.rets@ballys.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of regexp_count
5 participants