Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust][DataFusion] Improve like/nlike performance #71

Closed
alamb opened this issue Apr 26, 2021 · 1 comment
Closed

[Rust][DataFusion] Improve like/nlike performance #71

alamb opened this issue Apr 26, 2021 · 1 comment
Labels
arrow Changes to the arrow crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8681

Currently, the implementation of  like_utf8 and nlike_utf8 is based on regex, which is simple and readable, but poor at the performance.

 

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this repo, I compare three like algorithm.

like(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as a%a%a%a%a%a%a%a%b.

like_to_regex: which is almost the same as the current implementation in arrow.

like_optimize: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.

 

In my benchmark result, the recursive solution can be ignored due to bad time complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and about 4x time without regex compiling then solution 3. And It seems that the code complexity of solution 3 is acceptable.

Everyone can reproduce the benchmark result using this repo with a few codes.

 

I have submitted a PR to TiKV to optimize the like performance ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)], without UTF-8 support), and add collation support in [https://github.com/tikv/tikv/pull/6592], which can be simply port to data-fusion.

 

 

 

 

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

wrong repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

No branches or pull requests

2 participants