Improve like/nlike performance #88

alamb · 2021-04-26T13:17:50Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8681

Currently, the implementation of like_utf8 and nlike_utf8 is based on regex, which is simple and readable, but poor at the performance.

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this repo, I compare three like algorithm.

`like`(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as `a%a%a%a%a%a%a%a%b`.

`like_to_regex`: which is almost the same as the current implementation in arrow.

`like_optimize`: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.

In my benchmark result, the recursive solution can be ignored due to bad time complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and about 4x time without regex compiling then solution 3. And It seems that the code complexity of solution 3 is acceptable.

Everyone can reproduce the benchmark result using this repo with a few codes.

I have submitted a PR to TiKV to optimize the like performance ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)], without UTF-8 support), and add collation support in [https://github.com/tikv/tikv/pull/6592], which can be simply port to data-fusion.

The text was updated successfully, but these errors were encountered:

alamb · 2021-04-26T13:17:52Z

Comment from Andrew Lamb(alamb) @ 2021-04-26T12:30:42.598+0000:

Migrated to github: https://github.com/apache/arrow-rs/issues/71

alamb · 2023-03-10T12:29:12Z

I am pretty sure the arrow-rs kernels now do the standard tricks like using substring matching with %% eetc -- see https://github.com/apache/arrow-rs/blob/61c4f12e84330db243789fc98375512d67628e57/arrow-string/src/like.rs#L303-L306 so I am closing this issue

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

alamb mentioned this issue Mar 10, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

alamb closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve like/nlike performance #88

Improve like/nlike performance #88

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

alamb commented Mar 10, 2023

Improve like/nlike performance #88

Improve like/nlike performance #88

Comments

alamb commented Apr 26, 2021

like(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as a%a%a%a%a%a%a%a%b.

like_to_regex: which is almost the same as the current implementation in arrow.

like_optimize: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.

alamb commented Apr 26, 2021

alamb commented Mar 10, 2023

`like`(includes partial_like): this is the first naive version, using the recursive approach, which will cause terrible performance on special attack input, such as `a%a%a%a%a%a%a%a%b`.

`like_to_regex`: which is almost the same as the current implementation in arrow.

`like_optimize`: the like problem is similar to glob in shell, so a perfect solution is proposed in [https://research.swtch.com/glob] . The code in the research is written golang but I translate it to rust.