Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appropriately fast LIKE pattern compilation #286

Merged
merged 9 commits into from
Sep 24, 2020
Merged

Appropriately fast LIKE pattern compilation #286

merged 9 commits into from
Sep 24, 2020

Conversation

dlurton
Copy link
Member

@dlurton dlurton commented Sep 21, 2020

Implements #284

This is able to compile the like pattern %<n>% where <n> is 8000 ! characters on my local machine in ~60ms. The previous implementation took too long to measure on my local machine, even after #279, because it used an an unoptimized state machine. I do not know if there is a name for the algorithm in this PR, but it does not use a state machine.

I haven't yet analysed the performance of the evaluating the compiled pattern--I will need to do that before this can be merged.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

In two ways:

- Change fold/union operations to accumulate to a single list.
- Replace *ordered* sets and maps to hash sets and maps.

This results in a > 10x improvement in compiling large like patterns
(i.e. 1000 characters and up).
@dlurton dlurton requested a review from therapon September 21, 2020 02:41
@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2020

Codecov Report

Merging #286 into master will increase coverage by 0.64%.
The diff coverage is 92.07%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #286      +/-   ##
============================================
+ Coverage     82.44%   83.08%   +0.64%     
- Complexity     1202     1287      +85     
============================================
  Files           155      157       +2     
  Lines          9283     9745     +462     
  Branches       1522     1647     +125     
============================================
+ Hits           7653     8097     +444     
- Misses         1175     1190      +15     
- Partials        455      458       +3     
Flag Coverage Δ Complexity Δ
#CLI 18.11% <ø> (ø) 19.00 <ø> (ø)
#EXAMPLES 76.01% <ø> (ø) 27.00 <ø> (ø)
#LANG 85.72% <92.07%> (+0.59%) 1084.00 <21.00> (+85.00)
#PTS 100.00% <ø> (ø) 0.00 <ø> (ø)
#TEST_SCRIPT 79.68% <ø> (ø) 157.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...g/partiql/lang/eval/like/CheckpointIteratorImpl.kt 88.88% <88.88%> (ø) 7.00 <7.00> (?)
...tiql/lang/eval/like/CodepointCheckpointIterator.kt 90.90% <90.90%> (ø) 8.00 <8.00> (?)
lang/src/org/partiql/lang/eval/like/PatternPart.kt 92.06% <92.06%> (ø) 0.00 <0.00> (?)
...ng/src/org/partiql/lang/eval/EvaluatingCompiler.kt 83.59% <94.44%> (-0.05%) 150.00 <6.00> (ø)
lang/src/org/partiql/lang/Exceptions.kt 83.33% <0.00%> (+1.51%) 0.00% <0.00%> (ø%)
lang/src/org/partiql/lang/syntax/SqlParser.kt 84.84% <0.00%> (+3.55%) 287.00% <0.00%> (+70.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67c0e1c...85d494b. Read the comment docs.

@dlurton
Copy link
Member Author

dlurton commented Sep 21, 2020

Added some of my own comments which I will address along with the other reviewer's comments.

@dlurton
Copy link
Member Author

dlurton commented Sep 23, 2020

I've done a fair bit of checking on the evaluation-time performance and memory consumption. That part of it seems exactly on-par with the original LIKE implementation. This is ready to merge, less review.

therapon
therapon previously approved these changes Sep 24, 2020
Copy link
Contributor

@therapon therapon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Not really an issue for me, more of a limitation, is that the recursive call pattern for dealing with % can cause a stack overflow, the pathological examples would be a pattern with a series of leading %.


@Test
    fun stressTest() {
        executePattern(parsePattern("%".repeat(4000) + "a", null), "a")
    }

Which we could "compile" into a pattern with 1 leading %.
If users end up writing such a pattern (I hope they do not) they can do the rewrite to get around it. :)

@dlurton
Copy link
Member Author

dlurton commented Sep 24, 2020

LGTM.

Not really an issue for me, more of a limitation, is that the recursive call pattern for dealing with % can cause a stack overflow, the pathological examples would be a pattern with a series of leading %.


@Test
    fun stressTest() {
        executePattern(parsePattern("%".repeat(4000) + "a", null), "a")
    }

Which we could "compile" into a pattern with 1 leading %.
If users end up writing such a pattern (I hope they do not) they can do the rewrite to get around it. :)

I've added a change to consider multiple consecutive % the same as one % which will mitigate this somewhat.

alancai98
alancai98 previously approved these changes Sep 24, 2020
@dlurton dlurton requested a review from therapon September 24, 2020 20:20
@dlurton dlurton merged commit c5d72c0 into master Sep 24, 2020
dlurton added a commit that referenced this pull request Sep 29, 2020
Implements #284

Replaces previous LIKE implementation which is slow when compiling large patterns with wildcard characters `%` with another implementation that compiles the patterns in linear time and has similar performance characteristics at evaluation time.
dlurton added a commit that referenced this pull request Sep 29, 2020
Implements #284

Replaces previous LIKE implementation which is slow when compiling large patterns with wildcard characters `%` with another implementation that compiles the patterns in linear time and has similar performance characteristics at evaluation time.
@dlurton dlurton deleted the new-like branch September 29, 2020 22:01
@alancai98 alancai98 linked an issue Sep 30, 2020 that may be closed by this pull request
dlurton added a commit that referenced this pull request Sep 30, 2020
Implements #284

Replaces previous LIKE implementation which is slow when compiling large
patterns with wildcard characters `%` with another implementation that
compiles the patterns in linear time and has similar performance
characteristics at evaluation time.
dlurton added a commit that referenced this pull request Sep 30, 2020
Implements #284

Replaces previous LIKE implementation which is slow when compiling large
patterns with wildcard characters `%` with another implementation that
compiles the patterns in linear time and has similar performance
characteristics at evaluation time.
dlurton added a commit that referenced this pull request Mar 15, 2021
#32 appears to have been inadvertently fixed by #286. 

This commit adds a regression test for #32.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize LIKE pattern compilation
4 participants