Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Replace one regex by pool of regex matchers. #557

Merged
merged 1 commit into from
Nov 28, 2018
Merged

Replace one regex by pool of regex matchers. #557

merged 1 commit into from
Nov 28, 2018

Conversation

kuba--
Copy link
Contributor

@kuba-- kuba-- commented Nov 22, 2018

Signed-off-by: kuba-- kuba@sourced.tech

First approach to close src-d/gitbase#544

@kuba-- kuba-- requested a review from a team November 22, 2018 16:24
@kuba-- kuba-- added the bug Something isn't working label Nov 22, 2018
if err != nil {
return false, err
if re.pool == nil {
re.pool = &sync.Pool{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work when right is a column, for example. Because we will keep the same right for all cases. This just works for the case when right is constant.

Imagine we have: foo REGEXP bar

First eval:

  • No pool, create a pool where bar is 1

Second eval:

  • There's a pool, get a matcher whose regexp will be 1, but bar is 2 now. The result won't be correct.

There's an example in LIKE, I think, showing how to check if expr is constant to use a different path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but the question is - if it had worked for us?
Because I didn't change any logic. Before we had a just one matcher, now we have a synced pool, so the same bad things may happen. Generally, once you create an Regexp instance, right expression cannot mutate (because we keep pre-compiled matcher).
I think we should handle this on higher level, because compiling regex every time may have a really bad impact on performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behaviour might have been wrong before, but this still needs to be addressed.
How would we handle this on a higher level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so maybe for comparison and for like we'll take following approach:

var canCacheRegex = true
	Inspect(right, func(e sql.Expression) bool {
		if _, ok := e.(*GetField); ok {
			canCacheRegex = false
		}
		return true
	})

So if you can't cache it I'll create a new matcher every time (like it's done in like), but because it will impact performance, so for cache-like expressions we'll go with a pool.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean that we can have regexp patterns in a column? All the uses and tests were done with constant regexp patterns:

select
    files.file_path,
    ref_commits.repository_id,
    files.blob_content
FROM
    files
NATURAL JOIN
    commit_files
NATURAL JOIN
    ref_commits
WHERE
    ref_commits.ref_name = 'HEAD'
    AND ref_commits.history_index BETWEEN 0 AND 5
    AND is_binary(blob_content) = false
    AND files.file_path NOT REGEXP '^vendor.*'
    AND (
        blob_content REGEXP '(?i)facebook.*[\'\\"][0-9a-f]{32}[\'\\"]'
        OR blob_content REGEXP '(?i)twitter.*[\'\\"][0-9a-zA-Z]{35,44}[\'\\"]'
        OR blob_content REGEXP '(?i)github.*[\'\\"][0-9a-zA-Z]{35,40}[\'\\"]'
        OR blob_content REGEXP 'AKIA[0-9A-Z]{16}'
        OR blob_content REGEXP '(?i)reddit.*[\'\\"][0-9a-zA-Z]{14}[\'\\"]'
        OR blob_content REGEXP '(?i)heroku.*[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}'
        OR blob_content REGEXP '.*-----BEGIN PRIVATE KEY-----.*'
        OR blob_content REGEXP '.*-----BEGIN RSA PRIVATE KEY-----.*'
        OR blob_content REGEXP '.*-----BEGIN DSA PRIVATE KEY-----.*'
        OR blob_content REGEXP '.*-----BEGIN OPENSSH PRIVATE KEY-----.*'
    );

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, per example, have as regexp input a column. That means the regular expression could be different every time.

Copy link
Contributor

@ajnavarro ajnavarro Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a rule to check this cases; If the regular expression input is a Literal, then use always the same matcher. If not, use a different matcher per Eval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erizocosmico, @ajnavarro, @jfontan - how about this - PTAL.

For non-cached, every time create a new matcher. For cached, check the pool of matchers

Signed-off-by: kuba-- <kuba@sourced.tech>
@ajnavarro ajnavarro merged commit 2e4eb83 into src-d:master Nov 28, 2018
@kuba-- kuba-- deleted the fix-544/oniguruma branch November 29, 2018 22:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

oniguruma segfault
4 participants