Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "like" filter. #3642

Merged
merged 9 commits into from
Nov 4, 2016
Merged

Add "like" filter. #3642

merged 9 commits into from
Nov 4, 2016

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Nov 1, 2016

The main purpose of this is a performance boost over the regex filter (which can do similar stuff) for prefix patterns like "foo%" (see benchmarks below). A secondary purpose is making life easier for people that are familiar with SQL LIKE syntax.

Benchmarks:

  • matchLikeEquals is foo LIKE "1000000"
  • matchSelectorEquals is foo = "1000000" using the existing selector filter
  • matchLikePrefix is foo LIKE "50%"
  • matchRegexPrefix is foo RLIKE "^50.*" using the existing regex filter
Benchmark                                (cardinality)  Mode  Cnt       Score      Error  Units
LikeFilterBenchmark.matchLikeEquals               1000  avgt   10       0.799 ±    0.033  us/op
LikeFilterBenchmark.matchLikeEquals             100000  avgt   10       1.570 ±    0.055  us/op
LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10       1.688 ±    0.046  us/op
LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10       0.838 ±    0.009  us/op
LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10       1.447 ±    0.035  us/op
LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10       1.825 ±    0.087  us/op
LikeFilterBenchmark.matchLikePrefix               1000  avgt   10       5.214 ±    0.145  us/op
LikeFilterBenchmark.matchLikePrefix             100000  avgt   10     937.082 ±   53.887  us/op
LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10   10594.309 ±  537.874  us/op
LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10     153.203 ±   12.560  us/op
LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10   15425.745 ± 1269.856  us/op
LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10  150684.500 ± 3378.273  us/op

@gianm gianm added the Feature label Nov 1, 2016
@gianm gianm added this to the 0.9.3 milestone Nov 1, 2016

private final String dimension;
private final String pattern;
private final String escape;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe store escape as Character, since it must be null or 1-symbol string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe add @Nullable, to help IDE to highlight non-checking access?

private final SuffixStyle suffixStyle;
private final Pattern suffixPattern;

public LikeMatcher(final String prefix, final SuffixStyle suffixStyle, final Pattern suffixPattern)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this constructor be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason; changed to private

MATCH_PATTERN
}

private final String prefix;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nullable?


private final String prefix;
private final SuffixStyle suffixStyle;
private final Pattern suffixPattern;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nullable?

this.suffixPattern = suffixPattern;

if (suffixPattern == null && suffixStyle == SuffixStyle.MATCH_PATTERN) {
throw new ISE("Need suffixPattern for pattern matching!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe IAE, since we are in constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to IAE


public DruidPredicateFactory predicateFactory(final ExtractionFn extractionFn)
{
return new DruidPredicateFactory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give this class a name? Something like class LikeDruidPredicateFactory {...} return new LikeDruidPredicateFactory()? Just one extra LOC, but more readability in traces and logging. I've started fighting anonymous classes here: https://github.com/metamx/druid/commits/more-topn-metrics, this branch is far from merge, but probably new code could avoid this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the anonymous class is easier to read (much less boilerplate code). My personal feeling is the balance between code readability and logging readability is in favor of anonymous classes for things like this. But I'm open to being convinced otherwise or outvoted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, nevermind, I thought you were asking for a normal inner class. But based on "just one extra LOC" I guess you're asking for a local inner class in that method. Sure that would work, although I guess I think they look kind of weird…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm yes I mean local inner class. The name also serves as additional documentation sometimes, e. g. metamx@225c7e9#diff-8da1c4dd87d76bcce224a6be6bf74ed8R615 (another extra LOC justifying using local inner class probably shouldn't be there if this would be a common practice)

}
};
} else {
return new Predicate<String>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instance could be cached, or LikeMatcher could implement Predicate itself and return this here. Less indirection is a good thing for performance.

}
};
} else {
return new DruidLongPredicate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instance could be cached, or LikeMatcher could implement DruidLongPredicate itself and return this here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predicate returned depends on whether there's an extraction fn or not so it can't just be this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you mean just for the else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit there relative to code complexity is gonna be marginal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok anyway it should (?) be changed in a number of DimFilters, so when and if I will make such PR this could be discussed again


public class LikeFilter implements Filter
{
private static final String HIGHEST_CHARACTER = String.valueOf(Character.MAX_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why couldn't this constant be char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just so + works later on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm Hm + works with chars as well isn't it? And after that I think own constant is not needed, can + Character.MAX_VALUE directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, you're right. I changed it.


// Union bitmaps for all matching dimension values in range
return selector.getBitmapFactory().union(
new Iterable<ImmutableBitmap>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Iterable chain of nested classes is very noisy. Maybe create a simple List of bitmaps, fill it in a for loop with i between startIndex and endIndex, then return bitmapFactory.union(bitmaps)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the iterator since it can create and union the bitmaps one by one. That means decreased memory use for unioning a large number of bitmaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm ok, thanks. Could you please document this in the code? Because it is not obvious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe can create a list of integers from beginIndex to endIndex and then transform it using Guava's lazy transformation functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment.

} else {
// suffixStyle is MATCH_PATTERN
return suffixPattern.matcher(suffix).matches();
return suffixPattern.matcher(Strings.nullToEmpty(s).substring(prefix.length())).matches();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can just return matches(s); here? (That is what I meant in my earlier comment). It's not clear what would be faster, anyway it should be comparable. But suffixPattern could then be removed, reducing complexity of LikeMatcher.from() and LikeMatcher is general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet this is faster, since it totally skips even looking at the first prefix.length characters. Calling matches would have to read the first prefix.length characters to verify that they match the prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I thought substring would return a view but it copies the underlying array. Now I'm not so sure. Okay… I'll get rid of suffixPattern if the benchmarks look equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm right. My concern is substring(). But seems it could be addressed by Matcher m = suffixPattern.matcher(s); return m.find(prefix.length());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precisely return m.find(prefix.length()) && m.start() == prefix.length();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm I written this comment: #3642 (comment) before I see your last comment in this mini-thread. So I actually agree matching only suffix should be faster. And I suggested a way to avoid both substring and matching a prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, m.find(prefix.length()) && m.start() == prefix.length(); could potentially suffer performance-wise due to spending extra effort looking for substring matches after the prefix.length.

At any rate, I tried that, and didn't see a noticeable difference in performance with really any of these approaches, meaning it's tough to verify guesses like the above. So to keep things simple I'm planning to kill suffixPattern and go with your simple suggestion return matches(s);.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real increase is going to come from extracting prefix and using that for index lookups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed a new version with just pattern, no suffixPattern. benchmarks are the same within the margin of error.

if (suffixStyle == SuffixStyle.MATCH_ANY) {
return true;
} else if (suffixStyle == SuffixStyle.MATCH_EMPTY) {
return suffix.isEmpty();
return s == null || s.length() == prefix.length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in this case s == null means a match here? Maybe only if prefix.length() == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is you should only call this method if you independently verified the first prefix.length characters are equal to prefix. This check as-is will return the right answer if the precondition is adhered to. If not, that's the caller's problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm seems like if caller is feeding null here while prefix is not empty/null, it's a mistake in caller code, maybe should better throw something like NPE/IAE, rather than silently return true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the caller code is definitely wrong if this is happening. I just wanted to avoid checking unnecessary things here since this code is going to happen in a tight loop.

I changed it a bit to be:

return (s == null ? 0 : s.length()) == prefix.length();

So now the length check is always done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this seems to be the right version.

@@ -75,94 +79,113 @@ public LikeDimFilter(
MATCH_PATTERN
}

// Strings match if they start with "prefix" AND:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these comments could be Javadocs of SuffixStyle enum or constants themselves?

)
{
final StringBuilder prefix = new StringBuilder();
final StringBuilder suffixRegex = new StringBuilder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now this method could be simplified. Instead of accumulating suffixRegex, it could just assign suffixStyle variable, and make a simple constructor call in the end instead of if-else chain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good point. I removed suffixRegex and replaced it with a variable for the suffix matching style. I also removed boolean extractPrefix since it wasn't doing anything really useful anymore.

@navis
Copy link
Contributor

navis commented Nov 2, 2016

hm.. we have 'like' math function code of which is copied from hive. would it be better to use somewhere validated code?

@gianm
Copy link
Contributor Author

gianm commented Nov 2, 2016

@navis I'm not familiar with hive's version to know if it does the stuff this version does (we need to split out any exact prefix for the index lookup optimization, and we want to support custom "escape" chars for clauses such as foo LIKE "bar@%" ESCAPE "@"). If so and if performance is as good or better… sure. Really I didn't look into it since LIKE parsing is relatively simple.

@navis
Copy link
Contributor

navis commented Nov 3, 2016

I got this is specialized version for bitmap index not like what I've commented above. So forget what I've said.

@gianm
Copy link
Contributor Author

gianm commented Nov 3, 2016

@leventov any further comments?

@leventov
Copy link
Member

leventov commented Nov 3, 2016

@gianm no, LGTM.

Copy link
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some more boundary tests would be nice.

ImmutableList.of("2", "4")
);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: can we also add boundary tests for matchAll, matchNone, match against string having wildcard only ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for matching an empty string with and without extractionFns, and matching with just a wildcard.

@fjy
Copy link
Contributor

fjy commented Nov 3, 2016

👍

@nishantmonu51
Copy link
Member

👍

@nishantmonu51 nishantmonu51 merged commit 2c504b6 into apache:master Nov 4, 2016
fundead pushed a commit to fundead/druid that referenced this pull request Dec 7, 2016
* Add "like" filter.

* Addressed some PR comments.

* Slight simplifications to LikeFilter.

* Additional simplifications.

* Fix comment in LikeFilter.

* Clarify comment in LikeFilter.

* Simplify LikeMatcher a bit.

* No use going through the optimized path if prefix is empty.

* Add more tests.
@gianm gianm deleted the like-filter branch February 9, 2017 07:48
dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
* Add "like" filter.

* Addressed some PR comments.

* Slight simplifications to LikeFilter.

* Additional simplifications.

* Fix comment in LikeFilter.

* Clarify comment in LikeFilter.

* Simplify LikeMatcher a bit.

* No use going through the optimized path if prefix is empty.

* Add more tests.
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants