-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "like" filter. #3642
Add "like" filter. #3642
Conversation
|
||
private final String dimension; | ||
private final String pattern; | ||
private final String escape; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe store escape as Character
, since it must be null or 1-symbol string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe add @Nullable
, to help IDE to highlight non-checking access?
private final SuffixStyle suffixStyle; | ||
private final Pattern suffixPattern; | ||
|
||
public LikeMatcher(final String prefix, final SuffixStyle suffixStyle, final Pattern suffixPattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this constructor be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason; changed to private
MATCH_PATTERN | ||
} | ||
|
||
private final String prefix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nullable
?
|
||
private final String prefix; | ||
private final SuffixStyle suffixStyle; | ||
private final Pattern suffixPattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nullable
?
this.suffixPattern = suffixPattern; | ||
|
||
if (suffixPattern == null && suffixStyle == SuffixStyle.MATCH_PATTERN) { | ||
throw new ISE("Need suffixPattern for pattern matching!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe IAE
, since we are in constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to IAE
|
||
public DruidPredicateFactory predicateFactory(final ExtractionFn extractionFn) | ||
{ | ||
return new DruidPredicateFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give this class a name? Something like class LikeDruidPredicateFactory {...} return new LikeDruidPredicateFactory()
? Just one extra LOC, but more readability in traces and logging. I've started fighting anonymous classes here: https://github.com/metamx/druid/commits/more-topn-metrics, this branch is far from merge, but probably new code could avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the anonymous class is easier to read (much less boilerplate code). My personal feeling is the balance between code readability and logging readability is in favor of anonymous classes for things like this. But I'm open to being convinced otherwise or outvoted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, nevermind, I thought you were asking for a normal inner class. But based on "just one extra LOC" I guess you're asking for a local inner class in that method. Sure that would work, although I guess I think they look kind of weird…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm yes I mean local inner class. The name also serves as additional documentation sometimes, e. g. metamx@225c7e9#diff-8da1c4dd87d76bcce224a6be6bf74ed8R615 (another extra LOC justifying using local inner class probably shouldn't be there if this would be a common practice)
} | ||
}; | ||
} else { | ||
return new Predicate<String>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instance could be cached, or LikeMatcher
could implement Predicate
itself and return this
here. Less indirection is a good thing for performance.
} | ||
}; | ||
} else { | ||
return new DruidLongPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instance could be cached, or LikeMatcher
could implement DruidLongPredicate
itself and return this
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate returned depends on whether there's an extraction fn or not so it can't just be this
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you mean just for the else
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit there relative to code complexity is gonna be marginal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok anyway it should (?) be changed in a number of DimFilter
s, so when and if I will make such PR this could be discussed again
|
||
public class LikeFilter implements Filter | ||
{ | ||
private static final String HIGHEST_CHARACTER = String.valueOf(Character.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why couldn't this constant be char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just so +
works later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm Hm +
works with chars as well isn't it? And after that I think own constant is not needed, can + Character.MAX_VALUE
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, you're right. I changed it.
|
||
// Union bitmaps for all matching dimension values in range | ||
return selector.getBitmapFactory().union( | ||
new Iterable<ImmutableBitmap>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Iterable
chain of nested classes is very noisy. Maybe create a simple List of bitmaps, fill it in a for loop with i between startIndex and endIndex, then return bitmapFactory.union(bitmaps)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the iterator since it can create and union the bitmaps one by one. That means decreased memory use for unioning a large number of bitmaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm ok, thanks. Could you please document this in the code? Because it is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe can create a list of integers from beginIndex
to endIndex
and then transform it using Guava's lazy transformation functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment.
} else { | ||
// suffixStyle is MATCH_PATTERN | ||
return suffixPattern.matcher(suffix).matches(); | ||
return suffixPattern.matcher(Strings.nullToEmpty(s).substring(prefix.length())).matches(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can just return matches(s);
here? (That is what I meant in my earlier comment). It's not clear what would be faster, anyway it should be comparable. But suffixPattern
could then be removed, reducing complexity of LikeMatcher.from()
and LikeMatcher
is general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet this is faster, since it totally skips even looking at the first prefix.length
characters. Calling matches
would have to read the first prefix.length
characters to verify that they match the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually I thought substring
would return a view but it copies the underlying array. Now I'm not so sure. Okay… I'll get rid of suffixPattern
if the benchmarks look equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm right. My concern is substring()
. But seems it could be addressed by Matcher m = suffixPattern.matcher(s); return m.find(prefix.length());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely return m.find(prefix.length()) && m.start() == prefix.length();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm I written this comment: #3642 (comment) before I see your last comment in this mini-thread. So I actually agree matching only suffix should be faster. And I suggested a way to avoid both substring and matching a prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, m.find(prefix.length()) && m.start() == prefix.length();
could potentially suffer performance-wise due to spending extra effort looking for substring matches after the prefix.length
.
At any rate, I tried that, and didn't see a noticeable difference in performance with really any of these approaches, meaning it's tough to verify guesses like the above. So to keep things simple I'm planning to kill suffixPattern
and go with your simple suggestion return matches(s);
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real increase is going to come from extracting prefix
and using that for index lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I pushed a new version with just pattern
, no suffixPattern
. benchmarks are the same within the margin of error.
if (suffixStyle == SuffixStyle.MATCH_ANY) { | ||
return true; | ||
} else if (suffixStyle == SuffixStyle.MATCH_EMPTY) { | ||
return suffix.isEmpty(); | ||
return s == null || s.length() == prefix.length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in this case s == null
means a match here? Maybe only if prefix.length() == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is you should only call this method if you independently verified the first prefix.length
characters are equal to prefix
. This check as-is will return the right answer if the precondition is adhered to. If not, that's the caller's problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm seems like if caller is feeding null
here while prefix is not empty/null
, it's a mistake in caller code, maybe should better throw something like NPE/IAE, rather than silently return true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the caller code is definitely wrong if this is happening. I just wanted to avoid checking unnecessary things here since this code is going to happen in a tight loop.
I changed it a bit to be:
return (s == null ? 0 : s.length()) == prefix.length();
So now the length check is always done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this seems to be the right version.
@@ -75,94 +79,113 @@ public LikeDimFilter( | |||
MATCH_PATTERN | |||
} | |||
|
|||
// Strings match if they start with "prefix" AND: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these comments could be Javadocs of SuffixStyle
enum or constants themselves?
) | ||
{ | ||
final StringBuilder prefix = new StringBuilder(); | ||
final StringBuilder suffixRegex = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now this method could be simplified. Instead of accumulating suffixRegex
, it could just assign suffixStyle variable, and make a simple constructor call in the end instead of if-else chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, good point. I removed suffixRegex
and replaced it with a variable for the suffix matching style. I also removed boolean extractPrefix
since it wasn't doing anything really useful anymore.
hm.. we have 'like' math function code of which is copied from hive. would it be better to use somewhere validated code? |
@navis I'm not familiar with hive's version to know if it does the stuff this version does (we need to split out any exact prefix for the index lookup optimization, and we want to support custom "escape" chars for clauses such as |
I got this is specialized version for bitmap index not like what I've commented above. So forget what I've said. |
@leventov any further comments? |
@gianm no, LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, some more boundary tests would be nice.
ImmutableList.of("2", "4") | ||
); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: can we also add boundary tests for matchAll, matchNone, match against string having wildcard only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added tests for matching an empty string with and without extractionFns, and matching with just a wildcard.
👍 |
👍 |
* Add "like" filter. * Addressed some PR comments. * Slight simplifications to LikeFilter. * Additional simplifications. * Fix comment in LikeFilter. * Clarify comment in LikeFilter. * Simplify LikeMatcher a bit. * No use going through the optimized path if prefix is empty. * Add more tests.
* Add "like" filter. * Addressed some PR comments. * Slight simplifications to LikeFilter. * Additional simplifications. * Fix comment in LikeFilter. * Clarify comment in LikeFilter. * Simplify LikeMatcher a bit. * No use going through the optimized path if prefix is empty. * Add more tests.
The main purpose of this is a performance boost over the regex filter (which can do similar stuff) for prefix patterns like "foo%" (see benchmarks below). A secondary purpose is making life easier for people that are familiar with SQL LIKE syntax.
Benchmarks:
foo LIKE "1000000"
foo = "1000000"
using the existing selector filterfoo LIKE "50%"
foo RLIKE "^50.*"
using the existing regex filter