-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Composite approach for checking in-filter values set in column dictionary #13133
Conversation
break; | ||
// if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search | ||
// based lookup per value. The algorithm works well for smaller number of values. | ||
if (size < SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can determine the strategy at the point of Iterator<ImmutableBitmap>
object is instantiated.
And it would be much better if we split the returned Iterator<ImmutableBitmap>
into two inner classes, one is for the binary search, the other is for the sorted merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be much more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done. Had done it earlier like that, but changed it last moment.
@@ -175,6 +177,8 @@ public String getValue(int index) | |||
extends BaseGenericIndexedDictionaryEncodedIndex<ByteBuffer> implements StringValueSetIndex, Utf8ValueSetIndex | |||
{ | |||
private static final int SIZE_WORTH_CHECKING_MIN = 8; | |||
private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some javadoc to explain why the default threshold is 0.12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could link the issue that has the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added javadoc explanation for the threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
@@ -826,4 +862,28 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector) | |||
} | |||
}; | |||
} | |||
|
|||
public class ValueWithIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be cleaner to just use a ListIterator
which provides nextIndex()
.
You wouldn't be able to peek the next index though, and you might have to work around that.
(That could be easier to do if we go with @FrankChen021 's suggestion to separate the two kinds of
searches into two different iterables.)
Another alternative could be to just use Pair
but I am not a fan of it.
If you do decide to use this class, however, I would suggest putting as a top level class in druid-core/org.apache.druid.java.util.common
, as other parts of the code might have similar requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the iterator itself, so not needed anymore.
break; | ||
// if the size of in-filter values is less than the threshold percentage of dictionary size, then use binary search | ||
// based lookup per value. The algorithm works well for smaller number of values. | ||
if (size < SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be much more readable.
@rohangarg - thanks for putting these benchmarks. In terms of query latencies, what difference have you observed? |
a83d939
to
d53680b
Compare
The force-push is done since I rebased from latest master. The new changes are a part of new commit and aren't squashed into the old ones. |
I have not tried benchmarking queries as of now since the amount of difference would depend on the data and the type of query being used. The benchmark added can measure the improvement with in-filter operation independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've got one stale comment in there and the CI isn't passing for some reason, but the code looks good.
Currently, for checking the in-filter values in a column dictionary we use a binary search per value in the set. That works well for smaller value-sets but starts slowing down as the number of values in the set increase. To accommodate for large value-sets arising from large in-filters or from joins being pushed down as in-filters, we use sorted merge algorithm for merging the set and dictionary for larger values.
The following benchmark was run to find the cutoff point :
This PR has: