-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Use bitmap iteration for filtered aggregators #6889
Comments
I need to re-evaluate this in the context of the query vectorization proposed in #6794. Solution B yields good improvements with the current one-row-at-a-time query engine,but will likely not work as well with the vectorized query engine, since even low-selectivity filters are likely to have at least one matching row per batch (the default batch size in #6794 is 512 rows). On the other hand, it becomes possible to implement solution A on a per-batch basis. Right now, it looks like #6794 does not use bitmap iteration to build the list of matching rows, but checks the value of each row instead (see I'm going to try implementing solution A on top of the |
An aside: right now, for query level filters, Druid always uses indexes when possible. But this is not always optimal. Consider a query-level filter like Coming back to the proposal at hand: @peferron you point out that filtered aggregators, today, never use indexes. By the rationale above, this is probably sometimes better and sometimes worse than using indexes. (And it seems like your benchmarks show the same thing.) IMO we should be working to make the behavior here consistent: it shouldn't matter whether a filter is query-level or in an aggregator. In either case Druid should resolve them using some logic that tries to take advantage of the factors mentioned above. But, inevitably, it won't always make the best decision. So IMO I also think we should have some way of 'hinting' Druid to use, or not use, the index for a particular filter. Putting these two suggestions together: how about we build the hinting system first, which should let users that "know better" tell Druid what to do, and then build the automatic decisionmaking system afterwards? (With the goal that most users shouldn't have to provide hints.)
Hopefully the gap narrows, closes, or even inverts with vectorization (#6794) and adaptive compression (#6016). |
Another thing to consider is that using bitmap indexes isn't always better - see #3878 and related issues, which describes the case where very high cardinality dimensions can negatively impact query speed when using bitmap indexes first. I'm currently working on a solution to this problem, I still have some experiments to do, but we should definitely try to coordinate so that the system I propose for deciding to evaluate a filter as a pre or post filter would also be usable for aggregator filters. |
Oof, @gianm beat me to a response, there is some overlap between our comments 😜 |
Doing solution A on a per-batch basis sounds like a good way to go with vectorized filtered aggregators. |
To sum up:
Is that correct? |
@gianm: I experimented with removing Speed-up is about 10% on SQL query 7, which is better than nothing but not groundbreaking. I believe that the speed-up is small because the
If I understand correctly, depending on compression, encoding and so on, it may be faster to read only a small subset of rows from the underlying columnar storage, rather than the full batch. For example, reading the entire batch may involve decompressing an extra block that doesn't even contain any of the rows that the aggregator needs. But the current design of BTW, let me know if you'd like to take this discussion to #6794 instead, since it's starting to sound like a more general discussion of the vectorization system. |
1 & 2 are definitely correct. 3 & 4 are ideals we try to hold ourselves to whenever possible. Sometimes offering options to preserve old behavior isn't feasible for some reason -- maybe the old and new behavior cannot both exist due to a fundamental incompatibility. How to handle that kind of thing tends to be case by case. 5 accurately reflects my opinion at least :)
That's true. In general it's best if we only read what we need. One tricky thing here happens with a query like this: SELECT
SUM(x) FILTER(WHERE y = 1) as agg1
SUM(x) FILTER(WHERE y = 2) as agg2
FROM tbl If
Yeah, I think so -- if you reply to vectorization-specific stuff then let's do it over there. (But keep the discussion about harmonizing filtered aggregators & query-level filters, here, I think, assuming they don't become too intertwined.) |
I'm going to put this on hold until vectorization (#6794) is merged or at least close to its final form., since it's quite clear from this discussion that an alternative index-based strategy for filtered aggregators would be implemented very differently with or without vectorization. The configuration system to let users pick which strategy to use could be implemented now for query filters, though. I'm going to think a bit about that and open a new issue/proposal if needed. |
This issue has been marked as stale due to 280 days of inactivity. |
Problem summary
Filtered aggregators are slower than query root filters because they don't leverage bitmap iteration. This is a problem because:
Problem details
In Druid parlance, filters are partitioned into pre-filters and post-filters. Pre-filters contain conditions that can use bitmap iteration to jump to the next matching row, such as
my_dimension = my_value
. Post-filters contain conditions that must be evaluated on every row, such asmy_metric > my_value
.Query root filters are efficient because they indeed use bitmap iteration to jump to the next row that matches the root pre-filters.
Filtered aggregators are less efficient because they evaluate their pre-filters on every row that matches the root filter.
The result is that this sample query written using a root filter:
Can be several times faster than the equivalent query written using a filtered aggregator:
The obvious workaround is to rewrite the second query above into the first before sending it to Druid. But this only works on a subset of queries. For example, the following query cannot be rewritten:
The next workaround may be to add a unioned root filter:
But again, this only performs well on the subset of queries where the resulting union has low selectivity. The worst case is if the query contains an unfiltered aggregation:
The next workaround is to split the query into sub-queries that are sent to Druid separately and processed in parallel:
But this is complicated for users to implement, adds processing overhead, and can introduce inconsistency since the sub-queries may not scan the exact same set of rows especially in a streaming ingestion scenario.
Overall this means there's no good workaround. Please let me know if I missed one.
Potential solutions
To keep things simple, let's assume that all aggregators are filtered and that only pre-filters exist (no post-filters). Let's also ignore any potential optimizations around shared filters (#4267).
Current implementation (for reference): test each aggregator on each row
Pros: simple to implement, which means fewer bugs and easier maintenance.
Cons: as mentioned earlier, the bitmap value is checked for every single row in the segment. This is inefficient, especially for low-selectivity filters.
Solution A: compute each aggregator separately
Pros: uses bitmap iteration to skip non-matching rows.
Cons: scans through the data multiple times. Since Druid storage is columnar, this may not be a problem if the filtered aggregators read mutually disjoint sets of columns. But if there's overlap, then data locality suffers, since the same data is decompressed and read multiple times.
Solution B: repeatedly find the aggregator with the lowest next matched row
This can be implemented in multiple ways. A typical implementation is via a priority queue holding the next matched row of each aggregator's bitmap.
Pros: uses bitmap iteration to skip non-matching rows, while maintaining single-pass in-order data access.
Cons: increased complexity, and the extra overhead makes it difficult to beat the current implementation in all scenarios.
Early results
I benchmarked a few different implementations of solution B. I'm not linking to the code because I'd rather get feedback about the overall idea first, and the code will get thrown away anyway.
A min-heap implementation yielded the following results on a set of arbitrary test queries (each number is the ratio of avg ns/op compared to unpatched for that test query, so
< 1
is faster and> 1
is slower):The massive speed-ups such as
0.11
and0.03
happen for test queries that contain low-selectivity filtered aggregators, which makes sense.Another somewhat weird implementation on top of Roaring yielded less extreme speed-ups, but also smaller regressions:
The last two test queries mimic real queries that we run in production, and get 10-30% speed-ups.
I cut a few corners hacking this in, and I'm not sure what the final performance would look like. I'm especially concerned about extra garbage and memory pressure for some implementations which may not show up in benchmarks. But at least it seems worth looking into.
Next steps
I'm not that familiar with the Druid code base, so is there any obvious solution or blocker that I missed? Otherwise I'd love to get some general feedback about this idea.
Indexes differentiate Druid from its competition—which is usually faster on raw scan speed—so I think it makes sense to try to squeeze as much out of indexes as possible, especially when implementations as good as Roaring are available.
The text was updated successfully, but these errors were encountered: