-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] BufferAggregator support for growable sketches. #8126
Comments
Related: #5335 (comment) and #5335 (comment) |
As part of the work we are doing towards integrating Oak (off-heap based incremental index) integration into druid #5698, we invested some thought on how to bridge the gap between Oak--with its internal memory management, off-heap sketches--based on WritableMemory, and druid aggregators. What we suggest is to have oak manage the memory, including re-allocation of space when needed. Oak will implement its own WritebleMemory and MemoryRequestServer that are needed for a correct behaviour of the sketches wrt Oak index. Finally, we suggest to have a new Aggregator type - WritableMemoryAggregator that maps WritableMemory to sketch and can work the same way as buffer aggregators are working, and it does not need to worry about growing size of sketches. There might be other alternatives for closing this loop; let's discuss them. Does all this make sense @leventov @himanshug ? |
The "WritableMemoryAggregator" proposal is the new take on #3892. I support it. |
Thanks Roman. Can you summarize any progress made wrt Memory aggregator if any. |
2 naive question -
|
@Eshcar regarding growth strategy: current strategy I described was to double the size every time, but I also thought about having following(notice that return type is "int" here not boolean)
that could let aggregator impl control the growth strategy as it has best information about what it needs. later if needed, "default" grown strategy could be made pluggable so that extension writers could control that. That said, It doesn't matter for OakIncrementalIndex as that would choose its own default growth strategy. regarding using Memory instead of ByteBuffer : I am sure that is good idea for performance, so instead of adding more methods to |
@himanshug - What is the context of the current proposal? Does it refers to aggregators that are used only in the context of queries, when querying immutable segments that are cached off-heap? Does it also cover off-heap incremental index roll-up aggregation? If it is the first then it is reasonable to define an API for external memory allocator, if it is the second then what I am suggesting is more relevant. I think that by introducing a new writable memory aggregator we avoid backward compatibility issues, do we not? |
There is no bottom line, but there is kind of a "top line" that we want to move to
No working hands.
They are used by off-heap incremental index, and in the implementations of queries such as groupBy and topN.
It's used in the implementation of groupBy queries. It's not used more widely, including during the data ingestion specifically because of the unsolved problem with growable complex aggregations - this is what this proposal is all about. |
Then presenting a solution to this problem using Oak to manage the off-heap index, handling growable sketches using Memory, and showing that it performs as good as or better than the existing implementation would be a win-win-win solution, correct? BTW, if the current off-heap solution is operational we can compare against it in the current system (cluster) benchmarks that we are running, and compare performance (without sketches at this point). From what I know last time we tried to evaluate the off-heap incremental index through component level test it crashed and we were told it is not properly maintained. |
Yes. I just want to warn you about trying to bring all of that at once - the extent of the change may become unbearable. So even if it incurs more work, it's better to make those changes separately: e. g. first, the transition to
Not sure what do you mean by that. It's not possible to use the off-heap incremental index for indexing even when all aggregations are constant-sized - it's not implemented. |
Currently, the off-heap incremental index is used only in GroupBy V1, which is itself deprecated and may not work at all. So I don't really even see a point in running benchmarks against the current off-heap incremental index to prove that Oak is better. |
Thanks, that was our impression - that off-heap incremental index is not operational (however it does exist in the code). So, indeed there is no way to compare to it. I also agree that doing the oak-sketches-druid integration in one step might be too complicated. If I now understand correctly the purpose of the current proposal is to handle the queries aggregation problem. Oak might be a solution also for this problem but this is something we haven't looked at yet. |
yes, the context here applies to every place aggregation happens and also let it not be dependent on something specific but allow extensions to eventually add optimal behavior. |
The proposal LGTM in general, there are a couple of areas I think need adjustment:
|
@jon-wei thanks for taking a look, sounds like there is consensus in general. |
@himanshug, I'm wondering if you are still interested in pursuing this? |
things would need to change to possibly accommodate https://lists.apache.org/thread.html/r16b2d393f51a55007953a4e24659df171f7aaf9ccb3560716c955b7a%40%3Cdev.druid.apache.org%3E |
Motivation
From IncrementalIndex generally overestimates theta sketch size #6743
"Theta sketches have a very large max size by default, relative to typical row sizes (about 250KB with "size" set to the default of 16384). The ingestion-time row size estimator (getMaxBytesPerRowForAggregators in OnheapIncrementalIndex) uses this figure to estimate row sizes when theta sketches are used at ingestion time, leading to way more spills than is reasonable. It would be better to use an estimate based more on actual current size. I'm not sure how to get this, though."
From [Proposal] Resizable Buffer in BufferAggregator #2963
"In case of complex aggregators like thetaSketch, more than 80% of the time sketches don't grow to full capacity but query processing still reserves the full max size. The idea is to support use of a resizable aggregator by BufferAggregator so that maximum space is not reserved but BufferAggregator should be able to re-allocate the buffer on demand."
Some sketches e.g.
doublesSketch
do not have an upper bound on size and can't provide a correct number forAggregatorFactory.getMaxIntermediateSize()
. Current workaround, they use, is to use some number and then fall back to on-heap objects if sketch grows bigger than that.Proposed changes
Add following methods to
AggregatorFactory
Add following methods to
BufferAggregator
Update all of Druid core code to remove usage of
Aggregator
and useBufferAggregator
in all places using the newly introduced methods. For example, see the changes made toOnheapIncrementalIndex
in #8127 . Once that is done in all places,Aggregator
interface andAggregatorFactory.factorize(ColumnSelectorFactory)
andAggregatorFactory.getMaxIntermediateSize()
can be removed.At least
BufferAggregator
that work on top of sketches of variable sizes should be updated to implement newly introduced methods. In those extensions,AggregatorFactory.getMinIntermediateSize()
should return a value that can be overridden by user per query/indexing to allow fine tuning.Following interfaces and classes are introduced to aid usage of new methods in aggregator extension.
Rationale
#2963
Operational impact
None
Test plan (optional)
Unit Tests should cover the changes.
Future Work
I think
MemoryAllocator
interface might change a bit when we write code to use it in off-heap use cases in querying.With these changes we can possibly use off-heap memory for aggregators in IncrementalIndex too.
This would also enable removal of v1 groupBy implementation that is kept around due to its usage of
Aggregator
that allowed onheap growable sketches.The text was updated successfully, but these errors were encountered: