-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oak: New Concurrent Key-Value Map #5698
Comments
Hereby attached the refactor we suggest to IncrementalIndex module so it can adopt another index based on Oak. Would be happy to hear your comments. Thanks! |
@sanastas sounds great contribution is this lib open source? what it used with other projects? |
We intend to make Oak open source, but it is not yet there. Oak was not yet used in real project, we just measured it performance intensively on different machines and under different workloads. We are looking forward to see how it affects Druid's performance. |
@sanastas interesting proposal, thanks. Notes:
|
@leventov the first thing i see when i read memory docs is
that is scary! is there any plans to support JDK 9 and higher? i propose to make sure that is the case before we move out of |
@leventov , thanks for taking a look! You have great comments!
|
The problem is that if the Druid API is going to depend on Memory, then if Oak is used, conversion from ByteBuffer to Memory is required on every read and write access to the NavigableMap. The allocated Memory objects are garbage. I don't think the overhead of those conversions is going to be prohibitive, but it could be considerable, so better avoided. |
To be based on ByteBuffer of Memory is equivalent for Oak. We care about performance a lot and conversion from ByteBuffer to Memory is not an option. When aggregators will transform from ByteBuffer to Memory, we will switch Oak to use Memory. For now we just want to get your review feedback and to get some benchmark performance comparisons. By the way do you have any ready performance benchmarks we can use? Do you use any performance evaluation platform like YCSB or similar? |
Hi again everyone, We will join the Druid developer sync meeting next Tuesday, May 8, in order to present the Oak (briefly), to discuss the integration, and answer your questions. Thanks! |
@sanastas You are more than welcome, For the benchmark we do not have a client for YCSB but we do have some JMH benchmarks at the druid repo it self i think it is a good start. |
I have updated the pull request This pull request also includes interesting index ingestion benchmark results: |
The last PR is now: #6327 |
Hi Everyone! We are back! I have just created a new PR #7676 in order to continue discussion about the new type of Incremental Index based on Oak concurrent navigable map. For those yet not familiar with Oak - it is an off-heap (direct-allocation) concurrent navigable (ordered) KV-map. More details about Oak can be found in the previous comments in this issue and in PR #7676 introduction. For those who are familiar with Oak, this is what have been done so far (since this project went on a pause):
The Oak isolated benchmarks can be found here (in order to see the potential in using Oak vs ConcurrentSkipListMap): OAK PERFORMANCE.pdf Please take a look on PR #7676 and raise your questions here or there. Clearly we have more work to do, this is just a start! :) We will join the Druid developer sync meeting next Tuesday, May 21, to discuss the Oak possibility further. |
While measuring the Oak performance we have found IncrementalIndex module doesn't scale well with multi-threading. We traced the threads’ blocking states to two causes:
The first case is addressed in PR#7838. You are welcome to take a look. Thanks! |
@jihoonson please take a look on the following proposal draft! PROPOSAL MotivationThe following are some imperfections of current Druid IncrementalIndex:
Proposed changesOak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time analytics, which has all its keys and values kept in the off-heap memory. Oak is faster and scales better with the number of CPU cores than popular NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s default). RationaleA discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of. Let’s follow the initial motivation points:
Operational impactThe suggested OakIncrementalIndex should live alongside the current solution, giving the ability to switch to OakIncrementalIndex when preferable. As OakIncrementalIndex doesn’t affect the on disk layout, we do not foresee any specific operational impact. Test plan (optional)The plan is to add a specific unit-test for OakIncrementalIndex and to pass all existing tests for cluster based on OakIncrementalIndex. Future work (optional)As StringIndexer also takes a significant part of IncrementalIndex we might take also this map off-heap. IncrementalIngestionBenchmark resultsIn attached files please see the results of IncrementalIngestionBenchmark when 2 million vs 3 million of rows are inserted. It is done with native Druid’s incremental index and with newly suggested OakIncrementalIndex (data taken off-heap). Druid’s incremental index always gets 12GB on heap memory. OakIncrementalIndex always gets 8GB on-heap and 4GB off-heap (in total 12GB RAM). Please note that the rows come prepared before the benchmark measurement and actually already take big chunk of on-heap memory. The graphs show throughput (number of operations in seconds) so the bigger the better. In all results OakIncrementalIndex show better performance. When moving from 2M rows to 3M rows (using same memory limits) Oak shows about 10% performance degradation while IncrementalIndex shows about 60% degradation. Ingestion Throughput 2M rows (2.5GB data) ingested.pdf |
@sanastas thanks! This proposal looks only focused on support of OakIncrementalIndex which looks good to me. I have a question on the benchmark results you posted though. Do you know what caused the performance degradation when the number of rows was increased? 60% degradation sounds pretty huge. |
@jihoonson chiming in here, too .. I believe 60% throughput degradation attributes to GC. The JVM just doesn't have enough headroom for the heap. The data itself is 3.8 GB (<1/3 GB of the managed RAM), all pre-generated prior to the test. OakIncrementalIndex is pretty immune to scaling (10% slowdown is organic to search as the data structure grows). |
@jihoonson thanks for taking a look! First, indeed this proposal is about supporting OakIncrementalIndex. This is the idea we pursue for a while already. In general it is about building a bigger off-heap IncrementalIndexes and enjoying a good performance :) Would you like to work with us on promoting this possibility? Just as consultant, your insights are very valuable.... Second, Oak have couple of advantages when working with big data. As @ebortnik has mentioned, working with off-heap serialized data make it less affected to the JVM GC. In addition, Oak utilizes cache locality for searches. Lastly, Oak works good under multi-threading contention and scales well with multiple threads. However, in this specific experiment (single thread) the main problem should be caused by GC. Original, Druid's IncrementalIndex allocates the (to-be-added) rows on-heap prior to the benchmarks (taking 4GB out of given 12GB). Then StringIndexer takes more memory to save the String<->Integer translation, let's exaggerate and give it another 4GB. From here, for all other on-heap objects we remain with 4GB, which puts a lot of stress on GC. The ConcurrentSkipListMap used in Druid's IncrementalIndex is know to be less GC-friendly due to many small objects it allocates. I believe this is the reason for huge performance degradation we see. For example, here is a reference to Oracle themselves mentioning about Java Garbage Collector:
And here one can take a look on experimenting with GC and big heap sizes. The conclusion is:
Would be really glad to hear your thoughts! |
@jihoonson , I appreciate your intention to help with this project. Thanks a lot! In your opinion, what would be the performance result/benchmark (any other trigger) that would indicate that it worth to take OakIncrementalIndex into Druid? |
I think it would be nice if we have two benchmarks. One benchmark is for directly comparing performance of Another benchmark which is I think nice to have would measure the performance of |
@jihoonson , sounds good!
We have found OffheapIncrementalIndex not-optimized in some places, not so concurrent (synchronize on everything) and it looks like it not in use. Thus we decided not to compare to it as even in the tests/benchmarks there are some exceptions happening with OffheapIncrementalIndex....
Is it something like two graphs under "IncrementalIngestionBenchmark results" presented above? Is it the latency that you are missing? Or different parameters? |
Thanks for contributing #7676! I've been thinking about what the path to potentially merging #7676 would look like. In Druid, there are currently two categories of code contributions, and for merging consideration, these two categories have different requirements.
The requirements for contrib extensions are looser and could roughly be described as "reasonable implementation and potentially useful for some use cases or experimentation". The contrib extensions aren't actively maintained by the Druid committers, and are generally less extensively tested. It can make sense for a new feature to start out as a contrib extension, and potentially migrate to core as it evolves. Examples of this include the Google Cloud Storage extension and the ORC format extension, which started out as contrib extensions and were recently adopted as core extensions. For the Oak-based incremental index, this path could make sense as well, but Druid does not currently provide an extension point for incremental index implementations. To open that as an extension point would first involve discussion/consensus on whether it's a good idea to have that extension point, and there would also be significant design thought/implementation work required. Given those difficulties, I think it makes sense to think about the path to merging Oak-based incremental index as a core feature. For merging a contribution into core, the requirement is essentially: "Convince Druid committers such that they are willing to take responsibility for and maintain the contribution going forward." At the highest level, setting aside implementation details, I think it'd be helpful to see a comparison of performance metrics between Oak incremental index and the existing implementation on a real cluster. I would try to set up realistic workloads for native batch ingestion and Kafka indexing service ingestion, and gather metrics for the following:
Separately from the incremental index topic, I wonder if OakMap could be used as part of Druid's GroupBy V2 query. There is a class called |
@jon-wei , I couldn't ask for a better start of the conversation! Thank you so much! Totally agree with you, given that Druid does not currently provide an extension point for incremental index implementations, it is favorable to merge Oak-based incremental index as a core feature. We do not intent to replace the
Can you tell a bit more? Is there any available commonly used dataset? Any specific keys/rows distribution or data generator? Any performance measurement tools commonly used? |
I think it'd be interesting to see performance with data from a cluster that's used for some real world purpose, with some real queries used in that cluster. For performance measurement, you could gather query metrics emitted by druid, and check ingest/persist performance through task logs, profilers, etc. I don't have a specific tool in mind. |
@jon-wei , it would be interesting indeed :) @jihoonson , @b-slim , @leventov may be you have any ideas to contribute to this discussion? How to evaluate Oak so the community is convinced it worth it? Thanking you all in advance! |
In the meanwhile we would like to share some insights we had while playing with IncrementalIngestionBenchmark. We continued the experiments that had started and were presented above under “IncrementalIngestionBenchmark” title. We compare native Druid’s incremental index and newly suggested OakIncrementalIndex (data taken off-heap). The data distribution/generation is exactly as in IncrementalIngestionBenchmark, we just insert more rows. This time we inserted 6 Million rows (about 8GB of data) while giving 24GB RAM. Number 24GB appear as this is almost the lowest number allowing native Druid’s IncrementalIndex to run properly. Even taking into consideration that in IncrementalIngestionBenchmark, the rows come prepared before the benchmark measurement and actually already take big chunk of on-heap memory, x3 memory requirement sounds a lot… Druid’s incremental index gets 24GB on heap memory. OakIncrementalIndex always gets 16GB on-heap and 8GB off-heap (in total 24GB RAM). The results can be see in the file bellow. The graph shows throughput (number of operations in seconds) so the bigger the better. OakIncrementalIndex performs about twice faster. Ingestion Throughput 6M rows (8GB data) ingested.pdf In order to stress the memory overhead requirement, we have run yet another experiment, this time inserting 7 Million rows, which is up to 9GB data. We gradualy increased the memory requirement and present the throughput as function of total RAM used. Results in the file below. OakIncrementalIndex off-heap memory requirement was always 9GB, as it is what is written there. We have started by giving 24GB of total RAM as this is where OakIncrementalIndex was able to operate, although its throughput was very low. Native Druid’s IncrementalIndex was unable to operate until 28GB of on-heap memory was allowed to be used. Ingestion 9GB data into Druid.pdf Does the question of metadata memory overhead bother you? Also would you be interested in working with bigger IncrementalIndexes, in order to later have less flushes to disk (persist), causing less merges, and thus higher performance? |
I think in the past some have used a portion of the TPCH data (specifically the lineitem table) for performance evaluations. Beyond gather performance numbers, I do recommend running Oak incremental on a real cluster, as part of the testing strategy. |
Some context around the potential impact of Oak on Druid's IncrementalIndex:
Experience and query/ingest-rate metrics in a real cluster is the easiest way to validate all of the above, since the system is fairly complex and there are a lot of potential tradeoffs involved between the various components. If you don't have a real-world dataset available maybe try the publicly available tweets dataset from the Twitter Streaming API. We often use it for test clusters. Some resources for that:
That all being said, you might also want to look at the potential impact of Oak on Druid's groupBy engine. Check out the ConcurrentGrouper class, which groupBy v2 queries use for parallel aggregation. In particular, check out the |
@jon-wei , thank you, we will try to find something about the TPCH data. @gianm , Great to hear from you Gian and thank you for your great input! No doubts the query performance is important, we are working on it just now, to present the results soon. Also no doubts, the system level test/performance benchmark is also important. We try to collect the information about how it should be run to be convincing for the community.
There is no intention to force Druid to work with big incremental indexes, just wanted to show some cases where Oak advantage is clear. Ingestion (with Oak) on smaller indexes has the same latency/throughput (as with current IncrementalIndex), but takes less memory and gives a potential for a better concurrency if multi-threaded ingestion will be used one day.
OakIncrementalIndex can let you handle more rows with less RAM.
We are currently working on queries speed. Hope to update you soon. What are the expected query times?
There can be a trade of, assuming all in all you process X bytes of data. It can be persisted in chunks of X/10 and then merged 10 times, or alternatively it can be persisted in big chunks like X/2 and may be merged only twice. I am just exaggerating the numbers, and I am not sure this theory can show better performance. Just something to be checked. Thank you for pointing on the publicly available dataset. We will investigate what we can do. Some additional question: how big the read-only segments are? |
Thank you for the valuable input. We will begin benchmarking using a single-machine setup as described in the quickstart guide. We have real-world data from production Flurry server that we can use. It’s about 450M keys comprised of application names and timestamps, and we can generate values in different sizes, depending on what we plan to simulate. Our plan is to use Druid’s REST APIs for (batch and stream) ingestion, aggregation queries, and mixed workloads. We’ll use Druid’s emitted metrics to measure ingestion throughput (via We’d be happy to hear your suggestions for tuning the system, and also suggested queries and metrics you think will be convincing for Oak’s adoption. |
@gianm , @jon-wei , @jihoonson and everyone! Oak has a great ability to scan forward and backward with the same speed! As Java's ConcurrentSkipListMap backward scan steps are in O(logN) each, Oak performs almost ten time faster in backward scans. Can you think about any scenario in Druid where the scan (iterator) goes forward and backward or only backward? Thanks! |
Hi again, Just to prove what I have said about memory usage in numbers, hereby please find the graph presenting the measurements of memory usage of OakIncrementalIndex vs existing IncrementalIndex. The experiment presents memory usage measured for an invocation of IncrementalIngestionBenchmark. The horizontal axis presents the number of rows used in each experiment. The vertical axis presents GBs. Black dashed line present only data size - just keys and values (IndexRow and Aggregators). Blue line presents the total memory consumption for OakIncrementalIndex (both off-heap and on-heap). Red line presents the total memory consumption for Druid's current IncrementalIndex (just on-heap). One can see that IncrementalIndex metadata takes almost twice than OakIncrementalIndex's metadata. In total OakIncrementalIndex use less memory than IncrementalIndex. |
In a previous comment I wrote:
I'm having trouble finding the above mentioned ingestion metrics in the logs (I configured a logging emitter and do see other metrics). It seems Thanks! |
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time. |
Our new and improved |
Oak for Druid Short Summary:
Oak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time analytics. Oak is a next generation of our previous research in KV-map field. The idea raised more than a year ago and Oak was designed during discussions with @cheddar and @himanshug , so Oak is modeled based on the requirements of Druid.
Oak implements the industry standard Java NavigableMap API. It provides strong (atomic) semantics for read, write, read-modify-write, and range query (scan) operations (forward and backward). Oak is optimized for big keys and values, in particular for incremental maintenance of objects (e.g., aggregation). It is faster and scales better with the number of CPU cores than popular NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s default).
We suggest to integrate Oak-based Incremental Index as an alternative to currently existing Druid’s Incremental Index. Because Oak is naturally built for off-heap memory allocation, has greater concurrency support, and should show better performance results.
More information and explanations will follow. For more introduction please take a look on the following files:
OakIntroduction.pdf
OAK Off-Heap Allocated Keys.pptx
The text was updated successfully, but these errors were encountered: