Oak: New Concurrent Key-Value Map #5698

sanastas · 2018-04-25T08:41:07Z

Oak for Druid Short Summary:

Oak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time analytics. Oak is a next generation of our previous research in KV-map field. The idea raised more than a year ago and Oak was designed during discussions with @cheddar and @himanshug , so Oak is modeled based on the requirements of Druid.
Oak implements the industry standard Java NavigableMap API. It provides strong (atomic) semantics for read, write, read-modify-write, and range query (scan) operations (forward and backward). Oak is optimized for big keys and values, in particular for incremental maintenance of objects (e.g., aggregation). It is faster and scales better with the number of CPU cores than popular NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s default).

We suggest to integrate Oak-based Incremental Index as an alternative to currently existing Druid’s Incremental Index. Because Oak is naturally built for off-heap memory allocation, has greater concurrency support, and should show better performance results.

More information and explanations will follow. For more introduction please take a look on the following files:
OakIntroduction.pdf
OAK Off-Heap Allocated Keys.pptx

sanastas · 2018-04-25T14:13:13Z

Hereby attached the refactor we suggest to IncrementalIndex module so it can adopt another index based on Oak. Would be happy to hear your comments. Thanks!
Incremental Index Refactoring Suggestion.pdf

b-slim · 2018-04-25T15:07:12Z

@sanastas sounds great contribution is this lib open source? what it used with other projects?

sanastas · 2018-04-25T15:28:15Z

We intend to make Oak open source, but it is not yet there. Oak was not yet used in real project, we just measured it performance intensively on different machines and under different workloads. We are looking forward to see how it affects Druid's performance.

leventov · 2018-04-29T15:54:12Z

@sanastas interesting proposal, thanks. Notes:

There is a plan to globally departure from using ByteBuffer in Druid, we are going to use https://github.com/DataSketches/memory. In order not to do some work twice and conversions between ByteBuffer and Memory APIs back and forth, maybe you could already base your Oak code on the Memory library. You might want to talk with with @leerho about that.
Instead of adding more incremental index variants (your refactoring plan suggests existence of four), it would be better if everything have converged on the minimum number implementations, that work in all use cases reasonably well. E. g. I plan to make a change that will leave just one IncrementalIndex implementation (on-heap, but "serialized" in Memory objects), instead of current two: OnHeap and OffHeap. This is mentioned in the last paragraph of this message: Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335 (comment). However, my idea doesn't touch present ConcurrentSkipListMap-based machinery. It would be nice to combine that ideas with concurrency and key handling from Oak, and thus have just one IncrementalIndex. In the materials that you provided I see mentions of tests with 1K-sized values. For Druid, very small values (8 bytes) are important, so they might need to be tested too.
Another thing that might be important - Druid needs variable-sized values, see the second half of this message: Refactor index merging, replace Rowboats with RowIterators and RowPointers #5335 (comment)
I don't think there is a big issue with GC in our incremental index structures, because those heaps are quite small, < 10 GB. The bigger issue is the overhead of traditional Java object structures. That is why in the improvement that I planned (see above), I don't intend to move data off-heap, just reduce the overhead and flatten memory, storing data in Memory objects (read: byte[] buffers). On the other hand, it allows to simplify the concurrency model and take advantage of Java's GC. Buffers are shared among a small number of key-value pairs (e. g. 64) in order to reduce the overhead from byte[] and Memory headers themselves.
Currently concurrency of incremental indexes is inherently broken, because it doesn't synchronize between updates and reads: see Thread safe reads for aggregators in IncrementalIndex #3956. Would Oak allow to solve this problem? Also see ArrayOfDoublesSketch module #5148 (comment)

b-slim · 2018-04-29T16:42:36Z

@leventov the first thing i see when i read memory docs is

* <p>This may work with jdk9 but might require the JVM arg <i>-permit-illegal-access</i>,
 * <i>–illegal-access=permit</i> or equivalent. Proper operation with jdk9 or above is not
 * guaranteed and has not been tested.

that is scary! is there any plans to support JDK 9 and higher? i propose to make sure that is the case before we move out of ByteBuffer to Memory

leventov · 2018-04-29T16:47:29Z

@b-slim please raise the same question here: #3892, rather than this thread, it has a different topic.

sanastas · 2018-04-29T19:42:22Z

@leventov , thanks for taking a look! You have great comments!

Generally speaking Oak was written quite a while ago using ByteBuffers. The integration of the Oak to Druid is also almost done and we want to publish it soon for your review. We are good friends with Lee Rhodes and we know about his Memory library. We do not see any special advantage to use Memory over ByteBuffer internally at Oak, but definitely it can be an advantage in Druid. In order not to slow the progress I would suggest we publish the code as we have it just to get your impression. If required we can move it to Memory.
As we didn't yet feel comfortable to delete existing implementation, we just added our on top of what we saw as existing. It could be good for testing and benchmarking against the current implementation. In the future of course we can remain with one implementation. Regarding your plan about "just one IncrementalIndex implementation (on-heap, but "serialized" in Memory objects)", you can get it from Oak as Oak works off-heap and on heap and we can allocate ByteBuffer/Memory off-heap or on-heap. Clearly the objects on the ByteBuffer/Memory are serialized. Please note that with Oak you are getting much more than memory layout: the better concurrency, reacher and more efficient API, fast backward scans, etc.
Regarding your concern about small values. We work with them better than the ConcurrentSkipListMap as well. The off-heaping is less important with small values, but as mentioned Oak has more to suggest than just off-heaping. Of course all the benchmarks need to be repeated inside Druid. Here are the results
OakMap OnHeap with Integer Values.pdf
"Druid needs variable-sized values": Oak definitely supports variable-sized values and keys! This was one of the initial design points.
Regarding your comment about GC and flattening. With 10GB heap, the GC efficiency depends on the number of objects. The smaller they are and more of them - worse. When working with ConcurrentSkipListMap and small key-value pairs, hundred of millions key-value pairs still make young generation garbage-collection stop time long. There are 2 problems of ConcurrentSkipListMap: (1) it needs 2 objects to store one key-value on expectation. One Index (skiplist's average node height is 1), one Node itself. (2) Recent inserted KeyValue and its map structure (Index, Node) are assigned on young generation. The card table (for CMS gc algorithm) or RSet (for G1 gc algorithm) will change frequently on high writing throughput, which makes YGC slow. However, I share your concern that we also spend too much memory on objects overheads as well. Oak flattens it all! Keys and values saved in buffers (or you name any other container) and also index itself uses arrays as much as possible.
Finally, clearly Oak cares for all the concurrency issues. It is design for concurrent updates/reads of the data it stores.

leventov · 2018-04-29T21:24:30Z

We do not see any special advantage to use Memory over ByteBuffer internally at Oak, but definitely it can be an advantage in Druid. In order not to slow the progress I would suggest we publish the code as we have it just to get your impression. If required we can move it to Memory.

The problem is that if the Druid API is going to depend on Memory, then if Oak is used, conversion from ByteBuffer to Memory is required on every read and write access to the NavigableMap. The allocated Memory objects are garbage. I don't think the overhead of those conversions is going to be prohibitive, but it could be considerable, so better avoided.

sanastas · 2018-04-30T07:12:17Z

To be based on ByteBuffer of Memory is equivalent for Oak. We care about performance a lot and conversion from ByteBuffer to Memory is not an option. When aggregators will transform from ByteBuffer to Memory, we will switch Oak to use Memory. For now we just want to get your review feedback and to get some benchmark performance comparisons.

By the way do you have any ready performance benchmarks we can use? Do you use any performance evaluation platform like YCSB or similar?

sanastas · 2018-05-06T13:00:04Z

Hi again everyone,

We will join the Druid developer sync meeting next Tuesday, May 8, in order to present the Oak (briefly), to discuss the integration, and answer your questions. Thanks!

b-slim · 2018-05-06T15:43:11Z

@sanastas You are more than welcome, For the benchmark we do not have a client for YCSB but we do have some JMH benchmarks at the druid repo it self i think it is a good start.

sanastas · 2018-08-16T11:28:52Z

Here is the latest pull request #6184 :
#6184

sanastas · 2018-08-26T10:29:38Z

I have updated the pull request
#6235

This pull request also includes interesting index ingestion benchmark results:
IndexIngestionBenchmarkWithOak.pdf

sanastas · 2018-09-12T14:32:58Z

The last PR is now: #6327

sanastas · 2019-05-16T19:30:59Z

Hi Everyone!

We are back! I have just created a new PR #7676 in order to continue discussion about the new type of Incremental Index based on Oak concurrent navigable map.

For those yet not familiar with Oak - it is an off-heap (direct-allocation) concurrent navigable (ordered) KV-map. More details about Oak can be found in the previous comments in this issue and in PR #7676 introduction.

For those who are familiar with Oak, this is what have been done so far (since this project went on a pause):

The Druid's integration was reconsidered and renewed to impose less changes in Druid. (The old refactoring schema doesn't imply any longer.) The Oak integration code is rebased on the up-to-date Druid's code.
Oak is now E2E connected and Druid cluster based on OakIncrementalIndex can be created. Details how to activate in PR Add OakIncrementalIndex to Druid #7676.

The Oak isolated benchmarks can be found here (in order to see the potential in using Oak vs ConcurrentSkipListMap): OAK PERFORMANCE.pdf
We will soon publish OakIncrementalIndex JMH benchmarks available in Druid. However, those benchmarks are single threaded and we might need some modifications in order to show Oak benefits well.

Please take a look on PR #7676 and raise your questions here or there. Clearly we have more work to do, this is just a start! :) We will join the Druid developer sync meeting next Tuesday, May 21, to discuss the Oak possibility further.

sanastas · 2019-06-06T12:40:55Z

While measuring the Oak performance we have found IncrementalIndex module doesn't scale well with multi-threading. We traced the threads’ blocking states to two causes:

A monitor in IncrementalIndex that synchronized access to dimensionDescs
A Read-Write lock in StringDimensionIndexer

The first case is addressed in PR#7838. You are welcome to take a look. Thanks!

sanastas · 2019-06-20T11:51:37Z

@jihoonson please take a look on the following proposal draft!

PROPOSAL

Motivation

The following are some imperfections of current Druid IncrementalIndex:

No possibility to take the Incremental Index data off-heap. Off-heap memory allows working with bigger memory chunks in RAM without additional performance degradation.
Big on-heap memory foot-print. A lot of objects vs serialization.
Frequent and slow segment merges.
Low multi-thread scalability of queries

Proposed changes

Oak (Off-heap Allocated Keys) is a scalable concurrent KV-map for real-time analytics, which has all its keys and values kept in the off-heap memory. Oak is faster and scales better with the number of CPU cores than popular NavigableMap implementations, e.g., Doug Lea’s ConcurrentSkipListMap (Java’s default).
We suggest to integrate Oak-based Incremental Index as an alternative to currently existing Druid’s Incremental Index. Because Oak is naturally built for off-heap memory allocation, allows usage of much bigger RAM, has greater concurrency support, and should show better performance results.

Rationale

A discussion of why this particular solution is the best one. One good way to approach this is to discuss other alternative solutions that you considered and decided against. This should also include a discussion of any specific benefits or drawbacks you are aware of.

Let’s follow the initial motivation points:

Taking data off-heap allows:

Working with bigger data without paying the cost of JVM GC slowdown
Oak allows writing the data off-heap (even with additional copy to off-heap) while increasing the ingestion speed significantly

When working with IncrementalIndex Druid benchmarks and ingesting (for example) 2GB of data, we have encountered that we need to give 10GB more (!) in order to see proper results. Please see the IncrementalIngestionBenchmark results below. OakMap serialize the keys and the values (no object overheads). OakMap metadata is modest and it mostly comes as arrays (again no object overheads). Additionally, OakMap can estimate its off-heap usage precisely.
Segment merge in Druid is quite expensive. Segments are merged after (not big) IncrementalIndexes are flushed to disk and then need to be combined. Oak allows to keep more data in-memory giving the same (and sometimes much better) performance. Therefore less flushes to disk (less files requiring merge) may appear. As a future work to OakMaps can be merged while still in-memory.
Experimenting with IncrementalIndexReads are still working in progress. However, comparing the multithreading scalability of OakMap with ConcurrentSkipListMap we see that OakMap scales better. So we believe to see that scalability also in Oak-based IncrementalIndex.

Operational impact

The suggested OakIncrementalIndex should live alongside the current solution, giving the ability to switch to OakIncrementalIndex when preferable. As OakIncrementalIndex doesn’t affect the on disk layout, we do not foresee any specific operational impact.

Test plan (optional)

The plan is to add a specific unit-test for OakIncrementalIndex and to pass all existing tests for cluster based on OakIncrementalIndex.

Future work (optional)

As StringIndexer also takes a significant part of IncrementalIndex we might take also this map off-heap.

IncrementalIngestionBenchmark results

In attached files please see the results of IncrementalIngestionBenchmark when 2 million vs 3 million of rows are inserted. It is done with native Druid’s incremental index and with newly suggested OakIncrementalIndex (data taken off-heap). Druid’s incremental index always gets 12GB on heap memory. OakIncrementalIndex always gets 8GB on-heap and 4GB off-heap (in total 12GB RAM). Please note that the rows come prepared before the benchmark measurement and actually already take big chunk of on-heap memory. The graphs show throughput (number of operations in seconds) so the bigger the better. In all results OakIncrementalIndex show better performance. When moving from 2M rows to 3M rows (using same memory limits) Oak shows about 10% performance degradation while IncrementalIndex shows about 60% degradation.

Ingestion Throughput 2M rows (2.5GB data) ingested.pdf
Ingestion Throughput 3M rows (3.8GB data) ingested.pdf

jihoonson · 2019-06-21T23:02:14Z

@sanastas thanks! This proposal looks only focused on support of OakIncrementalIndex which looks good to me. I have a question on the benchmark results you posted though. Do you know what caused the performance degradation when the number of rows was increased? 60% degradation sounds pretty huge.

ebortnik · 2019-06-22T15:08:40Z

@jihoonson chiming in here, too .. I believe 60% throughput degradation attributes to GC. The JVM just doesn't have enough headroom for the heap. The data itself is 3.8 GB (<1/3 GB of the managed RAM), all pre-generated prior to the test. OakIncrementalIndex is pretty immune to scaling (10% slowdown is organic to search as the data structure grows).

sanastas · 2019-06-24T08:44:35Z

@jihoonson thanks for taking a look!

First, indeed this proposal is about supporting OakIncrementalIndex. This is the idea we pursue for a while already. In general it is about building a bigger off-heap IncrementalIndexes and enjoying a good performance :) Would you like to work with us on promoting this possibility? Just as consultant, your insights are very valuable....

Second, Oak have couple of advantages when working with big data. As @ebortnik has mentioned, working with off-heap serialized data make it less affected to the JVM GC. In addition, Oak utilizes cache locality for searches. Lastly, Oak works good under multi-threading contention and scales well with multiple threads. However, in this specific experiment (single thread) the main problem should be caused by GC.

Original, Druid's IncrementalIndex allocates the (to-be-added) rows on-heap prior to the benchmarks (taking 4GB out of given 12GB). Then StringIndexer takes more memory to save the String<->Integer translation, let's exaggerate and give it another 4GB. From here, for all other on-heap objects we remain with 4GB, which puts a lot of stress on GC. The ConcurrentSkipListMap used in Druid's IncrementalIndex is know to be less GC-friendly due to many small objects it allocates. I believe this is the reason for huge performance degradation we see.

For example, here is a reference to Oracle themselves mentioning about Java Garbage Collector:

Garbage collection (GC) reclaims the heap space previously allocated to objects no longer needed. The process of locating and removing the dead objects can stall any application and consume as much as 25 percent throughput.

And here one can take a look on experimenting with GC and big heap sizes. The conclusion is:

From conducting numerous tests, we have concluded that unless you are utilizing some off-heap technology, no Garbage Collector provided with JDK will render any kind of stable GC performance with heap sizes larger that 16GB. For example, on 50GB heaps we can often encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds.

Would be really glad to hear your thoughts!

jihoonson · 2019-06-26T00:46:44Z

@ebortnik @sanastas thank you for details! The result looks interesting.

Would you like to work with us on promoting this possibility?

Surely, let me know if you need anything.

sanastas · 2019-06-26T08:14:29Z

@jihoonson , I appreciate your intention to help with this project. Thanks a lot!

In your opinion, what would be the performance result/benchmark (any other trigger) that would indicate that it worth to take OakIncrementalIndex into Druid?

jihoonson · 2019-06-27T00:58:39Z

I think it would be nice if we have two benchmarks. One benchmark is for directly comparing performance of OakIncrementalIndex and OnheapIncrementalIndex (and optionally OffheapIncrementalIndex to show how optimized OakIncrementalIndex is to store data in off-heap memory). The benchmark result would include both ingestion throughput and latency to see how GC could affect to the ingestion performance.

Another benchmark which is I think nice to have would measure the performance of Appenderator. Appenderator is a wrapper class of IncrementalIndex to hide the details of persisting intermediary segments into local disk, and is used by both native batch index task and Kafka/Kinesis index tasks. I think it would be a good benchmark to show how OakIncrementalIndex could make the ingestion performance better by testing production code path.

sanastas · 2019-06-27T08:54:31Z

@jihoonson , sounds good!

One benchmark is for directly comparing performance of OakIncrementalIndex and OnheapIncrementalIndex (and optionally OffheapIncrementalIndex to show how optimized OakIncrementalIndex is to store data in off-heap memory).

We have found OffheapIncrementalIndex not-optimized in some places, not so concurrent (synchronize on everything) and it looks like it not in use. Thus we decided not to compare to it as even in the tests/benchmarks there are some exceptions happening with OffheapIncrementalIndex....

The benchmark result would include both ingestion throughput and latency to see how GC could affect to the ingestion performance.

Is it something like two graphs under "IncrementalIngestionBenchmark results" presented above? Is it the latency that you are missing? Or different parameters?

jon-wei · 2019-06-28T21:22:45Z

@sanastas

Thanks for contributing #7676!

I've been thinking about what the path to potentially merging #7676 would look like.

In Druid, there are currently two categories of code contributions, and for merging consideration, these two categories have different requirements.

Core Druid and core extensions
Contrib extensions

The requirements for contrib extensions are looser and could roughly be described as "reasonable implementation and potentially useful for some use cases or experimentation". The contrib extensions aren't actively maintained by the Druid committers, and are generally less extensively tested.

It can make sense for a new feature to start out as a contrib extension, and potentially migrate to core as it evolves. Examples of this include the Google Cloud Storage extension and the ORC format extension, which started out as contrib extensions and were recently adopted as core extensions.

For the Oak-based incremental index, this path could make sense as well, but Druid does not currently provide an extension point for incremental index implementations. To open that as an extension point would first involve discussion/consensus on whether it's a good idea to have that extension point, and there would also be significant design thought/implementation work required.

Given those difficulties, I think it makes sense to think about the path to merging Oak-based incremental index as a core feature. For merging a contribution into core, the requirement is essentially: "Convince Druid committers such that they are willing to take responsibility for and maintain the contribution going forward."

At the highest level, setting aside implementation details, I think it'd be helpful to see a comparison of performance metrics between Oak incremental index and the existing implementation on a real cluster.

I would try to set up realistic workloads for native batch ingestion and Kafka indexing service ingestion, and gather metrics for the following:

Ingestion throughput
Query performance (realtime tasks like Kafka indexing service tasks can answer queries)
Index persist performance

Separately from the incremental index topic, I wonder if OakMap could be used as part of Druid's GroupBy V2 query. There is a class called ConcurrentGrouper which is responsible for grouping/aggregating rows off-heap, with concurrent writes. This sounds like an area where OakMap could potentially be beneficial. If you're interested, that could be another worthwhile avenue for investigation.

sanastas · 2019-06-29T08:17:54Z

@jon-wei ,

I couldn't ask for a better start of the conversation! Thank you so much!

Totally agree with you, given that Druid does not currently provide an extension point for incremental index implementations, it is favorable to merge Oak-based incremental index as a core feature. We do not intent to replace the OnheapIncrementalIndex, Oak can be alternative to turn on for some cases. I will also take a look on ConcurrentGrouper.

I would try to set up realistic workloads for native batch ingestion and Kafka indexing service ingestion

Can you tell a bit more? Is there any available commonly used dataset? Any specific keys/rows distribution or data generator? Any performance measurement tools commonly used?

jon-wei · 2019-07-03T00:36:38Z

Can you tell a bit more? Is there any available commonly used dataset? Any specific keys/rows distribution or data generator? Any performance measurement tools commonly used?

I think it'd be interesting to see performance with data from a cluster that's used for some real world purpose, with some real queries used in that cluster.

For performance measurement, you could gather query metrics emitted by druid, and check ingest/persist performance through task logs, profilers, etc. I don't have a specific tool in mind.

sanastas · 2019-07-03T07:44:12Z

I think it'd be interesting to see performance with data from a cluster that's used for some real world purpose

@jon-wei , it would be interesting indeed :)
However, unfortunately, we do not own any Druid cluster used for some real world purpose. In order to be realistic, can we relax this requirement? Maybe still some synthetic workload? Or maybe you can help and provide some "real world" workload? Alternatively, maybe anyone have a possibility to give Druid cluster with OakIncrementalIndex a try?

@jihoonson , @b-slim , @leventov may be you have any ideas to contribute to this discussion? How to evaluate Oak so the community is convinced it worth it? Thanking you all in advance!

sanastas · 2019-07-03T08:28:49Z

In the meanwhile we would like to share some insights we had while playing with IncrementalIngestionBenchmark. We continued the experiments that had started and were presented above under “IncrementalIngestionBenchmark” title. We compare native Druid’s incremental index and newly suggested OakIncrementalIndex (data taken off-heap). The data distribution/generation is exactly as in IncrementalIngestionBenchmark, we just insert more rows.

This time we inserted 6 Million rows (about 8GB of data) while giving 24GB RAM. Number 24GB appear as this is almost the lowest number allowing native Druid’s IncrementalIndex to run properly. Even taking into consideration that in IncrementalIngestionBenchmark, the rows come prepared before the benchmark measurement and actually already take big chunk of on-heap memory, x3 memory requirement sounds a lot… Druid’s incremental index gets 24GB on heap memory. OakIncrementalIndex always gets 16GB on-heap and 8GB off-heap (in total 24GB RAM). The results can be see in the file bellow. The graph shows throughput (number of operations in seconds) so the bigger the better. OakIncrementalIndex performs about twice faster.

Ingestion Throughput 6M rows (8GB data) ingested.pdf

In order to stress the memory overhead requirement, we have run yet another experiment, this time inserting 7 Million rows, which is up to 9GB data. We gradualy increased the memory requirement and present the throughput as function of total RAM used. Results in the file below. OakIncrementalIndex off-heap memory requirement was always 9GB, as it is what is written there. We have started by giving 24GB of total RAM as this is where OakIncrementalIndex was able to operate, although its throughput was very low. Native Druid’s IncrementalIndex was unable to operate until 28GB of on-heap memory was allowed to be used.

Ingestion 9GB data into Druid.pdf

Does the question of metadata memory overhead bother you? Also would you be interested in working with bigger IncrementalIndexes, in order to later have less flushes to disk (persist), causing less merges, and thus higher performance?

jon-wei · 2019-07-03T19:22:24Z

However, unfortunately, we do not own any Druid cluster used for some real world purpose. In order to be realistic, can we relax this requirement? Maybe still some synthetic workload? Or maybe you can help and provide some "real world" workload? Alternatively, maybe anyone have a possibility to give Druid cluster with OakIncrementalIndex a try?

I think in the past some have used a portion of the TPCH data (specifically the lineitem table) for performance evaluations.

Beyond gather performance numbers, I do recommend running Oak incremental on a real cluster, as part of the testing strategy.

gianm · 2019-07-03T21:47:38Z

Some context around the potential impact of Oak on Druid's IncrementalIndex:

There are two things that matter at ingestion time: ingestion throughput (for all forms of ingestion) and query latency (for realtime ingestion only -- batch ingest does not serve queries).
It looks like you've been doing a lot of testing with really big incremental indexes, but it's more normal in Druid land to have smaller ones. There are a couple of reasons for this. One is that bigger indexes take up more memory, and large amounts of memory aren't always available. Another is that querying bigger indexes takes more time. The way Druid's ingestion works is that periodically, the incremental indexes are persisted to disk in Druid's segment format, which is compressed (smaller) and columnar & indexed (faster to query on a per-row basis). There are also multiple persist files and each can be processed in parallel. Keeping around too many rows in memory can negatively impact query speeds. In other words: having a 5 million row incremental index means that queries on realtime data cannot be faster than however long it takes to process those 5 million rows. This latter point matters for realtime ingestion (where queries happen concurrently with ingestion), and so for understanding the impact there, it'd be important to see how long queries take.
For ingestion throughput there are three components that matter: throughput of adding to an incremental index, how long it takes to persist the incremental index to disk, and how long it takes to merge persisted indexes into a final segment at the end of the ingestion cycle. All of them matter, & one reason folks are asking for real-world numbers is to make sure all three of these are being taken into account.

Experience and query/ingest-rate metrics in a real cluster is the easiest way to validate all of the above, since the system is fairly complex and there are a lot of potential tradeoffs involved between the various components. If you don't have a real-world dataset available maybe try the publicly available tweets dataset from the Twitter Streaming API. We often use it for test clusters. Some resources for that:

https://developer.twitter.com/en/docs/tweets/sample-realtime/overview/GET_statuse_sample
http://twitter4j.org/en/code-examples.html (look for "Streaming API")

That all being said, you might also want to look at the potential impact of Oak on Druid's groupBy engine. Check out the ConcurrentGrouper class, which groupBy v2 queries use for parallel aggregation. In particular, check out the aggregate(KeyType key, int keyHash) method. It is slicing up a buffer and then synchronizing on each slice, a pretty simple strategy that I am sure could be improved on. Maybe Oak could do better. The code path is somewhat similar to IncrementalIndex: both of them involve grouping by time and dimensions, and aggregating using AggregatorFactories.

sanastas · 2019-07-04T07:21:53Z

@jon-wei , thank you, we will try to find something about the TPCH data.

@gianm , Great to hear from you Gian and thank you for your great input!

No doubts the query performance is important, we are working on it just now, to present the results soon. Also no doubts, the system level test/performance benchmark is also important. We try to collect the information about how it should be run to be convincing for the community.

It looks like you've been doing a lot of testing with really big incremental indexes, but it's more normal in Druid land to have smaller ones.

There is no intention to force Druid to work with big incremental indexes, just wanted to show some cases where Oak advantage is clear. Ingestion (with Oak) on smaller indexes has the same latency/throughput (as with current IncrementalIndex), but takes less memory and gives a potential for a better concurrency if multi-threaded ingestion will be used one day.

One is that bigger indexes take up more memory, and large amounts of memory aren't always available.

OakIncrementalIndex can let you handle more rows with less RAM.

Another is that querying bigger indexes takes more time

We are currently working on queries speed. Hope to update you soon. What are the expected query times?

how long it takes to persist the incremental index to disk, and how long it takes to merge persisted indexes into a final segment at the end of the ingestion cycle.

There can be a trade of, assuming all in all you process X bytes of data. It can be persisted in chunks of X/10 and then merged 10 times, or alternatively it can be persisted in big chunks like X/2 and may be merged only twice. I am just exaggerating the numbers, and I am not sure this theory can show better performance. Just something to be checked.

Thank you for pointing on the publicly available dataset. We will investigate what we can do.

Some additional question: how big the read-only segments are?

eranmeir · 2019-07-15T10:59:28Z

@jon-wei, @gianm

Thank you for the valuable input.

We will begin benchmarking using a single-machine setup as described in the quickstart guide.

We have real-world data from production Flurry server that we can use. It’s about 450M keys comprised of application names and timestamps, and we can generate values in different sizes, depending on what we plan to simulate.

Our plan is to use Druid’s REST APIs for (batch and stream) ingestion, aggregation queries, and mixed workloads. We’ll use Druid’s emitted metrics to measure ingestion throughput (via ingest/rows/output, ingest/persists/time and ingest/merge/time) and query latency (via query/time).

We’d be happy to hear your suggestions for tuning the system, and also suggested queries and metrics you think will be convincing for Oak’s adoption.

sanastas · 2019-07-16T09:12:40Z

@gianm , @jon-wei , @jihoonson and everyone!

Oak has a great ability to scan forward and backward with the same speed! As Java's ConcurrentSkipListMap backward scan steps are in O(logN) each, Oak performs almost ten time faster in backward scans.

Can you think about any scenario in Druid where the scan (iterator) goes forward and backward or only backward? Thanks!

sanastas · 2019-07-21T07:51:33Z

Hi again,

Just to prove what I have said about memory usage in numbers, hereby please find the graph presenting the measurements of memory usage of OakIncrementalIndex vs existing IncrementalIndex. The experiment presents memory usage measured for an invocation of IncrementalIngestionBenchmark. The horizontal axis presents the number of rows used in each experiment. The vertical axis presents GBs.

Black dashed line present only data size - just keys and values (IndexRow and Aggregators). Blue line presents the total memory consumption for OakIncrementalIndex (both off-heap and on-heap). Red line presents the total memory consumption for Druid's current IncrementalIndex (just on-heap). One can see that IncrementalIndex metadata takes almost twice than OakIncrementalIndex's metadata. In total OakIncrementalIndex use less memory than IncrementalIndex.

Memory Usage (Metadata vs Data).pdf

eranmeir · 2019-07-31T12:57:43Z

@jon-wei, @gianm

In a previous comment I wrote:

... We’ll use Druid’s emitted metrics to measure ingestion throughput (via ingest/rows/output, ingest/persists/time and ingest/merge/time) and query latency (via query/time).

I'm having trouble finding the above mentioned ingestion metrics in the logs (I configured a logging emitter and do see other metrics). It seems RealtimeMetricsMonitor is no longer supported? How would you suggest configuring the benchmark cluster to retrieve relevant metrics and calculate ingestion throughput?

Thanks!

stale · 2020-05-06T13:21:29Z

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2020-06-03T13:52:55Z

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

liran-funaro · 2020-06-04T21:01:46Z

Our new and improved OakIncrementalIndex significantly reduces Druid's memory consumption during the ingestion process. We have conducted system-level experiments per the community request, which shows that OakIncrementalIndex uses 60% less memory and 50% less CPU-time compared with OnheapIncrementalIndex to achieve the same performance.
This translates to nearly double the system's ingestion-throughput with the same memory budget, and a 75% increase in throughput with the same CPU-time budget. Please check it out: #9967.

b-slim mentioned this issue Apr 29, 2018

Move from ByteBuffer to Memory #3892

Closed

jihoonson mentioned this issue May 2, 2018

Oak v0.1 #5732

Closed

sanastas mentioned this issue Aug 16, 2018

OakIncrementalIndex has a Plain mode #6184

Closed

sanastas mentioned this issue Aug 26, 2018

Druid with Oak supporting also plain mode v04 #6235

Closed

sanastas mentioned this issue Sep 12, 2018

Druid based on OakIncrementalIndex #6327

Closed

jihoonson mentioned this issue Jun 18, 2019

Improve IncrementalIndex concurrency scalability #7838

Closed

Eshcar mentioned this issue Jul 30, 2019

[Proposal] BufferAggregator support for growable sketches. #8126

Open

stale bot added the stale label May 6, 2020

jihoonson mentioned this issue May 27, 2020

Fix various processing buffer leaks and simplify BlockingPool. #9928

Merged

liran-funaro mentioned this issue Jun 2, 2020

Extension: oak-incremental-index: Low resource (RAM and CPU) incremental-index implementation using off-heap key/value map (OakMap) #9967

Open

stale bot closed this as completed Jun 3, 2020

Oak: New Concurrent Key-Value Map #5698

Oak: New Concurrent Key-Value Map #5698

Comments

sanastas commented Apr 25, 2018 • edited by leventov Loading

sanastas commented Apr 25, 2018

b-slim commented Apr 25, 2018

sanastas commented Apr 25, 2018

leventov commented Apr 29, 2018

b-slim commented Apr 29, 2018

leventov commented Apr 29, 2018

sanastas commented Apr 29, 2018

leventov commented Apr 29, 2018 • edited Loading

sanastas commented Apr 30, 2018

sanastas commented May 6, 2018

b-slim commented May 6, 2018

sanastas commented Aug 16, 2018

sanastas commented Aug 26, 2018

sanastas commented Sep 12, 2018

sanastas commented May 16, 2019

sanastas commented Jun 6, 2019

sanastas commented Jun 20, 2019

Motivation

Proposed changes

Rationale

Operational impact

Test plan (optional)

Future work (optional)

IncrementalIngestionBenchmark results

jihoonson commented Jun 21, 2019

ebortnik commented Jun 22, 2019

sanastas commented Jun 24, 2019

jihoonson commented Jun 26, 2019

sanastas commented Jun 26, 2019

jihoonson commented Jun 27, 2019

sanastas commented Jun 27, 2019

jon-wei commented Jun 28, 2019

sanastas commented Jun 29, 2019

jon-wei commented Jul 3, 2019

sanastas commented Jul 3, 2019

sanastas commented Jul 3, 2019

jon-wei commented Jul 3, 2019

gianm commented Jul 3, 2019

sanastas commented Jul 4, 2019

eranmeir commented Jul 15, 2019

sanastas commented Jul 16, 2019

sanastas commented Jul 21, 2019 • edited Loading

eranmeir commented Jul 31, 2019

stale bot commented May 6, 2020

stale bot commented Jun 3, 2020

liran-funaro commented Jun 4, 2020 • edited Loading

sanastas commented Apr 25, 2018 •

edited by leventov

Loading

leventov commented Apr 29, 2018 •

edited

Loading

sanastas commented Jul 21, 2019 •

edited

Loading

liran-funaro commented Jun 4, 2020 •

edited

Loading