Querying large datasets take up large amount of memory #341

jvshahid · 2014-03-26T21:01:07Z

select count(value) some_series on a series with many points takes up a huge amount of memory. Shouldn't be the case...

jvshahid · 2014-03-26T21:14:18Z

This turned out to be due to the multiple levels of caching that we have. The sample data set had 12 million data points spread across ~170 shards. Each shard's Passthrough engine sent a response with 1100 points and at the coordinator we are buffering responses in a channel of size query-shard-buffer-size responses. This causes the memory usage to explode as more responses are buffered. This pr reduces the default query-shard-buffer-size to 10

pauldix · 2014-03-27T18:41:41Z

Ok, after brainstorming, here's what we came up with. There should be a setting for the number of shards that can be queried in parallel. We should remove the option for per shard buffer size. This should be something that we compute based on the query. For example, if the shard duration is 7d and the group by interval is 1m, we know that we could possibly buffer 1440 * 7 points. If responses always bring back 200 points then we know the buffer would need to be 1440 * 7 / 200 + 1 in size.

Further, if there is no group by time interval, then the number of shards to query in parallel should be 1. That will guarantee that we never have stuff buffering while another newer shard is going slow.

Remove the setting for shard query buffer size and add logic for max number of shards to query in parallel.

Querying large datasets take up large amount of memory

We shouldn't be dropping responses anymore since the out of order response reception isn't possible. Also fix the logic that decides whether the shards should be queried sequentially or not. The only safe case to do parallel querying is when we have a single time series with aggregation over time only. Any other case is currently not safe to run in parallel.

this patch uses a channel of response channels instead of slice of response channels to create a pipeline instead of batches. In other words before this patch we processed shardConcurrentLimit shards first, then processed the next shardConcurrentLimit. With this patch we constantly have shardConcurrentLimit in the pipeline, as soon as we're done with one shard we start querying a new shard and so on. This provides more parallelism and cleaner design.

pauldix added this to the 0.5.0 milestone Mar 14, 2014

pauldix assigned jvshahid Mar 14, 2014

jvshahid modified the milestones: 0.5.1, 0.5.0 Mar 24, 2014

pauldix mentioned this pull request Mar 27, 2014

Should be able to configure the number of shards that can be queried in parallel #365

Closed

jvshahid added a commit that referenced this pull request Mar 28, 2014

fix #341. reduce the default query shard buffer size

f71bd11

jvshahid modified the milestones: 0.5.3, 0.5.2 Mar 28, 2014

pauldix modified the milestones: 0.5.4, 0.5.3 Mar 31, 2014

pauldix and others added 2 commits April 1, 2014 19:11

fix #341. Limit the amount of memory taken by query

c3bf487

Remove the setting for shard query buffer size and add logic for max number of shards to query in parallel.

refactoring to make the code clear and simple and fix some bugs

0e8a5ce

jvshahid added a commit that referenced this pull request Apr 1, 2014

Merge pull request #341 from influxdb/fix-341-query-memory-consumption

ee0277a

Querying large datasets take up large amount of memory

jvshahid merged commit ee0277a into master Apr 1, 2014

jvshahid deleted the fix-341-query-memory-consumption branch April 1, 2014 23:13

jvshahid mentioned this pull request Apr 4, 2014

select * from collection limit 1; is taking a long time to return #364

Closed

pauldix unassigned jvshahid Feb 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying large datasets take up large amount of memory #341

Querying large datasets take up large amount of memory #341

jvshahid commented Mar 26, 2014

jvshahid commented Mar 26, 2014

pauldix commented Mar 27, 2014

Querying large datasets take up large amount of memory #341

Querying large datasets take up large amount of memory #341

Conversation

jvshahid commented Mar 26, 2014

jvshahid commented Mar 26, 2014

pauldix commented Mar 27, 2014