-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying large datasets take up large amount of memory #341
Conversation
This turned out to be due to the multiple levels of caching that we have. The sample data set had 12 million data points spread across ~170 shards. Each shard's Passthrough engine sent a response with 1100 points and at the coordinator we are buffering responses in a channel of size |
Ok, after brainstorming, here's what we came up with. There should be a setting for the number of shards that can be queried in parallel. We should remove the option for per shard buffer size. This should be something that we compute based on the query. For example, if the shard duration is 7d and the group by interval is 1m, we know that we could possibly buffer 1440 * 7 points. If responses always bring back 200 points then we know the buffer would need to be 1440 * 7 / 200 + 1 in size. Further, if there is no group by time interval, then the number of shards to query in parallel should be 1. That will guarantee that we never have stuff buffering while another newer shard is going slow. |
Remove the setting for shard query buffer size and add logic for max number of shards to query in parallel.
Querying large datasets take up large amount of memory
We shouldn't be dropping responses anymore since the out of order response reception isn't possible. Also fix the logic that decides whether the shards should be queried sequentially or not. The only safe case to do parallel querying is when we have a single time series with aggregation over time only. Any other case is currently not safe to run in parallel.
this patch uses a channel of response channels instead of slice of response channels to create a pipeline instead of batches. In other words before this patch we processed shardConcurrentLimit shards first, then processed the next shardConcurrentLimit. With this patch we constantly have shardConcurrentLimit in the pipeline, as soon as we're done with one shard we start querying a new shard and so on. This provides more parallelism and cleaner design.
select count(value) some_series
on a series with many points takes up a huge amount of memory. Shouldn't be the case...