-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit maximum number of output partitions for filesystem exchange #12228
Conversation
What's the limitation? |
@martint The number of files created as well as the number of requests issued to S3 grows quadratically. With 50 partitions and 50 tasks (assuming 1 partition is processed by a single task) a stage creates 2500 files. This feels like a idealistic upper limit of what can be handled reliably. The amount of memory needed for buffering also grows in a similar way. 50 tasks running concurrently producing 50 partitions need |
That seems like an environmental/workload constraint, not a fundamental limit of how the file-based exchange works. Do we know that larger than 50 is problematic? How much larger than 50? Does the limit apply to GCS and Azure? Does the limit change if a user only plans to run one query at a time vs multiple queries concurrently? |
The limit of |
I have a concern that |
@jhlodin Good point. I wonder if we should set the default value for hash_partition_count to 50 if task level retries are enabled. |
This would be tricky I think. How would you know in one config what is the value in the other? |
We already do this to automatically disable https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/SystemSessionProperties.java#L1125 We could do something similar for
Ideally it would be great if the defaults worked out of the box as much as possible.
I agree. There are many ways users can miss-configure their clusters. However the |
As an afterthought. Also in the future we may want to apply adaptive strategies to determine number of hash partitions based on the runtime information. I wonder if we wan't to call this property CC: @martint @losipiuk @linzebing @jhlodin ? |
That works from a documentation perspective. Is it appropriate to label that under |
The However the exchange implementation (the file system exchange), which is the only option that is available today, can only reliably handle up to 50 exchange partitions. In the future we may provide a more scalable implementation that would support thousands of partitions (which in it's turn would open more opportunities for higher scale and adaptivity). At that point we may decided to re-adjust the default accordingly. However at this point it feels that it is better to set the default based on the exchange capabilities we currently provide. |
Separate config/session property may make sense. Or maybe we can make it static for now and set to 50 for execution with task retries. We may have hidden session property just in case, but not document it. Do we benefit from making the value smaller? |
Setting it to a smaller value would reduce the number of files created in S3 and reduce the number of requests being set. However it will also reduce the maximum query size in terms of memory. |
Why do we not want to document that property @arhimondr ? .. can you please hash this out with @jhlodin and get a doc PR done if necessary (and ideally asap since we might cut release today) |
The file system base exchange manager implementation is not designed to support number of partitions higher than 50. We shouldn't encourage users to increase this value. |
Sounds good @arhimondr .. make we wonder if we support upper bounds for parameters in airlift and if we should use that |
@mosabua We do. However the number of output partitions is passed to the exchange manager by the engine, so this change has to be done in that place |
Description
File system exchange is not designed to scale beyond 50 output partitions. An explicit enforcement is needed to avoid the exchange subsystem being unstable if not configured properly.
Fix
Exchange
When trying to use
Tardigrade
with thequery.hash-partition-count
configuration property (or thehash_partition_count
session property) set to a value higher than 50 users will see an error message (Max number of output partitions exceeded for exchange
). Before this change when supported number of output partitions was exceeded cluster could be unstable.Related issues, pull requests, and links
-
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: