Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit maximum number of output partitions for filesystem exchange #12228

Merged
merged 1 commit into from
May 5, 2022

Conversation

arhimondr
Copy link
Contributor

Description

File system exchange is not designed to scale beyond 50 output partitions. An explicit enforcement is needed to avoid the exchange subsystem being unstable if not configured properly.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Exchange

How would you describe this change to a non-technical end user or system administrator?

When trying to use Tardigrade with the query.hash-partition-count configuration property (or the hash_partition_count session property) set to a value higher than 50 users will see an error message (Max number of output partitions exceeded for exchange). Before this change when supported number of output partitions was exceeded cluster could be unstable.

Related issues, pull requests, and links

-

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@martint
Copy link
Member

martint commented May 3, 2022

File system exchange is not designed to scale beyond 50 output partitions. An explicit enforcement is needed to avoid the exchange subsystem being unstable if not configured properly.

What's the limitation?

@arhimondr
Copy link
Contributor Author

@martint The number of files created as well as the number of requests issued to S3 grows quadratically. With 50 partitions and 50 tasks (assuming 1 partition is processed by a single task) a stage creates 2500 files. This feels like a idealistic upper limit of what can be handled reliably. The amount of memory needed for buffering also grows in a similar way. 50 tasks running concurrently producing 50 partitions need 25GB for buffering (5MB minimal S3 chunk size * 2 for double buffering * number of output partitions * number of tasks).

@martint
Copy link
Member

martint commented May 3, 2022

That seems like an environmental/workload constraint, not a fundamental limit of how the file-based exchange works. Do we know that larger than 50 is problematic? How much larger than 50? Does the limit apply to GCS and Azure? Does the limit change if a user only plans to run one query at a time vs multiple queries concurrently?

@arhimondr
Copy link
Contributor Author

The limit of 50 is a default and it is configurable. This limit may change for GCP / Azure by a constant factor if the smallest chunk allowed for these providers is different. However the fundamental problem of the number of files / requests / memory growing quadratically will remain the same.

@jhlodin
Copy link
Contributor

jhlodin commented May 4, 2022

I have a concern that query.hash-partition-count defaults to 100 but this now enforces a limit of 50 when using the filesystem exchange manager. Is there some compromise between changing the engine default down from 100 partitions and increasing this enforced limit of 50 that allows users to configure a filesystem exchange manager without having to set a custom value for this seemingly unrelated property?

@arhimondr
Copy link
Contributor Author

@jhlodin Good point. I wonder if we should set the default value for hash_partition_count to 50 if task level retries are enabled.

CC: @martint @losipiuk @linzebing

@losipiuk
Copy link
Member

losipiuk commented May 4, 2022

@jhlodin Good point. I wonder if we should set the default value for hash_partition_count to 50 if task level retries are enabled.

This would be tricky I think. How would you know in one config what is the value in the other?
Maybe we can just drop this PR and assume that stating in the documentation that you should lower hash_partition_count to 50 is enough.
There are many possibilities to shoot one self in the foot with session properties. This PR feels like an overreaction to me a bit.

@arhimondr
Copy link
Contributor Author

This would be tricky I think. How would you know in one config what is the value in the other?

We already do this to automatically disable distributed_sort and dynamic_filtering. Basically we are checking the value of the retry_policy and based on the value return a different value for a session property:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/SystemSessionProperties.java#L1125
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/SystemSessionProperties.java#L1279

We could do something similar for hash_partition_count and either cap the value at 50 or throw an exception if the value exceeds 50 or maybe detect if the value is not set explicitly (through session) and override the default to 50.

Maybe we can just drop this PR and assume that stating in the documentation that you should lower hash_partition_count to 50 is enough.

Ideally it would be great if the defaults worked out of the box as much as possible.

There are many possibilities to shoot one self in the foot with session properties. This PR feels like an overreaction to me a bit.

I agree. There are many ways users can miss-configure their clusters. However the hash_partition_count property is relatively well know and more likely to get adjusted.

@arhimondr
Copy link
Contributor Author

arhimondr commented May 4, 2022

As an afterthought. query.hash-partition-count (hash_partition_count) has a slightly different semantics. Effectively it is max_hash_partition_count that is specific to streaming execution. I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count with a default that makes sense for fault tolerant execution? What do you guys think?

Also in the future we may want to apply adaptive strategies to determine number of hash partitions based on the runtime information. I wonder if we wan't to call this property fault-tolerant-execution-default-hash-partition-count to make it future proof?

CC: @martint @losipiuk @linzebing @jhlodin ?

@jhlodin
Copy link
Contributor

jhlodin commented May 4, 2022

I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count

That works from a documentation perspective. Is it appropriate to label that under fault-tolerant-execution- though, when it's specific to the filesystem exchange manager? I lack context from a code perspective, but I wonder if we can nest this with other exchange.xxxx properties that live in the exchange-manager.properties configuration file.

@arhimondr
Copy link
Contributor Author

arhimondr commented May 4, 2022

The hash_partition_count session property currently defines how many intermediate partitions have to be created during the hash based exchange. In the future we may (or may not, depending on the circumstances moving forward) chose to decide the number of exchange partitions automatically, based on runtime statistics and use this property only as a fallback or a "guideline" under some circumstances.

However the exchange implementation (the file system exchange), which is the only option that is available today, can only reliably handle up to 50 exchange partitions. In the future we may provide a more scalable implementation that would support thousands of partitions (which in it's turn would open more opportunities for higher scale and adaptivity). At that point we may decided to re-adjust the default accordingly. However at this point it feels that it is better to set the default based on the exchange capabilities we currently provide.

@losipiuk
Copy link
Member

losipiuk commented May 4, 2022

As an afterthought. query.hash-partition-count (hash_partition_count) has a slightly different semantics. Effectively it is max_hash_partition_count that is specific to streaming execution. I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count with a default that makes sense for fault tolerant execution? What do you guys think?

Also in the future we may want to apply adaptive strategies to determine number of hash partitions based on the runtime information. I wonder if we wan't to call this property fault-tolerant-execution-default-hash-partition-count to make it future proof?

CC: @martint @losipiuk @linzebing @jhlodin ?

Separate config/session property may make sense. Or maybe we can make it static for now and set to 50 for execution with task retries. We may have hidden session property just in case, but not document it. Do we benefit from making the value smaller?

@arhimondr
Copy link
Contributor Author

Do we benefit from making the value smaller?

Setting it to a smaller value would reduce the number of files created in S3 and reduce the number of requests being set. However it will also reduce the maximum query size in terms of memory.

@arhimondr arhimondr merged commit 0e019b2 into trinodb:master May 5, 2022
@arhimondr arhimondr deleted the number-of-partitions-limit branch May 5, 2022 17:28
@mosabua
Copy link
Member

mosabua commented May 5, 2022

Why do we not want to document that property @arhimondr ? .. can you please hash this out with @jhlodin and get a doc PR done if necessary (and ideally asap since we might cut release today)

@arhimondr
Copy link
Contributor Author

@mosabua

Why do we not want to document that property @arhimondr ? .. can you please hash this out with @jhlodin and get a doc PR done if necessary (and ideally asap since we might cut release today)

The file system base exchange manager implementation is not designed to support number of partitions higher than 50. We shouldn't encourage users to increase this value.

@mosabua
Copy link
Member

mosabua commented May 5, 2022

Sounds good @arhimondr .. make we wonder if we support upper bounds for parameters in airlift and if we should use that

@arhimondr
Copy link
Contributor Author

@mosabua We do. However the number of output partitions is passed to the exchange manager by the engine, so this change has to be done in that place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants