Limit maximum number of output partitions for filesystem exchange #12228

arhimondr · 2022-05-03T19:30:58Z

Description

File system exchange is not designed to scale beyond 50 output partitions. An explicit enforcement is needed to avoid the exchange subsystem being unstable if not configured properly.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Exchange

How would you describe this change to a non-technical end user or system administrator?

When trying to use Tardigrade with the query.hash-partition-count configuration property (or the hash_partition_count session property) set to a value higher than 50 users will see an error message (Max number of output partitions exceeded for exchange). Before this change when supported number of output partitions was exceeded cluster could be unstable.

Related issues, pull requests, and links

-

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

martint · 2022-05-03T20:38:07Z

File system exchange is not designed to scale beyond 50 output partitions. An explicit enforcement is needed to avoid the exchange subsystem being unstable if not configured properly.

What's the limitation?

arhimondr · 2022-05-03T20:46:29Z

@martint The number of files created as well as the number of requests issued to S3 grows quadratically. With 50 partitions and 50 tasks (assuming 1 partition is processed by a single task) a stage creates 2500 files. This feels like a idealistic upper limit of what can be handled reliably. The amount of memory needed for buffering also grows in a similar way. 50 tasks running concurrently producing 50 partitions need 25GB for buffering (5MB minimal S3 chunk size * 2 for double buffering * number of output partitions * number of tasks).

martint · 2022-05-03T20:54:47Z

That seems like an environmental/workload constraint, not a fundamental limit of how the file-based exchange works. Do we know that larger than 50 is problematic? How much larger than 50? Does the limit apply to GCS and Azure? Does the limit change if a user only plans to run one query at a time vs multiple queries concurrently?

arhimondr · 2022-05-03T21:12:37Z

The limit of 50 is a default and it is configurable. This limit may change for GCP / Azure by a constant factor if the smallest chunk allowed for these providers is different. However the fundamental problem of the number of files / requests / memory growing quadratically will remain the same.

jhlodin · 2022-05-04T07:36:38Z

I have a concern that query.hash-partition-count defaults to 100 but this now enforces a limit of 50 when using the filesystem exchange manager. Is there some compromise between changing the engine default down from 100 partitions and increasing this enforced limit of 50 that allows users to configure a filesystem exchange manager without having to set a custom value for this seemingly unrelated property?

arhimondr · 2022-05-04T14:44:57Z

@jhlodin Good point. I wonder if we should set the default value for hash_partition_count to 50 if task level retries are enabled.

CC: @martint @losipiuk @linzebing

losipiuk · 2022-05-04T16:43:38Z

@jhlodin Good point. I wonder if we should set the default value for hash_partition_count to 50 if task level retries are enabled.

This would be tricky I think. How would you know in one config what is the value in the other?
Maybe we can just drop this PR and assume that stating in the documentation that you should lower hash_partition_count to 50 is enough.
There are many possibilities to shoot one self in the foot with session properties. This PR feels like an overreaction to me a bit.

arhimondr · 2022-05-04T18:23:22Z

This would be tricky I think. How would you know in one config what is the value in the other?

We already do this to automatically disable distributed_sort and dynamic_filtering. Basically we are checking the value of the retry_policy and based on the value return a different value for a session property:

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/SystemSessionProperties.java#L1125
https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/SystemSessionProperties.java#L1279

We could do something similar for hash_partition_count and either cap the value at 50 or throw an exception if the value exceeds 50 or maybe detect if the value is not set explicitly (through session) and override the default to 50.

Maybe we can just drop this PR and assume that stating in the documentation that you should lower hash_partition_count to 50 is enough.

Ideally it would be great if the defaults worked out of the box as much as possible.

There are many possibilities to shoot one self in the foot with session properties. This PR feels like an overreaction to me a bit.

I agree. There are many ways users can miss-configure their clusters. However the hash_partition_count property is relatively well know and more likely to get adjusted.

arhimondr · 2022-05-04T18:33:47Z

As an afterthought. query.hash-partition-count (hash_partition_count) has a slightly different semantics. Effectively it is max_hash_partition_count that is specific to streaming execution. I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count with a default that makes sense for fault tolerant execution? What do you guys think?

Also in the future we may want to apply adaptive strategies to determine number of hash partitions based on the runtime information. I wonder if we wan't to call this property fault-tolerant-execution-default-hash-partition-count to make it future proof?

CC: @martint @losipiuk @linzebing @jhlodin ?

jhlodin · 2022-05-04T18:43:47Z

I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count

That works from a documentation perspective. Is it appropriate to label that under fault-tolerant-execution- though, when it's specific to the filesystem exchange manager? I lack context from a code perspective, but I wonder if we can nest this with other exchange.xxxx properties that live in the exchange-manager.properties configuration file.

arhimondr · 2022-05-04T19:11:18Z

The hash_partition_count session property currently defines how many intermediate partitions have to be created during the hash based exchange. In the future we may (or may not, depending on the circumstances moving forward) chose to decide the number of exchange partitions automatically, based on runtime statistics and use this property only as a fallback or a "guideline" under some circumstances.

However the exchange implementation (the file system exchange), which is the only option that is available today, can only reliably handle up to 50 exchange partitions. In the future we may provide a more scalable implementation that would support thousands of partitions (which in it's turn would open more opportunities for higher scale and adaptivity). At that point we may decided to re-adjust the default accordingly. However at this point it feels that it is better to set the default based on the exchange capabilities we currently provide.

losipiuk · 2022-05-04T19:25:52Z

As an afterthought. query.hash-partition-count (hash_partition_count) has a slightly different semantics. Effectively it is max_hash_partition_count that is specific to streaming execution. I wonder if we should have a dedicated property, say fault-tolerant-execution-hash-partition-count with a default that makes sense for fault tolerant execution? What do you guys think?

Also in the future we may want to apply adaptive strategies to determine number of hash partitions based on the runtime information. I wonder if we wan't to call this property fault-tolerant-execution-default-hash-partition-count to make it future proof?

CC: @martint @losipiuk @linzebing @jhlodin ?

Separate config/session property may make sense. Or maybe we can make it static for now and set to 50 for execution with task retries. We may have hidden session property just in case, but not document it. Do we benefit from making the value smaller?

arhimondr · 2022-05-04T19:47:04Z

Do we benefit from making the value smaller?

Setting it to a smaller value would reduce the number of files created in S3 and reduce the number of requests being set. However it will also reduce the maximum query size in terms of memory.

mosabua · 2022-05-05T18:02:08Z

Why do we not want to document that property @arhimondr ? .. can you please hash this out with @jhlodin and get a doc PR done if necessary (and ideally asap since we might cut release today)

arhimondr · 2022-05-05T18:17:26Z

@mosabua

Why do we not want to document that property @arhimondr ? .. can you please hash this out with @jhlodin and get a doc PR done if necessary (and ideally asap since we might cut release today)

The file system base exchange manager implementation is not designed to support number of partitions higher than 50. We shouldn't encourage users to increase this value.

mosabua · 2022-05-05T22:33:28Z

Sounds good @arhimondr .. make we wonder if we support upper bounds for parameters in airlift and if we should use that

arhimondr · 2022-05-06T17:12:35Z

@mosabua We do. However the number of output partitions is passed to the exchange manager by the engine, so this change has to be done in that place

Limit maximum number of output partitions for filesystem exchange

a917cab

arhimondr requested review from losipiuk and linzebing May 3, 2022 19:30

cla-bot bot added the cla-signed label May 3, 2022

github-actions bot added the tests:hive label May 3, 2022

linzebing approved these changes May 3, 2022

View reviewed changes

losipiuk approved these changes May 4, 2022

View reviewed changes

arhimondr mentioned this pull request May 4, 2022

Add properties, requirements, and best practices from FTE blog #12231

Merged

arhimondr merged commit 0e019b2 into trinodb:master May 5, 2022

arhimondr deleted the number-of-partitions-limit branch May 5, 2022 17:28

mosabua mentioned this pull request May 5, 2022

Add Trino 380 release notes #12184

Merged

arhimondr mentioned this pull request May 5, 2022

Add property for partition count for fault tolerant execution #12263

Merged

github-actions bot added this to the 380 milestone May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit maximum number of output partitions for filesystem exchange #12228

Limit maximum number of output partitions for filesystem exchange #12228

arhimondr commented May 3, 2022

martint commented May 3, 2022

arhimondr commented May 3, 2022

martint commented May 3, 2022

arhimondr commented May 3, 2022

jhlodin commented May 4, 2022 •

edited

Loading

arhimondr commented May 4, 2022

losipiuk commented May 4, 2022

arhimondr commented May 4, 2022

arhimondr commented May 4, 2022 •

edited

Loading

jhlodin commented May 4, 2022

arhimondr commented May 4, 2022 •

edited

Loading

losipiuk commented May 4, 2022

arhimondr commented May 4, 2022

mosabua commented May 5, 2022 •

edited

Loading

arhimondr commented May 5, 2022

mosabua commented May 5, 2022

arhimondr commented May 6, 2022

Limit maximum number of output partitions for filesystem exchange #12228

Limit maximum number of output partitions for filesystem exchange #12228

Conversation

arhimondr commented May 3, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

martint commented May 3, 2022

arhimondr commented May 3, 2022

martint commented May 3, 2022

arhimondr commented May 3, 2022

jhlodin commented May 4, 2022 • edited Loading

arhimondr commented May 4, 2022

losipiuk commented May 4, 2022

arhimondr commented May 4, 2022

arhimondr commented May 4, 2022 • edited Loading

jhlodin commented May 4, 2022

arhimondr commented May 4, 2022 • edited Loading

losipiuk commented May 4, 2022

arhimondr commented May 4, 2022

mosabua commented May 5, 2022 • edited Loading

arhimondr commented May 5, 2022

mosabua commented May 5, 2022

arhimondr commented May 6, 2022

jhlodin commented May 4, 2022 •

edited

Loading

arhimondr commented May 4, 2022 •

edited

Loading

arhimondr commented May 4, 2022 •

edited

Loading

mosabua commented May 5, 2022 •

edited

Loading