-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add properties, requirements, and best practices from FTE blog #12231
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,9 +49,14 @@ execution on a Trino cluster: | |
* - ``exchange.deduplication-buffer-size`` | ||
- Size of the coordinator's in-memory buffer used by fault-tolerant | ||
execution to store output of query :ref:`stages <trino-concept-stage>`. | ||
If this buffer is filled during query execution, the query fails unless | ||
If this buffer is filled during query execution, the query fails with a | ||
"Task descriptor storage capacity has been exceeded" error message unless | ||
an :ref:`exchange manager <fte-exchange-manager>` is configured. | ||
- ``32MB`` | ||
* - ``exchange.compression-enabled`` | ||
- Enable compression of spooling data. Setting to ``true`` is recommended | ||
when using an :ref:`exchange manager <fte-exchange-manager>`. | ||
- ``false`` | ||
|
||
.. _fte-retry-policy: | ||
|
||
|
@@ -91,11 +96,28 @@ recommended when executing large batch queries, as the cluster can more | |
efficiently retry smaller tasks within the query rather than retry the whole | ||
query. | ||
|
||
``TASK`` retry policy requires a configured :ref:`exchange manager | ||
<fte-exchange-manager>` to store spooled exchange data used for each task. It is | ||
also strongly recommended to set the ``query.low-memory-killer.policy`` | ||
configuration property to ``total-reservation-on-blocked-nodes``, or queries may | ||
need to be manually killed if the cluster runs out of memory. | ||
The following cluster configuration changes are recommended to improve | ||
fault-tolerant execution with a ``TASK`` retry policy: | ||
|
||
* Set the ``query.low-memory-killer.policy`` | ||
:doc:`query management property </admin/properties-query-management>` to | ||
``total-reservation-on-blocked-nodes``, or queries may | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the default, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the docs are correct, then yes. Still worth calling out I think, because it's a property that a reader might not associate with FTE and have configured differently. |
||
need to be manually killed if the cluster runs out of memory. | ||
* Set the ``query.low-memory-killer.delay`` | ||
:doc:`query management property </admin/properties-query-management>` to | ||
``0s`` so the cluster immediately unblocks nodes that run out of memory. | ||
* Modify the ``query.remote-task.max-error-duration`` | ||
:doc:`query management property </admin/properties-query-management>` | ||
to adjust how long Trino allows a remote task to try reconnecting before | ||
considering it lost and rescheduling. | ||
|
||
.. note:: | ||
|
||
A ``TASK`` retry policy is best suited for large batch queries, but this | ||
policy can result in higher latency for short-running queries executed in high | ||
volume. As a best practice, it is recommended to run a dedicated cluster | ||
with a ``TASK`` retry policy for large batch queries, separate from another | ||
cluster that handles short queries. | ||
|
||
Advanced configuration | ||
---------------------- | ||
|
@@ -264,6 +286,17 @@ fault-tolerant execution: | |
reschedule tasks in case of a failure. | ||
- (JVM heap size * 0.15) | ||
- Only ``TASK`` | ||
* - ``fault-tolerant-execution-partition-count`` | ||
- Number of partitions to use for distributed joins and aggregations, | ||
similar in function to the ``query.hash-partition-count`` :doc:`query | ||
management property </admin/properties-query-management>`. It is not | ||
recommended to increase this property value above the default of ``50``, | ||
which may result in instability and poor performance. May be overridden | ||
for the current session with the | ||
``fault_tolerant_execution_partition_count`` :ref:`session property | ||
<session-properties-definition>`. | ||
- ``50`` | ||
- Only ``TASK`` | ||
* - ``max-tasks-waiting-for-node-per-stage`` | ||
- Allow for up to configured number of tasks to wait for node allocation | ||
per stage, before pausing scheduling for other tasks from this stage. | ||
|
@@ -280,6 +313,9 @@ fault-tolerant execution. You can configure a filesystem-based exchange manager | |
that stores spooled data in a specified location, either an S3-compatible | ||
storage system or a local filesystem. | ||
|
||
Configuration | ||
^^^^^^^^^^^^^ | ||
|
||
To configure an exchange manager, create a new | ||
``etc/exchange-manager.properties`` configuration file on the coordinator and | ||
all worker nodes. In this file, set the ``exchange-manager.name`` configuration | ||
|
@@ -294,8 +330,8 @@ for your storage solution. | |
- Description | ||
- Default value | ||
* - ``exchange.base-directories`` | ||
- The base directory URI location that the exchange manager uses to store | ||
spooling data. Only supports S3 and local filesystems. | ||
- Comma-separated list of URI locations that the exchange manager uses to | ||
store spooling data. Only supports S3 and local filesystems. | ||
- | ||
* - ``exchange.encryption-enabled`` | ||
- Enable encrypting of spooling data. | ||
|
@@ -339,19 +375,45 @@ for your storage solution. | |
- Part size for S3 multi-part upload. | ||
- ``5MB`` | ||
|
||
It is also recommended to set the ``exchange.compression-enabled`` property to | ||
``true`` in the cluster's ``config.properties`` file, to reduce the exchange | ||
manager's overall I/O load. | ||
|
||
S3 bucket storage | ||
~~~~~~~~~~~~~~~~~ | ||
|
||
If using an AWS S3 bucket, it is recommended to configure a bucket lifecycle | ||
rule in AWS to automatically expire abandoned objects in the event of a node | ||
crash. | ||
|
||
The following example ``exchange-manager.properties`` configuration specifies an | ||
AWS S3 bucket as the spooling storage destination. Note that the destination | ||
does not have to be in AWS, but can be any S3-compatible storage system. | ||
|
||
.. code-block:: properties | ||
|
||
exchange-manager.name=filesystem | ||
exchange.base-directories=s3n://trino-exchange-manager | ||
exchange.base-directories=s3n://exchange-spooling-bucket | ||
exchange.encryption-enabled=true | ||
exchange.s3.region=us-west-1 | ||
exchange.s3.aws-access-key=example-access-key | ||
exchange.s3.aws-secret-key=example-secret-key | ||
|
||
You can configure multiple S3 buckets for the exchange manager to distribute | ||
spooled data across buckets, reducing the I/O load on any one bucket. If a query | ||
fails with the error message | ||
"software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your | ||
request rate", this indicates that the workload is I/O intensive, and you should | ||
specify multiple S3 buckets in ``exchange.base-directories`` to balance the | ||
load: | ||
|
||
.. code-block:: properties | ||
|
||
exchange.base-directories=s3n://exchange-spooling-bucket-1,s3n://exchange-spooling-bucket-2 | ||
|
||
Local filesystem storage | ||
~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The following example ``exchange-manager.properties`` configuration specifies a | ||
local directory, ``/tmp/trino-exchange-manager``, as the spooling storage | ||
destination. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use code highlighting for the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry it'll cause issues with SEO, which is the whole reason to have the full error text here.