From 0b1db372041a5451ec35c5773905960a141a31b6 Mon Sep 17 00:00:00 2001 From: Joe Lodin Date: Tue, 3 May 2022 17:50:17 -0400 Subject: [PATCH] Add properties, requirements, and best practices from FTE blog --- .../sphinx/admin/fault-tolerant-execution.rst | 80 ++++++++++++++++--- .../admin/properties-query-management.rst | 20 +++++ 2 files changed, 91 insertions(+), 9 deletions(-) diff --git a/docs/src/main/sphinx/admin/fault-tolerant-execution.rst b/docs/src/main/sphinx/admin/fault-tolerant-execution.rst index eb2d6d997266..381473f3978f 100644 --- a/docs/src/main/sphinx/admin/fault-tolerant-execution.rst +++ b/docs/src/main/sphinx/admin/fault-tolerant-execution.rst @@ -49,9 +49,14 @@ execution on a Trino cluster: * - ``exchange.deduplication-buffer-size`` - Size of the coordinator's in-memory buffer used by fault-tolerant execution to store output of query :ref:`stages `. - If this buffer is filled during query execution, the query fails unless + If this buffer is filled during query execution, the query fails with a + "Task descriptor storage capacity has been exceeded" error message unless an :ref:`exchange manager ` is configured. - ``32MB`` + * - ``exchange.compression-enabled`` + - Enable compression of spooling data. Setting to ``true`` is recommended + when using an :ref:`exchange manager `. + - ``false`` .. _fte-retry-policy: @@ -91,11 +96,28 @@ recommended when executing large batch queries, as the cluster can more efficiently retry smaller tasks within the query rather than retry the whole query. -``TASK`` retry policy requires a configured :ref:`exchange manager -` to store spooled exchange data used for each task. It is -also strongly recommended to set the ``query.low-memory-killer.policy`` -configuration property to ``total-reservation-on-blocked-nodes``, or queries may -need to be manually killed if the cluster runs out of memory. +The following cluster configuration changes are recommended to improve +fault-tolerant execution with a ``TASK`` retry policy: + +* Set the ``query.low-memory-killer.policy`` + :doc:`query management property ` to + ``total-reservation-on-blocked-nodes``, or queries may + need to be manually killed if the cluster runs out of memory. +* Set the ``query.low-memory-killer.delay`` + :doc:`query management property ` to + ``0s`` so the cluster immediately unblocks nodes that run out of memory. +* Modify the ``query.remote-task.max-error-duration`` + :doc:`query management property ` + to adjust how long Trino allows a remote task to try reconnecting before + considering it lost and rescheduling. + +.. note:: + + A ``TASK`` retry policy is best suited for large batch queries, but this + policy can result in higher latency for short-running queries executed in high + volume. As a best practice, it is recommended to run a dedicated cluster + with a ``TASK`` retry policy for large batch queries, separate from another + cluster that handles short queries. Advanced configuration ---------------------- @@ -264,6 +286,17 @@ fault-tolerant execution: reschedule tasks in case of a failure. - (JVM heap size * 0.15) - Only ``TASK`` + * - ``fault-tolerant-execution-partition-count`` + - Number of partitions to use for distributed joins and aggregations, + similar in function to the ``query.hash-partition-count`` :doc:`query + management property `. It is not + recommended to increase this property value above the default of ``50``, + which may result in instability and poor performance. May be overridden + for the current session with the + ``fault_tolerant_execution_partition_count`` :ref:`session property + `. + - ``50`` + - Only ``TASK`` * - ``max-tasks-waiting-for-node-per-stage`` - Allow for up to configured number of tasks to wait for node allocation per stage, before pausing scheduling for other tasks from this stage. @@ -280,6 +313,9 @@ fault-tolerant execution. You can configure a filesystem-based exchange manager that stores spooled data in a specified location, either an S3-compatible storage system or a local filesystem. +Configuration +^^^^^^^^^^^^^ + To configure an exchange manager, create a new ``etc/exchange-manager.properties`` configuration file on the coordinator and all worker nodes. In this file, set the ``exchange-manager.name`` configuration @@ -294,8 +330,8 @@ for your storage solution. - Description - Default value * - ``exchange.base-directories`` - - The base directory URI location that the exchange manager uses to store - spooling data. Only supports S3 and local filesystems. + - Comma-separated list of URI locations that the exchange manager uses to + store spooling data. Only supports S3 and local filesystems. - * - ``exchange.encryption-enabled`` - Enable encrypting of spooling data. @@ -339,6 +375,17 @@ for your storage solution. - Part size for S3 multi-part upload. - ``5MB`` +It is also recommended to set the ``exchange.compression-enabled`` property to +``true`` in the cluster's ``config.properties`` file, to reduce the exchange +manager's overall I/O load. + +S3 bucket storage +~~~~~~~~~~~~~~~~~ + +If using an AWS S3 bucket, it is recommended to configure a bucket lifecycle +rule in AWS to automatically expire abandoned objects in the event of a node +crash. + The following example ``exchange-manager.properties`` configuration specifies an AWS S3 bucket as the spooling storage destination. Note that the destination does not have to be in AWS, but can be any S3-compatible storage system. @@ -346,12 +393,27 @@ does not have to be in AWS, but can be any S3-compatible storage system. .. code-block:: properties exchange-manager.name=filesystem - exchange.base-directories=s3n://trino-exchange-manager + exchange.base-directories=s3n://exchange-spooling-bucket exchange.encryption-enabled=true exchange.s3.region=us-west-1 exchange.s3.aws-access-key=example-access-key exchange.s3.aws-secret-key=example-secret-key +You can configure multiple S3 buckets for the exchange manager to distribute +spooled data across buckets, reducing the I/O load on any one bucket. If a query +fails with the error message +"software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your +request rate", this indicates that the workload is I/O intensive, and you should +specify multiple S3 buckets in ``exchange.base-directories`` to balance the +load: + +.. code-block:: properties + + exchange.base-directories=s3n://exchange-spooling-bucket-1,s3n://exchange-spooling-bucket-2 + +Local filesystem storage +~~~~~~~~~~~~~~~~~~~~~~~~ + The following example ``exchange-manager.properties`` configuration specifies a local directory, ``/tmp/trino-exchange-manager``, as the spooling storage destination. diff --git a/docs/src/main/sphinx/admin/properties-query-management.rst b/docs/src/main/sphinx/admin/properties-query-management.rst index 62ca74a98899..86a48b4f6469 100644 --- a/docs/src/main/sphinx/admin/properties-query-management.rst +++ b/docs/src/main/sphinx/admin/properties-query-management.rst @@ -46,6 +46,16 @@ memory availability. Supports the following values: * ``total-reservation-on-blocked-nodes`` - Kill the query currently using the most memory specifically on nodes that are now out of memory. +``query.low-memory-killer.delay`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* **Type:** :ref:`prop-type-duration` +* **Default value:** ``5m`` + +The amount of time a query is allowed to recover between running out of memory +and being killed, if ``query.low-memory-killer.policy`` is set to +``total-reservation`` or ``total-reservation-on-blocked-nodes``. + ``query.max-execution-time`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -118,6 +128,16 @@ The minimal age of a query in the history before it is expired. An expired query is removed from the query history buffer and no longer available in the :doc:`/admin/web-interface`. +``query.remote-task.max-error-duration`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* **Type:** :ref:`prop-type-duration` +* **Default value:** ``5m`` + +Timeout value for remote tasks that fail to communicate with the +coordinator. If the coordinator is unable to receive updates from a remote task +before this value is reached, the coordinator treats the task as failed. + ``retry-policy`` ^^^^^^^^^^^^^^^^