Configure fault tolerance for your cluster#
You can configure SEP to be more resilient against query failure by enabling fault-tolerant execution. This allows SEP to handle larger queries such as batch operations without worker node interruptions causing the query to fail.
When configured, the SEP cluster buffers data used by workers during query processing. If a worker node fails for any reason, such as a network outage or running out of available resources, the coordinator issues the data to an available worker that can pick up query processing, allowing the query to continue.
Architecture#
The coordinator node uses a configured exchange manager service that buffers data for query processing in an external location such as an S3 bucket. Worker nodes send data to the buffer as they execute their query tasks.
Best practices and considerations#
A fault-tolerant cluster is best suited for large batch queries. Fault-tolerant execution of a high volume of short-running may experience increased latency. As such, it is recommended to run a dedicated fault-tolerant SEP cluster for handling batch operations, separate from a cluster that is designated for a higher query volume.
Support for fault-tolerant execution of SQL statements varies on a per-connector basis, with more details in the documentation for each connector. See Fault-tolerant execution for a table detailing per-connector support.
Catalogs for other data sources only support fault-tolerant execution of read
operations. The Starburst Teradata Direct connector only supports fault-tolerant execution of
read operations with a QUERY
retry policy.
When fault-tolerant execution is enabled on a cluster, write operations fail for any catalogs that do not support fault-tolerant execution of those operations.
The exchange manager may send a large amount of data to the exchange storage, resulting in high I/O load on that storage. You can configure multiple storage locations for use by the exchange manager to help balance the I/O load between them.
Configuration#
The following steps describe how to configure a SEP cluster for fault-tolerant execution with a S3 bucket-based exchange:
Set up a S3 bucket to use as the exchange storage. For this example we are using an AWS S3 bucket, but other storage options are described in the reference documentation as well. You can use multiple S3 buckets for exchange storage.
For each bucket in AWS, collect the following information:
S3 URI location for the bucket, such as
s3://exchange-spooling-bucket
Region that the bucket is located in, such as
us-west-1
AWS access and secret keys for the bucket
Add the following exchange manager configuration in both the
coordinator.etcFiles.properties
andworker.etcFiles.properties
sections of the Helm chart in a newexchange-manager.properties:
entry, using the gathered S3 bucket information:coordinator: etcFiles: properties: ... exchange-manager.properties: | exchange-manager.name=filesystem exchange.base-directories=s3://exchange-spooling-bucket-1,s3://exchange-spooling-bucket-2 exchange.s3.region=us-west-1 exchange.s3.aws-access-key=example-access-key exchange.s3.aws-secret-key=example-secret-key ... worker: etcFiles: properties: ... exchange-manager.properties: | exchange-manager.name=filesystem exchange.base-directories=s3://exchange-spooling-bucket-1,s3://exchange-spooling-bucket-2 exchange.s3.region=us-west-1 exchange.s3.aws-access-key=example-access-key exchange.s3.aws-secret-key=example-secret-key
In non-Kubernetes installations, the same properties must be defined in the
etc/exchange-manager.properties
configuration file on the coordinator and all worker nodes.Add the following configuration for fault-tolerant execution in both the
coordinator.additionalProperties:
andworker.additionalProperties:
sections of the Helm chart:coordinator: additionalProperties: ... retry-policy=TASK exchange.compression-enabled=true query.low-memory-killer.delay=0s ... worker: additionalProperties: ... retry-policy=TASK exchange.compression-enabled=true query.low-memory-killer.delay=0s
In non-Kubernetes installations, the same properties must be defined in the
config.properties
file on the coordinator and all worker nodes.Re-deploy your instance of SEP or, for non-Kubernetes installations, restart the cluster.
Your SEP cluster is now configured with fault-tolerant query execution. If a query run on the cluster would normally fail due to an interruption of query processing, fault-tolerant execution now resumes the query processing to ensure successful execution of the query.
Next steps#
For more information about fault-tolerant execution, including simple query retries that do not require an exchange manager and advanced configuration operations, see the reference documentation.