Starburst Warp Speed#

Starburst Warp Speed transparently adds an indexing and caching layer to enable higher performance for catalogs using the Hive, Iceberg, or Delta Lake connectors.

Note

Learn more about high-level architecture and characteristics of Starburst Warp Speed before using it.

Requirements#

To use Starburst Warp Speed, you need:

Supported object storage#

Starburst Warp Speed only supports the following object storage:

No other other object storage systems, including on-premises storage, S3-compatible storage such as MinIO, and any others are supported.

Cluster configuration#

Starburst Warp Speed requires your cluster to operate on a Kubernetes-based platform. Specifically the Starburst Enterprise with Kubernetes requirements apply.

In addition, Starburst Warp Speed requires specific nodes in terms of CPU and memory. The most important additional requirement is that sufficiently performant and sized Non-Volatile Memory Express (NVMe) solid-state drive (SSD) storage is available on all nodes, and exclusively used by SEP.

The following specific details apply for the supported platforms:

EKS

Suitable node sizes:

  • m6gd.4xlarge or larger

  • r5d.4xlarge or larger

  • r5dn.4xlarge or larger

  • r6gd.4xlarge or larger

  • i3.4xlarge or larger

Create an EKS Managed Node Group with a specific size, and use it for all nodes in the cluster.

Include a bootstrap script when creating the node groups for creating an /opt/data directory to serve as a mount point for the SSD disks where the index and cache elements are stored. The /opt/data directory value can be edited based on preference and should match the value of localStorageMountPath in the values.yaml file as described in the All platforms section.

If you are using eksctl to create the cluster, you must embed the following script in the managedNodeGroups.preBootstrapCommands section:

preBoostrapCommands:
  "yum install -y mdadm"
  "sysctl -w fs.aio-max-nr=8388608 >> /etc/sysctl.conf"
  "devices=; for device in $(ls /sys/block/); do if [[ $(grep -e Amazon EC2 NVMe Instance Storage -e ec2-nvme-instance /sys/block/${device}/device/subsysnqn -c 2> /dev/null) -gt 0 ]]; then devices=${devices} /dev/${device}; fi; done; echo ${devices} > /tmp/devices"
  "mdadm --create /dev/md0 $(cat /tmp/devices) --level=0 --force --raid-devices=$(cat /tmp/devices | wc -w)"
  "mkfs.ext4 /dev/md0 -O ^has_journal"
  "mkdir -p /opt/data"
  "mount /dev/md0 /opt/data"
  "chmod 777 -R /opt/data"

AKS

Suitable node sizes:

  • Standard_L16s_v2 or larger with SSDs attached

  • In the Starburst Warp Speed section of values.yaml, set the following configuration:

warpSpeed:
  image:
    tag: "a.b.c-azure"

GKE

Suitable node sizes:

Attach the SSDs during node pool creation, and use the pool for the cluster creation.

Use the ephemeral-storage-local-ssd gcloud CLI command to provision local SSDs on the cluster. Select an even number of workers.

Starburst Warp Speed catalogs using GCS must include the following catalog configuration property to update the SSD utilization threshold to 80%:

warp-speed.file-system-reserve-percentage=80

All platforms

Configuration considerations:

  • The task.max-worker-threads task property can not be changed with Starburst Warp Speed so it must be left at the default value.

  • SEP clusters support a maximum of 10 Starburst Warp Speed catalogs.

Warning

Starburst Warp Speed is not supported on a cluster running in Fault-tolerant execution mode.

Deployment and management is performed with the SEP Helm charts detailed in Deploying with Kubernetes and Configuring Starburst Enterprise in Kubernetes.

Add the following section in your values file for the specific cluster to enable Starburst Warp Speed:

warpSpeed:
  enabled: true

By default, it is disabled. The recommended memory allocation is automatically configured.

warpSpeed:
  enabled: true
  additionalHeapSizePercentage: 15

Starburst Warp Speed uses a filesystem on top of the underlying storage in order to not require privileged mode. Add the following filesystem and image configuration to your values file:

warpSpeed:
  fileSystem:
    # The path for the mount point used to mount the local SSDs. Value differs
    # between clouds:
    # AWS, Azure - Defined by the user
    # GCP - Must be set to /mnt/stateful_partition/kube-ephemeral-ssd
    localStorageMountPath: /opt/data
  image:
    # Image that prepares the filesystem for Starburst Warp Speed.
    # Due to system limitations, this must be done by an init container running
    # in privileged mode.
    repository: "harbor.starburstdata.net/starburstdata/starburst-warpspeed-init"
    # Tag value differs between clouds:
    # AWS, GCP - Use default tag
    # Azure - Append the "azure" suffix, for example "1.0.0-azure"
    tag: "1.0.7"
    pullPolicy: "IfNotPresent"

You need to ensure that you use a dedicated coordinator that is not scheduled for query processing, and adjust the query processing configuration to allow for more splits:

coordinator:
  additionalProperties: |
    ...
    node-scheduler.include-coordinator=false
    node-scheduler.max-splits-per-node=4096
    node-scheduler.max-unacknowledged-splits-per-task=1024
    node-scheduler.max-adjusted-pending-splits-per-task=1024

Use Helm to update the values and restart the cluster nodes. Confirm the cluster is operating correctly with the new configuration, but without any adjusted catalogs, and then proceed to configure catalogs.

Related to the catalog usage, the cluster needs to allow internal communication between all workers, as well as with the coordinator on all the HTTP ports configured by the different values for http-rest-port in all catalogs.

When starting the cluster, Starburst Warp Speed parses all configuration parameters and can send invalid warnings such as Configuration property 'cache-service.password' was not used. You can safely ignore these warnings.

Catalog configuration#

After a successful Cluster configuration, you can configure the desired catalogs to use Starburst Warp Speed.

Only catalogs using the Hive, Iceberg, or Delta Lake connectors can be accelerated:

  • connector.name=hive

  • connector.name=iceberg

  • connector.name=delta_lake

For more details, see Delta Lake considerations, Iceberg considerations, and Hive considerations.

Only catalogs backed by S3, GCS, and ADLS object storage are supported. For more details, see S3 considerations, GCS considerations, and ADLS considerations.

Update the example catalog that uses the Hive connector with AWS Glue in the values file.

catalogs:
  example: |
    connector.name=hive
    hive.metastore=glue
    ...

Enable Starburst Warp Speed on the catalog by updating the connector name to warp_speed and adding the required configuration properties:

catalogs:
  example: |
    connector.name=warp_speed
    warp-speed.proxied-connector=hive
    warp-speed.cluster-uuid=example-cluster-567891234567
    warp-speed.config.internal-communication.shared-secret=aLongSecretString
    hive.metastore=glue
    ...

The properties setting the connector name, the proxied connector and the cluster identifier are required.

The shared secret must be set to the same value as the secret for the cluster itself set in sharedSecret:. This is required unless the REST API is disabled.

For testing purposes, or alternatively for permanent usage of a new catalog name, such as faster, in parallel to the existing catalog, you can copy the configuration of a catalog and update it:

catalogs:
  example: |
    connector.name=hive
    hive.metastore=glue
    ...
  faster: |
    connector.name=warp_speed
    warp-speed.proxied-connector=hive
    warp-speed.cluster-uuid=example-cluster-567891234567
    warp-speed.config.internal-communication.shared-secret=aLongSecretString
    hive.metastore=glue
    ...

This allows you to query the same data with or without Starburst Warp Speed using different catalog names. However, existing scripts and statements that include the old catalog name example are not accelerated.

Catalog configuration properties#

The following table provides more information about the available catalog configuration properties:

Catalog configuration properties#

Property name

Description

connector.name

Required. Must be set to warp_speed.

warp-speed.proxied-connector

Required. The type of embedded connector that is used for accessing cold data through Starburst Warp Speed. Defaults to hive. Valid values are hive, iceberg, or delta-lake. All properties supported by these connectors, including metastore, file format, and connector properties, can be used to configure the catalog.

warp-speed.cluster-uuid

Required. Unique identifier of the cluster. Used as the folder name in the store path. Use the same value for all catalogs. When creating a new cluster and the same warp-speed.store.path and warp-speed.cluster-uuid are used, then the cluster warmup rules for index and cache creation are imported into the newly created cluster.

warp-speed.config.internal-communication.shared-secret

Required, unless REST API is disabled. The shared secret value of the cluster. It is configured for secure internal communication in sharedSecret: for the Kubernetes deployment. The identical value used for the property warp-speed.config.internal-communication.shared-secret for each catalog.

warp-speed.store.path

The optional path of the storage where metadata is managed. By default, this is managed in memory of the workers. You can use s3://, abfs://, or abfss:// as protocol for using remote object storage. Write access privileges are necessary. With remote a object storage configuration import-export and call-home data is also managed.

warp-speed.enable.import-export

Enable or disable index and cache resiliency. Defaults to false. Requires an object storage configured with warp-speed.store.path.

warp-speed.workerdb.db.path

The path to store the internal database file. Default path is set to /usr/lib/starburst/warp-speed/workerDB. The option to change the path should only be used if the default path is not suitable for your deployment. Write access privileges are necessary.

warp-speed.use-http-server-port

Specifies to run the REST API on the coordinator in the same server as SEP. Defaults to true. For details, see REST API access.

warp-speed.config.http-rest-port-enabled

Optional parameter to enable the REST API server on the coordinator and each worker for each catalog. Defaults to false. This REST API is only suitable for testing and debugging purposes, since it is not secured. For production, use the default configuration.

Set the property to true and add a unique http-rest-port value for each catalog. Endpoints use the specified port for a catalog, such as 8088, instead of the /ext/{catalogName} section of the context, and are otherwise identical to the secured REST API. Typically the port is not exposed outside the cluster, so users need to forward the port or call the API from inside the cluster.

warp-speed.call-home.enable

Enable pushing logs and metadata to the storage configured at warp-speed.store.path. Defaults to true. If you are not currently using k8s-native logging tools for troubleshooting, be sure to set the log.path=<path>. The server.log must be set in the log.path so that the call home feature can upload the server.log to the warp-speed.store.path. For details, see logging.

warp-speed.file-system-reserve-percentage

The percentage of the total SSD disk space held in reserve. Defaults to 90.

Hive considerations#

Most configurations of the Hive connector are supported. Additionally, the following considerations apply when using the Hive connector as the proxied connector for Starburst Warp Speed:

For optimal performance, add the following properties to your catalog configuration:

catalogs:
  example: |
    ...
    hive.max-outstanding-splits-size=512MB
    hive.max-initial-splits=0
    hive.max-outstanding-splits=2000
    hive.max-split-size=64MB
    parquet.max-read-block-row-count=1024
    hive.dynamic-filtering.wait-timeout= 1s
    ...

Iceberg considerations#

All configurations of the Iceberg connector are supported. Additionally, the following considerations apply when using the Iceberg connector as the proxied connector for Starburst Warp Speed:

  • Materialized views are supported.

  • An associated split is served from object storage and no acceleration occurs when:

For optimal performance, add the following properties to your catalog configuration:

catalogs:
  example: |
    ...
    parquet.max-read-block-row-count=1024
    iceberg.dynamic-filtering.wait-timeout= 1s
    ...

Delta Lake considerations#

All configurations of the Delta Lake connector are supported. Additionally, the following considerations apply when using the Delta Lake connector as the proxied connector for Starburst Warp Speed:

  • Materialized views are not supported.

  • An associated split is served from object storage and no acceleration occurs when:

For optimal performance, add the following properties to your catalog configuration:

catalogs:
  example: |
    ...
    delta-lake.max-outstanding-splits-size=512MB
    delta-lake.max-initial-splits=0
    delta-lake.max-outstanding-splits=2000
    delta-lake.max-split-size=64MB
    parquet.max-read-block-row-count=1024
    delta-lake.dynamic-filtering.wait-timeout= 1s
    ...

S3 considerations#

Starburst Warp Speed supports Amazon S3 with catalogs using the Hive, Iceberg, and Delta Lake connectors.

Using the s3:// protocol is required.

GCS considerations#

Starburst Warp Speed supports Google Cloud Storage (GCS).

Authentication to GCS can use a JSON key file or an OAuth 2.0 access token configured identically for the Hive, Delta Lake, or Iceberg connector in the catalog properties:

hive.gcs.json-key-file-path=/path/to/gcs_keyfile.json
hive.gcs.use-access-token=false

View the Google Cloud Storage to learn more.

The following OAuth 2.0 access scopes for Google APIs must be attached during GKE node creation to enable the index and cache resiliency:

For more information about authorization, refer to Google Cloud Service accounts documentation.

ADLS considerations#

Starburst Warp Speed supports Microsoft ADLS Gen 2. ADLS Gen1 is not supported.

Using the abfs:// or abfss:// protocol is required.

ADLS can be used with catalogs using the Hive and Delta Lake connectors with the following configuration properties to connect to Azure storage:

catalogs:
  faster: |
    ...
    warp-speed.store.path=abfs://<container_name@account_name>.dfs.core.windows.net/folder
    hive.azure.abfs-storage-account=<storage_account_name>
    hive.azure.abfs-access-key=xxx
    ...

In addition, warp-speed.call-home.enable must be disabled. Then it is possible to secure the connection with TLS and use the abfss protocol with the URI syntax.

JMX catalog#

The percentage of the SSD storage used by Starburst Warp Speed is displayed in the Insights Overview UI.

This metric relies on a catalog that uses the JMX connector called jmx:

catalogs:
  jmx: |
    connector.name=jmx

Alternatively, you can use the debug tools endpoint of the API.

Database configuration#

Advanced users, who configure custom warmup rules with the REST API, must use a database to prevent a loss of those rules with a restart of the cluster. Operation without a database uses a temporary database.

The following RDBMS are supported:

  • MySQL

  • PostgreSQL

  • Oracle

Each catalog requires a separate database. You must create the database in an external server, and configure the JDBC connection string, including credentials, to the database. Refer to your RDBMS server and the documentation of the JDBC driver for details.

You can secure the connection to the database with username and password authentication:

warp-speed.jdbc.user=<database-user>
warp-speed.jdbc.password=<database-password>
  • database-user: Name of a user on the database with sufficient access to create tables and manage the data.

  • database-password: Password of the user.

You can use secrets to avoid exposing these sensitive values.

MySQL#

warp-speed.jdbc.connection.url=jdbc:mysql://<host>:<port>/<database>
  • host: The host name of the database server.

  • port: The HTTP port used by the database server. MySQL typically uses 3306.

  • database: The name of the database.

PostgreSQL#

warp-speed.jdbc.connection.url=jdbc:postgresql://<host>:<port>/<database>
  • host: The host name of the database server.

  • port: The HTTP port used by the database server. PostgreSQL typically uses 5432.

  • database: The name of the database schema.

Oracle#

warp-speed.jdbc.connection.url=jdbc:oracle:thin:@<host>:<port>/<schema-name>
  • host: The host name of the database server.

  • port: The HTTP port used by the database server. Oracle typically uses 1521.

  • database: The name of the database schema.

Starburst Warp Speed management#

Starburst Warp Speed automatically creates and manages its data based on processed queries, also called the default acceleration.

Additional custom configuration can be applied with the REST API. Persistent storage of this configuration requires a configured database.

The Index and cache usage tab provides summary statistics that indicate performance gains from using Starburst Warp Speed.

REST API access#

The Starburst Warp Speed REST API is available on the coordinator with a separate context for each catalog on the same port and domain as the SEP web UI, the Starburst Enterprise REST API and the Trino REST API.

Access to the REST API is controlled by the authentication and authorization identical to the Starburst Enterprise REST API.

The REST API is enabled by default, and can be disabled with warp-speed.use-http-server-port set to false. The shared secret value of the cluster, configured for secure internal communication, must be set in warp-speed.config.internal-communication.shared-secret for each catalog.

The endpoints path structure includes the name of the catalog, /ext/{catalogName}/{paths: .+}. The following example shows the /ext path to the warmup/warmup-rule-set endpoint for the catalog named faster on a cluster without authentication exposed via HTTP:

curl -X GET 'http://sep.example.com:8080/ext/faster/warmup/warmup-rule-set' \
  -H 'Accept: application/json'

A secured server needs to be accessed via HTTPS and potentially include authentication information:

curl -X GET 'https://sep.example.com/ext/faster/warmup/warmup-rule-set' \
  -H 'Accept: application/json'

REST API overview#

The following sections detail the REST API and available endpoints. The example calls use plain curl calls to the endpoints for the faster catalog on the cluster at sep.example.com using HTTPS and omitting any authentication.

Warming status#

You can determine the status of the warmup for Starburst Warp Speed with a GET operation of the /warming/status endpoint. It measures the warmup progress for splits across workers and if warming is currently taking place.

curl -X GET 'https://sep.example.com/ext/faster/warming/status' \
  -H 'Accept: application/json'

Example response:

{"nodesStatus":
  {"172.31.16.98": {"started":22136,"finished":22136},
   "172.31.25.207":{"started":20702,"finished":20702},
   "172.31.19.167":{"started":21116,"finished":21116},
   "172.31.22.28":{"started":20678,"finished":20678}},
   "warming":false}

The response shows that warmup started and finished on four workers, and is currently not in progress.

Debug tools#

The debug-tools endpoint requires an HTTP POST to specify the detailed command with a JSON payload to retrieve the desired data. You can use it to return the storage utilization:

curl -X POST "https://sep.example.com/ext/faster/debug-tools"  \
  -d '{"commandName" : "all","@class" : "io.trino.plugin.warp.execution.debugtools.DebugToolData"}' \
  -H 'Content-Type: application/json'

Example response:

{"coordinator-container":
  {"result":
    {"Storage_capacity":15000000,
     "Allocated 8k pages":1000000,
     "Num used stripes":0
    }
  }
}

Calculate the storage utilization percentage with (Allocated 8k pages / Storage_capacity) * 100.

Debug tools are blocked and can not be used during warming.

Note

As an alternative to the debug tools, you can view the percentage of SSD usage by Starburst Warp Speed in the Insights Overview page. The JMX catalog must be added to the cluster to access this metric.

Row group count#

A row group in Starburst Warp Speed is a collection of index and cache elements that are used to accelerate processing of Trino splits from the SSD storage.

Note

A row group in Starburst Warp Speed is not equivalent to a Parquet row group or an ORC stripe, but a higher level artifact specific to Starburst Warp Speed. It can be related to a specific Parquet row group or ORC stripe but can also represent data from a whole file or more.

The row-group/row-group-count endpoint exposes all currently warmed up columns via an HTTP GET:

curl -X GET "https://sep.example.com/ext/faster/row-group/row-group-count" \
  -H "accept: application/json"

The result is a list of columns specified by schema.table.column.warumuptype as the key. The value represents the corresponding count of accelerated row groups. Warmup types:

In the following example, 20 row groups of the tripid column of the trips_data table in the trips schema are accelerated with a data cache and an index.

{
  trips.trips_data.tripid.WARM_UP_TYPE_DATA": 20,
  trips.trips_data.tripdid.WARM_UP_TYPE_BASIC": 20
}

Create a warmup rule#

Use the warmup/warmup-rule-set endpoint with an HTTP POST and a JSON payload to create a warmup rule. You can create a warmup rule at the column or table level. Access to a table or column initiates the creation of index and caching data. Warmup rules can prevent index and cache creation or impact the order in which the index and cache data is removed, when storage limits are reached.

If there are warmup rules defined for both a column and its table, the column rule takes precedence, unless the priority of the column rule is lower than the priority of the table rule.

Column-level warmup rule#

The following example creates a column-level warmup rule for the int_1 column in the aaa table of the tmp schema:

curl -X POST 'https://sep.example.com/ext/faster/warmup/warmup-rule-set' \
  -d '[ { "colNameId": "int_1", "schema": "tmp", "table": "aaa", "warmUpType": "WARM_UP_TYPE_BASIC", "priority": 8, "ttl": "PT720H", "predicates": [ ] } ]'
  -H 'Content-Type: application/json'

Find more details about the JSON payload in the table Warmup rule properties.

Table-level warmup rule#

The following example creates a table-level warmup rule for the aaa table of the tmp schema:

curl -X POST 'https://sep.example.com/ext/faster/warmup/warmup-rule-set' \
  -d '[ {"column":{"classType":"WildcardColumn","key":"*"},"schema": "tmp","table": "aaa","warmUpType": "WARM_UP_TYPE_BASIC","priority": -1,"ttl": "PT720H","predicates": [ ]} ]' \
  -H 'Content-Type: application/json'

Find more details about the JSON payload in the table Warmup rule properties.

Warmup rule properties#

Warmup rule properties#

Property name

Description

colNameid

Name of the column to which a warmup rule is attached.

column

Defines the columns to accelerate in a table-level rule. Specify all columns by using classType set to WildcardColumn and key to * as used in Table-level warmup rule.

schema

Name of the schema that contains the specified table.

table

Name of the table that contains the specified column.

warmUpType

The materialization type performed on the specified column in the specified table. Valid values are WARM_UP_TYPE_DATA for data cache acceleration, WARM_UP_TYPE_BASIC for index acceleration, and WARM_UP_TYPE_LUCENE for text search acceleration.

priority

Priority for the warmup rule. To ensure a column is accelerated even if storage capacity is exceeded, set the priority as high as 10. Rules with a higher priority take precedence. To ensure a column is never accelerated and prevent data cache or index creation, set to -10. Valid range of values: -10 to 10.

ttl

Duration for which the warmup rule remains active. Use PT0M to prevent expiration of the rule. Use duration specified in ISO-8601 duration format (PnDTnHnMn).

predicates

Defaults to all partitions. Use the JSON array syntax ["example1", "example2"] to limit to specific partitions.

Get all warmup rules#

The warmup/warmup-rule-get endpoint exposes all defined warmup rules via an HTTP GET:

curl -X GET 'https://sep.example.com/ext/faster/warmup/warmup-rule-get' \
  -H 'Accept: application/json'

Response:

{
  "id":186229827,
  "schema":"ride_sharing_dataset",
  "table":"trips_data_big",
  "colNameId":"d_date",
  "column":
    {
      "classType":"RegularColumn",
      "key":"d_date"
     },
  "warmUpType":"WARM_UP_TYPE_BASIC",
  "priority":8.0,
  "ttl":2592000.000000000,
  "predicates":[]
}

Delete a warmup rule#

The warmup/warmup-rule-delete endpoint allows you to delete a warmup rule via an HTTP DELETE. The identifier for the rule is a required parameter and can be seen from the result of warmup/warmup-rule-get in the id value.

curl -X DELETE 'https://sep.example.com/ext/faster/warmup/warmup-rule-delete' \
  -d '[186229827]' -H 'Accept: application/json'

You can delete multiple rules by using a comma-separated list such as [186229827,186229827] as parameter.

When you delete a warmup rule, the column index and cache data is de-prioritized to data from a default acceleration, and therefore is subject to earlier deletion.

SQL support#

All SQL statements and functions supported by the connector used in the accelerated catalog are supported by Starburst Warp Speed:

Starburst Warp Speed supports all data types, including structural data types. All structural data types are accessible, but indexing is only applicable to fields within ROW data types.

For some functions, Starburst Warp Speed does not accelerate filtering operations on columns. For example, the following filtering operation is not accelerated:

SELECT count(*)
FROM catalog.schema.table
WHERE lower(company) = 'starburst';

Starburst Warp Speed indexing accelerates the following functions when used on the left or the right side of the predicate:

  • ceil(x) with REAL and DOUBLE data types

  • in_nan(x) with REAL and DOUBLE data types

  • cast(x as type) with DOUBLE cast to REAL, or any type cast to VARCHAR

  • cast(x as type) with DOUBLE and DECIMAL data types

  • day(d) and day_of_month(d) with DATE and TIMESTAMP data types

  • day_of_year(d) and doy(y) with DATE and TIMESTAMP data types

  • day_of_week(d) and dow(d) with DATE and TIMESTAMP data types

  • year(d) with DATE and TIMESTAMP data types

  • year_of_week(d) and yow(d) with DATE and TIMESTAMP data types

  • week(d) and week_of_year(d) with DATE and TIMESTAMP data types

  • LIKE and NOT LIKE with VARCHAR data type

  • contains(arr_varchar, value) with array of VARCHAR data type

  • substring and substr with VARCHAR data type

  • strpos with BIGINT data type

The maximum supported string length for any cached data type is 48000 characters.