Elasticsearch connector#

The Elasticsearch connector lets you access Elasticsearch data from SEP. This document describes how to configure a catalog with the Elasticsearch connector to run SQL queries against Elasticsearch.

SEP includes additional enterprise features that are built on top of the existing Trino connector functionality. For more information on connector key feature differences between Trino and SEP, see the connectors feature matrix.

Requirements#

  • Elasticsearch 7.x or 8.x

  • Network access from the SEP coordinator and workers to the Elasticsearch nodes.

  • A valid Starburst Enterprise license.

Configuration#

To configure the Elasticsearch connector, create a catalog properties file that specifies the Elasticsearch connector by setting the connector.name to elasticsearch.

For example, to access a database as the example catalog, create the file etc/catalog/example.properties. Replace the connection properties as appropriate for your setup:

connector.name=elasticsearch
elasticsearch.host=localhost
elasticsearch.port=9200
elasticsearch.default-schema-name=default

Connection security#

If your cluster has globally-trusted certificates, you should only need to enable TLS. If you require custom configuration for certificates, the connector supports key stores and trust stores in PEM or Java Key Store (JKS) format.

The connector provides additional security options to connect to Elasticsearch clusters with TLS enabled.

The available configuration values are listed in the following table:

TLS Security Properties#

Property name

Description

elasticsearch.tls.enabled

Enables TLS security.

elasticsearch.tls.keystore-path

The path to the PEM or JKS key store.

elasticsearch.tls.truststore-path

The path to PEM or JKS trust store.

elasticsearch.tls.keystore-password

The key password for the key store specified by elasticsearch.tls.keystore-path.

elasticsearch.tls.truststore-password

The key password for the trust store specified by elasticsearch.tls.truststore-path.

elasticsearch.tls.verify-hostnames

Flag to determine if the hostnames in the certificates must be verified. Defaults to true.

Data source authentication#

The connector can provide credentials for the data source connection with either AWS or password authentication.

AWS authentication:

To enable AWS authentication and authorization using IAM policies, set the elasticsearch.security configuration property to AWS. In addition, configure the following properties in your catalog configuration file:

Property name

Description

elasticsearch.aws.region

AWS region of the Elasticsearch endpoint. This is required.

elasticsearch.aws.access-key

AWS access key to use to connect to the Elasticsearch domain. If not set, the default AWS credentials provider chain is used.

elasticsearch.aws.secret-key

AWS secret key to use to connect to the Elasticsearch domain. If not set, the default AWS credentials provider chain is used.

elasticsearch.aws.iam-role

Optional ARN of an IAM role to assume, then connect to Elasticsearch. Note that the configured IAM user must be able to assume this role.

elasticsearch.aws.external-id

Optional external ID to pass while assuming an AWS IAM role.

Password authentication:

To enable password authentication, set the elasticsearch.security configuration property to PASSWORD. In addition, configure the following properties in your catalog configuration file:

Property name

Description

elasticsearch.auth.user

User name to use to connect to Elasticsearch.

elasticsearch.auth.password

Password to use to connect to Elasticsearch.

General configuration properties#

The following table describes general catalog configuration properties for the connector:

Elasticsearch configuration properties#

Property name

Description

Default

elasticsearch.host

A comma-separated list of host names for the Elasticsearch node to connect to. This property is required.

elasticsearch.port

Port to use to connect to Elasticsearch.

9200

elasticsearch.default-schema-name

The schema that contains all tables defined without a qualifying schema name.

default

elasticsearch.scroll-size

Sets the maximum number of hits that can be returned with each Elasticsearch scroll request.

1000

elasticsearch.scroll-timeout

Duration for Elasticsearch to keep the search context alive for scroll requests.

1m

elasticsearch.request-timeout

Timeout duration for all Elasticsearch requests.

10s

elasticsearch.connect-timeout

Timeout duration for all Elasticsearch connection attempts.

1s

elasticsearch.backoff-init-delay

The minimum duration between backpressure retry attempts for a single request to Elasticsearch. Setting it too low can overwhelm an already struggling cluster.

500ms

elasticsearch.backoff-max-delay

The maximum duration between backpressure retry attempts for a single request to Elasticsearch.

20s

elasticsearch.max-retry-time

The maximum duration across all retry attempts for a single request to Elasticsearch.

30s

elasticsearch.node-refresh-interval

Duration between requests to refresh the list of available Elasticsearch nodes.

1m

elasticsearch.ignore-publish-address

Disable using the address published by the Elasticsearch API to connect for queries. Some deployments map Elasticsearch ports to a random public port and enabling this property can help in these cases.

false

Type mapping#

Because Trino and Elasticsearch each support types that the other does not, this connector maps some types when reading data.

Elasticsearch type to Trino type mapping#

The connector maps Elasticsearch types to the corresponding Trino types according to the following table:

Elasticsearch type to Trino type mapping#

Elasticsearch type

Trino type

Notes

BOOLEAN

BOOLEAN

DOUBLE

DOUBLE

FLOAT

REAL

BYTE

TINYINT

SHORT

SMALLINT

INTEGER

INTEGER

LONG

BIGINT

KEYWORD

VARCHAR

TEXT

VARCHAR

DATE

TIMESTAMP

For more information, see Date types.

IPADDRESS

IP

No other types are supported.

Array types#

Fields in Elasticsearch can contain zero or more values, but there is no dedicated array type. To indicate a field contains an array, it can be annotated in a Trino-specific structure in the _meta section of the index mapping.

For example, you can have an Elasticsearch index that contains documents with the following structure:

{
    "array_string_field": ["trino","the","lean","machine-ohs"],
    "long_field": 314159265359,
    "id_field": "564e6982-88ee-4498-aa98-df9e3f6b6109",
    "timestamp_field": "1987-09-17T06:22:48.000Z",
    "object_field": {
        "array_int_field": [86,75,309],
        "int_field": 2
    }
}

The array fields of this structure can be defined by using the following command to add the field property definition to the _meta.trino property of the target index mapping with Elasticsearch available at search.example.com:9200:

curl --request PUT \
    --url search.example.com:9200/doc/_mapping \
    --header 'content-type: application/json' \
    --data '
{
    "_meta": {
        "trino":{
            "array_string_field":{
                "isArray":true
            },
            "object_field":{
                "array_int_field":{
                    "isArray":true
                }
            },
        }
    }
}'

Note

You cannot use the asRawJson and isArray flags simultaneously for the same column.

Date types#

The Elasticsearch connector supports only the default date type. All other date formats including built-in date formats and custom date formats are not supported. Dates with the format property are ignored.

Raw JSON transform#

Documents in Elasticsearch can include more complex structures that are not represented in the mapping. For example, a single keyword field can have varying content such as a single keyword value, an array, or a multidimensional keyword array with any level of nesting.

The following command configures array_string_field mapping with Elasticsearch available at search.example.com:9200:

curl --request PUT \
    --url search.example.com:9200/doc/_mapping \
    --header 'content-type: application/json' \
    --data '
{
    "properties": {
        "array_string_field":{
            "type": "keyword"
        }
    }
}'

All the following documents are legal for Elasticsearch with array_string_field mapping:

[
    {
        "array_string_field": "trino"
    },
    {
        "array_string_field": ["trino","is","the","best"]
    },
    {
        "array_string_field": ["trino",["is","the","best"]]
    },
    {
        "array_string_field": ["trino",["is",["the","best"]]]
    }
]

For more information, see the Elasticsearch array documentation.

Elasticsearch supports types, such as dense_vector, that are not supported in SEP and may cause parsing exceptions for users.

To mitigate unwanted exceptions, you can transform fields to raw JSON by annotating it in a Trino-specific structure in the _meta section of the index mapping. This indicates to SEP that the field, and all nested fields beneath, need to be cast to a VARCHAR field that contains the raw JSON content. These fields can be defined by using the following command to add the field property definition to the _meta.trino property of the target index mapping.

curl --request PUT \
    --url search.example.com:9200/doc/_mapping \
    --header 'content-type: application/json' \
    --data '
{
    "_meta": {
        "trino":{
            "array_string_field":{
                "asRawJson":true
            }
        }
    }
}'

The preceding configuration causes Trino to return the array_string_field field as a VARCHAR containing raw JSON. You can parse these fields with the built-in JSON functions.

Note

You cannot use the asRawJson and isArray flags simultaneously for the same column.

Special columns#

The following hidden columns are available:

Column

Description

_id

The Elasticsearch document ID.

_score

The document score returned by the Elasticsearch query.

_source

The source of the original document.

Full text queries#

SEP SQL queries can be combined with Elasticsearch queries by providing the full text query as part of the table name, separated by a colon. For example:

SELECT * FROM "tweets: +trino SQL^2"

SQL support#

The connector provides globally available and read operation statements to access data and metadata in the Elasticsearch catalog. The connector supports the following features:

Table functions#

The connector provides specific table functions to access Elasticsearch.

raw_query(varchar) -> table#

The raw_query function lets you query the underlying database directly. This function requires Elastic Query DSL syntax. The full DSL query is pushed down and processed in Elasticsearch. This can be useful for accessing native features which are not available in SEP or for improving query performance in situations where running a query natively may be faster.

The native query passed to the underlying data source is required to return a table as a result set. Only the data source performs validation or security checks for these queries using its own configuration. Trino does not perform these tasks. Only use passthrough queries to read data.

The raw_query function requires three parameters:

  • schema: The schema in the catalog that the query is to be executed on.

  • index: The index in Elasticsearch to be searched.

  • query: The query to execute, written in Elastic Query DSL.

Once executed, the query returns a single row containing the resulting JSON payload returned by Elasticsearch.

For example, query the example catalog and use the raw_query table function to search for documents in the orders index where the country name is ALGERIA as defined as a JSON-formatted query matcher and passed to the raw_query table function in the query parameter:

SELECT
  *
FROM
  TABLE(
    example.system.raw_query(
      schema => 'sales',
      index => 'orders',
      query => '{
        "query": {
          "match": {
            "name": "ALGERIA"
          }
        }
      }'
    )
  );

Note

The query engine does not preserve the order of the results of this function. If the passed query contains an ORDER BY clause, the function result may not be ordered as expected.

Performance#

The connector includes a number of performance features, detailed in the following sections.

Parallelism#

The connector supports parallel data requests to the Elasticsearch cluster for query processing by splitting the data into independent partitions, known as shards. These shards can be processed concurrently, allowing multiple connections to query and retrieve data in parallel.

Predicate pushdown#

The connector supports predicate pushdown for the following data types:

Elasticsearch

Trino

boolean

BOOLEAN

double

DOUBLE

float

REAL

byte

TINYINT

short

SMALLINT

integer

INTEGER

long

BIGINT

keyword

VARCHAR

date

TIMESTAMP

No other data types are supported for predicate push down.

Security#

The connector includes the following security-related feature.

Kerberos authentication#

The connector supports Kerberos authentication using a keytab.

To configure Kerberos authentication, add the following catalog configuration properties to the catalog properties file:

elasticsearch.security=KERBEROS
kerberos.client.principal=example@example.com
kerberos.client.keytab=etc/kerberos/example.keytab
kerberos.config=etc/kerberos/krb5.conf

In this configuration, the user example@example.com connects to the database. The related Kerberos service ticket is located in the etc/kerberos/example.keytab file defined in the kerberos.client.principal property.