Starburst DynamoDB connector#

The Starburst DynamoDB connector allows querying data stored in Amazon DynamoDB.

Requirements#

To connect to DynamoDB, you need:

  • Amazon Web Services (AWS) access credentials with an attached policy to be able to read from DynamoDB.

  • Network access from the coordinator and workers to the AWS region hosting your DynamoDB tables.

  • A valid Starburst Enterprise license.

Configuration#

Create the example catalog with a catalog properties file in etc/catalog named example.properties (replace example with your database name or some other descriptive name of the catalog) with the following contents:

connector.name=dynamodb
dynamodb.aws-access-key=AKIAIOSFODNN7EXAMPLE
dynamodb.aws-secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
dynamodb.aws-region=us-east-1
dynamodb.generate-schema-files=NEVER
dynamodb.schema-directory=/path/to/schema/directory

Specify the connector.name property as dynamodb. Configure the catalog using your AWS access key, secret key, and region for your account. Additional authentication methods are documented in the Security section.

The connector can only access DynamoDB with the configured access credentials in the catalog. If you need to access DynamoDB with different credentials or a different region, configure a separate catalog. See more information about Type mapping and schemas session and catalog session properties.

General configuration properties#

The following table describes general catalog configuration properties for the connector:

Property name

Description

case-insensitive-name-matching

Support case insensitive schema and table names. Defaults to false.

case-insensitive-name-matching.cache-ttl

Duration for which case insensitive schema and table names are cached. Defaults to 1m.

case-insensitive-name-matching.config-file

Path to a name mapping configuration file in JSON format that allows Trino to disambiguate between schemas and tables with similar names in different cases. Defaults to null.

case-insensitive-name-matching.config-file.refresh-period

Frequency with which Trino checks the name matching configuration file for changes. The duration value defaults to 0s (refresh disabled).

metadata.cache-ttl

Duration for which metadata, including table and column statistics, is cached. Defaults to 0s (caching disabled).

metadata.cache-missing

Cache the fact that metadata, including table and column statistics, is not available. Defaults to false.

metadata.schemas.cache-ttl

Duration for which schema metadata is cached. Defaults to the value of metadata.cache-ttl.

metadata.tables.cache-ttl

Duration for which table metadata is cached. Defaults to the value of metadata.cache-ttl.

metadata.statistics.cache-ttl

Duration for which tables statistics are cached. Defaults to the value of metadata.cache-ttl.

metadata.cache-maximum-size

Maximum number of objects stored in the metadata cache. Defaults to 10000.

write.batch-size

Maximum number of statements in a batched execution. Do not change this setting from the default. Non-default values may negatively impact performance. Defaults to 1000.

dynamic-filtering.enabled

Push down dynamic filters into JDBC queries. Defaults to true.

dynamic-filtering.wait-timeout

Maximum duration for which Trino waits for dynamic filters to be collected from the build side of joins before starting a JDBC query. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries. Defaults to 20s.

Generate a schema file#

The connector exposes the data in your DynamoDB instance in the amazondynamodb schema of the catalog.

A schema file for a DynamoDB details the SEP table name, columns, and data types. It additionally specifies which DynamoDB attribute (or nested attribute) the connector uses to populate data for the column.

If you do not specify a schema file, the connector generates a schema file. The connector samples up to 50 rows from the DynamoDB table and uses those to generate a schema file. The schema files are in RealSQLDatabase (RSD) format and accessible using SQL commands.

The value dictates the behavior:

  • NEVER (default): A schema file is never generated.

  • ON_USE: A schema file is generated the first time a table is referenced, provided the schema file for the table does not already exist.

  • ON_START: A schema file is generated at connection time for any tables that do not currently have a schema file.

Several attributes detail the mapping from the SQL type to DynamoDB:

Schema file column definitions#

Attribute name

Description

Default Value

name

The name of the column in the SEP table

xs:type

The SQL type of the SEP table. Supported types are below

columnsize

Length of a SQL column that supports variable widths like VARCHAR

255

key

true if this column is part of the DynamoDB primary key

false

other:keytype

The type of the DynamoDB key. Valid values are either HASH or RANGE

other:relativepath

The name of the attribute in DynamoDB relative to any parent map or string set

other:fullpath

The full name of the attribute in DynamoDB

name

other:dynamodatatype

The DynamoDB Type. Valid values are N, S, SS, or M

Caution

A breaking change in the schema definition files is introduced with the 365-e release of SEP. If you update from 364-e or an older release to 365-e or newer, all schema definition files need to be regenerated.

Example#

This is an example of a schema file for the TPC-H orders table. The name of the file is orders.rsd and goes in the directory specified by the dynamodb.schema-directory configuration property on every SEP host.

The api:info element definition includes the name of the table , orders, set to the title and other:tableapiname attributes, along with any additional table and schema attributes.

The api:info element contains all columns in the SEP table definition as child elements.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">
  <api:info title="orders" other:catalog="CData" other:schema="AmazonDynamoDB" description="orders" other:tableapiname="orders" other:version="20">
    <attr name="orderkey"      xs:type="bigint"                  other:relativepath="orderkey"      other:fullpath="orderkey"      other:dynamodatatype="N" key="true" other:keytype="HASH" />
    <attr name="custkey"       xs:type="bigint"                  other:relativepath="custkey"       other:fullpath="custkey"       other:dynamodatatype="N" />
    <attr name="orderstatus"   xs:type="string"  columnsize="1"  other:relativepath="orderstatus"   other:fullpath="orderstatus"   other:dynamodatatype="S" />
    <attr name="totalprice"    xs:type="double"                  other:relativepath="totalprice"    other:fullpath="totalprice"    other:dynamodatatype="N" />
    <attr name="orderdate"     xs:type="date"                    other:relativepath="orderdate"     other:fullpath="orderdate"     other:dynamodatatype="S" />
    <attr name="orderpriority" xs:type="string"  columnsize="15" other:relativepath="orderpriority" other:fullpath="orderpriority" other:dynamodatatype="S" />
    <attr name="clerk"         xs:type="string"  columnsize="15" other:relativepath="clerk"         other:fullpath="clerk"         other:dynamodatatype="S" />
    <attr name="shippriority"  xs:type="integer"                 other:relativepath="shippriority"  other:fullpath="shippriority"  other:dynamodatatype="N" />
    <attr name="comment"       xs:type="string"  columnsize="79" other:relativepath="comment"       other:fullpath="comment"       other:dynamodatatype="S" />
  </api:info>
</api:script>

Type mapping#

Because Trino and DynamoDB each support types that the other does not, this connector modifies some types when reading data.

DynamoDB to Trino type mapping#

The connector maps DynamoDB types to the corresponding Trino types according to the following table:

DynamoDB to Trino type mapping#

DynamoDB type

Trino type

BOOLEAN

BOOLEAN

BIGINT

BIGINT

DOUBLE

DOUBLE

STRING

VARCHAR(N)

BINARY

VARBINARY

DATE

DATE

No other types are supported.

Type mapping and schemas#

Review the table to learn more about mapping the relational values of nested attributes and arrays. This is used when generating a schema file and dictates how the table displays.

Catalog and session catalog properties#

Property name

Description

Default value

dynamodb.schema-directory

The relative or absolute path to a directory that contains schema files. The directory must exist on every host. More information on generating schema files.

${java.io.tmpdir}/dynamodb-schemas

dynamodb.flatten-objects-enabled

Set to true to enable the expansion of objects into flat nested attributes in individual columns or to return as a single JSON string. Enable this feature on a per-query basis with a dynamodb.flatten_objects_enabled session property.

false

dynamodb.flatten-array-element-count

The default returns the array as a single column of JSON. Set the integer value to 256 to convert array attributes in DynamoDB into individual columns. Enable this feature on a per-query basis with a dynamodb.flatten_arrays_element_count session property.

0

Maps and string sets#

DynamoDB supports nested attributes of maps and string sets, and the connector supports flattening these elements to relational columns. By default, any map or string set is returned as a single JSON column of type VARCHAR.

Take the below table foo which has a single key row_id and one column bar which is a DynamoDB type M (only the api:info section is shown below in the example section).

<api:info title="foo" other:catalog="CData" other:schema="AmazonDynamoDB" description="foo" other:tableapiname="foo" other:version="20">
  <attr name="row_id" xs:type="string" columnsize="255"  other:relativepath="row_id" other:fullpath="row_id" other:dynamodatatype="S" key="true" other:keytype="HASH" />
  <attr name="bar"    xs:type="string" columnsize="2000" other:relativepath="bar"    other:fullpath="bar"    other:dynamodatatype="M" />
</api:info>

The following exmaple shows a value of bar is below. It contains three keys a, a string, b, a number, and c, a string set.

{
    "a": "string value",
    "b": 123,
    "c": ["foo", "bar"]
}

To flatten this value into a map, change the M column in our schema file to four columns named bar.a, bar.b, bar.c.0, and bar.c.1, setting the SQL types and DynamoDB types via the xs:type and other:dynamodatatype attributes, respectively. Then set other:relativepath to a and other:fullpath to bar.a, where . is used as a path separator to project into the map to select the a attribute in the top-level bar attribute.

The string set uses the same separator as the map. Instead of specifying an attribute, use [0] and [1] to map the first and second elements of the string set to c.0 and c.1 SQL columns.

<api:info title="foo" other:catalog="CData" other:schema="AmazonDynamoDB" description="foo" other:tableapiname="foo" other:version="20">
  <attr name="row_id"  xs:type="string" columnsize="255" other:relativepath="row_id" other:fullpath="row_id"   other:dynamodatatype="S" key="true" other:keytype="HASH" />
  <attr name="bar.a"   xs:type="string" columnsize="255" other:relativepath="a"      other:fullpath="bar.a"    other:dynamodatatype="S" />
  <attr name="bar.b"   xs:type="bigint"                  other:relativepath="b"      other:fullpath="bar.b"    other:dynamodatatype="N" />
  <attr name="bar.c.0" xs:type="string" columnsize="255" other:relativepath="c[0]"   other:fullpath="bar.c[0]" other:dynamodatatype="S" />
  <attr name="bar.c.1" xs:type="string" columnsize="255" other:relativepath="c[1]"   other:fullpath="bar.c[1]" other:dynamodatatype="S" />
</api:info>

For string set types, it uses the same method of projection. Below, baz is of type SS.

<api:info title="foo" other:catalog="CData" other:schema="AmazonDynamoDB" description="foo" other:tableapiname="foo" other:version="20">
  <attr name="row_id" xs:type="string" columnsize="255"  other:relativepath="row_id" other:fullpath="row_id" other:dynamodatatype="S" key="true" other:keytype="HASH" />
  <attr name="baz"    xs:type="string" columnsize="2000" other:relativepath="value"  other:fullpath="value"  other:dynamodatatype="SS" />
</api:info>

It splits it into three columns:

<api:info title="foo" other:catalog="CData" other:schema="AmazonDynamoDB" description="foo" other:tableapiname="foo" other:version="20">
  <attr name="row_id" xs:type="string" columnsize="255"  other:relativepath="row_id" other:fullpath="row_id" other:dynamodatatype="S" key="true" other:keytype="HASH" />
  <attr name="baz.0"  xs:type="string" columnsize="2000" other:relativepath="baz[0]" other:fullpath="baz[0]" other:dynamodatatype="S" />
  <attr name="baz.1"  xs:type="string" columnsize="2000" other:relativepath="baz[1]" other:fullpath="baz[1]" other:dynamodatatype="S" />
  <attr name="baz.2"  xs:type="string" columnsize="2000" other:relativepath="baz[2]" other:fullpath="baz[2]" other:dynamodatatype="S" />
</api:info>

If baz contains more than three elements, it is omitted from the result set. If baz contains fewer than three elements, the value of the column is NULL where an element is not present.

Using this method, you can project into many nested attributes in DynamoDB for both maps and string sets.

SQL support#

The connector supports globally available and read operation statements to access data and metadata in DynamoDB.

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a JDBC query.

You can disable dynamic filtering by setting the dynamic-filtering.enabled property in your catalog configuration file to false.

Wait timeout#

By default, table scans on the connector are delayed up to 20 seconds until dynamic filters are collected from the build side of joins. Using a large timeout can potentially result in more detailed dynamic filters. However, it can also increase latency for some queries.

You can configure the dynamic-filtering.wait-timeout property in your catalog properties file:

dynamic-filtering.wait-timeout=1m

You can use the dynamic_filtering_wait_timeout catalog session property in a specific session:

SET SESSION example.dynamic_filtering_wait_timeout = 1s;

Compaction#

The maximum size of dynamic filter predicate, that is pushed down to the connector during table scan for a column, is configured using the domain-compaction-threshold property in the catalog properties file:

domain-compaction-threshold=100

You can use the domain_compaction_threshold catalog session property:

SET SESSION domain_compaction_threshold = 10;

By default, domain-compaction-threshold is set to 32. When the dynamic predicate for a column exceeds this threshold, it is compacted into a single range predicate.

For example, if the dynamic filter collected for a date column dt on the fact table selects more than 32 days, the filtering condition is simplified from dt IN ('2020-01-10', '2020-01-12',..., '2020-05-30') to dt BETWEEN '2020-01-10' AND '2020-05-30'. Using a large threshold can result in increased table scan overhead due to a large IN list getting pushed down to the data source.

Metrics#

Metrics about dynamic filtering are reported in a JMX table for each catalog:

jmx.current."io.trino.plugin.jdbc:name=example,type=dynamicfilteringstats"

Metrics include information about the total number of dynamic filters, the number of completed dynamic filters, the number of available dynamic filters and the time spent waiting for dynamic filters.

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.

Security#

The connector includes a number of security-related features, detailed in the following sections.

Authenticating with AWS user credentials#

To authenticate using root or user credentials, set the following in your catalog properties, substituting the access key and secret key associated with the AWS user in your account. If you are not using root credentials, the user must have the correct policy for accessing DynamoDB.

dynamodb.aws-access-key=AKIAIOSFSJKFDFODNN7EXAMPLE
dynamodb.aws-secret-key=wJalrXUtnFEMI/Kadadf7MDENG/bPxRfiCYEXAMPLEKEY

Caution

Production use of root credentials is strongly discouraged as the root credentials have full access to your AWS account.

Authenticating from an EC2 instance#

To authenticate using the EC2 instance role, do not set access key or secret key properties. The connector automatically uses the IAM role credentials attached to the running EC2 instance and authenticate with them. Note that the EC2 instance role must have a correct policy attached for accessing DynamoDB.

Authenticating using a role#

To authenticate assuming a role, specify the role ARN in the catalog properties file. Note that an access key and secret key must be specified to authenticate the account unless running on an EC2 instance. In this case, the IAM role credentials attached to the EC2 instance is used to authenticate prior to assuming the role. If an external ID is required, specify this as well. Note the correct trust relationship for the role and the policy to access DynamoDB must be configured. Roles cannot be used when specifying the access key and secret key of the root user.

dynamodb.aws-role-arn=arn:aws:iam::account:role/role-name

# Optionally specify access key and secret key if not on EC2
dynamodb.aws-access-key=AKIAIOSFODNN7EXAMPLE
dynamodb.aws-secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Optionally specify the external ID to assume the role
dynamodb.aws-external-id=my-external-id