Starburst DynamoDB connector#

The Starburst DynamoDB connector allows querying data stored in Amazon DynamoDB.

Requirements#

To connect to DynamoDB, you need:

  • Amazon Web Services (AWS) access credentials with an attached policy to be able to read from DynamoDB.

  • Network access from the coordinator and workers to the AWS region hosting your DynamoDB tables.

  • A valid Starburst Enterprise license.

Configuration#

Create the mydynamodb catalog with a catalog properties file in etc/catalog named mydynamodb.properties:

connector.name=dynamodb
dynamodb.aws-access-key=AKIAIOSFODNN7EXAMPLE
dynamodb.aws-secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
dynamodb.aws-region=us-east-1
dynamodb.generate-schema-files=NEVER
dynamodb.schema-directory=/path/to/schema/directory

Specify the connector.name property as dynamodb. Configure the catalog using your AWS access key, secret key, and region for your account.

The connector can only access DynamoDB with the configured access credentials in the catalog. If you need to access DynamoDB with different credentials or a different region, configure a separate catalog. See more information about Type mapping and schemas session and catalog session properties.

Generate a schema file#

The connector exposes the data in your DynamoDB instance in the amazondynamodb schema of the catalog.

A schema file for a DynamoDB details the SEP table name, columns, and data types. It additionally species which DynamoDB attribute (or nested attribute) the connector uses to populate data for the column.

If you do not specify a schema file, the connector generates a schema file. The connector samples up to 50 rows from the DynamoDB table and uses those to generate a schema file. The schema files are in RealSQLDatabase (RSD) format and accessible using SQL commands.

The value dictates the behavior:

  • NEVER (default): A schema file is never generated.

  • ON_USE: A schema file is generated the first time a table is referenced, provided the schema file for the table does not already exist.

  • ON_START: A schema file is generated at connection time for any tables that do not currently have a schema file.

Several attributes detail the mapping from the SQL type to DynamoDB:

Schema file column definitions#

Attribute name

Description

Default Value

name

The name of the column in the SEP table

xs:type

The SQL type of the SEP table. Supported types are below

columnsize

Length of a SQL column that supports variable widths like VARCHAR

255

key

true if this column is part of the DynamoDB primary key

false

other:keytype

The type of the DynamoDB key. Valid values are either HASH or RANGE

other:path

The name of the attribute in DynamoDB

name

other::internaltype

The DynamoDB Type. Valid values are N, S, SS, or M

Example#

This is an example of a schema file for the TPC-H orders table. The name of the file is orders.rsd and goes in the directory specified by the dynamodb.schema-directory configuration property on every SEP host.

The api:info section has the name of the table, orders, and then lists all columns in the SEP table definition as child elements.

The additional api:script definitions for the GET, POST, MERGE, and DELETE API methods are required in every schema file. These definitions are standard, you can copy them as they are written below.

<api:script xmlns:api="http://apiscript.com/ns?v1">
  <api:info title="orders">
    <attr name="orderkey"      xs:type="bigint" key="true"      other:keytype="HASH" other:path="orderkey" other:internaltype="N" />
    <attr name="custkey"       xs:type="bigint"                 other:path="custkey"       other:internaltype="N" />
    <attr name="orderstatus"   xs:type="string" columnsize="1"  other:path="orderstatus"   other:internaltype="S" />
    <attr name="totalprice"    xs:type="double"                 other:path="totalprice"    other:internaltype="N" />
    <attr name="orderdate"     xs:type="date"                   other:path="orderdate"     other:internaltype="S" />
    <attr name="orderpriority" xs:type="string" columnsize="15" other:path="orderpriority" other:internaltype="S" />
    <attr name="clerk"         xs:type="string" columnsize="15" other:path="clerk"         other:internaltype="S" />
    <attr name="shippriority"  xs:type="integer"                other:path="shippriority"  other:internaltype="N" />
    <attr name="comment"       xs:type="string" columnsize="79" other:path="comment"       other:internaltype="S" />
  </api:info>

  <api:script method="GET">
    <api:call op="dynamodbadoMultiThreadSelect">
      <api:push/>
    </api:call>
  </api:script>

  <api:script method="POST">
    <api:call op="dynamodbadoAdd">
      <api:push/>
    </api:call>
  </api:script>

  <api:script method="MERGE">
    <api:call op="dynamodbadoUpdate">
      <api:push/>
    </api:call>
  </api:script>

  <api:script method="DELETE">
    <api:call op="dynamodbadoDelete">
      <api:push/>
    </api:call>
  </api:script>
</api:script>

Type mapping#

The supported SQL types include:

  • BOOLEAN

  • TINYINT

  • SMALLINT

  • BIGINT

  • REAL

  • DOUBLE

  • VARCHAR(N)

  • VARBINARY

  • DATE

All other SQL types are not supported.

Type mapping and schemas#

Review the table to learn more about mapping the relational values of nested attributes and arrays. This is used when generating a schema file and dictates how the table displays.

Catalog and session catalog properties#

Property name

Description

Default value

dynamodb.schema-directory

The relative or absolute path to a directory that contains schema files. The directory must exist on every host. More information on generating schema files.

${java.io.tmpdir}/dynamodb-schemas

dynamodb.flatten-objects-enabled

Set to true to enable the expansion of objects into flat nested attributes in individual columns or to return as a single JSON string. Enable this feature on a per-query basis with a dynamodb.flatten_objects_enabled session property.

false

dynamodb.flatten-array-element-count

The default returns the array as a single column of JSON. Set the integer value to 256 to convert array attributes in DynamoDB into individual columns. Enable this feature on a per-query basis with a dynamodb.flatten_arrays_element_count session property.

0

Maps and string sets#

DynamoDB supports nested attributes of maps and string sets, and the connector supports flattening these elements to relational columns. By default, any map or string set is returned as a single JSON column of type VARCHAR.

Take the below table foo which has a single key row_id and one column bar which is a DynamoDB type M (only the api:info section is shown below in the example section).

<api:info title="foo">
  <attr name="row_id" xs:type="string" columnsize="255"  other:path="row_id" other:internaltype="S" key="true" other:keytype="HASH" />
  <attr name="bar"    xs:type="string" columnsize="2000" other:path="bar"    other:internaltype="M" />
</api:info>

The following exmaple shows a value of bar is below. It contains three keys a, a string, b, a number, and c, a string set.

{
    "a": "string value",
    "b": 123,
    "c": ["foo", "bar"]
}

To flatten this value into a map, change the M column in our schema file to four columns named a, b, c_0, and c_1, setting the SQL types and DynamoDB types via the xs:type and other:internaltype attributes, respectively. Then set other:path to bar___cdata___a, where ___cdata___ is a path separator to project into the map to select the a attribute in the top-level bar attribute.

The string set uses the same separator as the map. Instead of specifying an attribute, use 0 and 1 to map the first and second elements of the string set to c_0 and c_1 SQL columns.

Set

<api:info title="foo">
  <attr name="row_id" xs:type="string" columnsize="255" other:path="row_id" other:internaltype="S" key="true" other:keytype="HASH" />
  <attr name="a"      xs:type="string" columnsize="255" other:path="bar___cdata___a"             other:internaltype="S" />
  <attr name="b"      xs:type="bigint"                  other:path="bar___cdata___b"             other:internaltype="N" />
  <attr name="c_0"    xs:type="string" columnsize="255" other:path="bar___cdata___c___cdata___0" other:internaltype="S" />
  <attr name="c_1"    xs:type="string" columnsize="255" other:path="bar___cdata___c___cdata___1" other:internaltype="S" />
</api:info>

For string set types, it uses the same method of projection. Below, baz is of type SS.

<api:info title="foo">
  <attr name="row_id" xs:type="string" columnsize="255"  other:path="row_id" other:internaltype="S" key="true" other:keytype="HASH" />
  <attr name="bar"    xs:type="string" columnsize="2000" other:path="bar"    other:internaltype="SS"  />
</api:info>

It splits it into three columns:

<api:info title="foo">
  <attr name="row_id" xs:type="string" columnsize="255"  other:path="row_id" other:internaltype="S" key="true" other:keytype="HASH" />
  <attr name="baz_0"  xs:type="string" columnsize="2000" other:path="baz___cdata___0"    other:internaltype="S"  />
  <attr name="baz_1"  xs:type="string" columnsize="2000" other:path="baz___cdata___1"    other:internaltype="S"  />
  <attr name="baz_2"  xs:type="string" columnsize="2000" other:path="baz___cdata___2"    other:internaltype="S"  />
</api:info>

If baz contains more than three elements, it is omitted from the result set. If baz contains fewer than three elements, the value of the column is NULL where an element is not present.

Using this method, you can project into many nested attributes in DynamoDB for both maps and string sets.

SQL support#

The connector supports globally available and read operation statements to access data and metadata in DynamoDB.

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Dynamic filtering#

Dynamic filtering is enabled by default. It causes the connector to wait for dynamic filtering to complete before starting a query.

You can disable dynamic filtering by setting the property dynamic-filtering.enabled in your catalog properties file to false.

Starburst Cached Views#

The connector supports table scan redirection to improve performance and reduce load on the data source.