Great Lakes connectivity #

Great Lakes connectivity abstracts the details of using different table formats and file types when using certain write access statements for object storage systems. This connectivity is built into Starburst Galaxy, and is available to all users.

The following object storage catalogs support Great Lakes connectivity:

This feature lets you read and write tables with the following types in the same object storage catalog:

In addition, the following table type is supported with read-only access:

Any read or write access to existing tables works transparently for all table formats. Starburst Galaxy recognizes the type and format of your tables by reading the metastore associated with your object storage.

Specify default table format #

For data consumers who want to read from any object storage file in any supported table type, there are no configuration steps or commands. Starburst Galaxy reads each file’s type and format from the object store’s associated metadata file.

For data engineers who want to create new tables in an object storage catalog, Starburst Galaxy provides a simple way to specify the default format for all newly created tables. When creating a new catalog, select from one of the options in the Default table format dialog. Your selection applies only to newly created tables, and does not convert any existing tables.

Select default table format

You can also use a SQL statement to specify an individual table’s default format when creating that table. With a statement such as CREATE TABLE or CREATE TABLE AS, specify the type parameter with iceberg, delta, or hive as its argument. A table created this way can override the catalog’s overall default type.

For the Hive and Iceberg types, you can optionally include a format= parameter with arguments as discussed in following sections.

Examples of CREATE SQL statements are in each table type’s subsection.

Hudi type #

Hudi type tables have read only support. Existing tables of type Hudi that are detected in a Galaxy-connected object storage location are automatically read.

Galaxy cannot create new Hudi tables or write to them.

Iceberg type #

Apache Iceberg is an open table type for huge analytic datasets. Starburst Galaxy allows querying data stored in files written in Iceberg type, as defined in the Iceberg Table Spec.

Iceberg type tables are automatically detected and read based on information in the storage environment’s associated metadata file.

To create an Iceberg type table, add type='iceberg' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
    name varchar,
    address varchar)
WITH (type='iceberg');

Specify the format of table data files with an additional format property with value either parquet or orc, defaulting to orc:

...
WITH (type='iceberg',
      format='parquet');

Append further comma-separated table properties as required:

Table property Description
partitioning Optionally specifies table partitioning. If a table is partitioned by columns c1 and c2, this property setting is partitioning=ARRAY['c1', 'c2'].
location Optionally specifies the file system location URI for the table.
format_version Optionally specifies the format version of the Iceberg specification, either 1 or 2. For example format_version=1.

For the Iceberg type, only INSERT and DELETE by partition is supported.

Delta Lake type #

Delta Lake connectivity allows querying data stored in Delta Lake type tables, including Databricks Delta Lake.

Delta Lake type tables are automatically detected and read based on information in the storage environment’s associated metadata file.

To create a Delta Lake type table, add type='delta' as a table property in your CREATE statement. There is no format parameter for Delta Lake, but you can append other table properties as required:

Table property Description
partitioned_by Optionally specifies table partitioning. If a table is partitioned by column regionkey, this property is set with partitioned_by=ARRAY['regionkey'].
location Optionally specifies the file system location URI for the table.
checkpoint_interval Optionally specifies the checkpoint interval in seconds.

For example:

CREATE TABLE galaxy_delta.default.my_partitioned_table
WITH (
  type = 'delta',
  location = 's3://my-bucket/at/this/path',
  partitioned_by = ARRAY['regionkey'],
  checkpoint_interval = 5
)
AS
  SELECT name, comment, regionkey
  FROM tpch.tiny.nation;

Hive type #

Hive connectivity allows querying data stored in an Apache Hive data warehouse.

Hive type tables are automatically detected and read based on information in the storage environment’s associated metadata file.

To create a Hive type table, add type='hive' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
  name varchar,
  address varchar)
WITH (type='hive');

Specify the format of table data files with an additional format property with one of the values shown in the following table:

File format Description
ORC Default value, Apache ORC file format.
PARQUET Apache Parquet file format.
JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
TEXTFILE
AVRO Apache Avro file format.
RCTEXT RCFile using ColumnarSerDe
RCBINARY RCFile using LazyBinaryColumnarSerDe
SEQUENCEFILE

Specify additional table properties as needed, consulting the list of table properties in the Trino Hive documentation. Notice that many of the available table properties must be used in conjunction with a particular Hive file format.

Some of the example SQL statements on the Hive documentation page must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

CREATE TABLE hive.web.page_views (
  view_time timestamp,
  user_id bigint,
  page_url varchar,
  ds date,
  country varchar
)
WITH (
  type = 'hive',
  format = 'ORC',
  partitioned_by = ARRAY['ds', 'country'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)