Great Lakes connectivity #

Great Lakes connectivity abstracts the details of using different table formats and file types when using certain write access statements for object storage systems. This connectivity is built into Starburst Galaxy, and is available to all users.

The following object storage catalogs support Great Lakes connectivity:

This feature supports the following table formats:

Any access to existing tables works transparently for all table formats. Starburst Galaxy recognizes the format of your tables by reading the metastore associated with your object storage.

Specify default table format #

For data consumers who want to read from any object storage file in any supported table format, there are no configuration steps or commands. Starburst Galaxy reads each file’s type and format from the object store’s associated metadata file.

For data engineers who want to create new tables in an object storage catalog, Starburst Galaxy provides a simple way to specify the default format for all newly created tables. When creating a new catalog, select from one of the options in the Default table format dialog. Your selection applies only to newly created tables, and does not convert any existing tables.

Select default table format

You can also use a SQL statement to specify an individual table’s default format when creating that table. With a statement such as CREATE TABLE or CREATE TABLE AS, specify the type parameter with iceberg, delta, hive, or hudi` as its argument. A table created this way can override the catalog’s overall default table format.

Iceberg table format #

Apache Iceberg is an open table format for huge analytic datasets. Starburst Galaxy allows querying data stored in files written in the Iceberg format, as defined in the Iceberg Table Spec.

Iceberg tables are automatically detected and read based on information in the storage environment’s associated metadata.

To create an Iceberg table, add type='iceberg' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
    name varchar,
    address varchar)
WITH (type='iceberg');

Specify the file format of table data files with an additional format property with value either parquet or orc, defaulting to orc:

...
WITH (type='iceberg',
      format='parquet');

Append further comma-separated table properties as required:

Table property Description
partitioning Optionally specifies table partitioning. If a table is partitioned by columns c1 and c2, this property setting is partitioning=ARRAY['c1', 'c2'].
location Optionally specifies the file system location URI for the table.
format_version Optionally specifies the format version of the Iceberg specification, either 1 or 2. For example format_version=1.

SQL support #

When using the Iceberg table format with the Great Lakes connectivity, the following SQL statements are supported:

The following additional considerations apply:

  • Only INSERT and DELETE by partition is supported.

Delta Lake table format #

Delta Lake connectivity allows querying data stored in Delta Lake tables, including Databricks Delta Lake.

Delta Lake tables are automatically detected and read based on information in the storage environment’s associated metadata.

To create a Delta Lake table, add type='delta' as a table property in your CREATE statement. There is no format parameter for Delta Lake, but you can append other table properties as required:

Table property Description
partitioned_by Optionally specifies table partitioning. If a table is partitioned by column regionkey, this property is set with partitioned_by=ARRAY['regionkey'].
location Optionally specifies the file system location URI for the table.
checkpoint_interval Optionally specifies the checkpoint interval in seconds.

For example:

CREATE TABLE galaxy_delta.default.my_partitioned_table
WITH (
  type = 'delta',
  location = 's3://my-bucket/at/this/path',
  partitioned_by = ARRAY['regionkey'],
  checkpoint_interval = 5
)
AS
  SELECT name, comment, regionkey
  FROM tpch.tiny.nation;

SQL support #

When using the Delta Lake table format with the Great Lakes connectivity, the following SQL statements are supported:

Hive table format #

Hive connectivity allows querying data stored in an Apache Hive data warehouse.

Hive tables are automatically detected and read based on information in the storage environment’s associated metadata.

To create a Hive table, add type='hive' as a table property in your CREATE statement. For example:

CREATE TABLE customer (
  name varchar,
  address varchar)
WITH (type='hive');

Specify the format of table data files with an additional format property with one of the values shown in the following table:

File format Description
ORC Default value, Apache ORC file format.
PARQUET Apache Parquet file format.
JSON JSON file format using org.apache.hive.hcatalog.data.JsonSerDe
CSV Comma-separated values file format using org.apache.hadoop.hive.serde2.OpenCSVSerde
TEXTFILE
AVRO Apache Avro file format.
RCTEXT RCFile using ColumnarSerDe
RCBINARY RCFile using LazyBinaryColumnarSerDe
SEQUENCEFILE

Specify additional table properties as needed, consulting the list of table properties in the Trino Hive documentation. Notice that many of the available table properties must be used in conjunction with a particular Hive file format.

Some of the example SQL statements on the Hive documentation page must be supplemented with the type='hive' property for use in Starburst Galaxy. For example:

CREATE TABLE hive.web.page_views (
  view_time timestamp,
  user_id bigint,
  page_url varchar,
  ds date,
  country varchar
)
WITH (
  type = 'hive',
  format = 'ORC',
  partitioned_by = ARRAY['ds', 'country'],
  bucketed_by = ARRAY['user_id'],
  bucket_count = 50
)

SQL support #

When using the Hive table format with the Great Lakes connectivity, the following SQL statements are supported:

Hudi table format #

Hudi tables have read only support. Existing tables of the Hudi format that are detected in a Galaxy-connected object storage location are read automatically.

Galaxy cannot create new Hudi tables or write to them.

SQL support #

When using the Hudi table format with the Great Lakes connectivity, the following SQL statements are supported: