Starburst Hive connector#
The Starburst Hive connector is an extended version of the Hive connector with configuration and usage identical.
Requirements#
Fulfill the Hive connector requirements.
Additional features of the connector require a valid Starburst Enterprise license, unless otherwise noted.
Extensions#
The Starburst Hive connector supports improvements detailed in the security and performance sections, and includes following additional extensions:
Amazon Glue support#
Statistics collection is supported for Hive Metastore and Amazon Glue.
Ceph support#
The connector supports querying data on Ceph storage.
Dell ECS and ObjectScale support#
The connector supports querying data on Dell ECS or ObjectScale storage.
Cloudera support#
The connector supports the Cloudera Data Platform (CDP).
MinIO support#
The connector supports querying data on MinIO storage.
OpenX JSON format support#
The connector supports reading and writing data to tables as JSON files, and
supports the OpenX JSON serialization and deserialization (serde) from the Java
class org.openx.data.jsonserde.JsonSerDe
.
Existing tables using that serde and all the associated serde properties are handled automatically.
Configuration#
The connector configuration is similar to the configuration for the base Hive connector, with these additional properties:
Property name |
Description |
---|---|
|
Set to |
|
List of tables to be excluded from caching. If used with the
|
|
Path to configuration file for mapping usage with storage caching |
|
Refresh period for mapping file with storage caching |
|
Various properties for materialized view usage and configuration. |
SQL support#
The connector supports all of the SQL statements listed in the Hive connector documentation.
The following section describes additional SQL operations that are supported by SEP enhancements to the Trino connector.
Materialized view management, see also Materialized views and Automated materialized view management
Procedures#
Use the CALL statement to perform data manipulation or administrative tasks. Procedures are available in the system schema of each catalog.
Flush filesystem cache#
system.flush_filesystem_cache()
Flushes filesystem cache of a specific table. By default, this function accepts a schema name and a table name as parameters. You can flush the filesystem cache for specific partitions. For example, the following system call flushes the filesystem cache of a specific partition of the
MY_TABLE
table:CALL system.flush_filesystem_cache('TPCH_SCHEMA', 'MY_TABLE', ARRAY['col2'], ARRAY['group1']);
Views#
By default, Hive views are executed with the DEFINER
security mode. Set the
hive.hive-views.run-as-invoker
catalog configuration property to true
to use
INVOKER
semantics.
To execute all views as INVOKER
, set both the
hive.trino-views.run-as-invoker
and the hive.hive-views.run-as-invoker
catalog configuration properties to true
.
Athena views#
Note
Amazon Athena view support is a public preview feature. There are expected to be cases of syntax differences between Athena and Trino that cause view execution to fail. Contact Starburst Support with questions or feedback.
The connector supports querying views created in Amazon Athena. Athena views are
executed with INVOKER
security mode. If you are using AWS Lake Formation, support for Athena views includes cross-account
access to views shared from other AWS catalogs.
Materialized views#
Note
If you are a data consumer, read the Materialized views page for an introduction to using materialized views.
The connector supports Materialized view management, with the following requirements:
The cache service must be configured and running
Catalogs must be configured to allow materialized views.
In the underlying system, each materialized view consists of a view definition and a storage table. The storage table name is stored as a materialized view property. The materialized data is stored in that storage table.
Note
If you are a data engineer or platform administrator, read our cache service introduction for an overview of setting up the cache service and using materialized views.
The CREATE MATERIALIZED VIEW statement specifies the query to define
the data for the materialized view, the refresh schedule, and other parameters
used by the cache service. The query can access any available catalog and
schema. Storage configuration for the materialized view must be supplied with
table properties in the WITH
statement. The
optional automatic refresh is also configured with properties set in the WITH clause.
Note
The format
table property is not supported for materialized views created
with the Hive connector. These materialized views use the default file format
configured in the optional hive.storage-format
catalog configuration
property, which defaults to ORC
.
Once the storage tables are populated, the materialized view is created, and you can access it like a table using the name of the materialized view.
Materialized views are populated with data and refreshed manually with the REFRESH MATERIALIZED VIEW command, or by the Automated materialized view management.
Use the SHOW CREATE MATERIALIZED VIEW statement to view the
complete CREATE MATERIALIZED VIEW
statement for a materialized view,
including the properties in the WITH
clause.
Dropping a materialized view with DROP MATERIALIZED VIEW removes the definition and the storage table.
Configuration#
To enable materialized views you must:
Create a storage schema to contain the storage tables for materialized views.
Specify the required configuration properties in the catalog properties file for each desired catalog to enable materialized view creation and usage in that catalog.
Both are discussed in this section.
The following are the required catalog configuration properties for a deployment that connects to the cache service using HTTPS on the default port:
materialized-views.enabled=true
materialized-views.namespace=<your_namespace>
materialized-views.storage-schema=<your_storage_schema>
materialized-views.http-client.https.hostname-verification=false
materialized-views.http-client.trust-store-path=</etc/trino/keystore.jks>
materialized-views.http-client.trust-store-password=<keystore_password>
cache-service.uri=https://<cache-service-hostname>:8543
cache-service.user=<starburst-user>
cache-service.password=<starburst-password>
The following are the required catalog configuration properties for a deployment that connects to the cache service using an insecure HTTP connection on the default port:
materialized-views.enabled=true
materialized-views.namespace=<your_namespace>
materialized-views.storage-schema=<your_storage_schema>
cache-service.uri=http://<cache-service-hostname>:8180
The following table lists all available catalog configuration properties related to materialized views. Instructions for creating the required storage schema follow this table.
Property name |
Description |
Required |
---|---|---|
|
Set to |
true |
|
Specifies the catalog that contains the schema used to store the storage tables for the materialized views. Defaults to the catalog used for the materialized view itself. |
false |
|
Specifies the schema used to store the storage tables for the materialized views. Ensure that the proper access control exists on that schema to prevent users from directly accessing the storage tables. |
true |
|
Used by the cache service to create a fully-qualified name for the materialized views, and to identify which catalog is used to run the scheduled refresh queries. |
true |
|
Directs SEP to run as the user submitting the query when present in a
catalog and set to |
false |
|
Specifies default value for the run-as-invoker property. |
false |
|
The URI of the SEP cache service. |
true |
The storage schema must be defined in the catalog properties file.
materialized-views.storage-schema=views_cache_storage
If it does not exist yet, you must create it with a defined location:
CREATE SCHEMA example.views_cache_storage WITH (location = 's3a://<s3-bucket-name>/hivepostgres_views/views_cache_storage/');
In addition, the schema for the materialized view itself must exist. If it does not exist, you must create it:
CREATE SCHEMA example.views_schema WITH (location = 's3a://<s3-bucket-name>/hivepostgres_views/views_schemas');
With the cache service running, the catalog configured and the schemas defined,
you can proceed to create a materialized view. In this example, a materialized
view named example.example_schema.example_materialized_view
is created:
CREATE MATERIALIZED VIEW example.example_schema.example_materialized_view
WITH (
grace_period = '15m',
max_import_duration = '1m'
) AS
SELECT *
FROM example.public.example_table
WHERE example_field IN ( 'examplevalue1', 'examplevalue2' )
;
The query, specified after AS
, can be any valid query, including queries
accessing one or multiple other catalogs.
The properties for the view are stored in the cache service database, and the
data in the storage schema, example.example_mvstorage
.
Once the materialized view has been created, you can query the data. If the data
is not yet cached, the query runs against the source data instead. You can force
the materialized view to cache at any time by running a REFRESH
statement:
REFRESH MATERIALIZED VIEW example.example_schema.example_materialized_view;
To query data in the materialized view, use a SELECT
statement
as you would for any other table:
SELECT * FROM example.example_schema.example_materialized_view;
You can also use the materialized view in more complex queries, just like any other table.
Access to a materialized view is much faster, since the data is readily available in the storage table and no computation is necessary. The data however does use storage in addition to the source data. Use the following query to determine the catalog, schema, and name of the storage table:
SELECT * FROM system.metadata.materialized_views WHERE name = 'viewname';
Then use the hidden properties of the storage table to compute an estimate for the used storage:
SELECT SUM(size) table_size, COUNT(file) num_files
FROM (
SELECT
"$path" AS file,
"$file_size" AS size
FROM <storage_catalog>.<storage_schema>.<storage_table>
)
GROUP BY file, size;
Troubleshooting#
Whether your materialized views are refreshed manually or by the cache service
using WITH
clause parameters, refreshes may fail if columns are added or
renamed at the source. If this happens, drop the materialized view, and create
it again.
ALTER MATERIALIZED VIEW SET PROPERTIES
#
The extended Starburst Hive connector adds support for ALTER MATERIALIED VIEW SET PROPERTIES statements.
ALTER MATERIALIZED VIEW
exampleMaterializedView
SET PROPERTIES
cron = '*/15 * * * *',
refresh_interval = DEFAULT;
When a materialized view property that impacts the storage layout or an
incremental refresh property is altered, a new storage table is created.
If a materialized view property that does not affect the storage layout is
changed, such as refresh_interval
or grace_period
, the existing
storage table continues to be used.
For example, the following ALTER MATERIALIZED VIEW
statement changes the
storage layout of exampleMaterializedView
by altering the partitioned_by
property, resulting in a new storage table for the materialized view:
ALTER MATERIALIZED VIEW
exampleMaterializedView
SET PROPERTIES
partitioned_by = ARRAY['regionkey'];
Automated materialized view management#
The connector uses the cache service to manage metadata and maintenance of materialized views, related storage tables, and other aspects.
When the CREATE MATERIALIZED VIEW
statement runs initially, the cache
services picks up processing asynchronously after its configured refresh-interval and delay, which defaults to
two minutes. The defined query is then run and used to populate the storage
tables. The processing time depends on the complexity of the query, the cluster
performance, and the performance of the storage.
Configuration properties are specified in the WITH
clause when creating a
materialized view:
Property name |
Description |
Default |
---|---|---|
|
Minimum duration between refreshes of the materialized
view, for example |
|
|
Unix cron expression specifying a schedule for regular refresh of the
materialized view, for example |
|
|
Maximum allowed execution time for the refresh of the materialized view to complete. Measured from the scheduled time to the completion of the query execution. If a refresh fails for exceeding the maximum duration, the cache service attempts a refresh at the next scheduled time. |
|
|
After a view’s TTL (Time to Live, calculated as
|
|
|
Column used during incremental refresh by the service to apply an
|
|
|
Namespace used by the cache service to create a fully qualified name for materialized views in a catalog. |
|
|
Validate access to tables and data referenced by materialized view as
invoker. If you want to set this to Warning If you enable |
|
In the following example, a materialized view named customer_total_return
in
example.example_schema
is created to automatically refresh daily at 2:30AM:
CREATE MATERIALIZED VIEW example.example_schema.customer_total_return
WITH (
grace_period = '5m',
max_import_duration = '30m',
cron = '30 2 * * *'
) AS
SELECT
sr_customer_sk ctr_customer_sk,
sr_store_sk ctr_store_sk,
sum(sr_return_amt) ctr_total_return
FROM
tpcds.sf1.store_returns,
tpcds.sf1.date_dim
WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year = 2000) )
GROUP BY sr_customer_sk, sr_store_sk
;
After a materialized view is refreshed, at the end of the effective grace period, any new query requests that arrive after the new refresh is complete are run against the new contents. Query requests created before a refresh is complete are run against the previously existing contents until the effective grace period for that table is over.
Troubleshooting#
When you first create a materialized view, it is helpful to be able to determine
the state of a refresh, particularly when a refresh has failed or the view data
appears to be stale. This can happen when the refresh takes longer than the
combined max_import_duration
and effective grace period. Materialized views
have their own metadata tables located in the default schema of the cache service database that contain current state and
other information. Metadata tables are named as the materialized view name with
$imports
added to the end. For example, given a materialized view,
example.example_schema.customer_total_return
, run the following query to refresh
and view the metadata for your materialized view:
SELECT * FROM example.example_schema."customer_total_return$imports"
Note
You must enclose <your_table_name>$imports
in quotes so that the query
parser handles the dollar sign correctly.
The resulting metadata table contains the following fields:
status
- Scheduled, Running, Finished, Failed, Timeoutmax_import_duration
- The value originally set in theCREATE AS
statementstart_time
- As computed from thecron
orrefresh_interval
finish_time
row_count
error
The metadata for a refresh is maintained until the effective grace period passes.
Table functions#
The connector provides specific table functions to access Hive.
UNLOAD#
The UNLOAD
SQL statement is a pre-built table function within the system
schema that writes files directly to storage. The files that UNLOAD
writes to
storage corresponds to what the input
parameter selects.
SELECT * FROM TABLE(system.unload(
input => TABLE(...) [PARTITION BY col (, ...)],
location => '',
format => ''
[, compression => '']
[, separator => '']
))
Note
The input
, location
, and format
parameters are required. The
compression
, and separator
parameters are optional.
The input
parameter can accept either a table name or a SELECT
query. The
separator
parameter is applicable only when the format
argument is set to
CSV
or TEXTFILE
. See
object storage file formats for
more details.
Roles do not have access to the UNLOAD
function by default. To enable role
access, users must have the UNLOAD
function privilege on the catalog, as well as the corresponding
location privilege on the object storage location
where the output is written.
Examples#
To unload the orderkey
, custkey
, and orderstatus
columns from the orders
table to the location s3://mybucket/my/unload/location
in ORC
file format:
SELECT
*
FROM
TABLE (
system.unload (
input => TABLE (
SELECT
orderkey,
custkey,
orderstatus
FROM
tpch.sf1.orders
)
PARTITION BY
(orderstatus),
location => 's3://mybucket/my/unload/location',
format => 'ORC'
)
)
An example of UNLOAD
using input => TABLE(tpch.sf1.orders)
:
SELECT
*
FROM
TABLE (
system.unload (
input => TABLE (tpch.sf1.orders)
PARTITION BY
(orderstatus),
location => 's3://mybucket/my/unload/location',
format => 'ORC'
)
)
An example of UNLOAD
with multiple partitions:
SELECT
*
FROM
TABLE (
system.unload (
input => TABLE (
SELECT
orderkey,
custkey,
orderdate,
orderstatus
FROM
tpch.sf1.orders
)
PARTITION BY
(orderdate, orderstatus),
location => 's3://mybucket/my/unload/location',
format => 'TEXTFILE',
compression => 'GZIP',
separator => '|'
)
)
Supported parameters#
Supported format
parameters:
ORC
PARQUET
AVRO
RCBINARY
RCTEXT
SEQUENCEFILE
JSON
OPENX_JSON
TEXTFILE
CSV
Supported compression
parameters:
NONE (default)
SNAPPY
LZ4
ZSTD
GZIP
Limitations#
Each format has its own set of constraints. The CSV
format exclusively
supports VARCHAR
columns and AVRO files do not permit special characters in
the column names.
Performance#
The connector includes a number of performance improvements, detailed in the following sections.
Dynamic row filtering#
Dynamic filtering, and specifically also dynamic row filtering, is enabled by default. Row filtering improves the effectiveness of dynamic filtering for a connector by using dynamic filters to remove unnecessary rows during a table scan. It is especially powerful for selective filters on columns that are not used for partitioning, bucketing, or when the values do not appear in any clustered order naturally.
As a result the amount of data read from storage and transferred across the network is further reduced. You get access to higher query performance and a reduced cost.
You can use the following properties to configure dynamic row filtering:
Property name |
Description |
---|---|
|
Toggle dynamic row filtering. Defaults to |
|
Control the threshold for the fraction of the selected rows from the
overall table above which dynamic row filters are not used. Defaults to
0.7. Catalog session property name is
|
|
Duration to wait for completion of dynamic row filtering. Defaults to 0.
The default causes query processing to proceed without waiting for the
dynamic row filter, it is collected asynchronously and used as soon as
it becomes available. Catalog session property name is
|
Storage caching#
The connector supports the default storage caching. In addition, if HDFS Kerberos authentication is enabled in your catalog properties file with the following setting, caching takes the relevant permissions into account and operates accordingly:
hive.hdfs.authentication.type=KERBEROS
Additional configuration for Kerberos is required.
If HDFS Kerberos authentication is enabled, you can also enable user impersonation using:
hive.hdfs.impersonation.enabled=true
The service user assigned to SEP needs to be able to access data files in underlying storage. Access permissions are checked against impersonated user, yet with caching in place, some read operations happen in context of system user.
Any access control defined with the integration of Apache Ranger or the Privacera platform is also enforced by the storage caching.
Auth-to-local user mapping#
The connector supports auth-to-local mapping of the impersonated username during
HDFS access. This requires enabling HDFS impersonation and setting the
hive.hdfs.auth-to-local.config-file
property to a path containing a mapping
file in the format described in auth-to-local translations file. You can configure regular refresh of the
configuration file with hive.hdfs.auth-to-local.refresh-period
.
Starburst Cached Views#
The connector supports Starburst Cached Views and can therefore be configured for table scan redirection and materialized views to improve performance.
Security#
The connector includes a number of security-related features, detailed in the following sections.
Built-in access control#
If you have enabled built-in access control for SEP, you must add the following configuration to all Hive catalogs:
hive.security=starburst
Microsoft Entra ID (formerly Azure Active Directory) credential pass-through#
To use Microsoft Entra ID with credential pass-through, you must include the following configuration in the config.properties file:
http-server.authentication.oauth2.scopes=https://storage.azure.com/user_impersonation,openid
http-server.authentication.oauth2.additional-audiences=https://storage.azure.com
If you use Microsoft Entra ID as the identity provider when you integrate with Azure Storage, you can reuse the Entra ID token to access Azure Storage blobs.
To reuse the Entra ID token, set the following access property in the catalog properties file:
# If using the native Azure filesystem support, instead set
# azure.use-oauth-passthrough-token=true
hive.azure.abfs.oauth2.passthrough=true
Enable the OAuth 2.0 token pass-through authentication type with the following configuration in Config properties:
http-server.authentication.type=DELEGATED-OAUTH2
You must grant the Azure application user_impersonation
API permissions for
Azure Storage. You can further read about it in the Azure
docs
To ensure blobs are accessible, read about Azure Storage access configuration: * Roles for the Azure Storage * Access control model in ADLS Gen2 * Managing ACLs in ADLS Gen2
For more information about DELEGATED-OAUTH2
, see
OAuth 2.0 token pass-through.
You can only set one type or group of access properties in a catalog properties file. Setting more than one prevents SEP from running properly. The valid access property combinations for the catalog properties file include the following:
hive.azure.abfs.oauth2.passthrough
.hive.azure.abfs-storage-account
andhive.azure.abfs-access-key
.hive.azure.abfs.oauth.endpoint
,hive.azure.abfs.oauth.client-id
, andhive.azure.abfs.oauth.secret
.
HDFS permissions#
Before running any CREATE TABLE
or CREATE TABLE ... AS
statements for
Hive tables in SEP, you need to check that the operating system user running
the SEP server has access to the Hive warehouse directory on HDFS.
The Hive warehouse directory is specified by the configuration variable
hive.metastore.warehouse.dir
in hive-site.xml
, and the default value is
/user/hive/warehouse
. If that is not the case, either add the following to
jvm.config
on all of the nodes: -DHADOOP_USER_NAME=USER
, where USER
is an operating system user that has proper permissions for the Hive warehouse
directory, or start the SEP server as a user with similar permissions. The
hive
user generally works as USER
, since Hive is often started with the
hive
user. If you run into HDFS permissions problems on CREATE TABLE ... AS
, remove /tmp/presto-*
on HDFS, fix the user as described above, then
restart all of the SEP servers.
LDAP user translation#
The connector supports LDAP-based user translation with a HMS metastore and/or HDFS object storage system.
When using LDAP-based user translation, you must configure the appropriate prefix for the service you’re connecting to.
For a HMS metastore, the hive.metastore.thrift.impersonation.enabled
catalog
configuration property must be set to true
. For a HDFS object storage
system, the hive.hdfs.impersonation.enabled
catalog configuration property
must be set to true
.
Limitations#
The following limitation apply in addition to the limitations of the Hive connector.
Reading ORC ACID tables created with Hive Streaming ingest is not supported.
Redirections are supported for Hive tables but not Hive views.