Starburst Hive connector#
The Starburst Hive connector is an extended version of the Hive connector with configuration and usage identical.
Additional information:
Requirements#
Fulfill the Hive connector requirements.
Additional features of the connector require a valid Starburst Enterprise license, unless otherwise noted.
Extensions#
The Starburst Hive connector supports improvements detailed in the security and performance sections, and includes following additional extensions:
Amazon Glue support#
Statistics collection is supported for Hive Metastore and Amazon Glue.
Configuring and using SEP with AWS Glue is described in the AWS Glue documentation section.
Cloudera support#
The connector supports the Cloudera Data Platform (CDP) and the predecessor Cloudera Distributed Hadoop (CDH) platform.
IBM Cloud Object Storage support#
The connector querying IBM Cloud Object Storage.
MapR Hive support#
The connector includes support for a MapR-based Hive metastore as well as the MapR filesystem.
OpenX JSON format support#
The connector supports reading and writing data to tables as JSON files, and use
the OpenX JSON serialization and deserialization (serde) from the Java class
org.openx.data.jsonserde.JsonSerDe
.
Existing tables using that serde and all the associated serde properties are handled automatically.
The actual serde implementation is a fork of the original OpenX serde. It is updated to be compatible with the Hive 3 APIs used in SEP. The binary package of the forked serde implementation is available from Starburst Support. You can install the package in your systems reading and writing to your Hive-managed storage with Hive 3 for optimal compatibility.
Configuration#
The connector configuration is similar to the configuration for the base Hive connector, with these additional properties:
Property name |
Description |
---|---|
|
Use the accelerated Parquet reader,
Defaults to |
|
Name of the catalog to which |
|
Deprecated, use |
SQL support#
This connector provides read access and write access to data and metadata in Hive. In addition to the globally available and read operation statements, the connector supports the following features:
Materialized views management, see also Materialized views and Automated materialized view management
Materialized views#
The connector supports Materialized views management, with the following requirements:
The cache service must be configured and running
Catalogs must be configured to allow materialized views.
In the underlying system, each materialized view consists of a view definition and a storage table. The storage table name is stored as a materialized view property. The materialized data is stored in that storage table.
Materialized views are populated with data and refreshed manually with the
REFRESH MATERIALIZED VIEW command, or by the
Automated materialized view management. Storage configuration for the storage
tables must be supplied with the standard Hive connector table properties in the
WITH
statement:
Property name |
Description |
Default |
---|---|---|
|
File format to use in the storage. Valid values include |
|
|
Partitioning column for the storage table. |
|
|
Bucketing column for the storage table. Must be used with
|
|
|
Number of buckets to group data into. Must be used with |
|
|
Column to sort by to determine bucketing for row. |
|
The automatic refresh is configured with properties set in the WITH clause of the CREATE MATERIALIZED VIEW statement.
The CREATE MATERIALIZED VIEW
statement specifies the query to define the
data for the materialized view, the refresh schedule, and other parameters used
by the cache service. The query can access any available catalog and schema.
Once a materialized view is created, it is accessed like a table using the name of the materialized view.
Use the SHOW CREATE MATERIALIZED VIEW statement to view the
complete CREATE MATERIALIZED VIEW
statement for a materialized view,
including the properties in the WITH
clause.
Dropping a materialized view with DROP MATERIALIZED VIEW removes the definition and the storage table.
Configuration#
Specify the following configuration properties in the catalog properties file for each desired catalog to enable materialized view creation and usage in that catalog.
materialized-views.enabled=true
materialized-views.namespace=<your_namespace>
materialized-views.storage-schema=<your_storage_schema>
cache-service-uri=http://<my-cache-service-hostname>:8180
Property name |
Description |
Required |
---|---|---|
|
Set to |
true |
|
Specifies the schema used to store the storage tables for materialized views. Ensure that the proper access control exists on that schema to prevent users from directly accessing storage tables. |
true |
|
Used by the cache service to create a fully-qualified names for materialized views, and to identify which catalog is used to run the scheduled refresh queries. |
true |
|
The URI of the SEP cache service. |
true |
|
Directs SEP to run as the user submitting the query when present in a
catalog and set to |
false |
The storage schema must be defined in the catalog properties file.
materialized-views.storage-schema=mymvstorage
If it does not exist yet, you must create it with a location defined location:
CREATE SCHEMA myhive.mymvstorage WITH (location = 's3a://mymvstorage/');
In addition the schema for the materialized view itself must exist.
With the cache service running, the catalog configured and the schemas defined,
you can proceed to create a materialized view. In this example, a materialized
view named myhive.myschema.my_materialized_view
is created:
CREATE MATERIALIZED VIEW myhive.myschema.my_materialized_view
WITH (
grace_period = '15.00m',
max_import_duration = '1.00m',
) AS
SELECT *
FROM mycatalog.public.my_table
WHERE my_field IN ( 'myvalue1', 'myvalue2' )
;
The query, specified after AS
, can be any valid query, including queries
accessing one or multiple other catalogs.
The properties for the view are stored in the cache service database, and the
data in the storage schema, myhive.mymvstorage
.
To query data in the materialized view, use a SELECT
statement as you would
for any other table:
SELECT * FROM myhive.myschema.my_materialized_view;
Troubleshooting#
Whether your materialized views are refreshed manually or by the cache service
using WITH
clause parameters, refreshes may fail if columns are added or
renamed at the source. If this happens, drop the materialized view, and create
it again.
Automated materialized view management#
The connector uses the cache service to
automate the maintenance of materialized views. Configuration properties are
specified in the WITH
clause when creating a materialized view:
Property name |
Description |
Default |
---|---|---|
|
Frequency at which the cache service triggers refresh of the materialized
view, for instance |
|
|
Unix cron expression specifying a schedule for regular refresh of the
materialized view, for example |
|
|
Maximum allowed execution time for the refresh of the materialized view to complete. Measured from the scheduled time to the completion of the query execution. If a refresh fails for exceeding the maximum duration, the cache service attempts a refresh at the next scheduled time. |
|
|
After a view’s TTL (Time to Live, calculated as
|
|
|
Column used during incremental refresh by the service to apply an
|
|
|
Namespace used by the cache service to create a fully qualified name for materialized views in a catalog. |
Example#
In the following example, a materialized view named customer_total_return
in
myhive.myschema
is created to automatically refresh daily at 2:30AM:
CREATE MATERIALIZED VIEW myhive.myschema.customer_total_return
WITH (
grace_period = '5.00m',
refresh_interval = '1440m',
max_import_duration = '30.00m',
cron = '30 2 * * *'
) AS
SELECT
sr_customer_sk ctr_customer_sk,
sr_store_sk ctr_store_sk,
sum(sr_return_amt) ctr_total_return
FROM
tpcds.sf1.store_returns,
tpcds.sf1.date_dim
WHERE ( (sr_returned_date_sk = d_date_sk) AND (d_year = 2000) )
GROUP BY sr_customer_sk, sr_store_sk
;
After a materialized view is refreshed, at the end of the effective grace period, any new query requests that arrive after the new refresh is complete are run against the new contents. Query requests created before a refresh is complete are run against the previously existing contents until the effective grace period for that table is over.
Troubleshooting#
When you first create a materialized view, it is helpful to be able to determine
the state of a refresh, particularly when a refresh has failed or the view data
appears to be stale. This can happen when the refresh takes longer than the
combined max_import_duration
and effective grace period. Materialized views
have their own metadata tables located in the default schema of the cache
service database that contain current state and
other information. Metadata tables are named as the materialized view name with
$imports
added to the end. For example, given a materialized view,
myhive.myschema.customer_total_return
, run the following query to view the
metadata for its refresh:
SELECT * FROM myhive.myschema."customer_total_return$imports"
Note
You must enclose <your_table_name>$imports
in quotes so that the query
parser handles the dollar sign correctly.
The resulting metadata table contains the following fields:
status
- Scheduled, Running, Finished, Failed, Timeoutmax_import_duration
- The value originally set in theCREATE AS
statementstart_time
- As computed from thecron
orrefresh_interval
finish_time
row_count
error
The metadata for a refresh is maintained until the effective grace period passes.
Performance#
The connector includes a number of performance improvements, detailed in the following sections.
Storage caching#
The connector supports the default storage caching. In addition, if HDFS Kerberos authentication is enabled in your catalog properties file with the following setting, caching takes the relevant permissions into account and operates accordingly:
hive.hdfs.authentication.type=KERBEROS
Additional configuration for Kerberos is required.
If HDFS Kerberos authentication is enabled, you can also enable user impersonation using:
hive.hdfs.impersonation.enabled=true
The service user assigned to SEP needs to be able to access data files in underlying storage. Access permissions are checked against impersonated user, yet with caching in place, some read operations happen in context of system user.
Any access control defined with the integration of Apache Ranger or the Privacera platform is also enforced by the storage caching.
Auth-to-local user mapping#
The connector supports auth-to-local mapping of the impersonated username during
HDFS access. This requires enabling HDFS impersonation and setting the
hive.hdfs.auth-to-local.config-file
property to a path containing a mapping
file in the format described in auth-to-local translations file. You can configure regular refresh of the
configuration file with hive.hdfs.auth-to-local.refresh-period
.
Starburst Cached Views#
The connector supports Starburst Cached Views and can therefore be configured for table scan redirection and materialized views to improve performance.
Accelerated Parquet reader#
The connector includes a Parquet reader with numerous performance improvements for reading data files using the Parquet format. The reader is enabled by default.
You can disable it, and therefore use the default reader from Trino, in your catalog properties file:
hive.accelerated-parquet-reader.enabled=false
Alternatively you can disable it for a specific user session with the
catalog session property
accelerated_parquet_reader_enabled
.
Security#
The connector includes a number of security-related features, detailed in the following sections.
Authorization options#
SEP includes provides several authorization options for use with the Hive connector:
Apache Ranger is the recommended choice to provide global, system-level security, which can optionally be used with other connectors
Apache Sentry is supported with known limitations
HDFS permissions#
Before running any CREATE TABLE
or CREATE TABLE ... AS
statements for
Hive tables in SEP, you need to check that the operating system user running
the SEP server has access to the Hive warehouse directory on HDFS.
The Hive warehouse directory is specified by the configuration variable
hive.metastore.warehouse.dir
in hive-site.xml
, and the default value is
/user/hive/warehouse
. If that is not the case, either add the following to
jvm.config
on all of the nodes: -DHADOOP_USER_NAME=USER
, where USER
is an operating system user that has proper permissions for the Hive warehouse
directory, or start the SEP server as a user with similar permissions. The
hive
user generally works as USER
, since Hive is often started with the
hive
user. If you run into HDFS permissions problems on CREATE TABLE ...
AS
, remove /tmp/presto-*
on HDFS, fix the user as described above, then
restart all of the SEP servers.
Limitations#
The following limitation apply in addition to the limitations of the Hive connector.
Reading ORC ACID tables created with Hive Streaming ingest is not supported.
Redirections are supported for Hive tables but not Hive views.
Hive 3 related limitations#
For security reasons,
sys
system catalog is not accessible in SEP.Hive’s
timestamp with local zone
data type is not supported in SEP. It is possible to read from a table having a column of this type, but the column itself will not be accessible. Writing to such a table is not supported.SEP does not correctly read
timestamp
values from Parquet, RCFile with binary serde and Avro file formats created by Hive 3.1 or later due to Hive issues HIVE-21002, HIVE-22167. When reading from these file formats, SEP returns different results.