Configuring Starburst Enterprise with CFT#

Starburst Enterprise platform (SEP) has an extensive set of configuration switches that allow it to be tuned for certain specific requirements. Default values are chosen for the best “out of the box” experience. However, if you need to fine-tune SEP behavior, you can do so when using Starburst’s CloudFormation template.

Default configuration#

The following configuration changes are applied automatically for you:

  • Java heap maximum memory (-Xmx) is set appropriately for the selected EC2 instance type

  • JVM’s JIT caches are set to 512 MiB

  • Java is configured to use G1 garbage collector, this is the recommended garbage collector to use when running SEP

  • If Hive Metastore is configured (refer to Configuring the Hive Metastore Service with CFT), the hive catalog is configured with connector configuration left at default values.

  • A query audit event listener is configured in etc/event-listener-audit-log.properties. If you have configured another event listener, add the property event-listener.config-files in the config properties file, and ensure both files are in the list comma-separated list.

  • The query.max-memory property is set to 1PB. This setting overrides the low default value.

Note

All configuration changes generated by the CFT are stored in the etc directory of the SEP installation directory. Because the installation directory itself is mounted as a RAM disk, files generated by the CFT configuration are also stored in memory only.

No secrets in any files, such as usernames or passwords in catalog files, are actually stored on disk at any time and the files can not be access from outside the running EC2 instances.

Custom configuration#

When using Starburst’s CloudFormation template, configuration packages for the coordinator, workers and catalogs are used to customize SEP. These configuration packages are used to append or override the default SEP configuration.

The CloudFormation template provides the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters used to specify the locations of the configuration packages for the coordinator and workers respectively. See the following sections for how to create, upload, and use configuration packages for SEP.

Note

All configuration changes made to your SEP cluster must be performed via the CloudFormation Template. If you manually change the configurations on the instances running SEP, the changes are not persisted.

Creating a configuration package#

A configuration package is a ZIP file with the structure shown below. All files are optional except for top-level etc/ directory entry.

etc/
  config.properties
  jvm.config
  catalog/
    hive.properties
    <catalog-name>.properties

Warning

You must use this exact directory structure or SEP is unable to start correctly.

Startup script nodes#

Node name

Description

etc/config.properties

This global configuration file is optional. Refer to the properties reference documentation for details.

etc/jvm.config

This Java Virtual Machine configuration file is optional. Certain options, including -Xmx and garbage collection algorithm selection are set by default.

etc/catalog/hive.properties

If the configuration package contains this file and the Hive Metastore is not configured (refer to Configuring the Hive Metastore Service with CFT) when launching Starburst’s CloudFormation template, then the file must contain the following:

  connector.name=hive
  hive.metastore.uri=thrift://example.net:9083

If the MetastoreType parameter is set to something other than None, then the hive.properties file was already created and it is not needed to provide the above. However, you can still provide a hive.properties file that includes properties you wish to append to the configuration. Refer to the Hive connector documentation of options that can be set here. Also refer to Auxiliary Files in this table for instructions on how to configure properties that refer to additional files.

etc/catalog/<catalog-name>.properties

When such a file is placed in the configuration package, a catalog called <catalog-name> is created. The file must contain the following:

  connector.name=<connector_name>

Where <connector_name> is the name of the connector, refer to the connector documentation documentation for a list of supported connectors and their documentation. If the chosen connector has some mandatory configuration parameters, they must be set in the <catalog-name>.properties file. There can be more than one such file in the etc/catalog/ folder of the configuration package. This allows you to define multiple catalogs.

Refer to Auxiliary Files in this table for instructions on how to configure properties that refer to additional files.

Auxiliary files

If a configuration property in any of the configuration files accepts a path to an additional file (e.g., Hive’s security.config-file), add the file to the configuration package and refer to it using a path that is relative to the configuration package top-level directory.

For example, if you are configuring Hive connector to use hive.security=file, you also must set security.config-file (see File-based access control for the meaning and structure of the file). To do so, add etc/catalog/hive-security.json in the configuration package and refer to etc/catalog/hive-security.json using a relative path:

  hive.security=file
  security.config-file=etc/catalog/hive-security.json

Uploading a configuration package to S3#

To use a configuration package ZIP when launching Starburst’s CloudFormation template, it must first be uploaded to S3 to a location of your choice.

Warning

If the configuration package contains sensitive information such as passwords, AWS access keys or Kerberos keytab files, make sure to use an S3 location that is not publicly accessible.

Using a configuration package#

When launching Starburst’s CloudFormation template, you can use the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters to refer to the configuration package that should be applied on top of default configuration done by the template. The URI should be of the form s3://my_bucket/path/to/configuration/package.zip. You may decide to use a single configuration package for use by both the SEP coordinator and workers or use different packages for each. Additionally, you may provide a configuration package only for the coordinator or worker.

If you upload to a location that is not publicly accessible, you must use IamInstanceProfile parameter when launching the cluster, and the selected Instance Profile must allow read access to the selected S3 location.

Updating a configuration package#

Instead of deleting a CloudFormation stack and creating a new one, you can use the AWS stack update feature to update the SEP configuration package. You must first create a new configuration package with the necessary changes, and then upload it to S3 as described in the previous sections. Then when updating the CloudFormation stack, enter the new S3 location as values to the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters. When CloudFormation is applying the updates, it updates the stack by using the new configuration package to configure SEP.

AWS CloudFormation does not update the CloudFormation stack if the values to the parameters have not changed. Therefore you must create a new configuration package zip file with a different name. We recommend including a version name within the file name to avoid any confusion when updating your configurations.

For example, if the original configuration package was located at s3://my_bucket/path/to/configuration/package-1.0.zip, then create a new configuration package with a location such as: s3://my_bucket/path/to/configuration/package-2.0.zip. Even if you change the contents of s3://my_bucket/path/to/configuration/package-1.0.zip and keep the name, CloudFormation is not able to update the configuration.

Interactions between default and custom configurations#

It is important to note that default values are overridden only for keys where a customization exists. If no customizations are made, the default value remains. However, in the case of jvm.config, additional configuration entries are appended to the default configuration.

CFT configuration parameters#

The CFT includes numerous configuration parameters that are grouped in different sections. All listed parameters have a description in the AWS console.

Network configuration#

Network Configuration Parameters#

Parameter key

Description

Example

VPC

Virtual Private Cloud ID

vpc-4bd6ca11

Subnet

Subnet to use for SEP nodes (must belong to the selected VPC)

subnet-123abc2b

SelectedSubnetAutoAssignsPublicIp

Set to no if selected subnet does not provide public IPs. In this case VPC endpoints are created for the SEP stack. VPC Endpoints create an EndpointSecurityGroup. There is no option for an existing security group for the end point.

yes

SecurityGroups

Additional Security Groups for SEP nodes (e.g: allowing SSH access). Must select at least one.

sg-12e34aeb

EC2 configuration#

The EC2 configuration details the infrastructure used for your SEP cluster.

Choose a CoordinatorInstanceType and WorkerInstanceType suitable for your workload. The r4.4xlarge instance types are chosen by default and work well for most workloads. See our CFT deployment guide for information about what instance types may be best for you.

EC2 Configuration Parameters#

Parameter key

Description

Default

Example

CoordinatorInstanceType

EC2 instance type of the coordinator.

r4.xlarge

r5.12xlarge

WorkerInstanceType

EC2 instance type of the workers.

r4.xlarge

m5.4xlarge

KeyName

Name of an EC2 KeyPair to enable SSH access to the instance. See SSH keys for more details.

john.smith

WorkersCount

Number of dedicated worker nodes (apart from coordinator) to instantiate. Worker nodes are added to an AWS AutoScaling Group. See Auto scaling for more details.

10

HACoordinatorsCount

Number of coordinator nodes to instantiate. If there’s more then one, the coordinator offers HA capabilities. This number represents one active coordinator plus the number of optional hot-standby coordinators. For example, if you specify 3, then there is 1 active coordinator and 2 standby coordinators, if the active one fails. See Coordinator high availability for more details.

1

3

WorkerMountVolume

Mount an additional EBS volume on each worker at /data. This is required when using caching for distributed storage. Make sure that the /data directory is configured in your Hive catalog properties.

no

yes

WorkerVolumeType

Type of the additional EBS volume mounted on the workers.

io1

gp2

WorkerVolumeSize

Size of the additional EBS volume mounted on the workers, in GiB. Use at least 10GiB with the io1 volume type. Value must be in the range of 4 to 16384.

4

100

WorkerVolumeIOPS

The number of possible I/O operations per second for the additional volume. Used only with the io1 volume type. Each 5000 I/O ops require at least 100 GiB storage size on the volume. Value must be in the range of 100 to 20000.

100

2000

KeepCoordinatorNode

(Debug only) Keep coordinator node running after the coordinator service fails.

no

yes

SEP configuration#

The SEP configuration parameter allow you to configure all SEP-specific aspects of your coordinators and workers in the cluster.

SEP Configuration Parameters#

Parameter key

Description

AdditionalCoordinatorConfigurationURI

(Optional) URI of S3 zip file with additional configuration for the coordinator. This zip file must contain the required directory structure. Example s3://my_bucket/starburst-additional-coordinator-configuration-1.0.zip.

AdditionalWorkersConfigurationURI

(Optional) URI of S3 zip file with additional configuration for the workers. This zip file must contain the required directory structure. Example s3://my_bucket/starburst-additional-workers-configuration-1.0.zip.

BootstrapScriptURI

(Optional) URI of a shell script stored on S3 to execute on all nodes. The script runs after SEP is configured, but before it is started. For example, a bash script can be used to create directories, install additional software, deploy UDFs, or deploy other plugins. When the script is executed, a string argument value of coordinator or worker is passed in. Check for this argument value in your script to perform certain actions based on the node type. Example s3://my_bucket/starburst-bootstrap-1.0.sh.

StarburstHttpPort

Port to use for SEP coordinator and therefore the Starburst Enterprise web UI as well as JDBC and other client connections. Example 8080.

LicenseURI

URI of the SEP license in S3. This is only needed when deploying the CFT (using a privately shared SEP AMI) without subscribing to the AWS Marketplace. Example s3://my_bucket/starburstdata.license.

Hive connector options#

The Hive connector is required if you plan to access data in HDFS or S3. It requires a Hive Metastore so SEP knows where data lives. Refer to the dedicated documentation Configuring the Hive Metastore Service with CFT to determine your configuration.

Hive Connector Options#

Parameter key

Description

MetastoreType

Determines what metastore is used by the Hive connector. Defaults to None, which means that no Hive connector is provisioned. Example AWS Glue Data catalog.

ExternalMetastoreHost

When external Metastore is used (see MetastoreType parameter), this points to the host of the Metastore. Example metastore.example.com.

ExternalMetastorePort

When external Metastore is used (see MetastoreType parameter), this points to the Metastore service port number.

When set to 0 (the default value), default value per each metastore type is used:

  • 3306 for External MySQL RDBMS

  • 5432 for External PostgreSQL RDBMS

  • 9083 for External Hive Metastore Service

Cannot be empty when MetastoreType is set to either of:

  • External MySQL RDBMS

  • External PostgreSQL RDBMS

  • External Hive Metastore Service

Example 9083.

ExternalRdbmsMetastoreUserName

When external Metastore is used (see MetastoreType parameter), this determines the JDBC connection user name. Cannot be empty when MetastoreType is set to either of:

  • External MySQL RDBMS

  • External PostgreSQL RDBMS

Example database_user_name.

ExternalRdbmsMetastorePassword

When external Metastore is used (see MetastoreType parameter), this determines the JDBC connection password. Cannot be empty when MetastoreType is set to either of:

  • External MySQL RDBMS

  • External PostgreSQL RDBMS

Example jdbc_user_p@55vv0rd.

ExternalRdbmsMetastoreDatabaseName

When external Metastore is used (see MetastoreType parameter), this determines the JDBC connection password. Cannot be empty when MetastoreType is set to either of:

  • External MySQL RDBMS

  • External PostgreSQL RDBMS

Example hivemetastore.

Ranger and LDAP user synchronization#

The following parameters are related to the global access control with Apache Ranger and the related synchronization of Ranger with an LDAP backend for user and group information.

Ranger-related Configuration Parameters#

Parameter key

Description

EnableRanger

When enabled, Apache Ranger for global access control is added. Defaults to no. Note that all other settings in this section are ignored if Ranger is disabled. Example yes.

RangerAdminPassword

Administrator password for Ranger. At least 8 characters, including lowercase, uppercase and digit, are required. When reusing an existing external database for Ranger in your CFT stack, you must provide the same password as the initial one, to ensure access remains functional.

RangerBackendType

Type of database backend used for Apache Ranger. The default External PostgreSQL RDBMS is recommended for production usage. Built-in PostgreSQL RDBMS is ephemeral and only suitable for demo purposes.

ExternalRdbmsRangerHost

Hostname of the external PostgreSQL RDBMS server.

ExternalRdbmsRangerPort

Port of the external PostgreSQL RDBMS server. Defaults to 5432.

ExternalRdbmsRangerDatabaseName

Name of the database on the external PostgreSQL RDBMS server to use as Ranger database backend. The database must already exist. Defaults to ranger.

ExternalRdbmsRangerUserName

Name of the database user that Ranger uses to manage the database on the external PostgreSQL RDBMS. The user must exist, have full permissions to the database and must have CREATEROLE permissions granted. An additional user ‘ranger’ is created for non-admin database access. If you specify ‘ranger’, the single user is used for all operations. Defaults to rangeradmin

ExternalRdbmsRangerPassword

Password for the database user.

RangerConfigFile

URL to an optional additional Ranger config file in an S3 bucket. A template is available to download. Modify the template and upload it to an S3 bucket. The config file is required for using Solr Audit with Ranger and other customizations. Example: s3://my-bucket/my-config_file.properties

RangerBootstrapScript

URL to an optional bootstrap script in an S3 bucket. The script is run before Ranger starts. For example, a bootstrap script can be used to provide truststore files. Example: s3://my-bucket/ranger-bootstrap.sh

EnableRangerUserSync

When enabled, Apache Ranger synchronizes users from an external LDAP directory. Requires Ranger to be enabled, disabled by default. The RangerUserSyncConfigFile setting is ignored if Ranger user sync is disabled.

RangerUserSyncConfigFile

URL to Ranger user synchronization configuration file in S3 bucket. A user sync template is available to download. Create a modified copy of the template and upload it to an S3 bucket. Required if Ranger user sync is enabled. Example: s3://my-bucket/my-config_file.properties

Advanced AWS S3 configuration#

The advanced AWS S3 configuration parameters only affect the configuration of provisioned Hive catalogs in order to:

  • configure custom access credentials for AWS S3

  • access a third-party S3-compatible storage system

In both of these cases, you must set all three of the the parameters listed in the following table:

Advanced S3 Configuration Parameters#

Parameter key

Description

Example

S3Endpoint

URI to AWS S3-compatible endpoint. Your choice of endpoint affects your ability to write to buckets. Specifying https://s3.us-east-2.amazonaws.com allows you to write to any bucket in that region, whereas specifying https://mybucket.s3-us-west-2.amazonaws.com restricts the metastore to reading and writing from a single bucket.

https://s3.us-east-2.amazonaws.com

S3AccessKey

Access key to AWS S3-compatible storage

AKIAIOSFODNN7EXAMPLE

S3SecretKey

Access secret to AWS S3-compatible storage

wJarXUI/PiYEXAMPLEKEY

Warning

Failure to set the S3Endpoint results in an empty value for both S3AccessKey and S3SecretKey in the hive-site.xml file generated for the CFT deployment, resulting in Access Denied exceptions at runtime.

Monitoring#

Monitoring Parameters#

Parameter key

Description

Example

EnableCloudWatchMetrics

Enable integration with CloudWatch metrics. When enabled, OS and SEP metrics are reported for each cluster node and a CloudWatch Dashboard with cluster overview is created. Additional CloudWatch fees are charged. Refer to Configuring Starburst Enterprise with CloudWatch in CFT for more details.

no

IAM instance#

IAM instance parameters#

Parameter key

Description

Example

IamInstanceProfile

Optional name of an IAM instance profile to attach to SEP nodes. See Instance profiles for more detail. If you do not specify the InstanceProfile, the CloudFormation Template creates the necessary IAM role privileges.

my-ec2-instance-profile

Other parameters#

Other Parameters#

Parameter key

Description

Example

LaunchSuperset

When enabled, Superset is deployed and started on an EC2 instance

yes