Managed statistics#

You can configure Starburst Enterprise platform (SEP) to collect and store its own managed Table statistics for some data sources that only collect or expose a limited number of statistics. These additional statistics can then enable the query planner to make better-informed, cost-based optimizations.

Note

Managed statistics is a public preview feature. Contact Starburst Support with questions or feedback.

Configuration#

Managed statistics requires the following:

Transport Layer Security (TLS) is not configured automatically to secure managed statistics collection. If TLS is a requirement, it must be configured at the cluster level.

Note

The managed statistics feature is only supported in select connectors. Reference the connectors feature matrix for more information about which connectors support this feature.

Coordinator node configuration#

In order to enable managed statistics for a cluster, add the following configuration properties to config.properties on the coordinator node only:

starburst.managed-statistics.enabled=true

Internal TLS configuration#

SSL/TLS is configured in the config.properties file. The SSL/TLS on the worker and coordinator nodes are configured using the same set of properties. Every node in the cluster must be configured. Nodes that have not been configured, or are configured incorrectly, are not able to communicate with other nodes in the cluster.

To enable SSL/TLS for internal communication, do the following:

  1. Disable HTTP endpoint.

    http-server.http.enabled=false
    

    Warning

    You can enable HTTPS, while leaving HTTP enabled. In most cases this is a security hole. If you are certain you want to use this configuration, you should consider using an firewall to limit access to the HTTP endpoint to only those hosts that should be allowed to use it.

  2. Configure the cluster to communicate using the fully qualified domain name (fqdn) of the cluster nodes. This can be done in either of the following ways:

    • If the DNS service is configured properly, the nodes can introduce themselves to the coordinator using the hostname taken from the system configuration (hostname --fqdn)

      node.internal-address-source=FQDN
      
    • Alternatively, specify each node’s fully-qualified hostname manually. This is different for every host. Hosts should be in the same domain to make it easy to create the correct SSL/TLS certificates. For example: coordinator.example.com, worker1.example.com, worker2.example.com.

      node.internal-address=<node fqdn>
      
  3. Generate a Java Keystore File. Every SEP node must be able to connect to any other node within the same cluster. It is possible to create unique certificates for every node using the fully-qualified hostname of each host, create a keystore that contains all the public keys for all of the hosts, and specify it for the client. In most cases, it is easier to use a wildcard in the certificate as shown below.

    keytool -genkeypair -alias example.com -keyalg RSA -keystore keystore.jks
    Enter keystore password:
    Re-enter new password:
    What is your first and last name?
      [Unknown]:  *.example.com
    What is the name of your organizational unit?
      [Unknown]:
    What is the name of your organization?
      [Unknown]:
    What is the name of your City or Locality?
      [Unknown]:
    What is the name of your State or Province?
      [Unknown]:
    What is the two-letter country code for this unit?
      [Unknown]:
    Is CN=*.example.com, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
      [no]:  yes
    Enter key password
            (RETURN if same as keystore password):
    
  4. Distribute the Java Keystore File across the SEP cluster.

  5. Enable the HTTPS endpoint.

    http-server.https.enabled=true
    http-server.https.port=<https port>
    http-server.https.keystore.path=<keystore path>
    http-server.https.keystore.key=<keystore password>
    
  6. Change the discovery URI to HTTPS.

    discovery.uri=https://<coordinator fqdn>:<https port>
    
  7. Configure the internal communication to require HTTPS.

    internal-communication.https.required=true
    
  8. Configure the internal communication to use the Java keystore file.

    internal-communication.https.keystore.path=<keystore path>
    internal-communication.https.keystore.key=<keystore password>
    

Catalog configuration#

In addition to enabling managed statistics for the cluster, you must also enable managed statistics for each catalog.

To enable managed statistics for a catalog, add the following catalog configuration property to the catalog configuration file under etc/catalog:

managed-statistics.enabled=true

Managed statistics configuration properties#

The following configuration properties are used to control managed statistics on the coordinator:

Managed statistics configuration properties#

Property name

Description

starburst.managed-statistics.enabled

Enables managed statistics on a cluster. Defaults to false.

managed-statistics.cache-ttl

Maximum amount of time for managed statistics to be stored on the in-memory cache on the coordinator before they are retrieved from the database. For more information see statistics storage and management. Default value is 10m.

managed-statistics.cache-maximum-size

Maximum number of objects stored in the in-memory statistics cache. If this value is exceeded, the least-used statistics are discarded and only retrieved again from the database when the query planner needs to reference them. Defaults to 1000.

Collecting statistics#

Managed statistics must be collected manually with the ANALYZE statement or the collect_statistics command in an ALTER TABLE EXECUTE statement depending on the connector. The optional column parameter is used to limit statistics collection to a specified array of column names. See the connector documentation for more information on ALTER TABLE EXECUTE or ANALYZE statement support and syntax for that connector.

The following statistics are collected:

  • For a table:

    • Row count: The total number of rows in the table.

  • For each column in a table:

    • Nulls fraction: The fraction of null values.

    • Distinct value count: The number of distinct values.

    • Range: For data types that have a range, the range of values.

    • Data size: For textual data types, the total data size of all values.

Note

Other methods of collecting statistics continue to be used for a table until you execute the collect_statistics command on that table. Collecting statistics for one table using the collect_statistics command does not prevent SEP from continuing to collect statistics for other tables from the catalog.

Storage and management#

When statistics are collected, they are stored in the backend service database indefinitely until collect_statistics is run again and overwrites any stale statistics. These statistics are pulled from the database into an in-memory cache on the coordinator at a frequency defined by the managed-statistics.cache-ttl configuration property which defaults to 10 minutes. The query planner accesses this in-memory cache for statistics instead of the backend service database, reducing network I/O on the database at the cost of additional memory usage on the coordinator node. To disable the in-memory cache and have the query planner directly access the backend service database for statistics, set the managed-statistics.cache-ttl property to 0.

The in-memory cache retrieves its statistics from the SEP database, not from catalogs directly, so you must still routinely run the collect_statistics command to ensure that the cache does not store stale statistics.