Table scan redirection#
Table scan redirection enables SEP to offload data access to tables accessed in one catalog to equivalent tables accessed in another catalog. This can improve performance by shifting data access to a more performant system. It can also reduce load on a data source.
Redirection is transparent to the user, and therefore provides performance improvements without the need to modify queries.
A typical use case is the redirection from a catalog configuring a relational database to a catalog using the Hive connector to access a data lake. That catalog can also take advantage of Hive connector storage caching.
Redirection of table scans is performed by SEP after applying authentication and permission checks from the source catalog.
The origin catalog must use one of the following connectors:
The target catalog can use an identical connector for maximum compatibility, or any other connector. Data types are translated based on the type mapping of the connector used for the source and target catalog. This type mapping can be customized to work around unsupported types by setting an explicit type mapping for the target catalog.
If table properties like partitioning, bucketing, sorting are used, then the target can only be Hive as other connectors don’t support these table properties.
Table scan redirection for a catalog can be disabled using the
table_scan_redirection_enabled
catalog session property:
SET SESSION mycatalog.table_scan_redirection_enabled = false;
Configuration#
Table scan redirection is enabled for a specific catalog mycatalog
in
etc/catalog/mycatalog.properties
using either the
cache service, or a local JSON file as the
source of redirections.
Configuration with the cache service
The cache service is the recommended source of redirections for production usage. It is responsible for the automatic creation and refresh of cached tables based on the configured refresh rules. You can customize the type mapping using configured type mapping rules The cache service is configured as the source of redirections using the following properties:
redirection.config-source=SERVICE
cache-service.uri=http://starburst-cache-service:8180
By default, the redirections obtained from the cache service are cached within
SEP for 1 minute. This can be changed using cache-service.cache-ttl
property.
When authentication is enabled for the cache service, the below configuration can be used to connect to it.
redirection.config-source=SERVICE
cache-service.uri=https://starburst-cache-service:8180
cache-service.user=test
cache-service.password=test
cache-service.http-client.trust-store-path=localhost.truststore
cache-service.http-client.trust-store-password=changeit
cache-service.http-client.https.hostname-verification=false
File-based configuration
File-based configuration is not recommended as the source of redirections for production usage. Creation and refresh of target tables is not managed by SEP with this configuration.
redirection.config-source=FILE
redirection.config-file=etc/redirection-rules.json
The configuration is located in a JSON file. The following sample
redirection-rules.json
file configures redirection of table scans from a
catalog rdbms
to a catalog datalake
.
{
"rdbms": {
"schemas": {
"schema_name_a": {
"table_name_1": {
"targetCatalog": "datalake",
"targetSchema": "cache",
"targetTable": "table_name_a"
}
},
"schema_name_b": {
"table_name_2": {
"targetCatalog": "datalake",
"targetSchema": "cache",
"targetTable": "table_name_b",
"columns": [
"col1",
"col2",
"col3"
]
},
"table_name_3": {
"targetCatalog": "datalake",
"targetSchema": "cache",
"targetTable": "table_name_c"
}
}
}
}
}
If the target table stores a subset of the columns from the source table, then
the columns stored by the target table must be specified as shown above for
rdbms.schema_name_b.table_name_2
. When columns
for a target table are
not specified, it is assumed that it contains all the columns from the source
table.
The above configuration results in the following behavior:
All scans on
rdbms.schema_name_a.table_name_1
are redirected todatalake.cache.table_name_a
.Scans on
rdbms.schema_name_b.table_name_2
using the columnscol1, col2, col3
are redirected todatalake.cache.table_name_b
.All scans on
rdbms.schema_name_b.table_name_3
are redirected todatalake.cache.table_name_c
.
Redirections for multiple catalogs can be defined by adding key-value pairs
along the lines of "rdbms": {...}
to the above configuration file.
By default, SEP reloads rules from the configuration file every minute. This
can controlled using redirection.refresh-period
property in
etc/config.properties
.
Metrics#
Metrics about table scan redirection are reported in a JMX table for each
specific catalog mycatalog
using any connector:
jmx.current."com.starburstdata.presto.plugin.jdbc.redirection:name=mycatalog,type=redirectionstats"
Metrics include information about the number of successful redirections and the number of times redirection could not be performed due to the target table missing some columns from the source table.
Limitations#
Redirections are supported for Hive tables but not Hive views.