Data sources and catalogs #
The following sections explain the main concepts about the data sources and how you can query them as catalogs using SQL.
Data sources #
Your organization likely has a multitude of data sources. A data source is a system where data is retrieved from, such as distributed object storage using HDFS/Hive, RDBMSs like PostgreSQL or Oracle, NoSQL databases and document databases like Elasticsearch, and systems such as Redshift or Kafka and many others. Without Trino, you have to connect to each separately to query the data.
A catalog is the configuration that enables access to a specific data sources. Every cluster can have numerous catalogs configured, and therefore allow access to many data sources.
SHOW CATALOGS; Catalog --------- hive_sales mysql_crm (2 rows)
The query editor and other client tools also display a list of catalogs.
A connector is specific to the data source it supports. It transforms the underlying data into the SQL concepts of schemas, tables, columns, rows, and data types.
Connectors provide the following between a data source and Starburst Enterprise or Starburst Galaxy:
- Secure communications link
- Translation of data types
- Handling of variances in the SQL implementation, adaption to a provided API, or translation of data in raw files
Every catalog uses a specific connector. Connectors are built-in features.
In SEP, you must specify a connector to create a catalog. In Starburst Galaxy, you create a catalog and the connector selection and configuration is handled for you.
Every catalog includes one or more schemas. They group together objects. Schemas are often equivalent to a specific database or schema in the underlying data source.
SHOW SCHEMA FROM examplecatalog;
Every schema includes one or more objects. Typically these objects are tables.
SHOW TABLES FROM examplecatalog.exampleschema;
Some catalogs also support views and materialized views as objects.
SHOW COLUMNS FROM examplecatalog.exampleschema.exampletable;
This information includes the columns in the table, the data type of the columns and other information.
The default context for any SQL statement is the catalog level. As a result any query to access a table needs to specific the catalog, schema and table establishing a fully-qualified name.
SELECT * FROM <catalog>.<schema>.<object>
This allows identical table names in the underlying data sources to be addressed specifically. The following to queries access tables of the same name in completely separate data sources:
SELECT * FROM sales.apac.customer; SELECT * FROM marketing.americas.users;
You can quote the catalog, schema, and table names if any names require escaping. For example, the dash character is not allowed in SQL, but some underlying data sources allow the character for objects that are exposed as schemas or objects.
SELECT * FROM "sales"."web-traffic"."page-views";
USE sales.default; SELECT * FROM customer;
Querying from multiple catalogs #
Starburst Enterprise and Starburst Galaxy let data consumers query anything, anywhere, and get the data they need in a single query. Specifically, they support queries that combine data from many different data sources at the same time.
Fully-qualified object names are critical when querying from multiple sources:
SELECT * FROM <catalog>.<schema>.<object>;
Here’s an example of data from two different sources, Hive and MySQL, combined into a single query:
SELECT sfm.account_number FROM hive_sales.order_entries.orders oeo JOIN mysql_crm.sf_history.customer_master sfm ON sfm.account_number = oeo.customer_id WHERE sfm.sf_industry = `medical` AND oeo.order_total > 300 LIMIT 2;
This query uses data from the following sources:
orderstable in the
order_entriesschema, which is defined in the
customer_mastertable in the
sf_historyschema, which is defined in the
Catalog session properties #
Users can customize the behavior of queries from a catalog using catalog session properties. A session is defined by a specific user accessing a cluster with a specific tool such as the CLI. Catalog session properties can control resource usage, enable or disable features, and change query processing.
Most of the session properties are named similar to their configuration properties counterparts in a catalog file, mostly differing by the use of underscores (_) in the name instead of dashes (-) to be SQL-compliant. These configuration properties are found in the catalog’s connector documentation. Session properties override catalog properties.
Is the information on this page helpful?