starburst galaxy > working with data > create catalogs > Catalogs overview

Catalogs overview #

A catalog contains the configuration that allows Starburst Galaxy to access a data source.

To query a data source in Galaxy, configure a catalog for it, and include that catalog in a cluster. Once the catalog is defined and used in a cluster, you can query the data source by accessing the catalog and its nested schemas and tables.

Data sources and clusters must be located in the same cloud provider and region to enable optimal performance and avoid unnecessary data transfer costs.

Access to create and manage catalogs is provided through the Data > Catalogs item in the navigation menu.

Data sources #

Galaxy facilitates access to a variety of data sources. Configuration for object storage systems, data warehouses, relational databases, and other systems varies by cloud and hosting provider. If your data source has secured or locked down network access, you may need to configure its network to admit one or more of Starburst Galaxy’s outgoing IP blocks as shown on the IP allow list.

The following sections provide links to the configuration pages for the data source catalogs supported by Starburst Galaxy.

Object storage systems #

Galaxy supports the following object storage systems.

Amazon S3, object storage on Amazon Web Services, combined with Amazon Glue, your own Hive Metastore Service, or the Galaxy metastore.
Azure Data Lake Storage, object storage on Microsoft Azure, combined with your own Hive Metastore Service, or the Galaxy metastore.
Google Cloud Storage, object storage on Google Cloud, combined with your own Hive Metastore Service, or the Galaxy metastore.

Galaxy also supports Iceberg REST catalogs for Iceberg tables in object storage.

Amazon S3 Tables, a catalog for Apache Iceberg tables stored in Amazon S3 Table buckets.
Apache Polaris, an open-source catalog for Apache Iceberg tables on AWS S3, Azure Data Lake Storage, and Google Cloud Storage.
Lakekeeper, an open-source catalog for Apache Iceberg tables on Azure Data Lake Storage.

Additionally, Galaxy supports Unity catalog for tables under the control of Unity on AWS S3, Azure Data Lake Storage, and Google Cloud Storage.

Starburst Warp Speed is available for S3 catalogs to improve performance.

Additional data sources #

Starburst Galaxy supports the following RDBMS and data warehouse catalogs.

Amazon DynamoDB, a fast, fully managed NoSQL database service.
Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse service.
Apache Cassandra, an open source NoSQL distributed database.
Apache Druid, a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
Apache Pinot, a distributed OLAP datastore.
Azure Synapse, an analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
ClickHouse, a high-performance, column-oriented SQL DBMS.
Elasticsearch, a fast and scalable search and analytics engine.
Galaxy Telemetry, a built-in catalog that lets you access Starburst Galaxy managed datasets.
Google BiqQuery, a serverless, scalable, cost-effective multi-cloud data warehouse.
Google Sheets, read-only access to spreadsheets stored on your Google Drive account.
MariaDB, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
Microsoft SQL Server, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
MongoDB or MongoDB Atlas data platform.
MySQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
OpenSearch, a flexible, scalable, open-source way to build solutions for data-intensive applications.
Oracle, a scalable, secure DBMS.
PostgreSQL, relational database in numerous variants on Amazon Web Services, Google Cloud, or Microsoft Azure.
Salesforce, a cloud-based customer relationship management system. private_connectivity
Salesforce Data Cloud, a cloud-based customer data platform natively integrated with Salesforce. private_connectivity
SAP HANA, a column-oriented in-memory database.
Snowflake, a cloud-based data platform.

private_connectivity designates a catalog in private preview status. Contact your account team for more information.

Sample datasets #

Galaxy also provides access to a number of full datasets. You can create a catalog for these datasets, and use them for a number of purposes:

Demonstration of Starburst Galaxy features without the need to configure an external data source.
Availability of a full dataset to query, learn SQL, and experiment with different clients.
Performance and other benchmark tests with well known data and standardized queries.

The following dataset catalogs are available:

Black Hole catalog

Designed for high performance testing of other components in the same way as /dev/null and /dev/zero on Unix-like systems.

COVID-19 data lake

See the Introductory project tutorials for examples of using this dataset.

Sample dataset

Provides data in two tables that represent space mission data.

TPC-DS

Provides a set of schemas to support the TPC Benchmark™ DS database, which is a benchmark used to measure the performance of complex decision support databases.

TPC-H

Provides a set of schemas to support the TPC Benchmark™ H database, which is a benchmark used to measure the performance of highly-complex decision support databases.

Is the information on this page helpful?

Yes

Cancel

Catalogs overview