Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Troubleshooting

  • Galaxy status

  •  Reference
  • Federate multiple data sources #

    One of Starburst Galaxy’s many uses is as an interactive query engine to federate multiple data sources together. With the ability to query both workloads, Starburst Galaxy becomes a single point of access and allows you to access your data wherever it lives.

    Federation tutorial architecture #

    For this tutorial, analyze one of the datasets in the COVID-19 data lake and run interactive queries to discover the proper insight required to federate this data with the TPC-H dataset. Then, write a SQL query with both data sources to find the total case count by region.

    • The first dataset is the Global Coronavirus (COVID-19) Data provided by Enigma. This dataset tracks confirmed cases.
    • The second dataset is the standard TPC-H dataset which provides the region information.

    This guide walks you through:

    • Connecting Starburst Galaxy to AWS
    • Querying multiple data sources

    Want to see it in action? Watch a video that demonstrates this tutorial.

    Login to Starburst Galaxy #

    1. Navigate to the Starburst Galaxy login page.
    2. If you don’t have an account, create a new one and verify the account with your email.
    3. Switch to a role with administrative privileges. If this is the first time you’ve logged in, or you have never made any additional roles, you are automatically assigned to the accountadmin role.

    Role selector

    Connect Starburst Galaxy to AWS #

    Create a catalog in Starburst Galaxy #

    Catalogs contain the configuration and connection information needed to access a data source. To gain this access, configure a catalog and use it in a cluster.

    This tutorial uses the public COVID-19 data lake that lives on AWS, which simplifies the setup process. Instead of creating an S3 catalog, where you are required to have access to an AWS account, the data for this tutorial is accessible without charge and without AWS credentials. You connect to the AWS COVID-19 data lake dataset catalog.

    1. In the navigation menu, click Catalogs, then Create Catalog.
    2. Scroll down to Select a dataset.
      • Select AWS COVID-19 data lake.
      • Enter a relevant catalog name such as aws_covid.
      • Add a relevant description such as COVID data from AWS public dataset.
      • Add the US East (Ohio) region to the catalog as this is the only region where the dataset lives.
      • Select Connect catalog.

    Image displaying the actions to create a covid-19 catalog

    Select Save access controls on the Set permissions page to add the default access controls.

    Create a cluster in Starburst Galaxy #

    A Starburst Galaxy cluster provides the resources necessary to run queries against your catalogs. Use the Starburst Galaxy Cluster explorer to access the catalog data exposed by running clusters.

    1. On the Add to cluster page, select + Create a new cluster.

    Image displaying the actions to create a covid-19 cluster

    • Enter cluster name: aws-covid-cluster.
    • Cluster size: Free.
    • Cluster type: Standard.
    • Catalogs: aws_covid, tpch.
    • Cloud provider region: US East (Ohio), also known as us-east-2.
      1. Select Create cluster.
      2. Select Add to cluster.
      3. Select Query my data.

    Confirm that you have connected your cluster to both the aws_covid and tpch catalogs by exploring the aws-covid-cluster in the Cluster explorer. If you do not have both catalogs, navigate to the Clusters page and edit your cluster.

    After configuring the catalog and cluster, you are directed to the query editor to continue the tutorial. Notice that the location drop-down menus in the top left corner have already selected the proper cluster and catalog for querying.

    Image displaying the example query navigation after connecting to   the covid cluster and catalog

    Configure role-based access control #

    The Starburst Galaxy access control system uses roles to provide users with privileges for clusters, catalogs, schemas, tables and other types of entities, such as object storage locations.

    Configure the accountadmin role to access the COVID-19 data lake location. (But if you have already gone through these steps at the direction of a different Starburst Galaxy tutorial, there is no need to repeat them.)

    1. In the navigation menu, select Roles and privileges.
    2. Click on the accountadmin role name.
    3. Select the Privileges tab.
    4. Select Add privilege.
    5. Add privilege to the accountadmin role.
      • Choose location as the modify privileges destination.
      • Add the storage location: s3://covid19-lake/*.
      • Select Create SQL.
      • Select Add privilege.

    Image displaying the URI input to schema discovery

    You now have access to all the data sources available in the COVID-19 data lake.

    Create the enigma_jhu table #

    Navigate to the Query editor. Validate that the location drop-down menus in the top left corner match the cluster and catalog previously created.

    Image displaying the example query navigation after connecting to   the covid cluster and catalog

    Create a schema to hold the COVID-19 table.

    CREATE SCHEMA query_federation;
    

    In the top left corner, select the query_federation schema so that you can easily run the tutorial queries.

    The enigma_jhu dataset provides the Global Coronavirus Data and is sourced from John Hopkins and provided by Enigma. This data tracks confirmed COVID-19 cases in provinces, states, and countries across the world, while also providing a county level breakdown in the United States.

    Edit your saved query to match the following SQL command to create the table.

    CREATE TABLE enigma_jhu (
       fips VARCHAR,
       admin2 VARCHAR,
       province_state VARCHAR,
       country_region VARCHAR,
       last_update VARCHAR,
       latitude DOUBLE,
       longitude DOUBLE,
       confirmed INTEGER,
       deaths INTEGER,
       recovered INTEGER,
       active INTEGER,
       combined_key VARCHAR
    )
    WITH (
       format = 'json',
       EXTERNAL_LOCATION = 's3://covid19-lake/enigma-jhu/json/')
    ;
    

    Run a SELECT * command to view your results. You can also use the table’s vertical ellipsis menu to generate the command for you in the query editor.

    SELECT * FROM enigma_jhu LIMIT 10;
    

    Notice that the admin2 column is actually the county, but was improperly named. You can also see that the case information is an aggregate value of each previously updated timestamp. We account for this as you query.

    Interactive analytics #

    Starburst Galaxy allows you to run one query and return values that join together information from multiple data sources. In this example first investigate the data provided from the COVID-19 data lake and then join this with the TPC-H dataset which is stored in an entirely different location.

    Query the enigma_jhu table #

    Analyze the enigma_jhu table to determine the most valuable insights to be reported. Run a SELECT query to return most of the pertinent values and observe the nature of the data within the table.

    SELECT
        fips,
        province_state,
        country_region,
        last_update,
        confirmed,
        recovered,
        active
    FROM
        enigma_jhu
    ORDER BY
        fips;
    

    There are many different last_update times. Evaluate one specific FIPS code for more information on the data contained within the table.

    SELECT
        fips,
        province_state,
        country_region,
        last_update,
        confirmed,
        recovered,
        active
    FROM
        enigma_jhu
    WHERE
        fips = '36121'
    ORDER BY
        last_update DESC;
    

    Look at the confirmed case count. It’s an aggregate, adding the previous confirmed cases to any additionally confirmed cases. Only evaluate data that has a last_update equal to the maximum value.

    Run a query to find the most recent update.

    SELECT max(last_update) FROM enigma_jhu;
    

    Only evaluate data that has a last_update equal to the maximum value.

    SELECT
        fips,
        province_state,
        country_region,
        last_update,
        confirmed
    FROM
        enigma_jhu
    WHERE
        last_update = '2020-05-30T02:32:48'
    ORDER BY
        last_update DESC;
    

    This query is successful for any location that has a last_update value equal to that maximum entry. However, this is not accurate for any location that has a different last_update value not equivalent to the maximum entry as their total count will be excluded. Run the following example focusing on the state of Utah that this issue.

    SELECT DISTINCT
        fips
    FROM
        enigma_jhu
    WHERE
        province_state = 'Utah'
        AND fips NOT IN
        (
            SELECT
                fips
            FROM
                enigma_jhu
            WHERE
                province_state = 'Utah'
                AND last_update = '2020-05-30T02:32:48'
        );
    

    The query returns all the distinct FIPS codes in Utah that do not have an entry containing the maximum last_update value. Look specifically at one FIPS code within Utah.

    SELECT * FROM enigma_jhu WHERE fips = '49005' ORDER BY last_update DESC;
    

    The maximum last_update for Cache, Utah is on April 16th, 2020, which is at least one month before the final May 30th, 2020 date. Therefore, you need a different solution for calculating the sum of confirmed cases. Add a first value window function to create a column identifying the most recent update for each fips.

    SELECT
        fips,
        admin2 AS county,
        province_state,
        country_region,
        confirmed,
        first_value(last_update) OVER (
          PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
        last_update
    FROM
        enigma_jhu;
    

    To only return records that contain the latest last_update value for each fips, run a nested query which only selects values where last_update = most_recent.

    SELECT
     fips,
     county,
     province_state,
     country_region,
     confirmed,
     last_update
    FROM
     ( SELECT
         fips,
         admin2 AS county,
         province_state,
         country_region,
         confirmed,
         first_value(last_update) OVER (
          PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
         last_update
       FROM
         enigma_jhu
         ) cases
    WHERE
     last_update = most_recent
    GROUP BY
        fips,
        county,
        province_state,
        country_region,
        confirmed,
        last_update;
    

    You can also run this same query using a WITH statement.

    WITH
        cases AS (
            SELECT
                fips,
                admin2 AS county,
                province_state,
                country_region,
                confirmed,
                first_value(last_update) OVER (
                  PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
                last_update
            FROM
                enigma_jhu
        )
    SELECT
        fips,
        county,
        province_state,
        country_region,
        confirmed,
        last_update
    FROM
        cases
    WHERE
        last_update = most_recent
    GROUP BY
        fips,
        county,
        province_state,
        country_region,
        confirmed,
        last_update;
    

    Query the TPC-H dataset #

    The TPC-H dataset provides two tables of interest: the nation table and the region table. Both tables assist in aggregating the total confirmed cases per region.

    Run a query to achieve familiarity with the nation table.

    SELECT * FROM tpch.tiny.nation LIMIT 10;
    

    There are 25 nations within the table.

    Run a query to achieve familiarity with the region table.

    SELECT * FROM tpch.tiny.region LIMIT 10;
    

    There are 5 regions in the table.

    The region_key in the region table acts as the foreign key in the nation table. Visit the TPC-H dataset page for more information on the relationships within the dataset.

    Federate your data sources #

    To determine the region aggregation, join the data lake table with the TPC-H nation table. Only consider the 25 countries that are accounted for in the nation table.

    Run the query to append the COVID-19 data with the proper region. Notice this query joins together data from two different data sources.

    SELECT
        country_region,
        first_value(last_update) OVER (
          PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
        last_update,
        confirmed,
        nationkey,
        name,
        regionkey
    FROM
        aws_covid.query_federation.enigma_jhu enigma
        INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = nation.name
    ORDER BY
        confirmed DESC;
    

    This query is a good start; however, it only returns 24 of the 25 countries. Run a distinct query to identify which country is missing.

    SELECT DISTINCT
        name
    FROM
        tpch.tiny.nation
    WHERE
        name NOT IN (
            SELECT
                UPPER(country_region)
            FROM
                aws_covid.query_federation.enigma_jhu);
    

    The United States is assigned the country_region value of ‘US’ in the Enigma data. Fix the join query to account for this mismatch.

    SELECT
        country_region,
        confirmed,
        nationkey,
        name,
        regionkey,
        FIRST_VALUE(last_update) OVER (
          PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
        last_update
    FROM
        aws_covid.query_federation.enigma_jhu enigma
        INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = REPLACE(nation.name, 'UNITED STATES', 'US')
    ORDER BY
        nationkey DESC;
    

    Add another inner join to append the region information to each record.

    SELECT
        country_region,
        nationkey,
        confirmed,
        region.name AS region_name,
        FIRST_VALUE(last_update) OVER (
          PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
        last_update
    FROM
        aws_covid.query_federation.enigma_jhu enigma
        INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = replace(nation.name, 'UNITED STATES', 'US')
        INNER JOIN tpch.tiny.region region ON nation.regionkey = region.regionkey;
    

    Now, put the puzzle pieces together to fulfill the initial ask of aggregating the confirmed case count by region.

    WITH
        cases AS (
            SELECT
                country_region,
                nationkey,
                confirmed,
                region.name AS region_name,
                FIRST_VALUE(last_update) OVER (
                    PARTITION BY fips ORDER BY last_update DESC) AS most_recent,
                last_update
            FROM
                aws_covid.query_federation.enigma_jhu enigma
                INNER JOIN tpch.tiny.nation nation ON UPPER(enigma.country_region) = REPLACE(nation.name, 'UNITED STATES', 'US')
                INNER JOIN tpch.tiny.region region ON nation.regionkey = region.regionkey
        )
    SELECT
        SUM(confirmed) AS total_confirmed_cases,
        region_name
    FROM
        cases
    WHERE
        last_update = most_recent
    GROUP BY
        region_name
    ORDER BY
        total_confirmed_cases DESC;
    

    Next Steps #

    Now that you have explored the query federation capabilities of Starburst Galaxy, explore another tutorial, or connect your own data.