PyStarburst #

The PyStarburst library implements the standard Python DataFrame API, which uses a data structure called a DataFrame to analyze and manipulate two-dimensional data. Use PyStarburst to query and transform data in Starburst Galaxy and Starburst Enterprise platform (SEP) clusters in a data pipeline using Python syntax.

With PyStarburst, you can create complex transformation pipelines, build data apps, and interact with data using Python without moving data to the system where your application code runs.

PyStarburst provides familiar syntax for writing and running production-grade ETL pipelines and data transformations. This makes it possible to not only build new pipelines but also to migrate existing PySpark or Snowpark workloads to Starburst Galaxy and SEP.

Install the library #

To install PyStarburst and its dependencies, run the following pip command from your command prompt:

pip install pystarburst

Connect to your Starburst Galaxy cluster #

Use your preferred local development environment to connect to a Starburst Galaxy cluster. Establish a session using the same connection parameters you use to log into Starburst Galaxy.

Specify these settings in a dictionary that associates parameter names with values. Then pass this dictionary to the Session.builder.configs method and call the create method to establish your session:

import trino
from pystarburst import Session

db_parameters = {
    "host": "<host>",
    "port": <port>,
    "http_scheme": "https",
    "catalog": "sample",
    "schema": "burstbank"
    "auth": trino.auth.BasicAuthentication("<user>", "<password>")
}
session = Session.builder.configs(db_parameters).create()

To determine the values for the connection parameters host, port, and user:

  1. Open Partner connect in the Starburst Galaxy navigation menu.
  2. Click the PyStarburst tile in the Drivers and clients section.
  3. From the Select cluster drop-down menu, select the cluster of interest.
  4. Copy the values from the User, Host, and Port fields.

Enable PyStarburst in Starburst Enterprise #

To enable PyStarburst in SEP, set the following configuration property to true in your SEP coordinator:

starburst.dataframe-api-enabled

PyStarburst API reference #

After you have established a connection with a cluster, use Python to construct DataFrames and query tables. PyStarburst has a number of methods to perform DataFrame operations on your data.

View technical documentation for PyStarburst’s API methods at: https://pystarburst.eng.starburstdata.net/.

Example Jupyter notebook #

Try out PyStarburst using the example Jupyter notebook in the starburstdata/pystarburst-demo GitHub repository.