Glossary

Glossary #

Terms A-E #

Amazon AWS marketplace #

A provider for all aspects of the required infrastructure. This includes using AWS CloudFormation for provisioning, Amazon Simple Storage Service (S3) for storage, Amazon Machine Images (AMI), and Amazon Elastic Compute Cloud (EC2) for computes, Amazon Glue as metadata catalog, and others.

Bare-metal server #

A physical computer server dedicated to a single consumer. See bare-metal server.

Catalog #

Configures a connection to a data source. Catalogs are generally defined by a data engineer. Each catalog’s configuration specifies a connector to define which data source the catalog connects to. For more information about catalogs, see catalog.

Certificate #

A public key certificate issued by a CA, sometimes abbreviated as cert, that verifies the ownership of a server’s keys. Certificate format is specified in the X.509 standard.

Certificate Authority (CA) #

A trusted organization that signs and issues certificates. Its signatures are used to verify the validity of certificates.

Client #

Client applications send queries to Trino clusters and receive results from connected data sources. Client applications can be command line-, desktop-, or browser-based.

Clock time #

See the definition for wall time.

Cluster #

Provides the resources to run queries against numerous data sources. For more information, see cluster basics.

Connector #

Translates data from a data source into Starburst Galaxy or SEP schemas, tables, columns, rows, and data types. A connector is specific to a data source and is used in catalogs to define the data source type. For more information, see Data sources and catalogs.

Container #

A lightweight virtual package of software that contains libraries, binaries, code, configuration files, and other dependencies needed to deploy an application. A running container does not include an operating system. It uses the operating system of the host machine. To learn more, read about containers in the Kubernetes documentation.

Coordinator #

A server that handles incoming queries, and manages workers to execute the queries. A cluster has only one coordinator.

COTS #

Common off-the-shelf. Refers to commodity hardware components.

Data consumer persona #

Owns data products such as reports, dashboards, models, and the quality of analysis.

Data engineer persona #

Owns schemas and is responsible for the source data quality and ETL SLA.

Data lake #

A single repository for storing and processing structured, semistructured, or unstructured data from multiple sources in native format. For more information about data lake and data lakehouse architectures, see Data lake.

Data lakehouse #

A data storage architecture that combines features of a data lake and a data warehouse. Data lakehouses use the same underlying storage technologies as data lakes along with ACID-compliant table formats such as Iceberg, Hudi, and Delta Lake. For more information about data lake and data lakehouse architectures, see Data lake.

Data source #

A system from which data is retrieved, for example, PostgreSQL, or Iceberg tables stored on S3. Users query data sources with catalogs that connect to each source. See Data sources and catalogs.

Data virtualization #

Data virtualization is a method of abstracting an interaction with multiple heterogeneous data sources without needing to know the distributed nature of the data, its format, or any other technical details involved in presenting the data.

Data warehouse #

A purpose-built database system that is optimized for reporting and analysis by data consumers.

Driver #

A driver is a sequence of operator instances. Drivers act upon data and combine operators to produce output that is aggregated by a task, and then delivered to another task in another stage.

Exchange #

Exchanges transfer data between cluster nodes for different stages of a query. Tasks move data into an output buffer and consume data from other tasks using an exchange client.

External ID (AWS) #

An external ID is an identifier in AWS that is required for using Starburst Galaxy. It is used to ensure that only trusted AWS accounts are given permission to operate the Starburst Galaxy clusters based on their assigned role and trust policy. For more information on AWS Identity and Access Management, see How to use an external ID when granting access to your AWS resources to a third party.

Terms F-J #

Google Cloud Marketplace #

Deploy in the Google Cloud Marketplace or using the Starburst Kubernetes solution on the Google Kubernetes Engine (GKE). GKE is a secure, production ready, managed Kubernetes service in Google Cloud managing for containerized applications.

gzip #

A file format and program that compresses and decompresses files. The most common extension for gzip-compressed files is .gz.

HDFS #

A scalable, open source filesystem created to store large amounts of data for the Hadoop ecosystem.

Hive Metastore Service (HMS) #

Manages metadata for data stores that do not necessarily have a catalog, such as HDFS, Object Stores (S3, ADLS, GKS, Min.IO, etc). The metadata is stored in a RDBMS like PostgreSQL.

Java KeyStore (JKS) #

The system of public key cryptography supported as one part of the Java security APIs. The legacy JKS system recognizes keys and certificates stored in keystore files, typically with the .jks extension. By default, it relies on a system-level list of CAs in truststore files installed as part of Java.

Terms K-O #

Key #

A cryptographic key specified as a pair of public and private strings generally used in the context of TLS to secure public network traffic.

Lake data (non-Delta) #

Data stored in a traditional lake method as files in an object store.

Load Balancer (LB) #

Software or a hardware device that sits on a network edge and accepts network connections on behalf of servers behind that wall, distributing traffic across network and server infrastructure to balance the load on networked services.

Marketplace #

Purchase a preconfigured set of machine images, containers, and other needed resources to run SEP on their cloud hosts under your control.

Metastore #

A database used to catalog the metadata of a data collection. The metadata can include tables, columns, schemas, file paths, and storage formats.

Microsoft Azure marketplace #

Deploy using in the Azure Marketplace or using the Starburst Kubernetes solution onto the Azure Kubernetes Services (AKS). AKS is a secure, production-ready, managed Kubernetes service on Azure for managing for containerized applications.

Object storage #

Object storage is a file storage mechanism. Examples of compatible object stores include the following:

Amazon S3
Google Cloud Storage
Azure Blob Storage
MinIO and other S3-compatible stores
HDFS

Open-source #

Typically refers to open-source software, which is software that has the source code available for others to see, use, and modify. Allowed usage varies depending on the license. Trino is licensed under the Apache license and is maintained by a community of contributors from across the globe.

Operator #

An operator consumes, transforms and produces data. For example, a table scan operator fetches data from a connector and produces data that can be consumed by other operators. A filter operator consumes data and produces a subset by applying a predicate over the input data.

Terms P-T #

Parser #

Analyses the Service Provider Interface (SPI) metadata for information about tables, columns, and types to validate SQL semantics, and to perform security checks and type checking of expressions in the original query.

PEM file format #

A format for storing and sending cryptographic keys and certificates. PEM format can contain both a key and its certificate, plus the chain of certificates from authorities back to the root CA, or back to a CA vendor’s intermediate CA.

Personal information (PI) #

Personal information (PI) is broader in scope than personally identifiable information (PII), encompassing additional information such as IP address, sexual orientation, and union membership.

PKCS #12 #

A binary archive used to store keys and certificates or certificate chains that validate a key. PKCS #12 files have .p12 or .pfx extensions.

Planner #

Uses the Statistics SPI to obtain information about row counts and table sizes to perform cost-based query optimizations during planning.

Platform administrator persona #

Owns platforms and services (ITIL-style). Has service SLA responsibility for the infrastructure supporting the cluster.

Presto and PrestoSQL #

The old name for Trino. To learn more about the name change to Trino, read the history.

Query #

A statement in a query language that results in data being returned from a data source, or operations being performed on data in the data source.

Query Federation #

A type of data virtualization that provides a common access point and data model across two or more heterogeneous data sources. A popular data model used by many query federation engines is translating different data sources to SQL tables.

Queue #

A sequence in which statements enter the coordinator to be executed.

Red Hat OpenShift marketplace #

A container platform using Kubernetes operators that automates the provisioning, management, and scaling of applications to any cloud platform or even on-prem. Starburst Enterprise is available on Red Hat marketplace as of OpenShift version 4.

Role-Based Access Control #

A system that implements access policies based on roles and the users associated with them. Built-in access control (icon always depicted in teal) and third-party access control integrations (icon always depicted in black) such as Apache Ranger may be available, depending upon product and platform.

Scheduler #

Uses the Data Location SPI in the creation of the distributed query plan to distribute plan stages to workers.

Schema #

A way to organize tables. Together, a catalog and schema define a set of tables that can be queried.

Secure Sockets Layer (SSL) #

Now superseded by TLS, but still recognized as the term for what TLS does.

Split #

The unit of data that a task processes on a single worker.

Stage #

A component of the distributed query plan containing one or more tasks that define how a query is to execute in a cluster. Query plans comprise one or more stages.

Starburst Enterprise platform (SEP) #

A fully supported, enterprise-grade distribution of Trino. It adds integrations, improves performance, provides security, and makes it easy to deploy, configure, and manager your clusters. For more information, see Starburst Enterprise.

Starburst Galaxy #

An easy to use, fully-managed and enterprise-ready SaaS offering of Trino. Configure your data sources, and query your data wherever it lives. Starburst takes care of the rest so you can concentrate on the analytics. For more information, see Starburst Galaxy.

Statement #

SQL statements retrieve, update, or manipulate data and database structures.

Structured Query Language (SQL) #

The standard language used with relational databases. For more information, see SQL.

SQL client #

A tool or application used to connect Starburst to a database. SQL clients include BI Tools, command line tools, SQL workbenches, etc.

Table #

A set of unordered rows, which are organized into named columns with types. This is the same as in any relational database. Type mapping from source data to tables is defined by the connector.

Tarball #

A common abbreviation for TAR file, which is a common software distribution mechanism. This file format is a collection of multiple files distributed as a single file, commonly compressed using gzip compression.

Task #

A task processes a split. One or more tasks comprise a stage.

Transport Layer Security (TLS) #

TLS is a security protocol designed to provide secure communications over a network. It is the successor to SSL, and used in many applications like HTTPS, email, and Trino. These security topics use the term TLS to refer to both TLS and SSL.

Trino #

The fastest open source, massively parallel processing SQL query engine designed for analytics of large datasets distributed over one or more data sources in object storage, databases, and other systems. Formerly PrestoSQL. For more information, see Trino.

Virtual machine (VM) #

An emulation of the hardware of a computer system on a physical host machine, so any operating system suitable for that hardware can run in the emulator. A typical example is a Linux virtual machine running on a Windows-based host machine. See virtual machine.

Virtual private cloud (VPC) #

A pool of cloud computing resources isolated within a shared public cloud environment. VPCs combine the security of a private network with the flexibility of public cloud infrastructure.

Wall time #

The elapsed real time from start to finish. For more information, see elapsed time.

Example, wall time for query processing is the elapsed time between the a user submitting a query and receiving results.

Real-world time, clock time, and wall-clock time refer to the same amount of time.

Worker #

A worker is a server that is responsible for executing tasks and processing data. A cluster has one or more workers.

Is the information on this page helpful?

Yes

Cancel

Glossary

Resources