Introduction to UDFs#

A user-defined function (UDF) is a custom function authored by a user of Starburst Galaxy as part of a SQL query or client application. UDFs are scalar functions that return a single output value, similar to built-in functions.

UDF declaration#

Declare the UDF with the SQL FUNCTION keyword and the supported statements for SQL user-defined functions.

A UDF can be declared as an inline UDF to be used in the current query, or declared as a catalog UDF to be used in any future query in any catalog.

Inline UDFs#

An inline user-defined function (inline UDF) declares and uses the UDF within a query processing context. The UDF is declared in a WITH block before the query:

WITH
  FUNCTION doubleup(x integer)
    RETURNS integer
    RETURN x * 2
SELECT doubleup(21);
-- 42

Inline UDF names must follow SQL identifier naming conventions, and cannot contain period characters.

The UDF declaration is only valid within the context of the query. A separate later invocation of the UDF is not possible. If this is desired, use a catalog UDF.

Multiple inline UDF declarations are comma-separated, and can include UDFs calling each other, as long as a called UDF is declared before the first invocation.

WITH
  FUNCTION doubleup(x integer)
    RETURNS integer
    RETURN x * 2,
  FUNCTION doubleupplusone(x integer)
    RETURNS integer
    RETURN doubleup(x) + 1
SELECT doubleupplusone(21);
-- 43

Note that inline UDFs can mask and override the meaning of a built-in function:

WITH
  FUNCTION abs(x integer)
    RETURNS integer
    RETURN x * 2
SELECT abs(-10); -- -20, not 10!

Catalog UDFs#

You can store a UDF in the context of a catalog. Starburst Galaxy stores all UDFs in the the global catalog named galaxy, which is automatically attached to each cluster in your Galaxy account.

In this scenario, the following commands can be used:

Because catalog UDFs in Galaxy all reside in the same global catalog, they can be referenced by function name alone. If you do need to reference the full path of a function such as square, that full path is galaxy.functions.square.

When using the SHOW FUNCTIONS statement in Galaxy, specify the location of all UDFs as follows:

show functions from galaxy.functions;

SQL environment configuration for UDFs#

Because catalog UDFs are stored in the global catalog named galaxy, there are no environment considerations for creating or dropping catalog UDFs.

SQL UDFs in Galaxy can be defined while the current location is a particular catalog.schema.table location (such as catA.schemaB.tableC) and then immediately invoked when the SQL editor’s context changes to a different location (such as catX.schemaY.tableZ).

Recommendations#

Processing UDFs can potentially be resource intensive on the cluster in terms of memory and processing. Take the following considerations into account when writing and running UDFs:

  • Some checks for the runtime behavior of queries, and therefore UDF processing, are in place. For example, if a query takes longer to process than a hardcoded threshold, processing is automatically terminated.

  • Avoid creation of arrays in a looping construct. Each iteration creates a separate new array with all items and copies the data for each modification, leaving the prior array in memory for automated clean up later. Use a lambda expression instead of the loop.

  • Avoid concatenating strings in a looping construct. Each iteration creates a separate new string and copying the old string for each modification, leaving the prior string in memory for automated clean up later. Use a lambda expression instead of the loop.

  • Most UDFs should declare the RETURNS NULL ON NULL INPUT characteristics unless the code has some special handling for null values. You must declare this explicitly since CALLED ON NULL INPUT is the default characteristic.