Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Troubleshooting

  • Galaxy status

  •  Reference
  • Data quality expressions #

    An expression is required to determine the scope of a data quality monitoring rule. All functions can be connected using boolean operators, such as AND, OR, and NOT, using parentheses to indicate precedence. Functions based on historical statistics are gathered using the SHOW STATS query on the table. Expressions are case insensitive.

    Examples #

    The following table contains examples of valid data quality rule expressions:

    Expression Description
    row_count_min(5000) There are a minimum of 5000 rows in the table.
    row_count_max(99999) There are a maximum of 99999 rows in the table.
    row_count_range(5000, 99999) The number of rows is between 5000 and 999999.
    row_count_delta(1000) Row count cannot vary by more than 1000 compared to previous row count.
    row_count_delta(0.05) Row count cannot vary by more than 5% compared to previous row count.
    nulls_fraction_min("age", 0.2) Column age minimum fraction of NULL values is 0.2.
    nulls_fraction_max("age", 0.3) Column age maximum fraction of NULL values is 0.3.
    nulls_fraction_range("age", 0.1, 0.9) Column age maximum fraction range of NULL values from 0.1 to 0.9.
    nulls_fraction_rows_delta("age", 5000) Column age row count multiplied by fraction of NULL values cannot vary by more than 5000 of previous such multiplication.
    nulls_fraction_delta("age", 0.2) Column age fraction of NULL values cannot vary by more than 0.2 of previous such multiplication.
    nulls_fraction_rows_delta("age", 5000) Column age row count multiplied by fraction of NULL values cannot vary by more than 5000 of previous such multiplication.
    low_value_min("age", 18) Column age lowest value must be at least 18.
    low_value_max("age", 34) Column age lowest value must be less than or equal to 34.
    low_value_range("saturation", 0.5, 0.99) Column saturation lowest value must be between 0.5 and 0.99.
    low_value_delta("age", 10) Column age lowest value cannot vary by more than 10 from previous low value.
    high_value_min("age", 18) Column age highest must be at least 18.
    high_value_max("age", 34) Column age highest value must be less than or equal to34.
    high_value_range("saturation", 0.5, 0.99) Column saturation highest value must be between 0.5 and 0.99.
    high_value_delta("age", 10) Column age highest value cannot vary by more than 10 from previous high value.
    distinct_values_count_min("age", 200) Column age minimum count of distinct values is 200.
    distinct_values_count_max("age", 9999) Column age maximum count of distinct values is 9999.
    distinct_values_count_range("age", 200, 9999) Column age count of distinct values is between 200 and 9999.
    distinct_values_count_delta("age", 5000) Column age count of distinct values cannot vary by more than 5000 from previous distinct values count.
    data_size_min("csv_attachment", 200) Column csv_attachment minimum data size is 200 bytes.
    data_size_max("csv_attachment", 9999) Column csv_attachment maximum data size is 9999 bytes.
    data_size_range("csv_attachment", 200, 9999) Column csv_attachment data size is between 200 bytes and 9999 bytes.
    data_size_delta("csv_attachment", 5000) Column csv_attachment data size cannot vary by more than 5000 bytes from previous data size.
    nulls_fraction_min(“temperature”, 0.6) OR (row_count_min(7) AND nulls_fraction_min(“humidity”, 0.3)) Column temperature minimum fraction of NULL values is 0.6 or there are a minimum of 7 rows and column humidity minimum fraction of NULL values is 0.3.