Starburst Galaxy

  •  Get started

  •  Working with data

  •  Data engineering

  •  Developer tools

  •  Cluster administration

  •  Security and compliance

  •  Troubleshooting

  • Galaxy status

  •  Reference

  • Data classifier jobs #

    Data classifier jobs automatically classify data in catalogs, schemas, tables, and views to apply attribute tags to that data.

    When used in conjunction with attribute tags and policies, classification provides an automated way to perform governance on your data.

    Data classifier jobs analyze the data and metadata of your catalogs, schemas, and tables. They also propose tags on columns. Administrators choose whether to accept or reject the tag proposal, and can change the color or name of the proposed tag.

    Before you begin #

    A role in the user’s active role set must have the account-level privilege Manage Security in order to create, update, view, or delete classification jobs.

    The classifier job queries data on a cluster in the account using a role the user specifies that must be in the user’s active role set. Because queries execute on the cluster, the specified role must have the Use Cluster privilege on the cluster. The specified role must have at least one of Create Tag or Apply Tag privileges to suggest proposed tags. Additionally, only data for which the role has a SELECT grant is analyzed.

    Create a data classifier job #

    To create a data classifier job, click Access control > Data classifier jobs in the navigation menu.

    In the Create classification dialog, provide the following information:

    • In the Name and description section, enter a name for the job and a useful description.

    • In the Cluster section, choose a cluster to run the classifier on from the drop-down menu.

    • In the Execution role section, select an executing role.

    • In the Add catalogs, schemas and tables section, specify which catalogs, schemas, and tables to include in the classifier job.

    • In the Classifier section:
    • In the Schedule section, use the toggle switches to enable Run on a schedule or Execute immediately.

      • To Run on a schedule:
        • Select a Time zone from the drop-down menu.
        • Choose the Select frequency or Enter cron expression recurring interval format.

      For Select frequency: Choose an hourly, daily, weekly, monthly, or annual schedule from the drop-down menu. The corresponding values depend on the schedule:

      • Weekly: Enter a time in the format hh:mm, specify AM or PM, then select a day of the week.
      • Monthly: Enter a time in the format hh:mm, specify AM or PM, then select a date.
      • Annually: Enter a month, day, hour, and minutes in the format MM/DD hh:mm. Specify AM or PM.

      For Enter cron expression: Enter the desired schedule in the form of a UNIX cron expression. For example, a cycle scheduled to run weekly at 9:30 AM on Monday, Wednesday, and Friday:

      30 9 * * 1,3,5
      
    • To run the classifier job now, use the toggle to enable Execute immediately.

    • Click Create classifier job.

     

    Create a new classifier #

    In the Create or edit classifier dialog, click addAdd a new classifier. To remove a classifier click (insert google icon).enter the following information:

    • Name: Enter a name for the classifier.
    • Tag: Choose a tag from the drop-down menu. If you do not see the tags you prefer, create a new tag.
    • Classifier type: Choose Regular expression or Text classification category.

      • For the Regular expression classifier type, enter a Java regex and choose a threshold. The threshold is a percentage of rows that must match the regular expression in order for the tag to be suggested. For example: If you enter 0.8 in the threshold field, and 80% of a column’s rows match the regular expression, the tag is suggested. The threshold must be a number between 0 and 1.
      • For the Text classification category classifier type, enter a text classification category.

    • Click Save and go back to go back to the Run a classifier dialog.

     

    Data classifier job details #

    All classifier jobs are listed in the Data classifier jobs pane.

    The header displays the total number of classifier jobs, and provides a search bar for finding data classifier jobs.

    The list of classifier jobs has the following columns:

    • Name: The name of the classifier job.
    • Description: The description provided for the classifier job.
    • Last run status: When the classifier job was last run.
    • Executing role: The role running the classifier job.
    • Last run ended: The date and time the last classifier job run ended.
    • Next run starts: The next date and time the classifier job is scheduled to start running.

     

    View, accept, or discard proposed tags #

    The classifier job recommends tags as it comes across a table or column that could fit a requested category. Tags may be recommended while the job is still executing.

    You can access the list of suggested tags in two ways:

    1. Click View results in the row of the classifier job. This option is only given if a classifier job has had a successful run.
    2. Go to the catalog-level of the catalog explorer, and click Auto tag.

    Follow these steps to accept or reject the proposed tags:

    1. In the Suggested tags dialog, select the checkbox next to the tags you would like to accept or reject. Alternatively, click the add icon next to a suggested tag name to open a drop-down menu where you can select additional, previously created tags to apply to the entity.

    2. Click the corresponding button to apply or discard the selected tags. Clicking Apply selected tags on a proposed tag creates the tag if it does not already exist and applies the tag to the column attached to it. To remove the proposed tags from the suggested tags list, click Discard selected tags. Future classifier job runs that propose the same tag are not shown.

    For more information on the classifier job, click the classifier job name.

    Data classifier job summary #

    The title of the summary pane is the name of the classifier job. The top portion provides a Run now button, the classifier job description, and the date of the next scheduled run.

    The Run history section is organized in the following columns:

    • Status: An icon showing status of the data maintenance job:
      • hourglass_top Queued
      • sync Currently running
      • check_circle Completed
      • close Error. The Error information link opens a window with information about the error.
    • Started: When the data classifier job started.
    • Elapsed time: The duration of data classifier job.

     

    Manage data classifier jobs #

    Perform editing tasks in the Data classifier jobs pane and the header section of classifier job’s summary pane.

    Edit data classifier jobs #

    To edit classifier jobs in the Data classifier jobs pane:

    • Click themore_vertoptions menu in the row, then select Edit classifier.
    • Make changes, then click Save.

    To edit classifier jobs in the classifier job’s summary pane:

    • Click the name of the classifier job of interest.
    • Click themore_vertoptions menu in the header.
    • Click Edit classifier.
    • Make changes, then click Update classifier.

    Delete data classifier jobs #

    To delete classifier jobs in the Data classifier jobs pane:

    • Click themore_vertoptions menu in the row.
    • Select Delete classifier, the click Yes, delete.

    To delete classifiers in the classifier job’s summary pane:

    • Click themore_vertoptions menu in the header.
    • Select Delete classifier, the click Yes, delete.

    Supported classification categories #

    Classifier Group Data Category Default Tag
    PII E-Mail Address pii.email
      Full Name pii.full_name
      First Name pii.first_name
      Last Name pii.last_name
      Phone Number pii.phone_number
      Street Address pii.address
      Social Security Number (SSN) pii.us_ssn
      Individual Taxpayer Identification Number (ITIN) pii.us_itin
      Preparer Taxpayer Identification Number (PTIN) pii.us_ptin
      Adoption Taxpayer Identification Number (ATIN) pii.us_atin
      Passport Number pii.passport
      International Mobile Equipment Identifier (IMEI) pii.imei
      IP Address pii.ip_address
      MAC Address pii.mac_address
      URL pii.url
      International Bank Account Number (IBAN) pii.iban
      US Bank Account Number pii.us_bank_num
      US Drivers License Number pii.us_driver_num
      UK National Health Service Number (NHS) pii.uk_nhs_num
      UK Drivers License Number pii.uk_driver_num
      ABA Routing Number pii.routing_number
      Employer Identification Number pii.us_employer_id
      Canada Social Insurance Number pii.ca_sin
      Australia Medicare Number pii.au_medicare
      Australia Tax File Number pii.aus_tax_file_number
      Language Code pii.language_code
      Currency Code pii.currency_code
      Medical Diagnostic Code pii.diagnostic_code
    LOCATION Street Address pii.address
      ZIP Code pii.zip_code
      Postal Code pii.postal_code
      Canadian Postal Code pii.ca_postal_code
      US State Code pii.us_state_code
      Canadian Province Code pii.canada_province_code
      Country Code pii.country_code
      Jurisdiction Code pii.jurisdiction_code