Leverage Snowflake Tagging to Empower Governance
As the amount of data organizations collects grows, managing that data becomes increasingly complex. That’s where data governance comes in: the process of managing the availability, usability, integrity, and security of the data used in an organization.
Snowflake has implemented many features that reflect the current reality of organizations, and data governance is in the product front and center. One key aspect of data governance is data classification, or categorizing data according to its sensitivity or criticality. Snowflake offers a very useful feature for data classification: database object tagging.
In this post, we’ll explore how you can leverage Snowflake tagging to empower your data governance efforts and I’ve added examples of creating tags, gathering the tag metadata and how you can mix tagging with other features. Once you put all this together, you will have a robust, fully fleshed governance and security layer on top of your Snowflake account.
One caveat before we get started, tagging is a feature of Snowflake Enterprise edition or higher, so you will need to be on the edition if you want to play with this feature on your own.
What is Tagging?
A tag is a user-defined label that can be attached to a Snowflake object, such as a database, table, or column. Tags can categorize objects based on any criteria that you choose, such as sensitivity, business unit, project, or owner. Once tags have been applied, you can use them to control access to the tagged objects, track usage and costs, and apply policies and rules.
Now let’s apply tagging to a specific use case: identifying sensitive customer data. For example, let’s assume that you have a table in Snowflake called “customers” that contains customer information, including their addresses. We want to categorize the “address” column as sensitive so that we can apply data protection policies and controls.
To accomplish this, we’ll first create a tag called “data_sensitivity” using the Snowflake web interface or command line interface and constrain it so that it always has to have a value of ‘low’, ‘medium’ or ‘high’. Here’s an example SQL statement to create the tag:
CREATE TAG data_sensitivity allowed_values 'low', 'medium'. 'high';
Next, we’ll apply the “data_sensitivity” tag to the “address” column of the “customers” table using another SQL statement and set it to ‘high’:
ALTER TABLE customers MODIFY COLUMN address SET TAG data_sensitivity='high';
As you can see, each tag is composed of a key-value pair and either of those can be set to any name or any value allowed by the tag definition. This allows the tagging feature to adapt to the naming conventions, department names, categorizations, etc. that an organization may implement.
It’s very simple to get going with tags and you will eventually end up with a whole set of tagged objects in your database describing their different sensitivity levels. At some point, you will want to know what objects are tagged with which tags either for auditing or reporting or to implement new policies. To do this, you can use the following SQL query:
select * from table(database.information_schema.tag_references('customers', 'table')) where tag_name='data_sensitivity' and tag_value='high';
In this query, we’re using the
INFORMATION_SCHEMA.TAG_REFERENCES function to retrieve information about the tags applied to the ‘customers’ table. Alternatively, you can also run a similar query but pass in the name of the database and the parameter ‘database’ and you will get a larger output of all the tags in all the objects in the database.
WHERE clause filters the results to only include columns that have been tagged with the specific tag you’re interested in. In this case, we’re simply looking for everything tagged as having a ‘high’ level of data sensitivity.
The query returns a result set where each record corresponds to one tag applied to one object. So if you have 10 tags applied to a particular column, then it will generate 10 records for that configuration.
Mixing Tags with Other Features
A great thing about Snowflake’s tagging implementation is that it can mix with other governance-related features (such as column data masking, for example.)
Going back to our previous example with the ‘customers’ table, now, we want to ensure that only authorized users can view the full address data, while still allowing other users to access the rest of the table.
One way to achieve this is by using Snowflake’s column-level masking feature. Column masking allows you to dynamically mask (i.e., obscure or hide) data in specific columns based on the user’s role or permissions. By combining column masking with tagging, you can mask only the sensitive columns, ensuring sensitive data is protected while still providing access to non-sensitive data.
Here’s an example of how to implement this:
create or replace masking policy sensitivity_mask as (val string) returns string -> case when current_role() in ('DATA_ADMIN') then val when system$get_tag_on_current_column('data_sensitivity') = 'low' then val else '***MASKED***' end; alter tag data_sensitivity set masking policy sensitivity_mask;
In this example, we’re creating a masking policy called
sensitivity_mask. The policy will only show the complete value to members of a role called ‘DATA_ADMIN’ or if the data sensitivity has been set to ‘low’. Everyone else will instead see a string that says ‘***MASKED***”. You could optionally show a partial description of the address, or extract just the zip code, etc based on the data sensitivity level as well.
Then the next statement simply associates the new policy with the tag. And, from that point on, every single column that has been tagged as having ‘high’ data sensitivity will have this particular masking policy applied at runtime.
You could continue integrating the tagging with even more Snowflake features. Since we have columns tagged as sensitive, we can use Snowflake’s tag-based access control (TBAC) feature to control who has access to that data. For example, we could create a Snowflake role called “DATA_AUDITOR” and grant that role access to all objects with the data sensitivity tag set to ‘high’. We could also run audit reports to generate a list of roles and users that can see anything tagged as highly sensitive as well and decide to revoke access based on the results.
In addition to access control and data sensitivity, Snowflake tags can also track usage and costs. You can create a tag called “organization_consumer” and apply it to all objects in your Snowflake account, including databases, tables, columns and virtual warehouses and give it values of the different departments that consume those resources and that data. This would allow you to see how much storage and compute resources are being used by each different ‘organization_consumer’ category, which can help you optimize costs and resources or implement chargeback or cost-sharing mechanisms internally.
In conclusion, Snowflake tagging is a powerful tool for data governance and data classification. It has a very simple but versatile implementation that can be fine-tuned to any organization’s compliance and security requirements. By using tags to categorize your data based on sensitivity or other criteria, you can control access, track usage and costs, and apply policies and rules. If you have this feature available, then I encourage you to compose your tagging strategy to enhance your data governance efforts immediately.