Snowflake Clustering Keys: Best Practice

Rewritten: 6th January 2024

Having worked as a solution architect advising over 100 Snowflake customers, I believe Snowflake Clustering keys are one of the most misunderstood and misused tools to maximize query performance. 

In this article, I will explain:

After reading this article, you'll fully appreciate this challenging but hugely rewarding technique to maximize query performance.

What are Micro Partitions in Snowflake?

Before we discuss Snowflake Clustering, it’s important to understand micro partitions in Snowflake and how Snowflake physically stores data.

Unlike many OLTP systems like Oracle, which store data in row format in relatively small blocks, Snowflake stores data in Columnar format in huge 16MB compressed blocks called micro-partitions.  Micro-partitions in Snowflake are the physical unit of storage, locking, and I/O and Snowflake typically store up to 500MB of raw data in a single micro-partition – which can include thousands of rows. 

The two data storage methods are illustrated in the diagram below:

Row Vs. Column Storage

Snowflake stores data in micro-partitions in columnar format because research has demonstrated that columnar-based storage is around 50 times faster than row-stores for data warehouse queries.  It also has the huge advantage of maximizing data compression and therefore helps to eliminate the I/O bottleneck in reading data from slower disk storage.

Snowflake - Column Optimised Data Storage

The diagram above illustrates how data is loaded into Snowflake.  It shows data files that store data in row format are converted to columnar format and the data is compressed into micro-partitions before being loaded.

Column stores are around 50 times faster than row stores in data warehouse queries
— Professor Mike Stonebraker. MIT

The process of loading micro partitions in Snowflake is a lot more complex, but the diagram below illustrates the overall steps taken.

Snowflake - Data Loading

The process of loading data into Snowflake includes: 

  1. As data is loaded into Snowflake, the Cloud Services layer automatically stores metadata statistics about the data being loaded.  These include (for reasons which will become clear later), the minimum and maximum value of every column in every micro-partition.

    As statistics are automatically captured, it means subsequent queries can immediately use the data to maximize query performance.

  2.  The virtual warehouse physically constructs the micro-partitions, compressing every column using a potentially different compression algorithm and capturing the statistics for the cloud services layer.

  3. When the data is committed, it is written to slower disk storage on the cloud provider.  This is S3 on AWS or blob storage on Azure or Google and is much slower than the super-fast SSD in the virtual warehouse layer.

Once complete, the diagram below illustrates how Snowflake stores data.  As the virtual warehouse is suspended, it stores no permanent data, but the Cloud Services layer keeps a record of every micro-partition, it's physical location on disk and the column statistics needed to help tune query performance.

Snowflake metadata statistics and micro-partitions

Snowflake Micro Partition Pruning

The diagram below illustrates a deeper dive into the data and metadata held by Snowflake.  If we assume data is bulk-loaded at the end of every month, the diagram below illustrates the situation by the end of March.

Snowflake Micro-Partition Pruning

The above diagram illustrates how Cloud Services metadata statistics including the minimum and maximum value of the dates loaded whereas the Cloud Storage holds the gigabytes to terabytes of data in contiguous micro-partitions.

Let's assume a user executes the following query:


select *
From sales
where sale date = '14-Feb-22';

When the query is executed, Snowflake can check the statistics and identify the exact micro partitions to search.  The diagram below illustrates the situation whereby Snowflake knows that data for 14-Feb-22 can only be in micro-partition number three.

Snowflake Automatic Partition Pruning

It’s worth noting the above partition pruning approach does not use Snowflake indexes - which are only supported in Hybrid Tables. A database index stores information about where the data is located, whereas in this case, the Cloud Services metadata records where the data might be.

Using this information can partition prune all other micro partitions being scanned, which can lead to incredible query performance improvements on very large tables.  Down from hours to milliseconds.

The diagram below illustrates how Snowflake uses the metadata statistics captured in the Cloud Services layer to tune queries and maximize query performance automatically.

How Snowflake Executes a Query

The diagram above illustrates the two-stage tuning process used by Snowflake:

  1. The cloud services layer accepts and parses the SQL query, checks you have the correct access privileges to view the data, and then uses the metadata to perform Static micro-partition elimination, which we'll explain more about below.  This produces a Query Plan, which indicates which tables should be scanned first, the join order, and the steps needed to produce the results.

  2. The Virtual Warehouse executes the query using the Query Plan, but it has additional tricks up its sleeve.  In addition to the Static Elimination designed to reduce the number of micro-partitions scanned at execution time, Snowflake employs Dynamic Run-time Optimization to further speed query performance.

Benchmarking Performance of Snowflake Clustering 

To give some idea of the potential impact of using a Snowflake clustering key on query performance, consider the following SQL statement:


select *
From web_sales
where ws_sold_date_sk  = 2451161
and   ws_item_sk       = 463520;

This query was executed against a table with 4.7TB of data containing around 72 billion rows of data, and the screenshot below demonstrates the performance gains:

The query on the left scanned 300,112 micro-partitions and took nearly 21 minutes to complete, whereas the same query on the right used micro partition pruning to produce the same results in under two seconds. 

Notice the query with partition pruning only scanned four micro-partitions?  This means this simple step eliminated 99.86% of the data and delivered results 741 times faster.

While the above may be very impressive, it raises a question:  Why did the query run with poor query pruning?

Snowflake Clustering:  Good Vs. Bad

There are many reasons why data may be stored with poor data clustering. If you do accidentally have well-clustered data, you're quite lucky. 

Let's consider the previous situation where we load data at the end of each month.  Assuming we are migrating from an on-premises Oracle data warehouse and must migrate ten years of history.   One sensible way to achieve this would be to unload all ten years to flat files and use a COPY or SNOWPIPE to load the data to Snowflake.

Unlike regular monthly loading, this method risks loading the data in a random sequence.  Instead of loading one month at a time, the data will be extracted from Oracle to files in random order and loaded to Snowflake.

The diagram below illustrates how the same three months of data might look:

In the above example, unlike the data, which is loaded monthly, every micro-partition holds a random set of dates spanning the entire range from January to March.

If we execute a query to find all data for 14-Feb-22 against this data, it would need to scan the entire table – in this example, just four micro-partitions, but in a large table, this could be thousands.

How to Measure Snowflake Clustering

While we've illustrated the extremes from good to poor Snowflake clustering, in reality, we need a method to identify the extent to which data is clustered and to achieve this; we need to understand the Clustering Depth.

The diagram below illustrates what we mean by a Poor Clustering Depth.  Assuming we queried poorly clustered data, we'd need to scan every micro-partition to find whether it included data for 21-Jan.

Poor Clustering Depth

Compare the situation above to the Good Clustering Depth illustrated in the diagram below.  This shows the same query against a table where the data is highly clustered.

Example of Good Clustering Depth

In this case, Snowflake eliminates all but one micro-partition, as it's the only possible entry with data for 21st January. 

Note:  This doesn't mean that micro-partition two has data for 21st Jan, but it's definitely not in any other micro-partitions.

Using the above examples, we can see that tables with a Low Clustering Depth are highly clustered, whereas tables with a High Clustering Depth are very poorly clustered.

Measuring Clustering in Snowflake

Snowflake provides a SQL function SYSTEM$CLUSTERING_INFORMATION to measure clustering objectively.  Unfortunately, it's not very helpful, but we'll explain the output here and how to read the runes.

The SQL Snippet below shows how to execute this function:


select system$clustering_information('ORDERS','(o_orderdate)');

The parameters are:

  • Table Name

  • Column Name(s).  Notice these must be enclosed in brackets


The diagram below highlights the key points:

The example above shows the typical output, but you must focus on just a few statistics.

  • Total Partition Count: This is important because unless you have at least 1,000 micro-partitions, you shouldn't be concerned with clustering

  • Average Depth: This estimates the average number of micro-partitions you'd need to read to find all entries for a given value.  In the above example, to find entries for a given O_ORDERDATE.

  • Maximum Depth: This is an estimate of the worst-case number of reads.  In the above example, you'd need to read just six micro-partitions; there are only four entries at this depth.

Hopefully, you'll conclude the above indicates this table is highly clustered.  It indicates that despite having 3,242 micro-partitions, around 1/3rd can be found using a single read.  To put this in context, a query to retrieve entries WHERE O_ORDERDATE = '01-JAN-22' would most likely return within seconds despite the table holding around 54GB of data.

The example above illustrates a situation where a table is poorly clustered.  In this case, the average depth is the same as the total number of micro-partitions, indicating the need for a full table scan.

Don't be misled by the partition depth histogram.  After a depth of 16, it quickly increases by a factor of 2.  The above indicates any attempt to fetch entries WHERE WS_WEB_PAGE_SK = X needs a full table scan.

Creating a Clustering Key in Snowflake

Creating a Snowflake cluster key is incredibly easy, ensuring data will be automatically clustered and (hopefully) improving query performance.   The SQL Snippet below shows how to create a cluster key on the SALES table:


alter table sales
    cluster by (status, store_no);

Unlike a CREATE INDEX statement, Snowflake clustering is a background task that will take some time.  In benchmark tests, I re-clustered a 1.3TB table STORE_SALES, which took over four hours for the background re-clustering to complete.

The automatic clustering service doesn't use a virtual warehouse but uses Snowflake-supplied compute resources billed separately.   Clustering a terabyte of compressed data costs around 69 credits (around $200).

Effect of Updates on Snowflake Clustering

When a Snowflake cluster key is created on the table, Snowflake will automatically maintain the clustering sequence by re-sorting the data as needed.   The diagram below illustrates how Snowflake deals with update operations:

You can read more about how Snowflake manages updates in my article on how Snowflake physically stores data but in summary:

The diagram below illustrates the effect of updates against a table over time:

The diagram above shows how updated micro partitions in Snowflake are added to the end of the table, disrupting the sort order, especially as these micro partitions are updated.

This process slowly turns good clustering into poor clustering, and Snowflake automatically identifies this situation and re-clusters the table in the background.

It's important to note the number of rows updated is not relevant.  A single update can modify a single row in every micro-partition to disrupt clustering on the entire table.  The number of micro-partitions updated and the frequency of updates is far more important than the number of rows changed each time.

Pausing and Resuming Snowflake Clustering

Because the automatic re-clustering operation will be billed, the option exists to suspend or resume clustering on a table.  Using the following SQL commands, we can temporarily halt the re-clustering operation:


alter table sales suspend recluster;

alter table sales resume recluster;

Likewise, you can verify whether a table has automatic clustering enabled using the following SQL:


show tables;

What does Clustering in Snowflake do?

Clustering in Snowflake physically sorts the data and stores it in the sequence of the Snowflake cluster key.  However, it's not quite as simple as that. 

A sort operation would physically sort every row in the table, which can be incredibly expensive.  Instead, Snowflake performs an approximate sort that groups (or clusters) data with the same key value. This means (for example) that if data is clustered by DATE, it tries to keep all values for a given date in the same micro-partition. 

However, clustering will not sort data within the micro-partition as there's no benefit, making clustering far more efficient than data sorting.

Why not Sort Data after Loading? 

There's no technical reason why you cannot sort data after it's been loaded into Snowflake.  Indeed, the following command will re-sort the sales table:


insert overwrite sales as
select *
from sales
order by sale_date;

Be aware that large sort operations can lead to memory spillage, extending the elapsed time and query cost. Assuming the SALES table is large, this operation is best executed on a sizable virtual warehouse, as it will re-sort the entire table.

The advantage of using automatic Snowflake clustering, however, is it's incremental.  Once micro partitions are well clustered, the automatic clustering service ignores these.   We can identify these in the output of SYSTEM$CLUSTERING_INFORMATION above as the constant_partition_count. 

Because automatic clustering is incremental, it can be significantly less expensive to recluster data rather than performing a full sort each time.  In addition, unlike a sort operation, the clustering operation runs as a background task without locking the table and, therefore, impacting other operations.

Snowflake Clustering Best Practices

You'll know how clustering works, how to create, suspend, or resume Snowflake clustering, and the alternative sort operation.  I would, however, advise caution in deploying clustering.  Experience has demonstrated that around 80% of customers deploy cluster keys very poorly, which can lead to significant costs with little benefit.

It’s sensible to follow the Snowflake clustering best practices to avoid mistakes.

Best Practices for Snowflake Cluster Keys

Snowflake clustering is only sensible in a relatively small number of situations, including:

  • Large Tables:  Unless you have at least 1,000+ micro-partitions in the table, you won't benefit from clustering.  Snowflake recommends cluster keys on tables over 1TB in size.  In reality, (as well as many things), it all depends.   As clustering eliminates entire micro-partitions, however, it is only sensible on relatively large tables

  • Frequent Queries:  Snowflake Cluster Keys are best deployed on tables with many SELECT queries relative to the number of UPDATE or DELETE operations.  Remember, the purpose of cluster keys is to maximize query performance.  There's little potential benefit if there are few queries against the table.

  • Partitions Scanned:  The objective of clustering is to reduce the number of micro-partitions scanned by queries.  You should therefore check that queries you intend to improve do actually scan a high proportion of the data.  You can check this on QUERY_HISTORY column PARTITIONS_SCANNED as a proportion of PARTITIONS_TOTAL

  • Query Elapsed Time:  It sounds obvious, but the objective of clustering is reducing the query elapsed times.  You should therefore target queries with poor elapsed times.

  • Few Updates:  Because of the impact of UPDATE and DELETE operations on clustering, it's seldom cost-effective to deploy a clustering key on a table with frequent updates or deletes.  There are mitigating factors, but keep this in mind.  I did (for example) have a customer who created cluster keys on a table that was frequently deleted and re-loaded.   They abandoned clustering on the entire system because they saw no performance benefits.

Best Practices to Identify the Right Cluster Key  

Choosing inappropriate cluster columns will add to re-clustering cost for little or no performance gains.  The best practices include:

  • The Where Clause:  It's absolutely critical that cluster key columns frequently appear in the WHERE clause of queries, ideally using equality or range comparisons. 

  • Wrapping columns with functions:  I frequently see queries that change the data type from DATE to VARCHAR to compare to a fixed string as illustrated in the example below.  This leads to poor query performance as Snowflake cannot fully partition eliminate.

-- Avoid the following:
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.ORDERS
where to_char(o_orderdate,'DD-MM-YY') = '01-04-92';

-- Instead use:
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.ORDERS
where o_orderdate = to_date('01-04-92','DD-MM-YY');

Snowflake Clustering Best Practices: Unique Keys

A common mistake customers make is creating a cluster key, including a TIMESTAMP column.  Snowflake clustering aims to bucket values with the same value together in micro-partitions, and clustering by a unique key goes against this.

You can create a cluster key on a unique key, and the query performance may be excellent, but the cost of reclustering from update operations may outweigh the benefits.

Modify the key instead of creating the cluster key on a TIMESTAMP to reduce the cardinality.  For example:

-- Instead of
alter table mobile_calls
   cluster by event_timestamp;

-- Cluster by the DAY 
alter table mobile_calls
   cluster by (date_trunc('DAY',event_timestamp);

Using the above technique means any query fetching data for a given DATE or TIMESTAMP performs partition elimination by day, which will massively reduce the clustering cost but retain the performance benefits.

Best Practices: Cluster Key Data Types

Sometimes, data may be delivered in VARCHAR format instead of DATE or NUMBER formats.  It's highly advisable to store data in its true native data type.

Instead of storing dates in VARCHAR format 'YYYYMMDD', you must store them as DATE. Likewise, numbers should be stored as datatype NUMBER.

It's a little-known fact that clustering is only performed on the first five characters of the key.  This means if you cluster by a column in the format YYYYMMDD, you'll actually be clustering by YYYYM, and most data will be clustered by year.

Best Practices: Initial Snowflake Clustering

As indicated above, Snowflake performs automatic clustering on an incremental basis, which can be highly efficient for ongoing data clustering.  However, manually sorting the data is almost certainly cheaper and faster for the initial clustering effort. 

Consider the following SQL, which sorts a 1.3TB table:

alter warehouse demo_wh set warehouse_size = 'X2LARGE'; 

-- Copy 1.3TB table STORE_SALES and sort by customer 
create or replace table store_sales_sort_by_customer as 
select * 
from snowflake_sample_data.tpcds_sf10tcl.store_sales 
order by ss_customer_sk; 

alter warehouse demo_wh suspend;

This took 23 minutes on an X2LARGE warehouse costing 12 credits, but was over five times more expensive (69 credits) and took over four hours using background clustering.

Therefore, the initial clustering (where possible) should use an INSERT OVERWRITE operation, but subsequent clustering utilizes automatic clustering.

Best Practices: Frequent Updates

As mentioned above, clustering is normally unsuited for tables with frequent UPDATEs as Snowflake records data modification events (Inserts, Updates, or Deletes) and triggers re-clustering.

However, customers sometimes have frequent Update or Delete operations against clustered tables, which risks a very high re-clustering cost. 

In extreme cases, Snowflake may constantly re-cluster data, which is then disrupted by subsequent update operations, leading to costly reclustering.

One way around this is to suspend and resume clustering around peak update times manually.  Many customers find that suspending clustering during the week and then resuming at the weekend (when there are fewer updates) restores the balance between keeping the tables well clustered and reclustering the cost.

In many cases, updates marginal impact query performance, leading to graceful performance degradation over time.  Re-clustering during quiet times can restore the query performance without the cost of frequent clustering operations.

Conclusion

Hopefully, by now, you'll appreciate why Snowflake Cluster Keys and Clustering are considered more an art than a science.  However, many Snowflake customers have proven many good practices successful (and likewise, many pitfalls to avoid).

Note this article, and return to it the next time you consider deploying clustering on a table.  You will find it worthwhile.

Notice Anything Missing?

No annoying pop-ups or adverts. No bull, just facts, insights and opinions. Sign up below and I will ping you a mail when new content is available. I will never spam you or abuse your trust. Alternatively, you can leave a comment below.

Disclaimer: The opinions expressed on this site are entirely my own, and will not necessarily reflect those of my employer.


John A. Ryan

John A. Ryan


Previous
Previous

What is the Snowflake Data Warehouse?

Next
Next

What is a Snowflake Virtual Warehouse?