Everything About SQL DISTINCT In Snowflake

The DISTINCT keyword in SQL lets you eliminate redundancy in your data queries and results. It’s particularly important in Snowflake where data can be enormous in size and highly redundant.

Using DISTINCT in Snowflake can also optimize performance when dealing with large volumes of data in the cloud. It can help in:

  • reducing network traffic
  • minimizing storage requirements
  • improving overall query performance

This article runs through every aspect that you need to know about DISTINCT in Snowflake.

How The DISTINCT Keyword Works In SQL Databases

The fundamental role of DISTINCT in SQL is to remove duplicate rows from a result set. It ensures that every row in the output is unique.

When you use the keyword in a SELECT statement, the database system evaluates the uniqueness of the rows based on the columns you specified in the SELECT clause.

DISTINCT with one column

if you specify one column after the DISTINCT keyword, the system ensures that every value in that column is unique in the result set.

DISTINCT with multiple columns

If you specify more than one column, the system ensures that every combination of values across those columns is unique.

How The DISTINCT Keyword Works In Snowflake

Snowflake is a SQL database and the DISTINCT operator is functionally the same as with many other systems. However, Snowflake’s implementation is specifically optimized for its own unique architecture.

When you use DISTINCT in a SELECT statement in Snowflake, the system begins by scanning the selected columns for duplicate data.

This scan takes advantage of the underlying architecture to distribute the data and the computation across multiple nodes. This makes the process efficient and scalable for large data sets.

Snowflake can also use caching to improve the performance of repeated DISTINCT queries.

Sample Data

I will use this product table and data to illustrate the various uses of DISTINCT. The last row has a NULL value in the third column.

IDNAMECATEGORY
1PenOffice Supplies
2PencilSchool Supplies
3CrayonSchool Supplies
4Chalk 

If you want to follow along by running the SQL statements in this article, use this script to create the sample table and add the data.

Create Or Replace Table PRODUCT (
Id VARCHAR NOT NULL PRIMARY KEY,
NAME VARCHAR NOT NULL,
CATEGORY VARCHAR
);

INSERT INTO PRODUCT VALUES (1, 'Pen', 'Office Supplies');
INSERT INTO PRODUCT VALUES (2, 'Pencil', 'School Supplies');
INSERT INTO PRODUCT VALUES (3, 'Crayon', 'School Supplies');
INSERT INTO PRODUCT VALUES (4, 'Chalk', NULL);

How To Select Distinct Values From A Single Column

The basic syntax is straightforward. The DISTINCT keyword is used within the SELECT statement to retrieve unique rows from a table.

Suppose we want a list of the unique categories in the product table. This is the syntax:

SELECT DISTINCT Category FROM Product;

When you run this command, Snowflake will return a result set of three rows:

  1. Office Supplies
  2. School Supplies
  3. null

This effectively eliminates the duplicate entries in the column.

How To Select Distinct Values Across Multiple Columns

If you want to select unique rows based on multiple columns, you can include more column names after the DISTINCT keyword.

Suppose we want the unique combinations of product name and category. This is the syntax:

SELECT DISTINCT Name, Category FROM Product;

Ordering with multiple columns

The order in which you specify the column names after the DISTINCT keyword can affect the output. This is especially when it’s used in combination with the ORDER BY clause.

Be sure that your query accurately represents the data that you want to retrieve.

How to select distinct values across all columns

Regardless of how many columns are in the table, you can include them all when evaluating the number of distinct rows in the set.

Thankfully, the * syntax works. This should save you some typing! Here is the syntax to select distinct for all columns:

SELECT DISTINCT * FROM PRODUCT;

How To Count Distinct Values In A Set

The COUNT function in Snowflake returns the number of rows in a set. For example, you can count the number of rows in the set of categories:

SELECT COUNT(Category) FROM PRODUCT;

This returns a value of 3 rows that represent “Office Supplies”, “School Supplies” and the NULL value.

It’s important to understand how the functionality changes when you combine COUNT with the DISTINCT keyword. The combination returns the number of non-NULL values in a set.

For our sample data, the below statement will return a value of 2 rows. The NULL row is not counted.

SELECT COUNT(DISTINCT Category) FROM PRODUCT;

How Snowflake Treats NULL Values

Snowflake, like most SQL databases, treats NULL as a distinct value. However, when used with the COUNT function, NULLs are ignored.

The reason is that the COUNT function considers NULLs as “unknown” or “missing” values and therefore does not include them in its count.

How To Include NULL When Counting Distinct Values

If you want to include NULLs in your distinct count, you need to convert NULLs to a non-NULL value.

This can be done by using a function like COALESCE. Here’s an example:

SELECT COUNT(DISTINCT COALESCE(CATEGORY, 'Not Set')) FROM PRODUCT;

How To Use DISTINCT With Other SQL Clauses

You can create more complex queries by combining DISTINCT with other SQL clauses. They include;

  • WHERE
  • ORDER BY
  • GROUP BY

WHERE clause

The WHERE clause lets you filter the values in the set before eliminating duplicates.

For example, if you want to get unique categories for products that start with the letter “P”, this SQL will do the trick:

SELECT DISTINCT CATEGORY FROM PRODUCT WHERE NAME LIKE 'P%';

ORDER BY

The ORDER BY clause will sort the unique result set with your specified columns.

If you want the categories listed in descending alphabetical order, this syntax can be used:

SELECT DISTINCT CATEGORY FROM PRODUCT ORDER BY CATEGORY DESC;

Tips For Using DISTINCT With Large Data Sets

Using the DISTINCT keyword with large datasets can present performance challenges. This is mostly due to the computational load of identifying unique values across millions or billions of rows.

Here are some tips and best practices to improve performance.

Low cardinality

The DISTINCT keyword means comparing each row with every other row. You can reduce the load on resources if the columns in the DISTINCT set have lower cardinality.

This means that the total number of unique values is a small fraction of the total number of rows.

Clustering keys

When you create tables in Snowflake, you can include clustering keys that organize the data according to similar values.

If you frequently use DISTINCT with specific columns, you should consider setting a clustering key on those columns to potentially improve query performance.

Our article on clustering in Snowflake will give you more background.

Use estimates for counting distinct values

The APPROX_COUNT_DISTINCT function approximates the count of distinct values in a column.

This is especially useful when an exact count isn’t necessary. It is faster and less resource-intensive than using COUNT(DISTINCT), especially with large data sets and high cardinality.

Here is the syntax for our sample data:

SELECT APPROX_COUNT_DISTINCT(CATEGORY) FROM PRODUCT;

The behavior with NULL values is the same i.e. they are not included in the count.

Does Snowflake Have PostgreSQL’s DISTINCT ON?

Snowflake does not have PostgreSQL’s DISTINCT ON syntax. However, it’s possible to achieve the same results as that handy method.

Our article on using DISTINCT ON in Snowflake shows you how to do do.

How To Delete Duplicate Rows In Snowflake

Now that you understand the basic functions, let’s look at one of the most common use cases for DISTINCT: removing duplicate records from tables.

Check if there are duplicate records

The first step is to identify whether there are duplicate values in your data. To do so, you can count the distinct rows or values and compare this to the total number of rows.

If the distinct values are fewer than the total number, then you know that you have duplicates.

Here are the two queries with our sample data:

SELECT COUNT(DISTINCT ID, NAME, CATEGORY) FROM PRODUCT;

SELECT COUNT(*) FROM PRODUCT;

Remove Duplicate Rows

When you want to permanently delete duplicate rows from a table, you can do this in two steps:

  1. Create a new table with the distinct rows.
  2. Replace the original table with the reduced version.

I’ve already shown you how to run a SELECT DISTINCT across all columns in the table. Here is the full syntax to eliminate the duplicates:

CREATE TABLE new_table AS SELECT DISTINCT * FROM original_table;

DROP TABLE original_table;

ALTER TABLE new_table RENAME TO original_table;

Bear in mind that you will lose indexes and constraints from the original table. You will need to recreate them if you want them preserved.

Bear in mind that you will lose indexes and constraints from the original table. You will need to recreate them if you want them preserved.