How To Use DISTINCT ON With Snowflake

PostgreSQL’s DISTINCT ON clause is a powerful tool for filtering result sets. However, the syntax is not available in Snowflake. Thankfully, there are several methods you can use to get the same results.

This article walks you through three techniques that provide the equivalent of DISTINCT ON in Snowflake.  Each method combines the ROW_NUMBER() windows function with additional syntax.

The first two methods use standard SQL syntax. The third method is non-standard SQL specific to Snowflake and some other database platforms.

  1. Use ROW_NUMBER() with a CTE
  2. Use ROW_NUMBER() with a subquery
  3. Use ROW_NUMBER() with the QUALIFY clause

What Is The DISTINCT ON Clause?

The DISTINCT ON clause in PostgreSQL is a useful feature that allows you to return the first row of each group of duplicates based on certain columns.

Unlike the standard DISTINCT clause, it keeps the “first” row of each group of duplicates based on a specific column or set of columns. This is easier to explain with an example.

Let’s say you have a stationary company that has a table of its products:

IDNameCategory
1PenOffice Supplies
2PencilSchool Supplies
3PenOffice Supplies
4PencilSchool Supplies
5CrayonSchool Supplies

You want to eliminate the duplicates but return the first row for each unique product-category combination. This is the syntax in PostgreSQL:

SELECT DISTINCT ON (name, category) *
FROM product
ORDER BY name, category, id;

This is the output:

IDNameCategory
1PenOffice Supplies
2PencilSchool Supplies
5CrayonSchool Supplies

Standard DISTINCT

If you need a run through of all aspects of the more standard syntax for unique values, check out our article on the DISTINCT keyword in Snowflake.

It covers using it on single and multiple columns, counting unique values, and what you need to know about the treatment of null values.

Sample Data For This Article

If you want to follow along with the examples in this article, use this script to create the sample table and data on Snowflake:

Create Table PRODUCT (
Id VARCHAR NOT NULL PRIMARY KEY,
NAME VARCHAR NOT NULL,
CATEGORY VARCHAR NOT NULL
);

INSERT INTO PRODUCT VALUES (1, 'Pen', 'Office Supplies');
INSERT INTO PRODUCT VALUES (2, 'Pencil', 'School Supplies');
INSERT INTO PRODUCT VALUES (3, 'Pen', 'Office Supplies');
INSERT INTO PRODUCT VALUES (4, 'Pencil', 'School Supplies');
INSERT INTO PRODUCT VALUES (5, 'Crayon', 'School Supplies');

If you’re still learning the basics, check out our article on creating tables in Snowflake.

Use ROW_NUMBER With A CTE

You can achieve similar results to the PostgreSQL DISTINCT ON functionality by using the ROW_NUMBER() window function with a CTE expression.

Here is the SQL for our sample data:

WITH numbered_products AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) AS row_num
    FROM product
)
SELECT id, name, category
FROM numbered_products
WHERE row_num = 1
ORDER BY id;

In this query, the “ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id)” expression gives each row a unique number within each partition of name and department ordered by id,

The WHERE row_num = 1 clause in the outer query then only selects the first row from each of these groups, effectively mimicking the behavior of DISTINCT ON.

Use ROW_NUMBER With A Subquery

You can also achieve similar results by using the ROW_NUMBER() window function with a subquery.

Here is the SQL for our sample data:

SELECT id, name, category
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) AS row_num
    FROM product
) AS subquery
WHERE row_num = 1
ORDER BY id;

The subquery is defined in the FROM clause and assigned the alias subquery. The main query then selects from this subquery and applies the WHERE and ORDER BY clauses just like before.

Use ROW_NUMBER() With The QUALIFY Clause

You can use the QUALIFY clause to filter the results of a window function, which can be used to achieve the same effect as DISTINCT ON.

The QUALIFY clause works similarly to a WHERE clause, but it operates on the results of window functions. It’s important to note that the QUALIFY clause is specific to Snowflake and a few other SQL dialects. It’s not a part of standard SQL.

Here is the SQL for our sample data:

SELECT id, name, category
FROM product
QUALIFY ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) = 1
ORDER BY id;

In this query:

  • the ROW_NUMBER() function assigns a unique row number within each partition
  • the QUALIFY clause filters out all rows except the first one in each partition.

This gives you the same result as ROW_NUMBER() with the CTE or subquery techniques, but in a more concise form.

Don’t Use The First_Value() Function

You may be tempted to use the First_Value() function with partitioned data. The problem is that it doesn’t filter out the duplicate rows.

You’ll end up with the first value for each partition in every row of that partition. It won’t eliminate rows from the result set unlike the other methods we’ve looked at in this article.