PostgreSQL’s DISTINCT ON clause is a powerful tool for filtering result sets. However, the syntax is not available in Snowflake. Thankfully, there are several methods you can use to get the same results.
This article walks you through three techniques that provide the equivalent of DISTINCT ON in Snowflake. Each method combines the ROW_NUMBER() windows function with additional syntax.
The first two methods use standard SQL syntax. The third method is non-standard SQL specific to Snowflake and some other database platforms.
- Use ROW_NUMBER() with a CTE
- Use ROW_NUMBER() with a subquery
- Use ROW_NUMBER() with the QUALIFY clause
Table of Contents
What Is The DISTINCT ON Clause?
The DISTINCT ON clause in PostgreSQL is a useful feature that allows you to return the first row of each group of duplicates based on certain columns.
Unlike the standard DISTINCT clause, it keeps the “first” row of each group of duplicates based on a specific column or set of columns. This is easier to explain with an example.
Let’s say you have a stationary company that has a table of its products:
ID | Name | Category |
1 | Pen | Office Supplies |
2 | Pencil | School Supplies |
3 | Pen | Office Supplies |
4 | Pencil | School Supplies |
5 | Crayon | School Supplies |
You want to eliminate the duplicates but return the first row for each unique product-category combination. This is the syntax in PostgreSQL:
SELECT DISTINCT ON (name, category) *
FROM product
ORDER BY name, category, id;
This is the output:
ID | Name | Category |
1 | Pen | Office Supplies |
2 | Pencil | School Supplies |
5 | Crayon | School Supplies |
Standard DISTINCT
If you need a run through of all aspects of the more standard syntax for unique values, check out our article on the DISTINCT keyword in Snowflake.
It covers using it on single and multiple columns, counting unique values, and what you need to know about the treatment of null values.
Sample Data For This Article
If you want to follow along with the examples in this article, use this script to create the sample table and data on Snowflake:
Create Table PRODUCT (
Id VARCHAR NOT NULL PRIMARY KEY,
NAME VARCHAR NOT NULL,
CATEGORY VARCHAR NOT NULL
);
INSERT INTO PRODUCT VALUES (1, 'Pen', 'Office Supplies');
INSERT INTO PRODUCT VALUES (2, 'Pencil', 'School Supplies');
INSERT INTO PRODUCT VALUES (3, 'Pen', 'Office Supplies');
INSERT INTO PRODUCT VALUES (4, 'Pencil', 'School Supplies');
INSERT INTO PRODUCT VALUES (5, 'Crayon', 'School Supplies');
If you’re still learning the basics, check out our article on creating tables in Snowflake.
Use ROW_NUMBER With A CTE
You can achieve similar results to the PostgreSQL DISTINCT ON functionality by using the ROW_NUMBER() window function with a CTE expression.
Here is the SQL for our sample data:
WITH numbered_products AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) AS row_num
FROM product
)
SELECT id, name, category
FROM numbered_products
WHERE row_num = 1
ORDER BY id;
In this query, the “ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id)” expression gives each row a unique number within each partition of name and department ordered by id,
The WHERE row_num = 1 clause in the outer query then only selects the first row from each of these groups, effectively mimicking the behavior of DISTINCT ON.
Use ROW_NUMBER With A Subquery
You can also achieve similar results by using the ROW_NUMBER() window function with a subquery.
Here is the SQL for our sample data:
SELECT id, name, category
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) AS row_num
FROM product
) AS subquery
WHERE row_num = 1
ORDER BY id;
The subquery is defined in the FROM clause and assigned the alias subquery. The main query then selects from this subquery and applies the WHERE and ORDER BY clauses just like before.
Use ROW_NUMBER() With The QUALIFY Clause
You can use the QUALIFY clause to filter the results of a window function, which can be used to achieve the same effect as DISTINCT ON.
The QUALIFY clause works similarly to a WHERE clause, but it operates on the results of window functions. It’s important to note that the QUALIFY clause is specific to Snowflake and a few other SQL dialects. It’s not a part of standard SQL.
Here is the SQL for our sample data:
SELECT id, name, category
FROM product
QUALIFY ROW_NUMBER() OVER (PARTITION BY name, category ORDER BY id) = 1
ORDER BY id;
In this query:
- the ROW_NUMBER() function assigns a unique row number within each partition
- the QUALIFY clause filters out all rows except the first one in each partition.
This gives you the same result as ROW_NUMBER() with the CTE or subquery techniques, but in a more concise form.
Don’t Use The First_Value() Function
You may be tempted to use the First_Value() function with partitioned data. The problem is that it doesn’t filter out the duplicate rows.
You’ll end up with the first value for each partition in every row of that partition. It won’t eliminate rows from the result set unlike the other methods we’ve looked at in this article.