The default SQL standard used by Databricks has changed since the platform first launched.
Databricks originally used Spark SQL as the default SQL dialect, but changed the standard in late 2021. The default dialect in Databricks is currently ANSI Standard SQL.
Because there are differences between the two dialects, developers and analysts should be aware of several consequences. I’ll give you the background in this article, and then point out some traps to be wary of.
Table of Contents
Why Databricks Started With Spark SQL
Let’s first address why Spark SQL was the original standard. This won’t be much of a surprise to you if you’re familiar with the origins of Databricks.
The company was created by seven of the students and professors at Berkeley University who developed Spark, the distributed processing engine. They donated the Spark code to Apache as an open-source license.
These creators then founded Databricks to commercialize a platform and service layer running on top of the free engine.
Spark SQL is a Spark module that performs SQL-like operations on data stored in memory. That’s why it was the original standard for Databricks, it was already there.
It’s fair to say that at one time the Databricks documentation for Spark SQL was the de-facto standard for this implementation of the data language. But in more recent years, you should use the Apache documentation as your reference.
Why Databricks Changed Their Standard SQL
So, why did Databricks change their standards?
Because enterprises adopting the platform usually had one or many migration projects to bring their existing databases into the new regime. Most enterprise databases are still SQL-based, despite the advent of No-SQL.
ANSI SQL is the standardized specification for the SQL language. As its been around since 1986, it’s widely implemented in the industry. Different DBMS platforms have extra proprietary syntax, but all the major vendors have ANSI SQL baked in.
The problem for enterprises was that Spark SQL has important differences with ANSI SQL (we’ll look at the specifics later in this article). On a practical level, this meant that migrating to Databricks wasn’t just about moving the data. Companies also had code migration projects.
As this was an obstacle to adoption of the platform, the company decided to switch to ANSI SQL as the default.
What Is ANSI SQL?
SQL was first developed in the 1970s at IBM. As different DMBS platforms and vendors adopted the database programming language, it became clear that a standard was needed to allow the exchange of data between systems.
The American National Standards Institute (ANSI) doesn’t develop standards itself. Instead, it reviews and approves a consensus between relevant stakeholders.
In the case of SQL, the stakeholders included IBM, the U.S. Department of Commerce, and various industry and academic professionals brought together by ANSI’s technical committee.
In 1986, ANSI approved a consensus specification for SQL. The following year, the International Organization of Standardization (ISO) adopted the same standard.
There have been several revisions and additions since 1986. They are usually referred to the years that they are published. SQL-92 was the first major expansion.
Differences Between Spark SQL And ANSI SQL
Here are some of the major differences that developers encountered when migrating code.
Let’s start with string concatenation. Spark SQL didn’t recognize the “||” operator as an alternative to the concat() function. Some development teams faced a lot of changes!
The basic single-row insert statement didn’t need modification. But Spark SQL didn’t recognize the handy multi-row syntax:
INSERT INTO big_table (col1, col2, col3) VALUES (var1, var2, var3), (var1, var2, var3), (var1, var2, var3)
Changes in behavior
In many ways, differences in syntax are the easiest changes to handle. The major problem came in the form of subtle and not-so-subtle changes in behavior between the two standards.
For example, the ANSI standard requires the to_date() function to fail if the input string can’t be parsed. For example, something like this:
SELECT to_date("Chicago, Illinois")
My bet is that there’s a data issue here where location fields are being incorrectly parsed as date fields. I definitely want that to fail immediately.
But Spark SQL returns a NULL result. If you’re not expecting this, your code and testing can merrily carry along all the way into production.
Similarly, casting from a string to an integer throws a run-time exception in systems implementing ANSI SQL. But Spark SQL is more accommodating.
How Databricks Switched To ANSI SQL
When Apache released Spark 3.0 in 2020, they introduced a new ANSI standard mode:
If this option is set to true, Spark follows the ANSI standard for SQL parsing and operations. Of course, this experimental option defaulted to false in this first deployment.
Databricks simply turned on the option in late 2021. This made the Databricks platform conform to ANSI SQL standards by default.