Delta tables are a core feature of Databricks. This article assumes you are completely new to the platform but have some familiarity with SQL or Spark basics.
Read on to get an understanding of what they are, how they differ from other storage formats, and how to create and use them.
Table of Contents
What Is A Delta Table in Databricks?
A delta table is the default format for a table in Databricks. When you create a table on the platform, it is stored as a delta table unless you specify otherwise.
For example, this SQL command creates a delta table called “sales” with two columns:
spark.sql("""CREATE TABLE sales (product STRING, orders INT)""")
The storage format is based on Parquet, which makes it a columnar-based file format optimized for data analytics. The main difference is that data tables have additional transaction logs that track data changes.
This means that you can update delta tables directly. In contrast, the data in Parquet files cannot be updated. Instead, the entire file must be replaced.
In practice, delta tables bring the big data capabilities of Spark and Parquet to the traditional needs of the data warehouse.
Differences Between Parquet And Delta Tables
As I mentioned, Parquet files are immutable i.e. the data within cannot be updated. The most obvious difference with Parquet is that you can perform CRUD operations (update, merge, delete) directly against delta tables.
But that’s not the only difference. Let’s look at some more:
- Time travel with transaction logs
- Query performance with z-ordering and data skipping
- Caching on local worker nodes
Time Travel
With some database technologies, you have to install and deploy change tracking features as an optional extra. Microsoft SQL Server is one example.
But change tracking comes in-built with delta tables. Again, the functionality is provided through the transaction logs that accompany each table.
In Databricks parlance, this is known as Time Travel. The feature lets you query the data as it was at a point in time in recent history.
Query Performance
You may be familiar with using partitioning to increase the performance of data retrieval from Parquet files. As delta tables are built on top of Parquet, they also get the benefits.
But they come with the added benefits of what’s known as z-ordering and data skipping.
Z-ordering is a feature that groups related information in the same set of files. If that sounds a bit like partitioning – well, it is.
But the key point about z-ordering is that it works really well with high-cardinality values e.g. increasing numeric IDs. Partitioning doesn’t cope well with high cardinality (you get the “lots of small files” problem).
By default, delta tables apply z-ordering to the first thirty-two columns of your table. You should bear that in mind when designing your data model.
Crucially, the delta tables also maintain hi/lo statistics on the z-order columns. This means that they track the max and min values at the file level.
These statistics allow filtered queries to take advantage of data skipping. The query processor can check the max/min values within a file to determine if its values fall outside of the filter. If they do, the file can be skipped.
This can save a lot of query processing overhead.
Caching For Performance
You may be thinking that Spark already has caching features. That’s true.
But delta tables come with a special cache known as the Delta Cache that can store copies for remote data locally on worker nodes.
I like the name Delta Cache because it’s so obvious. I hate the name Time Travel. It took me a while to realize that this was simply change tracking. But I digress…
Suppose you have queries that run repeatedly against the same result set. You can use the CACHE keyword to copy the result set to the Delta Cache. Subsequent queries will use the locally cached data instead of reaching out to the underlying Parquet fields.
Difference Between Tables And Delta Tables
I mentioned that Databricks tables are delta by default. But you can create other types of tables on the platform.
For example, you can create a table that is bound to data in an external system. This means that the table name is really a pointer to the external storage.
You can also create a table that specifically uses an alternative file format, such as JSON or Avro. The USING keyword after the CREATE TABLE syntax provides this functionality.
Here’s a statement that creates a CSV-format table based on an external directory.
CREATE TABLE agencies USING CSV LOCATION '/mnt/external/csv_file';
How to check if a table is delta or not
If you want to check if you’re dealing with a delta table or otherwise, you can use a Boolean indicator provided by the DeltaTable module.
DeltaTable.isDeltaTable(spark, "tmp/sales")
How To Create A Delta Table In Databricks
There are several ways to create a delta table. We’ll look at two:
- using PySpark
- Using SQL
Using PySpark and DataFrames
First, create your DataFrame. Here is some sample code:
columns = ["product", "orders"]
data = [("football", "529"), ("helmet", "293")]
rdd = spark.sparkContex.parallelize(data)
df = rdd.toDF(columns)
Now that you have your DataFrame, you can write it to a new delta table.
df.write.format("delta").saveAsTable("sales")
You can also use the df.write method to write a DataFrame created from reading data from a CSV file.
Using SQL
Let’s create our sales table using a SQL command and populate it with some data:
spark.sql(""CREATE TABLE sales (product STRING, orders INT) USING delta""")
spark.sql("""INSERT INTO sales VALUES ('football', '529'), ('helmet', '293')""")
Note that “USING delta” is optional as delta is the default file format.
Which should you use?
You may be wondering if you should use PySpark or SQL? You should use the language you’re most familiar with.
However, if you’re going to use PySpark, be aware that ANSI SQL is Databrick’s standard. There are some differences in the syntax and behavior between the two dialects.
How To Convert A Parquet File To A Delta Table
You could read the Parquet file into a DataFrame and save it back to the platform as a delta table (see the previous section). However, there’s a single-step process to do this transformation.
The DeltaTable.convertToDelta API lets you do this as an in-place operation.
Suppose you have a parquet file at “/tmp/sales1” that you want to convert to a delta table. Here is the code:
From delta.tables import *
tbl = DeltaTable.convertToDelta(spark, "parquet.tmp/sales1")
This operation adds transaction logs to the original Parquet file. That’s what turns it into a delta table!
How To Read A Delta Table
You can use the table name of the file path to read data from a delta table.
If you’re using SQL, then the select statement with the table name is the same as any other DBMS:
SELECT * FROM sales;
There’s a bit more to remember if you’re referencing a path. The syntax will look like this:
SELECT * FROM delta.'/tmp/sales';
If you’re using PySpark then this syntax will pull the data from the sales table into a DataFrame:
df = spark.read.table("sales")
How To Get The Path Of A Delta Table
Delta tables are stored as collection of files in cloud storage. If you’re not sure where a specific table is stored ,you can use the DESC FORMATTED command. Here’s an example:
spark.sql("DESC FORMATTED chicago.sales").show(truncate = false)
This outputs quite a lot of detail: scroll and look for the location description.