In Databricks, notebooks provide the ability to develop real-time machine learning, data enginering, and data analytics workflows. Notebooks support four programming languages:
You can use one or all languages within a single notebook. If you’re starting out, you may be wondering if one of these languages is better for your purposes.
Does the company favor a specific language? Which runs faster on the platform? Which scales better? Read on to find out.
Table of Contents
What Language Is Databricks Written In?
Databricks as a platform is written in Scala, one of the four languages it supports.
This is because the platform runs on top of Apache Spark, which is written in Scala. Databricks provides an abstract layer to simplify operations on the distributed Spark clusters beneath it.
Because Spark is Scala-based, the language is a natural choice for the internal workings of APIs that run on top of it.
However, when Databricks was first conceived as a commercial venture, it would have been technically possible for the founders to choose a different programming language. For example, they could have written their layer in Java. After all, Scala itself runs on the JVM (Java virtual machine).
So, why did the original Databricks developers choose Scala? That’s easy to understand when you know the background of the seven co-founders of Databricks. They were all involved in the early development of Spark (written in Scala).
Which Languages Are Used Inside The Company?
Scala is also widely used within the company when implementing all the surrounding tools and glue that facilitates notebooks, delta tables, jobs, and workflows. A company blog post in 2021 described Scala as:
a kind of lingua francaCompany blog post
However, it’s not the only language in use – as you’d expect from a medium-sized developer workshop. Reynold Xin, one of the co-founders, answered a question on Quora in 2021 as to what languages are used in Databricks. This is what he said:
So, there you have it from the horse’s mouth. Now that you know the importance of Scala, you may be thinking that it must be the preferred language when using the platform. But that’s not actually true.
Python As The Most Commonly Used Databricks Language
In some ways, it doesn’t really matter what is used within the company. The choice of the customer base is more important.
Xin expanded on his 2021 Quora comment with this:
Our customers use primarily Python, SQL, and some R and Scala.Reynold Xin, 2021
It doesn’t surprise me that Python is the number one. All the surveys amongst data engineers rank it as the most popular language in the industry. This is shown in the TIOBE index where Python was number 1 in 2022 and 2023.
These are the rankings for the four languages in 2023:
- Python: 1
- SQL: 9
- R: 17
- Scala: 23
Should You Use Python Or PySpark In Databricks?
Some data scientists who are new to Databricks may be mostly familiar with using Python and Pandas to work with their data.
It could be tempting to continue to work exclusively within this framework. But that would be a mistake if you are working with large datasets. I advise that you become familiar with PySpark.
Here’s the reason why. The Databricks platform is designed to distribute your large data workloads across nodes working in parallel on a cluster. But the language that you use must be able to tell the platform to do so.
Plain old Python runs in a single process unless you do clever things to make it launch multiple processes in parallel. (You can run Databricks notebooks in parallel but it takes a bit of extra coding).
Similarly to Python, the Pandas library is designed to run on a single machine.
In contrast, PySpark is an API that distributes your Python load statements, queries, and transformations to run in parallel across multiple nodes.
To help with Pandas, Databricks launched an API called Koalas that implemented the single-process Pandas DataFrames as multi-node Spark RDD data structures.
But wait – you don’t have to get familiar with the marsupial-named library. Koalas was conveniently merged into PySpark in 2021.
To get the benefits of using Pandas with parallel DataFrames (really being RDDs under the hood), all you need to do is add a single line of code at the top of your notebooks. It will look like this:
from pyspark.pandas import read_csv
From there, you can simply perform the usual Pandas data import and transformations that you’re used to. The PySpark library implements them as distributed Spark RDDs.