How To Run Databricks Notebooks In Parallel

There are several ways to run multiple notebooks in parallel in Databricks. You can also launch the same notebook concurrently.

If this is a once-off task, you may simply want to use the Workspace interface to create and launch jobs in parallel. However, you can also create a “master” notebook that programmatically calls other notebooks at the same time.

This tutorial shows you three methods that achieve these scenarios:

  • run or schedule jobs in parallel
  • use multiple threads with the concurrent.futures library
  • use multiple processes with the multiprocessing library

Running Jobs In Parallel

Suppose you have a notebook that accepts a table name and file path as parameters. The notebook reads data from the file, processes it, and saves it to the table.

You want to process many files into tables in parallel. You can do this by launching multiple jobs that call the same notebook with different parameters.

For example:

  • Job 1 calls the notebook and specifies table name 1 and file path 1
  • Job 2 calls the notebook and specifies table name 2 and file path 2
  • And so forth…

You can manually do this within the Workflow tab by creating individual jobs and running them.

Bear in mind that the actual level of parallelism depends on the available resources in your Databricks environment. If resources run low, the jobs will be queued and run sequentially as resources become available.

Once you’ve created these jobs, you can also schedule them to start at the same time. Again, if you run out of resources, they’ll switch to sequential.

If this task isn’t a one-off, you’ll probably want to do it with code instead of the GUI. Read on for some programmatic options.

Multiple Threads With The Concurrent.Futures Library

This method uses a built-in Python library that gives multi-threading features. The library called concurrent.futures allows you to submit tasks to be executed in parallel across a pool of worker threads.

Suppose you have two notebooks that you want to run in parallel: Child1 and Child2.

Your master notebook will control the execution. Once you’ve created a new notebook, follow these steps:

Step 1: import the module(s)

From concurrent.futures import ThreadPoolExecutor

Step 2: create a function that runs any notebook

The dbutils module has functions to run a specific notebook. Here is the code:

def run_notebook(path):

     dbutils.notebook.run(path = path, timeout_second = 600)

Step 3: create a list with the paths of each notebook

notebook_paths = ["/path/to/notebook1", "/path/to/notebook2"]

Step 4: execute the function concurrently

with ThreadPoolExecutor() as executor:
     
     results = list(executor.map(run_notebook, notebook_paths))

This code uses the map() function to apply the call to your function to each notebook path in the list.

The threads will run concurrently. As long as your Databricks environment has multiple cores and enough available resources, these notebooks will be executed in parallel.

Multiple Processes With The Multiprocessing Library

The multiprocessing module is another powerful tool for creating parallel workflows in Python.

It uses separate memory space for multiple processes. This avoids GIL (Global Interpreter Lock). I won’t get into GIL, but it’s a feature that restricts your parallel tasks to running sequentially.

The multiprocessing module spawns new subprocesses that run code using a method called pickling. It can produce some odd behavior depending on the environment.  

I recommend that you use the earlier code method of concurrent.futures unless you know exactly what you’re dealing with.

But here is sample code to run two notebooks in parallel with the multiprocessing module.

Step 1: import the module(s)

from multiprocessing import Pool

Step 2: create a function that runs any notebook

This is the same code as in the previous method.

def run_notebook(path):

     dbutils.notebook.run(path = path, timeout_second = 600)

Step 3: create a list with the paths of each notebook

This is the same code as in the previous method.

notebook_paths = ["/path/to/notebook1", "/path/to/notebook2"]

Step 4: use multiple pools for each notebook

with Pool(processes=2) as pool:
     
     results = pool.map(run_notebook, notebook_paths)

Things To Consider When Running Notebooks In Parallel

You need to think about more than just how to run notebooks concurrently. There are other factors that determine the best approach.

The type of workload, the size of the data, and the resources available all come into play. Here are some aspects to consider.

Data size and skew

Do you have some notebooks operating against massive data sets and others against small tables?

It may make more sense to run the “small” notebooks in one parallel batch to get them out of the way. This is especially so if the failure of any one notebook means that the entire set has to run again.

Task Independence

Is every notebook independent of each other? If yes, then you’re more likely to have success with a parallel approach.

However, there can sometimes be data dependencies or the need for a file to exist that was created by another notebook.

At best, this introduces bottlenecks while one notebook waits on another to get to a specific step. At worst, it creates data inconsistency.

If your notebooks are dependent, you should probably think of a different pattern of workflow.

Resources and Overhead

Parallel processing comes with overhead, such as thread creation and context switching. For lightweight tasks, the overhead may outweigh the benefits of parallel execution.

In other words, it may cost more than running everything sequentially.

Concurrent Rate Limits

Databricks has limits on the number of notebooks that can be run concurrently.

The maximum number has increased in recent years, but be sure to check the limits when you are planning your parallel execution strategy.

Error Handling

When running tasks in parallel, consider how you will handle failures. It’s a good idea to map out the flow on a whiteboard.

If one notebook fails, should all the others be suspended? Or can the rest reach the finish line and wait for the problem child to be fixed and restarted?

You should have a plan before running a thousand notebooks in parallel against massive data sets!