You will often want to reuse code from one Databricks notebook in another. This step-by-step beginner guide shows you how to:
- Import a function from one notebook to another.
- Run one notebook from another.
- Return a value from a called notebook.
- Pass parameters to a notebook.
- Return large datasets from a called notebook.
If you’re new to this technology, don’t worry. I assume that you know the basics of notebooks in Databricks. But that’s all you need to follow along.
Table of Contents
Typical Use Case
Here is a typical scenario. You are working on a project that requires analyzing different datasets coming from multiple sources. Each dataset should go through the same set of preprocessing steps such as handling missing values.
You’ve written a function, preprocess_data(), that takes raw data as input, goes through the pre-processing steps, and returns the cleaned data.
You could copy and paste your preprocess_data() function into each of them, but that’s a bad idea. Not only is it time-consuming, but it also creates room for potential errors or inconsistencies.
Another drawback is that If you ever need to update the function, you would need to do so in each notebook.
This is a clear case where code reuse can significantly streamline your workflow. By writing your preprocess_data() function in one notebook and importing it into others, you save time, reduce the risk of errors, and make your code easier to maintain.
Now, let’s dive into practical ways to deal with this scenario.
How To Import Functions With The %run Magic Command
Magic commands in Databricks are special commands in a Python notebook cell that are not part of the Python language itself. They are preceded by the percent symbol: %.
The %run command in Databricks enables the execution of an entire notebook within another notebook.
The command acts as an importer and executor. It imports the entire code from the specified notebook and runs it in the context of the current notebook.
This means that any functions, classes, or variables are imported and available for use in the notebook where the %run command is used.
Let’s say that your required functionality is in a notebook called “Processing” in a subfolder called “tasks”. The call would look like this:
%run “./tasks/Processing”
Once you have run this command in a cell, later cells have access to the function in the other notebook.
Drawback
Because the %run command runs the entire notebook, any executable statements (like function calls or print statements) will also execute when you use %run.
So, you probably don’t want to use %run with standard notebooks with lots of print statements or reporting displays.
Best Tip
Here’s the setup that I advise: create a “library” notebook that only contains functions and classes. This notebook should have no runnable code.
This means that when you use the run command to import all the functions into another notebook, you don’t get an avalanche of hard-to-follow messages.
Step-By-Step Guide To Importing Functions From Other Notebooks
If you’re already familiar with working with notebooks in Databricks, then the previous section should get you going with importing functions. But if this is new to you, here’s a step-by-step guide that assumes you are a complete beginner.
Step 1: Create a function in a new notebook
Create a new notebook in Databricks and add this simple function to it. All it does is add two numbers together.
def add_numbers(a, b):
return a + b
Step 2: Save your notebook
Rename the notebook to something like “Utils” to show that it’s to hold a collection of utility functions.
Make sure to take note of the notebook’s path in your Databricks workspace.
Step 3: Create a second notebook and use the %run command
Create a new notebook and use the %run command in a cell like this:
%run "/path/to/Utils"
Step 4: Call the imported function
After running the %run command, the add_numbers function from the “Utils” notebook is available in your current notebook. You can now call the function as you normally would:
result = add_numbers(5, 7)
print(result) # Outputs: 12
Advantages And Disadvantages Of The %run Command
The %run command is a quick and straightforward solution to importing functions from one notebook to another. But it comes with advantages and disadvantages.
Advantages
Simplicity
The %run command is easy to use and just needs one line of code.
Code Reusability and Modularity
The command helps you avoid duplicating code across multiple notebooks. This promotes better code organization and modularity.
Shared State
When a notebook is run using the %run command, it shares the same SparkContext with the calling notebook.
This means that any Spark jobs that are kicked off by the called notebook are part of the same Spark application as the calling notebook.
Disadvantages
Entire Notebook Execution
The %run command runs the entire notebook, not just specific functions or variables. This can be a disadvantage if you only need to use a specific function from the imported notebook.
No Isolation
The functions and variables from the imported notebook become available in the global scope of the calling notebook. This can lead to naming conflicts if the same name is used in both notebooks.
How To Run One Notebook From Another With dbutils.notebook.run
The dbutils.notebook.run also lets you run a notebook from within another notebook. The function lets you return a string result from the called notebook.
This is the syntax:
output = dbutils.notebook.run("/path/to/notebook", timeout_seconds)
There are three main differences with the %run command:
- Context
- Timeout (optional)
- Functions are not imported
Let’s look at these aspects in turn:
Context
A major difference is that this method executes the called notebook in a separate context. This is essentially treating the second notebook as a separate job.
Optional Timeout
The optional timeout parameter lets you specify the maximum time in seconds that the notebook is allowed to run. If it doesn’t complete, the execution will be interrupted.
Functions are not imported
Unlike %run, dbutils.notebook.run does not import functions or variables from the called notebook into the calling notebook. Instead, it returns the last expression in the called notebook as a string.
This is useful when you want to call a notebook that produces a specific output, but you don’t need to use any of its functions or variables in your current notebook.
Step-By-Step Guide To Using dbutils.notebook.run
While dbutils.notebook.run can’t directly import functions like %run, it can be used to run a notebook that returns the output of a function. Here’s how you can use it:
Step 1: Create a function in a new notebook
Create a new notebook in Databricks and add this simple function to it. All it does is add two numbers together.
def add_numbers(a, b):
return a + b
Step 2: Return the result of the function
At the end of the same notebook, call the function with the desired parameters and return its output:
result = add_numbers(5, 7)
return result
Step 3: Save your notebook
Rename the notebook to something like “Calc” to indicate the usage.
Make sure to take note of the notebook’s path in your Databricks workspace.
Step 4: Create a second notebook and use dbutils
Create a new notebook and assign the output of the first notebook to a variable like this:
output = dbutils.notebook.run("/path/to/Calc", 600)
Step 5: Use the output
The dbutils.notebook.run command returns the last expression of the called notebook as a string.
In our example, output would be ’12’. If you need to use the output as an integer, you’ll need to convert it:
output = int(output)
Advantages And Disadvantages Of dbutils.notebook.run
Here are some of the benefits and limitations of this second method.
Advantages
Isolation
dbutils.notebook.run executes the called notebook in a separate context. This means that the called notebook doesn’t affect the state of the calling notebook.
This helps prevent naming conflicts and other unexpected issues.
Controlled Output
Unlike %run, dbutils.notebook.run returns the last expression from the called notebook as a string.
This means that you can specifically control what data you want to pass back to the calling notebook.
Timeout Control
You can optionally set a timeout for the execution of the called notebook.
Disadvantages
String Output
The fact that the output is always a string can be a pain in the neck. You will often need to convert it.
No Function or Variable Import
dbutils.notebook.run does not import functions or variables from the called notebook into the calling notebook.
If you need to use a function from another notebook, you’ll need to structure your notebooks differently or use the %run command instead.
Resource Consumption
Since dbutils.notebook.run executes the called notebook as a separate job, it consumes separate resources.
This can lead to increased resource usage if you are running several notebooks concurrently.
How To Pass Parameters From One Notebook To Another
You can pass parameters when using dbutils.notebook.run by using the widgets function that is part of the dbutils library.
Widgets in Databricks notebooks are user-interactive graphical elements that let your notebooks accept input from users. There are lots of fancy choices like buttons and dropdowns. But when passing parameters, you will use the simple “text” widget.
Here’s a step-by-step guide.
Step 1: Define the Notebook to be Called
Let’s call the target notebook “CalledNotebook”. You can define a widget to accept a parameter like this:
dbutils.widgets.text("input", "defaultValue")
This code will create a widget named input with a default value of “defaultValue”. You can get the value of the widget (and thus the passed parameter) like this:
input_value = dbutils.widgets.get("input")
Step 2: Call the Notebook with Parameters
Let’s name the notebook that calls the target as “CallingNotebook”. In this notebook, you use the dbutils.notebook.run function to call CalledNotebook and pass parameters. Here’s an example:
dbutils.notebook.run("CalledNotebook", 600, {"input": "Hello, world!"})
This code will run CalledNotebook and pass the string “Hello, world!” as the input parameter. The 600 is a timeout value in seconds, you can adjust this based on your needs.
When CalledNotebook is run, the input widget will have a value of “Hello, world!”. The input_value variable will also be set to “Hello, world!”.
How To Return Large Datasets From One Notebook To Another
We’ve already looked at how the dbutils notebook.run function will return a string. But what if you want to return something more complex – like a DataFrame?
The problem here is that a DataFrame is usually too large to be converted to a string representation. This would not be an efficient method, and mightn’t even work.
The solution is to have the called notebook write the DataFrame to a temporary view or to a file. The calling notebook can then read and access the data.
This approach has the advantage of being efficient, because Spark’s file reading and writing operations are designed to handle large datasets.
It also allows you to control the format of your data. For example, you can choose to write either to a CSV file or a Parquet file, depending on your use case.
Bear in mind that if you write to a temporary view, the data will only be accessible while your Spark session is active. If you need more permanent storage, writing to a file is a better option.