Before you start exchanging data between Databricks and S3, you need to have the necessary permissions in place. We have a separate article that takes you through configuring S3 permissions for Databricks access.
The rest of this article provides code examples for common use cases when reading and writing data with Databricks and S3. Create a notebook to follow along!
Table of Contents
How To Read A CSV File From S3 With Databricks
Here is how to read a CSV file using Python in your Databricks notebook.
Let’s say that your CSV file is “salesdata.csv” and it’s in a bucket called /europe
filename = “s3://europe/salesdata.csv” data = spark.read.format('csv').options(header='true', inferSchema='true').load(filename) display(data)
The key part of this code is using spark.read() to read the file. These are options used:
- Header: specifiy if your data has a header row
- inferSchema: this means you don’t have to provide the schema yourself
How To Read Parquet From S3 With Databrick
Parquet is a columnar file format created by Apache that is optimized for big data in Spark.
Let’s say that your Parquet file is “salesdata” and it’s in a bucket called /europe
filename = “/europe/salesdata” data = spark.read.parquet(filename) display(data)
How To Write A Dataframe To A CSV File In S3 From Databricks
You can use the functions associated with the dataframe object to export the data in a CSV format. Here is some sample code:
filename = “/europe/salesdata.csv” df .write\ .format("com.databricks.spark.csv") \ .option("header", "true") \ .save(filename)
How To Write A Dataframe To A JSON File In S3 From Databricks
You can use the functions associated with the dataframe object to export the data in JSON format. Here is some sample code:
filename = “/europe/salesdata.json” df .write \ .format("json") \ .save(filename)
Troubleshooting “Access Denied”
One of the most common errors you’ll encounter when working with S3 is “access denied”.
You will get this error if you’re mixing using instance profiles for permissioning alongside using secrete keys in your SPARK environment variables.
You can run “env | grep -i aws” on your cluster. If you see AWS_ACCESS_KEY_ID but you shouldn’t be using keys, then you will want to get rid of this variable.
Troubleshooting “Upload Exceeds Maximum Size”
If you’re trying to save a large DataFrame to a single CSV or JSON file in S3, you may get this error message:
“Your proposed upload exceeds the maximum allowed size.”
The preferred solution is to save the file in multiple parts. But what if you have a client who insists that they receive the file as a single object?
You have the option of splitting the DataFrame into multiple smaller ones. You can loop through a list and append the data to the same file like this: