Permissions in AWS are defined by IAM roles (Identity and Access Management).
To work with files in S3, your Databricks deployment will need an IAM role that has read/write permissions to an S3 bucket. You have two configuration options:
- Use the Unity Catalog in Databricks.
- Set up the roles and permissions in AWS.
The Unity Catalog is a new feature from Databricks this aims to simplify what you otherwise must set up in AWS. However, you may not have access to it.
This article walks you through setting up the roles and permissions directly.
Table of Contents
Should You Mount S3 on DBFS?
There used to be a third option of mounting S3 buckets on DBFS. We used to do this by using the dbutils.fs.mount command.
However, this method is deprecated i.e. Databricks doesn’t want us to do this anymore.
Instead, use the Unity Catalog or directly configure the IAM roles as shown in the rest of this article.
Prerequisites For This Tutorial
I assume that you have a Databricks account and workspace. You will also need an AWS S3 bucket for Databricks to work with.
Follow these steps:
- Login into AWS and click on the “Create Bucket” button.
- Set the name and region and save your choices.
Four-Stage Process
The Databricks/S3 permissions configuration is a four-stage process:
- Stage 1: create an IAM role that grants access from EC2 to an S3 bucket.
- Stage 2: create a bucket policy that links back to the stage 1 role.
- Stage 3: add the stage 1 role to the EC2 role you used when deploying Databricks.
- Stage 4: hook up the stage 1 role to your Databricks workspace.
Confused? Thought that the whole point of Databricks was to simplify all this?
Well, that’s what the Unity Calalog is for.
But don’t worry. Just follow the steps in the next sections.
Stage 1: Create The Instance Profile (IAM Role)
Databricks calls this IAM role for EC2 access an “Instance Profile”. This is how to set it up:
- Log into AWS.
- Click on “Roles” in the left pane and choose “Create Role”.
- Choose “AWS Service” as the trusted entity and “EC2” as the use case.
- Click “Next” twice.
- Give the role a name and click “Create Role”.
Now that you have your Instance Profile, you need to give it access to the S3 bucket you just created.
- Find the role in Role List and click on the name.
- Go to the Permissions tab.
- Click “Add Permissions -> Inline Policy”.
- Go to the JSON tab.
Here is sample JSON. You just need to replace the resource details with your bucket name.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<s3-bucket-name>"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::<s3-bucket-name>/*"
]
}
]
}
Once you’ve pasted in your policy and reviewed it, give it a name and click “Create Policy”.
The next display shows a summary of what has been created. At this point, you should take a copy of the Role ARN (you need this for stage 4).
Stage 2: Link Back From the Bucket To the Instance Profile
Now you tell AWS that the bucket connects back to the instance profile.
- Open the IAM console.
- Click on “Policies” in the left pane and choose “Create Policy”.
- Use the template below to add a policy with read/write/list permissions.
- Name the policy and create it.
Here is a sample policy where you should replace:
- The AWS Account ID
- The Instance Profile role
- The bucket name
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::<s3-bucket-name>"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
},
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::<s3-bucket-name>/*"
}
]
}
Stage 3: Add The Intance Profile To The EC2 Instance Policy
The Instance Profile is what you created in Stage 1. You also need the details of the IAM role that you used when you first created your Databricks account. To get this:
- Go to Workspaces.
- Click your workspace name.
- Scroll down to the credentials box.
- Grab the role from the long “ARN” string.
Suppose the Role ARN is this: rn:aws:iam:999988887777:role/data-tasks
The role name is “data-tasks” in this example.
Now you know the role name, follow these steps:
- Open the IAM console.
- Click on “Roles” in the left pane and find the Databricks deployment role.
- Go to the Permissions tab.
- Edit the poilcy.
- Add the section below to the JSON.
Replace “iam-role-instance-profile” with the role from Stage 1.
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-instance-profile>"
}
Stage 4: Hook Up The Instance Profile To Your Workspace
Whew, nearly finished! Follow these steps in Databricks:
- Open the admin settings.
- Go to the Instance Profiles tab.
Add the Instance Profile you created in Stage 1.