Apache Airflow is an open-source platform for creating, scheduling, and monitoring data pipelines.
Because Airflow is written in Python, your developers can use Python to define the tasks and dependencies of the steps in a pipeline. They can also use version control, code reuse, and automated tests when developing workflows.
Airflow’s scheduler provides an interactive web-based user interface that helps with visualizing pipelines running in production, monitoring progress, and troubleshooting issues.
Table of Contents
Origins Of Airflow
Apache Airflow was originally developed by Airbnb in 2014 as a solution to manage the company’s increasingly complex tasks and dependencies that made up their data pipelines.
Airbnb open-sourced Airflow in 2015, making it available to a broader community. The following year, Apache Software Foundation took Airflow under its umbrella. They made it a top-level Apache project in 2019, acknowledging its wide adoption and active community.
Essential Terms To Understand
You need to get familiar with these concepts (I’ll explain each in turn):
A task in Airflow is a single unit of work. This could be running a Python script, a Shell command, or a Spark job.
A DAG is a collection of tasks that operate on your data. The term stands for “Directed Acyclic Graph”. “Directed” means that it runs in a specific order, and “Acyclic” means that a task can’t loop back to a previous task.
Operators determine what gets done by a task. Airflow comes with many built-in operators for common tasks, such as PythonOperator for running Python code, BashOperator for bash commands, and so on.
Executors are what actually runs the tasks within a DAG. Airflow comes with several executors with different characteristics, such as SequentialExecutor, LocalExecutor,
Workers are the machines within a distributed system that the tasks run on.
Hooks are interfaces to external platforms and databases like Snowflake, S3, MySQL, Postgres, etc. They allow Airflow to interact with external systems.
Workflows As Code
One of the key features of Apache Airflow is that workflows are written as code, typically in Python.
Let’s contrast that with GUI-based workflow tools. Implementing code means that teams can use loops and conditional logic to generate tasks dynamically.
This means that tasks can be created based on certain conditions or variables that may change at runtime, rather than hard-coding each task separately.
Let’s say you want to create a backup task for each database in your database management system (e.g. PostgreSQL).
Instead of creating a task manually for each database, you can dynamically generate these tasks with a loop that uses the metadata (list of databases) of the DBMS. When a new database is created, the new backup task will also appear.
This of course is achieved with a Python script that can be version controlled. It can also be peer-reviewed and run through automated unit tests.
Scheduling With Airflow
Scheduling and monitoring are essential for any data pipeline, ensuring tasks are running at the correct time and providing visibility into their progress and status.
In Airflow, each DAG has a schedule defined by a start date and a schedule interval.
Airflow’s scheduler monitors all tasks and all DAGs and triggers the task instances whose dependencies have been met. It’s designed to run continuously and keep a constant state of what tasks are due to run.
Airflow provides a rich web-based user interface for monitoring and managing your DAGs and tasks. It lets you:
- View the list of DAGs and their status.
- Drill down into a DAG to see tasks, their dependencies, and status.
- View the logs for each task instance.
- Trigger DAG runs manually.
- Mark task instances as successful or failed.
- Clear the state of task instances. dependencies.
In addition, Airflow can send alerts through emails or services like Slack or PagerDuty.
Five Typical Use Cases For Airflow
Here are five examples of how organizations use Airflow in real-world scenarios:
ETL (Extract, Transform, Load) Pipelines
This is one of the most common use cases for Airflow. ETL involves extracting data from multiple sources, transforming it into a suitable format, and loading it into a data warehouse for analysis.
Each of these stages can be modeled as tasks in a DAG, with dependencies between them ensuring they run in the correct order.
Machine Learning Workflows
Complex machine learning workflows often involve many steps, such as data extraction, data cleaning, feature extraction, model training, model evaluation, and model deployment.
Airflow can orchestrate these tasks, ensuring that each step is carried out at the right time and in the right order.
Data Quality Checks
Airflow can be used to run tasks that check the quality of your data.
For example, you might have a task that verifies the integrity of your data, checks for duplicates, or validates that the data conforms to certain rules or constraints.
Automated Report Generation
Reports that need to be generated and distributed regularly can be automated using Airflow.
The pipeline might involve extracting the latest data, running it through some analysis code, generating a report (such as a PDF or a spreadsheet), and then emailing it to the relevant people.
Data Lake Ingestion
Ingesting data into a data lake at regular intervals from various sources is another common use case.
Airflow can manage the orchestration of these complex ingestion processes, handling failures and retries as necessary.
What Do You Need To Install Apache Airflow?
Because it’s a Python package, installing Airflow is relatively straightforward.
A support team can install it on a linux system using the pip installer like this:
pip install 'apache-airflow'
But before you tell them to do so, have a think about which database system will be used for its configuration. The default is SQLite, but many organizations choose PostgreSQL instead.
If that’s what you want, the pip command looks like this:
pip install 'apache-airflow[postgres]'