One Stop solution for all the Orchestration needs.
Overall Satisfaction with Apache Airflow
I am part of the data platform team, where we are responsible for building the platform for data ingestion, an aggregation system, and the compute engines. Apache Airflow is one of the core systems responsible for orchestrating pipelines and scheduled workflows. We have multiple deployments of Apache Airflow running for different use cases, each with a workflow of 5,000 to 9,000 DAGs and executing even more DAGs. The Apache Airflow now also offers HA with scheduler replicas, which is a lifesaver and is well-maintained by the community.
Pros
- Apache Airflow is one of the best Orchestration platforms and a go-to scheduler for teams building a data platform or pipelines.
- Apache Airflow supports multiple operators, such as the Databricks, Spark, and Python operators. All of these provide us with functionality to implement any business logic.
- Apache Airflow is highly scalable, and we can run a large number of DAGs with ease. It provided HA and replication for workers. Maintaining airflow deployments is very easy, even for smaller teams, and we also get lots of metrics for observability.
Cons
- To achieve a production-ready deployment of Apache Airflow, you require some level of expertise. A repository of officially maintained sample configurations of Helm charts will be handy for a new team.
- As airflow is used to build many data pipelines, a feature for building lineage using queries for different compute engines will help develop the data catalog. Typically, multiple tools are required for this use case.
- For building a data pipeline from upstream to downstream tables, using Airflow with lineage to trigger the downstream DAGs after recovery will be helpful. Additionally, creating a dependency between the DAGs would be beneficial.
- By using Apache Airflow, we were able to build the data platform and migrate our workloads out of Hevo Data.
- Airflow currently powers the datasets for the entire company, supporting analytics backends, data science, and data engineering use cases.
- We can scale the DAGS from < 1000 to currently> 8000 dag runs per day using HA and worker scaling.
Apache Airflow is suited for a much wider set of use cases compared to Databricks. You can run it anywhere, and there is also no vendor lock-in. With Airflow, we can utilize almost any compute engine. Same thing we want to do with Databricks. There might be some level of difficulty based on the support.
Do you think Apache Airflow delivers good value for the price?
Yes
Are you happy with Apache Airflow's feature set?
Yes
Did Apache Airflow live up to sales and marketing promises?
Yes
Did implementation of Apache Airflow go as expected?
Yes
Would you buy Apache Airflow again?
Yes

Comments
Please log in to join the conversation