Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code | by Alvaro Leandro Cavalcante Carneiro | Jan, 2025
    Artificial Intelligence

    Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code | by Alvaro Leandro Cavalcante Carneiro | Jan, 2025

    Team_AIBS NewsBy Team_AIBS NewsJanuary 30, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Apache Airflow is among the hottest orchestration instruments within the information subject, powering workflows for firms worldwide. Nonetheless, anybody who has already labored with Airflow in a manufacturing setting, particularly in a posh one, is aware of that it could actually often current some issues and peculiar bugs.

    Among the many many elements it’s worthwhile to handle in an Airflow setting, one essential metric typically flies beneath the radar: DAG parse time. Monitoring and optimizing parse time is crucial to keep away from efficiency bottlenecks and make sure the appropriate functioning of your orchestrations, as we’ll discover on this article.

    That stated, this tutorial goals to introduce airflow-parse-bench, an open-source software I developed to assist information engineers monitor and optimize their Airflow environments, offering insights to cut back code complexity and parse time.

    Concerning Airflow, DAG parse time is usually an missed metric. Parsing happens each time Airflow processes your Python recordsdata to construct the DAGs dynamically.

    By default, all of your DAGs are parsed each 30 seconds — a frequency managed by the configuration variable min_file_process_interval. Which means each 30 seconds, all of the Python code that’s current in your dags folder is learn, imported, and processed to generate DAG objects containing the duties to be scheduled. Efficiently processed recordsdata are then added to the DAG Bag.

    Two key Airflow elements deal with this course of:

    Collectively, each elements (generally known as the dag processor) are executed by the Airflow Scheduler, making certain that your DAG objects are up to date earlier than being triggered. Nonetheless, for scalability and safety causes, additionally it is doable to run your dag processor as a separate element in your cluster.

    In case your setting solely has a number of dozen DAGs, it’s unlikely that the parsing course of will trigger any sort of drawback. Nonetheless, it’s widespread to search out manufacturing environments with tons of and even hundreds of DAGs. On this case, in case your parse time is simply too excessive, it could actually result in:

    • Delay DAG scheduling.
    • Improve useful resource utilization.
    • Setting heartbeat points.
    • Scheduler failures.
    • Extreme CPU and reminiscence utilization, losing assets.

    Now, think about having an setting with tons of of DAGs containing unnecessarily advanced parsing logic. Small inefficiencies can shortly flip into important issues, affecting the soundness and efficiency of your complete Airflow setup.

    When writing Airflow DAGs, there are some necessary finest practices to keep in mind to create optimized code. Though you will discover a whole lot of tutorials on how one can enhance your DAGs, I’ll summarize a number of the key ideas that may considerably improve your DAG efficiency.

    Restrict Prime-Stage Code

    One of the widespread causes of excessive DAG parsing occasions is inefficient or advanced top-level code. Prime-level code in an Airflow DAG file is executed each time the Scheduler parses the file. If this code contains resource-intensive operations, corresponding to database queries, API calls, or dynamic activity technology, it could actually considerably influence parsing efficiency.

    The next code exhibits an instance of a non-optimized DAG:

    On this case, each time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which may considerably influence the parse time.

    One other necessary issue contributing to gradual parsing is top-level imports. Each library imported on the high stage is loaded into reminiscence throughout parsing, which could be time-consuming. To keep away from this, you’ll be able to transfer imports into capabilities or activity definitions.

    The next code exhibits a greater model of the identical DAG:

    Keep away from Xcoms and Variables in Prime-Stage Code

    Nonetheless speaking about the identical subject, is especially attention-grabbing to keep away from utilizing Xcoms and Variables in your top-level code. As acknowledged by Google documentation:

    If you’re utilizing Variable.get() in high stage code, each time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This could dramatically decelerate parse occasions.

    To deal with this, think about using a JSON dictionary to retrieve a number of variables in a single database question, somewhat than making a number of Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this manner are solely processed throughout activity execution, not throughout DAG parsing.

    Take away Pointless DAGs

    Though it appears apparent, it’s all the time necessary to recollect to periodically clear up pointless DAGs and recordsdata out of your setting:

    • Take away unused DAGs: Examine your dags folder and delete any recordsdata which can be now not wanted.
    • Use .airflowignore: Specify the recordsdata Airflow ought to deliberately ignore, skipping parsing.
    • Assessment paused DAGs: Paused DAGs are nonetheless parsed by the Scheduler, consuming assets. If they’re now not required, take into account eradicating or archiving them.

    Change Airflow Configurations

    Lastly, you possibly can change some Airflow configurations to cut back the Scheduler useful resource utilization:

    • min_file_process_interval: This setting controls how typically (in seconds) Airflow parses your DAG recordsdata. Growing it from the default 30 seconds can cut back the Scheduler’s load at the price of slower DAG updates.
    • dag_dir_list_interval: This determines how typically (in seconds) Airflow scans the dags listing for brand spanking new DAGs. In case you deploy new DAGs occasionally, take into account rising this interval to cut back CPU utilization.

    We’ve mentioned lots in regards to the significance of making optimized DAGs to take care of a wholesome Airflow setting. However how do you truly measure the parse time of your DAGs? Fortuitously, there are a number of methods to do that, relying in your Airflow deployment or working system.

    For instance, if in case you have a Cloud Composer deployment, you’ll be able to simply retrieve a DAG parse report by executing the next command on Google CLI:

    gcloud composer environments run $ENVIRONMENT_NAME 
    — location $LOCATION
    dags report

    Whereas retrieving parse metrics is easy, measuring the effectiveness of your code optimizations could be much less so. Each time you modify your code, it’s worthwhile to redeploy the up to date Python file to your cloud supplier, look forward to the DAG to be parsed, after which extract a brand new report — a gradual and time-consuming course of.

    One other doable method, for those who’re on Linux or Mac, is to run this command to measure the parse time regionally in your machine:

    time python airflow/example_dags/instance.py

    Nonetheless, whereas easy, this method will not be sensible for systematically measuring and evaluating the parse occasions of a number of DAGs.

    To deal with these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and evaluating the parse occasions of your DAGs utilizing Airflow’s native parse technique.

    The airflow-parse-bench software makes it straightforward to retailer parse occasions, evaluate outcomes, and standardize comparisons throughout your DAGs.

    Putting in the Library

    Earlier than set up, it’s really helpful to make use of a virtualenv to keep away from library conflicts. As soon as arrange, you’ll be able to set up the bundle by operating the next command:

    pip set up airflow-parse-bench

    Word: This command solely installs the important dependencies (associated to Airflow and Airflow suppliers). You have to manually set up any further libraries your DAGs rely upon.

    For instance, if a DAG makes use of boto3 to work together with AWS, be certain that boto3 is put in in your setting. In any other case, you may encounter parse errors.

    After that, it’s a necessity to initialize your Airflow database. This may be finished by executing the next command:

    airflow db init

    As well as, in case your DAGs use Airflow Variables, you will need to outline them regionally as nicely. Nonetheless, it’s not vital to place actual values in your variables, because the precise values aren’t required for parsing functions:

    airflow variables set MY_VARIABLE 'ANY TEST VALUE'

    With out this, you’ll encounter an error like:

    error: 'Variable MY_VARIABLE doesn't exist'

    Utilizing the Device

    After putting in the library, you’ll be able to start measuring parse occasions. For instance, suppose you’ve gotten a DAG file named dag_test.py containing the non-optimized DAG code used within the instance above.

    To measure its parse time, merely run:

    airflow-parse-bench --path dag_test.py

    This execution produces the next output:

    Execution consequence. Picture by creator.

    As noticed, our DAG offered a parse time of 0.61 seconds. If I run the command once more, I’ll see some small variations, as parse occasions can range barely throughout runs because of system and environmental elements:

    Results of one other execution of the identical DAG. Picture by creator.

    So as to current a extra concise quantity, it’s doable to mixture a number of executions by specifying the variety of iterations:

    airflow-parse-bench --path dag_test.py --num-iterations 5

    Though it takes a bit longer to complete, this calculates the common parse time throughout 5 executions.

    Now, to guage the influence of the aforementioned optimizations, I changed the code in mydag_test.py with the optimized model shared earlier. After executing the identical command, I bought the next consequence:

    Parse results of the optimized code. Picture by creator.

    As observed, simply making use of some good practices was able to lowering nearly 0.5 seconds within the DAG parse time, highlighting the significance of the modifications we made!

    There are different attention-grabbing options that I feel it’s related to share.

    As a reminder, if in case you have any doubts or issues utilizing the software, you’ll be able to entry the whole documentation on GitHub.

    Apart from that, to view all of the parameters supported by the library, merely run:

    airflow-parse-bench --help

    Testing A number of DAGs

    Typically, you seemingly have dozens of DAGs to check the parse occasions. To deal with this use case, I created a folder named dags and put 4 Python recordsdata inside it.

    To measure the parse occasions for all of the DAGs in a folder, it is simply essential to specify the folder path within the --path parameter:

    airflow-parse-bench --path my_path/dags

    Working this command produces a desk summarizing the parse occasions for all of the DAGs within the folder:

    Testing the parse time of a number of DAGs. Picture by creator.

    By default, the desk is sorted from the quickest to the slowest DAG. Nonetheless, you’ll be able to reverse the order through the use of the --order parameter:

    airflow-parse-bench --path my_path/dags --order desc
    Inverted sorting order. Picture by creator.

    Skipping Unchanged DAGs

    The --skip-unchanged parameter could be particularly helpful throughout growth. Because the title suggests, this selection skips the parse execution for DAGs that have not been modified because the final execution:

    airflow-parse-bench --path my_path/dags --skip-unchanged

    As proven beneath, when the DAGs stay unchanged, the output displays no distinction in parse occasions:

    Output with no distinction for unchanged recordsdata. Picture by creator.

    Resetting the Database

    All DAG data, together with metrics and historical past, is saved in an area SQLite database. If you wish to clear all saved information and begin recent, use the --reset-db flag:

    airflow-parse-bench --path my_path/dags --reset-db

    This command resets the database and processes the DAGs as if it have been the primary execution.

    Parse time is a vital metric for sustaining scalable and environment friendly Airflow environments, particularly as your orchestration necessities turn into more and more advanced.

    For that reason, the airflow-parse-bench library could be an necessary software for serving to information engineers create higher DAGs. By testing your DAGs’ parse time regionally, you’ll be able to simply and shortly discover your code bottleneck, making your dags sooner and extra performant.

    Because the code is executed regionally, the produced parse time gained’t be the identical because the one current in your Airflow cluster. Nonetheless, if you’ll be able to cut back the parse time in your native machine, the identical may be reproduced in your cloud setting.

    Lastly, this undertaking is open for collaboration! When you’ve got solutions, concepts, or enhancements, be at liberty to contribute on GitHub.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAI 101: The Ultimate Beginner’s Guide to Artificial Intelligence — Part 1 | by Shubham Arya | Jan, 2025
    Next Article Here’s What Amazon Is Doing To Cut Down On Middle Management
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025
    Artificial Intelligence

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Pressure grows to hold secret Apple privacy hearing in public

    March 13, 2025

    I Scaled a 500-Person Company on Hustle — But Wellness Made It Sustainable (and More Profitable)

    June 4, 2025

    Temu’s Chinese owner sees profits plunge as tariff war bites

    May 28, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.