Close Menu
    Trending
    • Revisiting Benchmarking of Tabular Reinforcement Learning Methods
    • Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025
    • Qantas data breach to impact 6 million airline customers
    • He Went From $471K in Debt to Teaching Others How to Succeed
    • An Introduction to Remote Model Context Protocol Servers
    • Blazing-Fast ML Model Serving with FastAPI + Redis (Boost 10x Speed!) | by Sarayavalasaravikiran | AI Simplified in Plain English | Jul, 2025
    • AI Knowledge Bases vs. Traditional Support: Who Wins in 2025?
    • Why Your Finance Team Needs an AI Strategy, Now
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle
    Artificial Intelligence

    From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

    Team_AIBS NewsBy Team_AIBS NewsJune 19, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    to guide the cloud business with a whopping 32% share as a consequence of its early market entry, sturdy expertise and complete service choices. Nevertheless, many customers discover AWS difficult to navigate, and this discontentment lead extra firms and organisations to choose its rivals Microsoft Azure and Google Cloud Platform.

    Regardless of its steeper studying curve and fewer intuitive interface, AWS stays the highest cloud service as a consequence of its reliability, hybrid cloud and most service choices. Extra importantly, the choice of correct methods can considerably scale back configuration complexity, streamline workflows, and increase efficiency.

    On this article, I’ll introduce an environment friendly approach to arrange a whole ETL pipeline with orchestration on AWS, primarily based by myself expertise. It is going to additionally provide you with a refreshed view on the manufacturing of information with AWS or make you are feeling much less struggling when conducting configuration if that is your first time to make use of AWS for sure duties.

    Technique for Designing an Environment friendly Knowledge Pipeline

    AWS has probably the most complete ecosystem with its huge providers. To construct a production-ready knowledge warehouse on AWS no less than requires the next providers:

    • IAM – Though this service isn’t included into any a part of the workflow, it’s the inspiration for accessing all different providers.
    • AWS S3 – Knowledge Lake storage
    • AWS Glue – ETL processing
    • Amazon Redshift – Knowledge Warehouse
    • CloudWatch – Monitoring and logging

    You additionally want entry to Airflow if it’s important to schedule extra advanced dependencies and conduct superior retries by way of error dealing with though Redshift can deal with some primary cron jobs.

    To make your work simpler, I extremely suggest to put in an IDE (Visible Studio Code or PyCharm and naturally you possibly can select your individual favorite IDE). An IDE dramatically improves your effectivity for advanced python code, native testing/debugging, model management integration and group collaboration. And within the subsequent session, I’ll present step-by-step configurations.

    Preliminary Setup

    Listed here are the steps of preliminary configurations:

    • Launch a digital atmosphere in your IDE
    • Set up dependencies – mainly, we have to set up the libraries that shall be used afterward.
    pip set up apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy
    • Set up AWS CLI – this step means that you can write scripts to automate numerous AWS operations and makes the administration of AWS sources extra effectively.
    • AWS Configuration – be sure to enter these IAM person credentials when prompted:
      • AWS Entry Key ID: Out of your IAM person.
      • AWS Secret Entry Key: Out of your IAM person.
      • Default area: us-east-1 (or your most popular area)
      • Default output format: json.
    • Combine Airflow – listed here are the steps:
      • Initialize Airflow
      • Create DAG recordsdata in Airflow
      • Run the net server at http://localhost:8080 (login:admin/admin)
      • Open one other terminal tab and begin the scheduler
    export AIRFLOW_HOME=$(pwd)/airflow
    airflow db init
    airflow customers create 
      --username admin 
      --password admin 
      --firstname Admin 
      --lastname Consumer 
      --role Admin 
      --email [email protected]
    #Initialize Airflow
    airflow webserver --port 8080 ##run the webserver
    airflow scheduler #begin the scheduler

    Growth Workflow: COVID-19 Knowledge Case Examine

    I’m utilizing JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration objective. You may seek advice from knowledge here,

    The chart beneath reveals the workflow from knowledge ingestion to knowledge loading to Redshift tables within the growth atmosphere.

    Growth workflow created by writer

    Knowledge Ingestion

    In step one of information ingestion to AWS S3, I processed knowledge by melting them to lengthy format and changing the date format. I saved the information within the parquet format to enhance the storage effectivity, improve question efficiency and scale back storage prices. The code for this step is as beneath:

    import pandas as pd
    from datetime import datetime
    import os
    import boto3
    import sys
    
    def process_covid_data():
        attempt:
            # Load uncooked knowledge
            url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
            df = pd.read_csv(url)
            
            # --- Knowledge Processing ---
            # 1. Soften to lengthy format
            df = df.soften(
                id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
                var_name='date_str',
                value_name='confirmed_cases'
            )
            
            # 2. Convert dates (JHU format: MM/DD/YY)
            df['date'] = pd.to_datetime(
                df['date_str'], 
                format='%m/%d/%y',
                errors='coerce'
            ).dropna()
            
            # 3. Save as partitioned Parquet
            output_dir = "covid_processed"
            df.to_parquet(
                output_dir,
                engine='pyarrow',
                compression='snappy',
                partition_cols=['date']
            )
            
            # 4. Add to S3
            s3 = boto3.shopper('s3')
            total_files = 0
            
            for root, _, recordsdata in os.stroll(output_dir):
                for file in recordsdata:
                    local_path = os.path.be part of(root, file)
                    s3_path = os.path.be part of(
                        'uncooked/covid/',
                        os.path.relpath(local_path, output_dir)
                    )
                    s3.upload_file(
                        Filename=local_path,
                        Bucket='my-dev-bucket',
                        Key=s3_path
                    )
                total_files += len(recordsdata)
            
            print(f"Efficiently processed and uploaded {total_files} Parquet recordsdata")
            print(f"Knowledge covers from {df['date'].min()} to {df['date'].max()}")
            return True
    
        besides Exception as e:
            print(f"Error: {str(e)}", file=sys.stderr)
            return False
    
    if __name__ == "__main__":
        process_covid_data()

    After working the python code, you need to have the ability to see the parquet recordsdata within the S3 buckets, beneath the folder of ‘uncooked/covid/’.

    Screenshot by writer

    ETL Pipeline Growth

    AWS Glue is principally used for ETL Pipeline Growth. Though it can be used for knowledge ingestion even when the information hasn’t loaded to S3, its energy lies in processing knowledge as soon as it’s in S3 for knowledge warehousing functions. Right here’s PySpark scripts for knowledge rework:

    # transform_covid.py
    from awsglue.context import GlueContext
    from pyspark.sql.capabilities import *
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    df = glueContext.create_dynamic_frame.from_options(
        "s3",
        {"paths": ["s3://my-dev-bucket/raw/covid/"]},
        format="parquet"
    ).toDF()
    
    # Add transformations right here
    df_transformed = df.withColumn("load_date", current_date())
    
    # Write to processed zone
    df_transformed.write.parquet(
        "s3://my-dev-bucket/processed/covid/",
        mode="overwrite"
    )
    Screenshot by writer

    The subsequent step is to load knowledge to Redshift. In Redshift Console, click on on “Question Editor Q2” on the left aspect and you’ll edit your SQL code and end the Redshift COPY.

    # Create a desk covid_data in dev schema
    CREATE TABLE dev.covid_data (
        "Province/State" VARCHAR(100),  
        "Nation/Area" VARCHAR(100),
        "Lat" FLOAT8,
        "Lengthy" FLOAT8,
        date_str VARCHAR(100),
        confirmed_cases FLOAT8  
    )
    DISTKEY("Nation/Area")   
    SORTKEY(date_str);
    # COPY knowledge to redshift
    COPY dev.covid_data (
        "Province/State",
        "Nation/Area",
        "Lat",
        "Lengthy",
        date_str,
        confirmed_cases
    )
    FROM 's3://my-dev-bucket/processed/covid/'
    IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
    REGION 'your-region'
    FORMAT PARQUET;

    Then you definitely’ll see the information efficiently uploaded to the information warehouse.

    Screenshot by writer

    Pipeline Automation

    The simplest approach to automate your knowledge pipeline is to schedule jobs beneath Redshift question editor v2 by making a Saved Process (I’ve a extra detailed introduction about SQL Saved Process, you possibly can seek advice from this article).

    CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
    AS $$
    BEGIN
      TRUNCATE TABLE dev.covid_data;
      COPY dev.covid_data 
      FROM 's3://simba-dev-bucket/uncooked/covid'
      IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
      REGION 'your-region'
      FORMAT PARQUET;
    END;
    $$ LANGUAGE plpgsql;
    Screenshot by writer

    Alternatively, you possibly can run Airflow for scheduled jobs.

    from datetime import datetime
    from airflow import DAG
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    
    default_args = {
        'proprietor': 'data_team',
        'depends_on_past': False,
        'start_date': datetime(2023, 1, 1),
        'retries': 2
    }
    
    with DAG(
        'redshift_etl_dev',
        default_args=default_args,
        schedule_interval='@day by day',
        catchup=False
    ) as dag:
    
        run_etl = RedshiftSQLOperator(
            task_id='run_covid_etl',
            redshift_conn_id='redshift_dev',
            sql='CALL dev.run_covid_etl()',
        )

    Manufacturing Workflow

    Airflow DAG is highly effective to orchestrates your complete ETL pipeline if there are numerous dependencies and it’s additionally observe in manufacturing atmosphere.

    After creating and testing your ETL pipeline, you possibly can automate your duties in manufacturing atmosphere utilizing Airflow.

    Manufacturing workflow created by writer

    Listed here are the verify checklist of key preparation steps to assist the profitable deployment in Airflow:

    • Create S3 bucket my-prod-bucket 
    • Create Glue job prod_covid_transformation in AWS Console
    • Create Redshift Saved Process prod.load_covid_data()
    • Configure Airflow
    • Configure SMTP for emails in airflow.cfg

    Then the deployment of the information pipeline in Airflow is:

    from datetime import datetime, timedelta
    from airflow import DAG
    from airflow.operators.python import PythonOperator
    from airflow.suppliers.amazon.aws.operators.glue import GlueJobOperator
    from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
    from airflow.operators.e-mail import EmailOperator
    
    # 1. DAG CONFIGURATION
    default_args = {
        'proprietor': 'data_team',
        'retries': 3,
        'retry_delay': timedelta(minutes=5),
        'start_date': datetime(2023, 1, 1)
    }
    
    # 2. DATA INGESTION FUNCTION
    def load_covid_data():
        import pandas as pd
        import boto3
        
        url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
    
        df = df.soften(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
        
        df.to_parquet(
            's3://my-prod-bucket/uncooked/covid/',
            engine='pyarrow',
            partition_cols=['date']
        )
    
    # 3. DAG DEFINITION
    with DAG(
        'covid_etl',
        default_args=default_args,
        schedule_interval='@day by day',
        catchup=False
    ) as dag:
    
        # Activity 1: Ingest Knowledge
        ingest = PythonOperator(
            task_id='ingest_data',
            python_callable=load_covid_data
        )
    
        # Activity 2: Remodel with Glue
        rework = GlueJobOperator(
            task_id='transform_data',
            job_name='prod_covid_transformation',
            script_args={
                '--input_path': 's3://my-prod-bucket/uncooked/covid/',
                '--output_path': 's3://my-prod-bucket/processed/covid/'
            }
        )
    
        # Activity 3: Load to Redshift
        load = RedshiftSQLOperator(
            task_id='load_data',
            sql="CALL prod.load_covid_data()"
        )
    
        # Activity 4: Notifications
        notify = EmailOperator(
            task_id='send_email',
            to='you-email-address',
            topic='ETL Standing: {{ ds }}',
            html_content='ETL job accomplished: View Logs'
        )

    My Ultimate Ideas

    Though some customers, particularly those that are new to the cloud and in search of easy options are typically daunted by AWS’s excessive barrier to entry and be overwhelmed by the large decisions of providers, it’s well worth the time and efforts and listed here are the explanations:

    • The method of configuration, and the designing, constructing and testing of the information pipelines offers you the deep understanding of a typical knowledge engineering workflow. The abilities will profit you even should you produce your initiatives with different cloud providers, comparable to Azure, GCP and Alibaba Cloud.
    • The mature ecosystem that AWS has and an unlimited array of providers that it affords allow customers to customize their knowledge structure methods and luxuriate in extra flexibility and scalability of their initiatives.

    Thanks for studying! Hope this text useful to construct your cloud-base knowledge pipeline!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleIt’s pretty easy to get DeepSeek to talk dirty
    Next Article This One Leadership Move Will Transform Your Team’s Loyalty and Performance
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025
    Artificial Intelligence

    An Introduction to Remote Model Context Protocol Servers

    July 2, 2025
    Artificial Intelligence

    How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Macron Pitches Lighter Regulation to Fuel A.I. Boom in Europe

    February 10, 2025

    JPMorgan to Cut Headcount in Some Divisions Due to AI

    May 19, 2025

    มือใหม่หัดสร้าง Signal Trading Strategy | by Donato_TH | May, 2025

    May 12, 2025
    Our Picks

    Revisiting Benchmarking of Tabular Reinforcement Learning Methods

    July 2, 2025

    Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

    July 2, 2025

    Qantas data breach to impact 6 million airline customers

    July 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.