From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

to guide the cloud business with a whopping 32% share as a consequence of its early market entry, sturdy expertise and complete service choices. Nevertheless, many customers discover AWS difficult to navigate, and this discontentment lead extra firms and organisations to choose its rivals Microsoft Azure and Google Cloud Platform.

Regardless of its steeper studying curve and fewer intuitive interface, AWS stays the highest cloud service as a consequence of its reliability, hybrid cloud and most service choices. Extra importantly, the choice of correct methods can considerably scale back configuration complexity, streamline workflows, and increase efficiency.

On this article, I’ll introduce an environment friendly approach to arrange a whole ETL pipeline with orchestration on AWS, primarily based by myself expertise. It is going to additionally provide you with a refreshed view on the manufacturing of information with AWS or make you are feeling much less struggling when conducting configuration if that is your first time to make use of AWS for sure duties.

Technique for Designing an Environment friendly Knowledge Pipeline

AWS has probably the most complete ecosystem with its huge providers. To construct a production-ready knowledge warehouse on AWS no less than requires the next providers:

IAM – Though this service isn’t included into any a part of the workflow, it’s the inspiration for accessing all different providers.
AWS S3 – Knowledge Lake storage
AWS Glue – ETL processing
Amazon Redshift – Knowledge Warehouse
CloudWatch – Monitoring and logging

You additionally want entry to Airflow if it’s important to schedule extra advanced dependencies and conduct superior retries by way of error dealing with though Redshift can deal with some primary cron jobs.

To make your work simpler, I extremely suggest to put in an IDE (Visible Studio Code or PyCharm and naturally you possibly can select your individual favorite IDE). An IDE dramatically improves your effectivity for advanced python code, native testing/debugging, model management integration and group collaboration. And within the subsequent session, I’ll present step-by-step configurations.

Preliminary Setup

Listed here are the steps of preliminary configurations:

Launch a digital atmosphere in your IDE
Set up dependencies – mainly, we have to set up the libraries that shall be used afterward.

pip set up apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy

Set up AWS CLI – this step means that you can write scripts to automate numerous AWS operations and makes the administration of AWS sources extra effectively.

AWS Configuration – be sure to enter these IAM person credentials when prompted:
- AWS Entry Key ID: Out of your IAM person.
- AWS Secret Entry Key: Out of your IAM person.
- Default area: us-east-1 (or your most popular area)
- Default output format: json.
Combine Airflow – listed here are the steps:
- Initialize Airflow
- Create DAG recordsdata in Airflow
- Run the net server at http://localhost:8080 (login:admin/admin)
- Open one other terminal tab and begin the scheduler

export AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow customers create 
  --username admin 
  --password admin 
  --firstname Admin 
  --lastname Consumer 
  --role Admin 
  --email [email protected]
#Initialize Airflow

airflow webserver --port 8080 ##run the webserver

airflow scheduler #begin the scheduler

Growth Workflow: COVID-19 Knowledge Case Examine

I’m utilizing JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration objective. You may seek advice from knowledge here,

The chart beneath reveals the workflow from knowledge ingestion to knowledge loading to Redshift tables within the growth atmosphere.

Growth workflow created by writer

Knowledge Ingestion

In step one of information ingestion to AWS S3, I processed knowledge by melting them to lengthy format and changing the date format. I saved the information within the parquet format to enhance the storage effectivity, improve question efficiency and scale back storage prices. The code for this step is as beneath:

import pandas as pd
from datetime import datetime
import os
import boto3
import sys

def process_covid_data():
    attempt:
        # Load uncooked knowledge
        url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
        
        # --- Knowledge Processing ---
        # 1. Soften to lengthy format
        df = df.soften(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        
        # 2. Convert dates (JHU format: MM/DD/YY)
        df['date'] = pd.to_datetime(
            df['date_str'], 
            format='%m/%d/%y',
            errors='coerce'
        ).dropna()
        
        # 3. Save as partitioned Parquet
        output_dir = "covid_processed"
        df.to_parquet(
            output_dir,
            engine='pyarrow',
            compression='snappy',
            partition_cols=['date']
        )
        
        # 4. Add to S3
        s3 = boto3.shopper('s3')
        total_files = 0
        
        for root, _, recordsdata in os.stroll(output_dir):
            for file in recordsdata:
                local_path = os.path.be part of(root, file)
                s3_path = os.path.be part of(
                    'uncooked/covid/',
                    os.path.relpath(local_path, output_dir)
                )
                s3.upload_file(
                    Filename=local_path,
                    Bucket='my-dev-bucket',
                    Key=s3_path
                )
            total_files += len(recordsdata)
        
        print(f"Efficiently processed and uploaded {total_files} Parquet recordsdata")
        print(f"Knowledge covers from {df['date'].min()} to {df['date'].max()}")
        return True

    besides Exception as e:
        print(f"Error: {str(e)}", file=sys.stderr)
        return False

if __name__ == "__main__":
    process_covid_data()

After working the python code, you need to have the ability to see the parquet recordsdata within the S3 buckets, beneath the folder of ‘uncooked/covid/’.

ETL Pipeline Growth

AWS Glue is principally used for ETL Pipeline Growth. Though it can be used for knowledge ingestion even when the information hasn’t loaded to S3, its energy lies in processing knowledge as soon as it’s in S3 for knowledge warehousing functions. Right here’s PySpark scripts for knowledge rework:

# transform_covid.py
from awsglue.context import GlueContext
from pyspark.sql.capabilities import *

glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_options(
    "s3",
    {"paths": ["s3://my-dev-bucket/raw/covid/"]},
    format="parquet"
).toDF()

# Add transformations right here
df_transformed = df.withColumn("load_date", current_date())

# Write to processed zone
df_transformed.write.parquet(
    "s3://my-dev-bucket/processed/covid/",
    mode="overwrite"
)

The subsequent step is to load knowledge to Redshift. In Redshift Console, click on on “Question Editor Q2” on the left aspect and you’ll edit your SQL code and end the Redshift COPY.

# Create a desk covid_data in dev schema
CREATE TABLE dev.covid_data (
    "Province/State" VARCHAR(100),  
    "Nation/Area" VARCHAR(100),
    "Lat" FLOAT8,
    "Lengthy" FLOAT8,
    date_str VARCHAR(100),
    confirmed_cases FLOAT8  
)
DISTKEY("Nation/Area")   
SORTKEY(date_str);

# COPY knowledge to redshift
COPY dev.covid_data (
    "Province/State",
    "Nation/Area",
    "Lat",
    "Lengthy",
    date_str,
    confirmed_cases
)
FROM 's3://my-dev-bucket/processed/covid/'
IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
REGION 'your-region'
FORMAT PARQUET;

Then you definitely’ll see the information efficiently uploaded to the information warehouse.

Pipeline Automation

The simplest approach to automate your knowledge pipeline is to schedule jobs beneath Redshift question editor v2 by making a Saved Process (I’ve a extra detailed introduction about SQL Saved Process, you possibly can seek advice from this article).

CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
AS $$
BEGIN
  TRUNCATE TABLE dev.covid_data;
  COPY dev.covid_data 
  FROM 's3://simba-dev-bucket/uncooked/covid'
  IAM_ROLE 'arn:aws:iam::your-account-id:position/RedshiftLoadRole'
  REGION 'your-region'
  FORMAT PARQUET;
END;
$$ LANGUAGE plpgsql;

Alternatively, you possibly can run Airflow for scheduled jobs.

from datetime import datetime
from airflow import DAG
from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator

default_args = {
    'proprietor': 'data_team',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'retries': 2
}

with DAG(
    'redshift_etl_dev',
    default_args=default_args,
    schedule_interval='@day by day',
    catchup=False
) as dag:

    run_etl = RedshiftSQLOperator(
        task_id='run_covid_etl',
        redshift_conn_id='redshift_dev',
        sql='CALL dev.run_covid_etl()',
    )

Manufacturing Workflow

Airflow DAG is highly effective to orchestrates your complete ETL pipeline if there are numerous dependencies and it’s additionally observe in manufacturing atmosphere.

After creating and testing your ETL pipeline, you possibly can automate your duties in manufacturing atmosphere utilizing Airflow.

Manufacturing workflow created by writer

Listed here are the verify checklist of key preparation steps to assist the profitable deployment in Airflow:

Create S3 bucket my-prod-bucket
Create Glue job prod_covid_transformation in AWS Console
Create Redshift Saved Process prod.load_covid_data()
Configure Airflow
Configure SMTP for emails in airflow.cfg

Then the deployment of the information pipeline in Airflow is:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.suppliers.amazon.aws.operators.glue import GlueJobOperator
from airflow.suppliers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
from airflow.operators.e-mail import EmailOperator

# 1. DAG CONFIGURATION
default_args = {
    'proprietor': 'data_team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'start_date': datetime(2023, 1, 1)
}

# 2. DATA INGESTION FUNCTION
def load_covid_data():
    import pandas as pd
    import boto3
    
    url = "https://github.com/CSSEGISandData/COVID-19/uncooked/grasp/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
    df = pd.read_csv(url)

    df = df.soften(
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
        var_name='date_str',
        value_name='confirmed_cases'
    )
    df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
    
    df.to_parquet(
        's3://my-prod-bucket/uncooked/covid/',
        engine='pyarrow',
        partition_cols=['date']
    )

# 3. DAG DEFINITION
with DAG(
    'covid_etl',
    default_args=default_args,
    schedule_interval='@day by day',
    catchup=False
) as dag:

    # Activity 1: Ingest Knowledge
    ingest = PythonOperator(
        task_id='ingest_data',
        python_callable=load_covid_data
    )

    # Activity 2: Remodel with Glue
    rework = GlueJobOperator(
        task_id='transform_data',
        job_name='prod_covid_transformation',
        script_args={
            '--input_path': 's3://my-prod-bucket/uncooked/covid/',
            '--output_path': 's3://my-prod-bucket/processed/covid/'
        }
    )

    # Activity 3: Load to Redshift
    load = RedshiftSQLOperator(
        task_id='load_data',
        sql="CALL prod.load_covid_data()"
    )

    # Activity 4: Notifications
    notify = EmailOperator(
        task_id='send_email',
        to='you-email-address',
        topic='ETL Standing: {{ ds }}',
        html_content='ETL job accomplished: View Logs'
    )

My Ultimate Ideas

Though some customers, particularly those that are new to the cloud and in search of easy options are typically daunted by AWS’s excessive barrier to entry and be overwhelmed by the large decisions of providers, it’s well worth the time and efforts and listed here are the explanations:

The method of configuration, and the designing, constructing and testing of the information pipelines offers you the deep understanding of a typical knowledge engineering workflow. The abilities will profit you even should you produce your initiatives with different cloud providers, comparable to Azure, GCP and Alibaba Cloud.
The mature ecosystem that AWS has and an unlimited array of providers that it affords allow customers to customize their knowledge structure methods and luxuriate in extra flexibility and scalability of their initiatives.

Thanks for studying! Hope this text useful to construct your cloud-base knowledge pipeline!

Source link

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

An Introduction to Remote Model Context Protocol Servers

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Macron Pitches Lighter Regulation to Fuel A.I. Boom in Europe

JPMorgan to Cut Headcount in Some Divisions Due to AI

มือใหม่หัดสร้าง Signal Trading Strategy | by Donato_TH | May, 2025

Our Picks

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

Qantas data breach to impact 6 million airline customers

From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

Technique for Designing an Environment friendly Knowledge Pipeline

Preliminary Setup

Growth Workflow: COVID-19 Knowledge Case Examine

Manufacturing Workflow

My Ultimate Ideas

Related Posts