ETL: Query and consolidate a large number of small JSON files with AWS Athena | by Wenjing Liu

1 min learn

Simply now

—

👉 Necessities: The Amazon Bin Picture Dataset incorporates over 500,000 photos, every with a metadata JSON file sized round 1–3KB. Nevertheless, smaller information (e.g., lower than 1MB) will be inefficient to course of in bulk, as Spark incurs overhead when opening and processing every file individually. On this demo, we use AWS Athena with Trino SQL (Amazon Ion Hive SerDe) to question and consolidate 17.9MB of 10,441 JSON information into 3.9MB of 21 SNAPPY-compressed Parquet information.

👉 To breed the consequence: run the next two steps: First, obtain a portion of the metadata JSON information from the general public S3 bucket to your native machine and add them to your individual S3 bucket. Then, run Trino SQL instructions in AWS Athena.

1. ETL — https://github.com/nov05/udacity-nd009t-capstone-starter/blob/master/starter/ETL.ipynb

2. AWS Athena Trino SQL — https://github.com/nov05/udacity-nd009t-capstone-starter/blob/master/starter/AWS%20Athena%20Trino%20SQL.md

Source link

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

How to Fine-Tune Large Language Models for Real-World Applications | by Aurangzeb Malik | Aug, 2025

Questioning Assumptions & (Inoculum) Potential | by Jake Winiski | Aug, 2025

Roleplay AI Chatbot Apps with the Best Memory: Tested

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Zaporizhzhia’s Future: Nuclear Peril or Promise?

Building ETL Pipelines for Machine Learning Using PySpark: A Comprehensive Guide | by Orami | Apr, 2025

Google, Apple, Meta Passwords Exposed in Massive Hack: Report

Our Picks

Roleplay AI Chatbot Apps with the Best Memory: Tested

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

PwC Reducing Entry-Level Hiring, Changing Processes

ETL: Query and consolidate a large number of small JSON files with AWS Athena | by Wenjing Liu | Jan, 2025

Related Posts