Simply now
๐ Necessities: The Amazon Bin Picture Dataset incorporates over 500,000 photos, every with a metadata JSON file sized round 1โ3KB. Nevertheless, smaller information (e.g., lower than 1MB) will be inefficient to course of in bulk, as Spark incurs overhead when opening and processing every file individually. On this demo, we use AWS Athena with Trino SQL (Amazon Ion Hive SerDe) to question and consolidate 17.9MB of 10,441 JSON information into 3.9MB of 21 SNAPPY-compressed Parquet information.
๐ To breed the consequence: run the next two steps: First, obtain a portion of the metadata JSON information from the general public S3 bucket to your native machine and add them to your individual S3 bucket. Then, run Trino SQL instructions in AWS Athena.
1. ETL โ https://github.com/nov05/udacity-nd009t-capstone-starter/blob/master/starter/ETL.ipynb
2. AWS Athena Trino SQL โ https://github.com/nov05/udacity-nd009t-capstone-starter/blob/master/starter/AWS%20Athena%20Trino%20SQL.md