I have been going through use cases where some basic analytics needed to be run on structured logs generated by our system. The way I did it till now, is to spin up an EMR cluster, load my logs on it and execute hive queries.
Then I found Athena.
We already have most of our relevant data in S3. Which means we cold use Athena directly with it. Other Benefits:
I wrote up a dummy code that created files of user data
I created a few files and then setup a directory structure in s3:
The next step was to query this data in Athena. For this we need to make a database and table in Athena.
Step 1: Create the database:
Step 2: Create the users table:
The table here is created over the base S3 bucket. I did not apply any partioning. Simply provided the base folder and let Athena detect the files under the folder hierarchy.
I executed a simple query:
The query execution history is as below:
I also decided to setup a partitioned table for the same data:
The query executed successfully and Athena UI gave the below comment:
My data is not partitioned in the Hive format. So I will have to manually load the partitions.
If instead of date being parts of different folder levels (i.e. s3://athene-test-dump/2020/05/28), I had it as 's3://athene-test-dump/dataset=2020-05-28', than I could have loaded it using the MSCK REPAIR TABLE command.
Then I found Athena.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run. Athena scales automatically -executing queries in parallel—so results are fast, even with large datasets and complex queries.
I created a few files and then setup a directory structure in s3:
The next step was to query this data in Athena. For this we need to make a database and table in Athena.
For each dataset, a table needs to exist in Athena. The metadata in the table tells Athena where the data is located in Amazon S3, and specifies the structure of the data, for example, column names, data types, and the name of the table. Databases are a logical grouping of tables, and also hold only metadata and schema information for a dataset. The tables creation process registers the dataset with Athena. This registration occurs in the AWS Glue Data Catalog and enables Athena to run queries on the data.
Step 1: Create the database:
CREATE DATABASE users_db
CREATE EXTERNAL TABLE IF NOT EXISTS userRecs ( user_id int, name String, phone_no String, age int, hobbies array<string>, state String, country String) COMMENT 'User details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3://athene-test-dump/';
When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL engine.
The query execution history is as below:
I also decided to setup a partitioned table for the same data:
CREATE EXTERNAL TABLE IF NOT EXISTS users_partitioned ( user_id int, name String, phone_no String, age int, hobbies array<string>, state String, country String ) COMMENT 'User details' PARTITIONED BY (dataset_date String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3://athene-test-dump/';
Query successful. If your table has partitions, you need to load these partitions to be able to query data. You can either load all partitions or load them individually. If you use the load all partitions (MSCK REPAIR TABLE) command, partitions must be in a format understood by Hive. Learn more.
ALTER TABLE users_partitioned ADD PARTITION (dataset_date='2020-05-28')
location 's3://athene-test-dump/2020/05/28';
If instead of date being parts of different folder levels (i.e. s3://athene-test-dump/2020/05/28), I had it as 's3://athene-test-dump/dataset=2020-05-28', than I could have loaded it using the MSCK REPAIR TABLE command.
AWS Athena Consulting Services - By addressing these common issues, users can optimize the performance and accuracy of their queries in Amazon Athena. Regularly monitoring query execution, optimizing data formats, and maintaining data consistency will contribute to a smooth and efficient data analysis experience with Athena.
ReplyDeleteVizio’s AWS Glue consulting services help businesses efficiently automate and streamline their ETL processes, enabling seamless data migration, transformation, and integration across various platforms.
ReplyDelete