Search This Blog

Tuesday 2 June 2020

AWS Athena

I have been going through use cases where some basic analytics needed to be run on structured logs generated by our system. The way I did it till now, is to spin up an EMR cluster, load my logs on it and execute hive queries.
Then I found Athena.
Amazon Athena is an interactive query service that makes it easy to analyze
data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
We already have most of our relevant data in S3. Which means we cold use Athena directly with it. Other Benefits:
Athena is serverless, so there is no infrastructure to set up or manage,
and you pay only for the queries you run. 
Athena scales automatically -executing queries in parallel—so results
are fast, even with large datasets and complex queries.
I wrote up a dummy code that created files of user data
I created a few files and then setup a directory structure in s3:
The next step was to query this data in Athena. For this we need to make a database and table in Athena.
For each dataset, a table needs to exist in Athena. The metadata in the
table tells Athena where the data is located in Amazon S3, and 
specifies the structure of the data, for example, column names, data 
types, and the name of the table. Databases are a logical grouping of 
tables, and also hold only metadata and schema information for a dataset.
The tables creation process registers the dataset with Athena. This 
registration occurs in the AWS Glue Data Catalog and enables Athena to 
run queries on the data.

Step 1: Create the database:
CREATE DATABASE users_db
Step 2: Create the users table:
CREATE EXTERNAL TABLE IF NOT EXISTS userRecs ( 
  user_id int, 
  name String,
  phone_no String, 
  age int,
  hobbies array<string>,
  state String,
  country String)
COMMENT 'User details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://athene-test-dump/';
The table here is created over the base S3 bucket. I did not apply any  partioning. Simply provided the base folder and let Athena detect the files under the folder hierarchy.
When you query an existing table, under the hood, Amazon Athena uses Presto,
 a distributed SQL engine.
I executed a simple query:
The query execution history is as below:
I also decided to setup a partitioned table for the same data:

CREATE EXTERNAL TABLE IF NOT EXISTS users_partitioned (
         user_id int,
         name String,
         phone_no String,
         age int,
         hobbies array<string>,
         state String,
         country String 
) COMMENT 'User details' PARTITIONED BY (dataset_date String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3://athene-test-dump/';
The query executed successfully and Athena UI gave the below comment:
Query successful. If your table has partitions, you need to load these partitions
to be able to query data. You can either load all partitions or load them 
individually. If you use the load all partitions (MSCK REPAIR TABLE) command, 
partitions must be in a format understood by Hive. Learn more.
My data is not partitioned in the Hive format. So I will have to manually load the partitions.
ALTER TABLE users_partitioned ADD PARTITION (dataset_date='2020-05-28') 
location 's3://athene-test-dump/2020/05/28';


If instead of date being parts of different folder levels (i.e. s3://athene-test-dump/2020/05/28), I had it as 's3://athene-test-dump/dataset=2020-05-28', than I could have loaded it using the MSCK REPAIR TABLE command.

1 comment:

  1. AWS Athena Consulting Services - By addressing these common issues, users can optimize the performance and accuracy of their queries in Amazon Athena. Regularly monitoring query execution, optimizing data formats, and maintaining data consistency will contribute to a smooth and efficient data analysis experience with Athena.

    ReplyDelete