AWS Athena Lab

AWS Athena

AWS Athena



Athena Data Analytics


Athena in Analysis and Visualization Solution


Core Components

  • AWS Glue
    • Glue Data Catalog
    • Glue Crawler
    • Glue ETL
  • AWS S3
  • AWS Athena
  • QuickSight

Lab Details

  • Athena working with GLue
    • Crawl NYC Taxi CSV data to catalog taxi data
    • Run Glue ETL to convert taxi data to Parquet
    • Crawl Parquet data to catalog columnar formatted taxi data
    • Use Athena to query taxi data
    • Use QuickSight to visualize our taxi data

Task Details

Query Non-SQL batch data in SQL statement.

  • AWS S3
    1. Download data from TLC and upload the data to Amazon S3
  • AWS Glue
    1. Using AWS Glue to create the database then crawl the data in S3 for building table schemas.
    2. Using AWS Glue ETL Job to transform the raw data (.csv) to (.parquet) for getting the date in timestamp type instead of string type.
  • AWS Athena
    • Connect to Glue for loading schemas(databases and tables)
    • Choose your database
    • Query data
  • QuickSight
    • Connect to Athena for visualization

Prerequisite


Download Data

Download data from NYC gov, visit TLC Trip Record Data

Or click here ot download 2016 January - Yellow Taxi Trip Records (CSV)


Validate raw data

We can see it is a huge data with 1.6GB

Shell
1
2
cd ~/Downloads
ll yellow_tripdata_2016-01.csv


New York city has 10,906,859 records of yellow taxi in Jan, 2016.

Shell
1
wc -l yellow_tripdata_2016-01.csv


S3 Configuration

Create a bucket for storing the lab data.

Create bucket and folders

Create a folder in your S3 bucket for storing raw data.

Create a folder in your S3 bucket for storing parquet data.


Upload data to your raw data folder

Click on your folder for the raw data.

You may need to stably upload the data through Chrome.


Glue Configuration

Create a Database for Raw Data Table & Transformed Data Table

Create a database for storing raw data and transformed data schemas.
Click on Add databases


  • Database name: ny_taxi_db
  • Location: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
  • Description: New York Taxi Data


Click ny_taxi_db

Click on Tables in ny_taxi_db

Click on Create tables using a crawler


Create Raw Data Table through Crawler

Use Crawler to craw the .csv metadata.

Crawler Info
  • Crawler name: ny_taxi_crawler

Click on Next


Crawler source type

Click on Next


Data store
  • Choose a data store: S3
  • Include path: your S3 bucket

Click on Next

Click on Next


IAM Role

Create a new IAM Role or choose an existing IAM Role if you have.

Policies:

  • AmazonS3FullAccess
  • AmazonKinesisFullAccess
  • AWSGlueServiceRole

Click on Next


Schedule
  • Frequency: Run on demand

Click on Next


Output
  • Database: ny_taxi_db

Click on Next


Review all steps

Click on Next


Select the Crawler, and click on Run crawler


Review the table

Now you can see the table
Click on ny_taxi table



ETL - Jobs

At this step, AWS Glue transform the raw data from .csv to apache .parquet.
As we can see, the raw data has stream date format, which we need to change string to timestamp for time related query.

Services -> AWS Glue -> AWS Glue Studios


Click on Create and mange jobs


Create job

  • Select Source and target added to the graph

Click on Create


Click on Untitled job on the left and top page to edit the job name.

  • ny_taxi_etl_parquet


Click on the 1st node, S3 bucket icon, then choose your Database.

  • Database: ny_taxi_db
  • Table: ny_taxi


Parquet Format

Click on the 2nd node, Transform - ApplyMapping icon, then change tpep_pickup_datetime and tpep_dropoff_datetime to pickup_date and dropoff_date. Modify their Data type to timestamp.


Click on the 3rd node, S3 bucket icon, then set the followings:

  • Format: Parquet
  • Compression Type: None
  • S3 Target Location: choose your S3 bucket

Select your bucket then click on Choose


Click on Job details tab.

  • IAM Role: GlueUnlimited
  • Bookmarks: Disable

Click on Save on the right and top page.


Services -> Glue -> Jobs

Select ny_taxi_parquet_ver and click on Action -> Edit script


Map Class

Add a column type with all the values are yellow to .parquet files
Add the following code under line 20

Add a column "type"
1
2
3
4
5
6
7
## Add a column 'type' with all the values are 'yellow'

def AddType(rec):
rec["type"] = "yellow" # add "yellow" to dictionary rec when then function runs
return rec

mapped_DataSource0 = Map.apply(frame = DataSource0, f = AddType)


Add the following code to the end of Transform0 = ApplyMapping.apply()

1
, ("type", "string", "type", "string")

The complete code should be the following code. (Scroll right)

1
Transform0 = ApplyMapping.apply(frame = mapped_DataSource0, mappings = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_date", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_date", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double"), ("type", "string", "type", "string")], transformation_ctx = "Transform0")


Modify Transform0 = ApplyMapping.apply(frame = DataSource0, to Transform0 = ApplyMapping.apply(frame = mapped_DataSource0,


Click on X on the right of the page.
Then select ny_taxi_parquet, click on Action -> Run job

Click on Run job


Select ny_taxi_parquet to see the History


We can also check the S3 bucket.


Create Transformed Data Table through Crawler

Use Crawler to craw the .parquet files metadata.

Services -> Glue -> Databases

Click on ny_taxi_db


Click on Add tables -> add tables using a crawler


Crawler Info
  • Crawler name: ny_taxi_parquet_crawler

Click on Next


Crawler source type

Click on Next


Data store
  • Choose a data store: S3
  • Include path: your S3 bucket

Click on Next


IAM Role

Create a new IAM Role or choose an existing IAM Role if you have.

Policies:

  • AmazonS3FullAccess
  • AmazonKinesisFullAccess
  • AWSGlueServiceRole

Click on Next


Schedule
  • Frequency: Run on demand

Click on Next


Output
  • Database: ny_taxi_db

Click on Next

Review all steps

Click on Next


Select the Crawler, and click on Run crawler


Review the table

Now you can see the table
Click on ny_taxi_parquet table


Notice we have the pickup_date and dropoff_date columns, and they are timestamp data type.

Notice we have the new column type.


Athena Configuration

Services -> Athena


Click on Get Started


Athena User Interface

Data source
See our Data source is AwsDataCatalog that connect to AWS Glue.

Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. See Amazon Athena.


Database
Under Database, collapse default to choose a database.
You can see ny_taxi_db, which is the database we created at the previous steps.
Select ny_taxi_db.


Tables
Here are the tables ny_taxi and ny_taxi_parquet.

Attributes in table ny_taxi

Attributes in table ny_taxi_parquet
You can see tpep_pickup_datetime (string) and tpep_dropoff_datetime (string) have been transformed to pickup_date (timestamp) and dropoff_date (timestamp)


Athena Query

You can click Run query button to query data in S3 through Glue schema.
You can also do it in pressing Command + Enter


Athena
1
show tables from ny_taxi_db;


Athena
1
select count(*) from ny_taxi_db.ny_taxi;


Athena
1
select count(*) from ny_taxi_db.ny_taxi_parquet;


Notice in ny_taxi table, the datetime is string type, which is hard to extract date and time.

Athena
1
select * from ny_taxi_db.ny_taxi order by tpep_pickup_datetime limit 10;


Since we use AWS Glue ETL convert the string of datetime in CSV file to timestamp and removed tpep_ prefix.

Athena
1
select * from ny_taxi_db.ny_taxi_parquet order by pickup_date limit 10;

And when we scroll right, we can see the new column type


Use Athena to make time related queries

Athena
1
SELECT EXTRACT (HOUR FROM pickup_date) AS hour, avg(fare_amount) AS average_fare FROM ny_taxi_db.ny_taxi_parquet GROUP BY 1 ORDER BY 1;

We get the average_fare based on hour.
Conclusion: A taxi driver living in New York could earn the maximum fare during 4 AM to 5 AM.


Query the tip percent category

Athena
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as mta_taxprct
, (tolls_amount / total_amount) as tolls_prct
, (tip_amount / total_amount) as tip_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from ny_taxi_db.ny_taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;

Run time: 3.89s! It is much faster than m5.xlarge one core EMR cluster (14s) and m4.xlarge (25s).


Count all rows in ny_taxi_parquet by type

Athena
1
SELECT type, count(*) FROM ny_taxi_db.ny_taxi_parquet group by type;


Get the max and min distances

Athena
1
2
3
4
5
6
SELECT min(trip_distance) as minDistance,
max(trip_distance) as maxDistance,
min(total_amount) as minTotal,
max(total_amount) as maxTotal
FROM ny_taxi_db.ny_taxi_parquet;

Seems like the data source contains some wired data.


Get the average distance and average cost per mile

Athena
1
2
3
4
5
6
7
SELECT type,
avg(trip_distance) AS avgDist,
avg(total_amount/trip_distance) AS avgCostPerMile
FROM ny_taxi_db.ny_taxi_parquet
WHERE trip_distance > 0
AND total_amount > 0
GROUP BY type


Get average distance and percentile

Athena
1
2
3
4
5
6
7
8
SELECT type,
avg(trip_distance) avgDistance,
avg(total_amount/trip_distance) avgCostPerMile,
avg(total_amount) avgCost,
approx_percentile(total_amount, .99) percentile99
FROM ny_taxi_db.ny_taxi_parquet
WHERE trip_distance > 0 and total_amount > 0
GROUP BY type;


QuickSight Configuration

Services -> QuickSight

Connect to Athena

Click on Datasets on navigation panel


Click on New dataset on the right and top page


Click on Athena


  • Data source name: NYC Taxi Parquet

Click on Validate connection

Click on Create data source


  • Database: select ny_taxi_db


Select ny_taxi_parquet
Click on Edit/Preview data


Edit/Preview data

Click on Add calculated field


Add name: HourOfDay

Type the following expression

1
extract('HH', {pickup_date})

Click on Save


Now we have the new field


Click on Save & visualize on the top and middle page


AutoGraph
  • Visual types: AutoGraph
  • Y axis: HourOfDay

Click on the expand button to see the graph in full size


Geospatial chart
  • Visual types: Points on map
  • Geospatil
    • pickup_longitude
    • pickup_latitude
  • Color: HourOfDay

Wait a minute for the result.

  • You can zoom in the map to see details.
  • Put your cursor on the point to see the details