AWS Glue Lab

AWS Glue

AWS Glue


Glue Data catalog Lab

AWS Glue Data catalog



Populating Glue Data Catalog


Lab Details

  • Four steps to populate your Glue data catalog
    1. Classify your data by running a crawler
    • Custom classifiers
    • Built-in classifiers
    1. Crawler connects to the data store
    2. Crawler infers the schema
    3. Crawler writes metadata to the Data Catalog


Data Source Prerequisite

If you did Glue ETL Data Lab, you should have a .json file with the name like run-*** in your bucket.


Select it, click on Actions, then click on Rename object


  • New object name: Medicare_Hospital_Provider.json

Click on Save changes


Click on Create folder

  • Folder name: JSON

Click on Create folder


Select your JSON file, click on Actions, then click on Move


Click on Browse S3

Select JSON/
Click on Choose destination


Click on Move


Extract Data - Glue Crawler Configuration

Services -> AWS Glue


Click on Crawlers navigation


Click on Add crawler


Crawler info
  • Crawler name: legislators

Click on Next


Crawler source type

Leave all settings as default

Click on Next


Data store
  • Choose a data store from the following: S3
    • S3
    • JDBC
    • DynamoDB
    • Amazon DocumentDB
    • MongoDB


  • Include path: your JSON/ path.

Exclude patterns: define the prefix of the data that you want the crawler skip.

Click on Next


Select No and click on Next


IAM Role
  • The IAM role allows the crawler to run and access your Amazon S3 data stores.

    • Select Create an IAM role
  • IAM Role for Glue: Role assumed by the crawler with permission to access your data store.

    • Legislate

The IAM will give your new role (demoClawer) the S3 bucket the following permissions:

  • s3:GetObject
  • s3:PutObject

Click on Next


Schedule
  • Frequency: Run on demand

Click on Next


Output

Click on Add database

  • Database: legislators

Click on Next


Review all steps

Click on Finish


Select legislators then click on Run crawler


Check the Database & Table

Click Databases on the navigation panel.

Select legislators and click on View tables

Click on the table


Dev endpoints Configuration

Click on Dev endpoints on the navigation panel.


Properties

Glue a Dev endpoint allows us to use a SageMaker Notebook to interact with a Glue Data Catalog.

Click on Add endpoint


  • Development endpoint name: example_endpoint

Under IAM role, click on Create IAM


Click on Create role


Select Glue
Click on Glue: Allow Glue to call AWS services on your behalf
Click on Next: Permissions


  • Policies: AmazonS3FullAccess

Click on Next: Tags


Click on Next: Review


  • Role name: glue_legislators

Click on Create role


Go back to Dev endpoint page
Click on refresh button
Select glue_legislators
Click on Next


Networking

Click on Next


SSH public key

Click on Next


Review

Click on Finish

Click on refresh button
Wait until the Provisioning status changes from PROVISIONING to READY


Create a SageMaker Notebook

Select your dev endpoint
Click on Action
Click on Create SageMaker notebook


  • Notebook name: legislators
  • IAM role: Legislate

Click on Create notebook


Create Sparkmagic (PySpark)

AWS Glue PySpark Transforms Reference

Select your notebook, click on Open notebook


Here we are in tht Jupyter Notebook instance.
We will use Python to query the JSON file schema.

Click on New
Select Sparkmagic (PySpark)


1
2
3
4
5
6
7
8
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

Import the libraries we need to use.
Click on Run


Due to the lack of the same data source, the following screenshots are from the video.

Create a dynamic_frame

1
2
3
4
5
persons = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="persons_json")
print "Count: ", persons.count()
persons.printSchema()


1
2
3
4
5
memberships = glueContext.create_dynamic_frame.from_catalog(
database="legislators",
table_name="memberships_json")
print "Count: ", memberships.count()
memberships.printSchema()


Glue ETL Lab


Glue ETL from S3


Lab Details

We want to crawl the .csv file to build the table in catalog. So we need configure the security.


Task Details

  1. Upload .csv file to your S3 bucket.
  2. Setup a Glue security strategy.
  3. Make a Glue crawler to crawl the .csv file schema and get the metadata
  4. Configure a Glue job for transforming the .csv file to .json file.

Data Source & Prerequisite

Services -> S3


Select one of your S3 bucket.
Upload AWS Glue sample data.

Modify the S3 uri to https for downloading the sample data.

  • s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv
  • https://awsglue-datasets.s3.amazonaws.com/examples/medicare/Medicare_Hospital_Provider.csv


Download sample file to your PC

If you don’t have wget, run the following command:

Shell
1
2
wget --version
brew install wget

Shell
1
2
cd ~/Downloads
wget https://awsglue-datasets.s3.amazonaws.com/examples/medicare/Medicare_Hospital_Provider.csv


Check file

Shell
1
head Medicare_Hospital_Provider.csv 


Upload sample file to S3

Then upload the .csv file to your S3 bucket.


Security Configurations

Services -> AWS Glue


Click on Security configurations navigation


  • Security configuration name: glueDemoSecurityConfig
  • Encryption mode: SSE-KMS

Three things we can encrypt in AWS Glue:

  • Data (at-rest)

  • CloudWatch logs

  • Job bookmark

  • Uncheck CloudWatch logs encryption

Expand Advanced properties

  • Uncheck Job bookmark encryption

Click on Finish


Refresh the page to see your security configuration


Extract Data - Glue Crawler Configuration

Services -> AWS Glue


Click on Tables navigation


Click on Add tables using a crawler


Crawler info
  • Crawler name: glueDemo

Expand Tags, description, security configuration, and classifiers (optional)

  • Security configuration: glueDemoSecurityConfig

Click on Next


Crawler source type

Leave all settings as default

Click on Next


Data store
  • Choose a data store from the following: S3
    • S3
    • JDBC
    • DynamoDB
    • Amazon DocumentDB
    • MongoDB


  • Include path: your bucket with the sample .csv file.

Exclude patterns: define the prefix of the data that you want the crawler skip.

Click on Next


Select No and click on Next


IAM Role
  • The IAM role allows the crawler to run and access your Amazon S3 data stores.

    • Select Create an IAM role
  • IAM Role for Glue: Role assumed by the crawler with permission to access your data store.

    • demoClawer

The IAM will give your new role (demoClawer) the S3 bucket the following permissions:

  • s3:GetObject
  • s3:PutObject

Click on Next


Schedule
  • Frequency: Run on demand

Click on Next


Output

Click on Add database

  • Database: demodb

  • Prefix added to tables (optional): demoTarget

Click on Next


Review all steps

Click on Finish


Validation Test

Run Crawler

Working with Crawlers on the AWS Glue Console

Select your new crawler, then click on Run crawler


CloudWatch logs

Click on Logs


Databases

Click on Databases navigation
Click on demodb


Click on tables in demodb


Transform Data - Job Configuration

Click on Jobs navigation


Click on Add job


Job properties
  • Name: glueDemoJob

Click on Create IAM role
Ensure that this role has permission to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job.

Temporarily give a role with AWSGlueConsoleFullAccess and AmazonS3FullAccess

  • IAM role: unlimitedGlue


  • Type: Spark
  • Glue version: Spark 2.4, Python 3 (Glue Version 1.0)
  • This job runs: A proposed script generated by AWS Glue


Expand Advanced properties

  • Job bookmark: Specifies how AWS Glue processes job bookmark when the job runs. It can remember previously processed data (Enable), update state information (Pause), or ignore state information (Disable).
    • Disable


Expand Monitoring options


Expand Security configuration, script libraries, and job parameters (optional)

  • Security configuration: glueDemoSecurity

  • Worker type: Standard

    • Standard
    • G.1X(Recommended for memory-intensive jobs)
    • G.2X(Recommended for jobs with ML transforms)
  • Number of workers: The maximum number of workers you can define are 299 for G.1X, and 149 for G.2X.

  • Max concurrency: Sets the maximum number of concurrent runs that are allowed for this job. An error is returned when this threshold is reached. The default is 1.

  • Job timeout (minutes): Set the execution time limit in minutes.

  • Delay notification threshold (minutes): Set a delay threshold in minutes. If the job runs longer than the specified time Glue will send a delay notification via CloudWatch

  • Catalog options (optional): To use Glue data catalog as the Hive metastore, the IAM role used for the job should have glue:CreateDatabase permissions. A database called “default” is created in the Glue catalog if it does not exist.

Click on Next


Data source

Select the data source you created before.
Click on Next


Transform type
  • Change schema: Change schema of your source data and create a new target dataset
  • Find matching records: Use machine learning to find matching records within your source data
    • Remove duplicate records: When records match, the record with the lowest primary key value survives.
  • Worker type: Set the type of predefined worker that is allocated when a job runs.
  • Maximum capacity: Set the maximum number of AWS Glue data processing units (DPUs) that can be allocated when a job runs.

Select Change schema then click on Next


Data target

Select Create tables in your data target

  • Data store: Amazon S3
  • Format: JSON
  • Compression type: None
  • Target path: choose your sample .csv file bucket

Click on Next


Schema

Review the original .csv file

Notice some Data type will change.

You can Add column, Delete column, Reorder columns.

Click on Save job and edit script.


Script editor


Click on Database on the left and top page.

Then click on Schema tab.


On the top navigation pane, you can change the Source, Target, Target Location.

See more of Transforms


Click on Run job
You have a second chance to modify the settings we did before.
Click on Run job


The job is running.

Refresh the page to see Logs

Once the refresh icon(on the left and top page) disappeared, the job is finished.


Go to S3 bucket to see the transformed file.

Select the transformed file, then click on Action and Download

Compare to the .csv file.


Glue ETL from Streaming

Adding Streaming ETL Jobs in AWS Glue


Kinesis Configuration

Services -> Data streams


Click on Create data stream


Data stream configuration

  • Data stream name: MySparkStreamSource

Data stream capacity

  • Number of open shards: 1
    • Because we only need a very light streaming workload to feed AWS Glue

Create data stream


AWS Glue Configuration

Services -> AWS Glue


Databases

In order to be able to get the stream, we need a AWS Glue Database and Table.


Click on Databases
Click on Add database


  • Database name: sparkstreamdb

Click on Create


Click on Tables on the navigation panel.


Click on sparkstreamdb


Click on Tables in sparkstreamdb


Click on Add tables -> Add table manually


Table properties
  • Table name: MySparkStreamSource
  • Database: sparkstreamdb


Data Store
  • Select the type of source: Kinesis
  • Stream name: MySparkSteamSource
  • Kinesis source URL: https://kinesis.us-east-1.amazonaws.com (Input your kinesis endpoint url like this)


Data format
  • Classification: JSON

Click on Next


Glue Schema

Click on Add column
Add the columns to meet your data source (.JSON file) requirements

  • EVENT_TIME, string
  • ITEM, string
  • PRICE, float

Click on Next


Review

Click on Finish


Click on the table you just created.


ETL - Jobs

Click on Jobs on the navigation panel
Click on Add job


Job properties
  • Name: MySparkStreamETL

Click on create IAM Role
Click on Create role

Select Glue as the AWS Service

  • Permissions (Policies)

    • AmazonS3FullAccess
    • AWSGlueServiceRole
    • AmazonKinesisFullAccess
  • Role name: SparkStreamETL


  • IAM Role: SparkStreamETL
  • Type: Spark Streaming
  • Glue version: `Spark 2.4, Python 3 (Glue Version 1.0)
  • This job runs: A proposed script generated by AWS Glue
  • Script file name: MySparkStreamETL

Click on Next


Data source

data source is the database -> table
Click on Next


Data target

We want to put our data in S3, so select Create tables in your data target

  • Data store: Amazon S3
  • Format: Parquet
  • Target path: your S3 bucket path, you can create a dedicated folder for this lab

Click on Next


Schema

Select Specify output schema for all records

You can drop any column you don’t want. Here we try to drop price
Save job and edit script


You can edit the script
Click on Save
Click on X on to exit


Now we have the ETL Job


Python Configuration

Python3 KinesisProducer-C&F.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import sys
import json
import boto3
import random
import datetime
import pandas as pd

# define our Kinesis service using the us-east-1 region
kinesis = boto3.client('kinesis', region_name='us-east-1')
def start_process(format_flag):
#format_flag = int(format_flag)
begin_streaming(format_flag)

# function to produce our streaming data in JSON format
def produceJSONData():
data = {}
time_now = datetime.datetime.now()
time_now_string = time_now.isoformat()
data['EVENT_TIME'] = time_now_string
data['ITEM'] = random.choice(['Longboard', 'Onewheel', 'Surfboard', 'Snowboard', 'Paddleboard'])
price = random.random() * 100
data['PRICE'] = round(price, 2)
return data

# function to produce our streaming data in CSV format
def produceCSVData():
data = pd.read_csv("Surveillance.csv")
return data

def begin_streaming(format_flag):
# define the number of data stream elements we wish to create
number_of_records = 30
record_count = 0

if format_flag == 'transform':
# create the streaming data and send it to our Kinesis Data Stream called kinesis-transform-demo
data = produceCSVData()
for _, row in data.iterrows():

values = ','.join(str(value) for value in row) # join the values together by a ','

encodedValues = bytes(values, 'utf-8') # encode the string to bytes
print(encodedValues)
kinesis.put_record(
StreamName="kinesis-transform-demo",
Data=encodedValues,
PartitionKey="partitionkey")
record_count += 1
elif format_flag == 'format':
data = produceJSONData()
while record_count < number_of_records:
data = json.dumps(produceJSONData()) #+ 'record # ' + str(record_count)
print(data)
kinesis.put_record(
StreamName="MySparkStreamSource",
Data=data,
PartitionKey="partitionkey")
record_count += 1

if __name__ == "__main__":
#print(f"Arguments count: {len(sys.argv)}")
#for i, arg in enumerate(sys.argv):
# print(f"Argument {i:>6}: {arg}")
format_flag = sys.argv[1]
start_process(format_flag)

Copy the code to your local machine and give it a name KinesisProducer-C&F.py

Shell
1
python3 KinesisProducer-C&F.py format

If you have any permissions error, go check AWS CLI


Run Glue ETL Job

Services -> AWS Glue -> Jobs


Select MySparkStreamETL
Click on Action -> Run job

Click on Run job


Select MySparkStreamETL to see details


Validation Test

We have send the data to AWS Kinesis Data Stream MySparkStreamSource

Let’s check the console.


Kinesis Data Streams

Services -> Kinesis -> Data Streams


Click on MySparkStreamSource
Click on Monitoring tab.
Scroll down to see the incoming data


S3

Services -> S3


Open your bucket to see the parquet file