AWS Glue Lab
AWS Glue
Glue Data catalog Lab
Populating Glue Data Catalog
Lab Details
- Four steps to populate your Glue data catalog
- Classify your data by running a crawler
- Custom classifiers
- Built-in classifiers
- Crawler connects to the data store
- Crawler infers the schema
- Crawler writes metadata to the Data Catalog
Data Source Prerequisite
If you did Glue ETL Data Lab, you should have a .json file with the name like run-***
in your bucket.
Select it, click on Actions
, then click on Rename object
- New object name:
Medicare_Hospital_Provider.json
Click on Save changes
Click on Create folder
- Folder name:
JSON
Click on Create folder
Select your JSON file, click on Actions
, then click on Move
Click on Browse S3
Select JSON/
Click on Choose destination
Click on Move
Extract Data - Glue Crawler Configuration
Services -> AWS Glue
Click on Crawlers
navigation
Click on Add crawler
Crawler info
- Crawler name:
legislators
Click on Next
Crawler source type
Leave all settings as default
Click on Next
Data store
- Choose a data store from the following:
S3
- S3
- JDBC
- DynamoDB
- Amazon DocumentDB
- MongoDB
- Include path: your
JSON/
path.
Exclude patterns: define the prefix of the data that you want the crawler skip.
Click on Next
Select No
and click on Next
IAM Role
The IAM role allows the crawler to run and access your Amazon S3 data stores.
- Select
Create an IAM role
- Select
IAM Role for Glue: Role assumed by the crawler with permission to access your data store.
Legislate
The IAM will give your new role (demoClawer) the S3 bucket the following permissions:
s3:GetObject
s3:PutObject
Click on Next
Schedule
- Frequency:
Run on demand
Click on Next
Output
Click on Add database
- Database:
legislators
Click on Next
Review all steps
Click on Finish
Select legislators
then click on Run crawler
Check the Database & Table
Click Databases
on the navigation panel.
Select legislators
and click on View tables
Click on the table
Dev endpoints Configuration
Click on Dev endpoints
on the navigation panel.
Properties
Glue a Dev endpoint allows us to use a SageMaker Notebook to interact with a Glue Data Catalog.
Click on Add endpoint
- Development endpoint name:
example_endpoint
Under IAM role, click on Create IAM
Click on Create role
Select Glue
Click on Glue: Allow Glue to call AWS services on your behalf
Click on Next: Permissions
- Policies:
AmazonS3FullAccess
Click on Next: Tags
Click on Next: Review
- Role name:
glue_legislators
Click on Create role
Go back to Dev endpoint page
Click on refresh button
Select glue_legislators
Click on Next
Networking
Click on Next
SSH public key
Click on Next
Review
Click on Finish
Click on refresh button
Wait until the Provisioning status changes from PROVISIONING
to READY
Create a SageMaker Notebook
Select your dev endpoint
Click on Action
Click on Create SageMaker notebook
- Notebook name:
legislators
- IAM role:
Legislate
Click on Create notebook
Create Sparkmagic (PySpark)
Select your notebook, click on Open notebook
Here we are in tht Jupyter Notebook instance.
We will use Python to query the JSON file schema.
Click on New
Select Sparkmagic (PySpark)
1 | import sys |
Import the libraries we need to use.
Click on Run
Due to the lack of the same data source, the following screenshots are from the video.
Create a dynamic_frame
1 | persons = glueContext.create_dynamic_frame.from_catalog( |
1 | memberships = glueContext.create_dynamic_frame.from_catalog( |
Glue ETL Lab
Glue ETL from S3
Lab Details
We want to crawl the .csv file to build the table in catalog. So we need configure the security.
Task Details
- Upload .csv file to your S3 bucket.
- Setup a Glue security strategy.
- Make a Glue crawler to crawl the .csv file schema and get the metadata
- Configure a Glue job for transforming the .csv file to .json file.
Data Source & Prerequisite
Services -> S3
Select one of your S3 bucket.
Upload AWS Glue sample data.
Modify the S3 uri to https for downloading the sample data.
s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv
https://awsglue-datasets.s3.amazonaws.com/examples/medicare/Medicare_Hospital_Provider.csv
Download sample file to your PC
If you don’t have wget
, run the following command:
1 | wget --version |
1 | cd ~/Downloads |
Check file
1 | head Medicare_Hospital_Provider.csv |
Upload sample file to S3
Then upload the .csv file to your S3 bucket.
Security Configurations
Services -> AWS Glue
Click on Security configurations
navigation
- Security configuration name:
glueDemoSecurityConfig
- Encryption mode:
SSE-KMS
Three things we can encrypt in AWS Glue:
Data (at-rest)
CloudWatch logs
Job bookmark
Uncheck
CloudWatch logs encryption
Expand Advanced properties
- Uncheck
Job bookmark encryption
Click on Finish
Refresh the page to see your security configuration
Extract Data - Glue Crawler Configuration
Services -> AWS Glue
Click on Tables
navigation
Click on Add tables using a crawler
Crawler info
- Crawler name:
glueDemo
Expand Tags, description, security configuration, and classifiers (optional)
- Security configuration:
glueDemoSecurityConfig
Click on Next
Crawler source type
Leave all settings as default
Click on Next
Data store
- Choose a data store from the following:
S3
- S3
- JDBC
- DynamoDB
- Amazon DocumentDB
- MongoDB
- Include path: your bucket with the sample .csv file.
Exclude patterns: define the prefix of the data that you want the crawler skip.
Click on Next
Select No
and click on Next
IAM Role
The IAM role allows the crawler to run and access your Amazon S3 data stores.
- Select
Create an IAM role
- Select
IAM Role for Glue: Role assumed by the crawler with permission to access your data store.
demoClawer
The IAM will give your new role (demoClawer) the S3 bucket the following permissions:
s3:GetObject
s3:PutObject
Click on Next
Schedule
- Frequency:
Run on demand
Click on Next
Output
Click on Add database
- Database:
demodb
- Prefix added to tables (optional):
demoTarget
Click on Next
Review all steps
Click on Finish
Validation Test
Run Crawler
Working with Crawlers on the AWS Glue Console
Select your new crawler, then click on Run crawler
CloudWatch logs
Click on Logs
Databases
Click on Databases
navigation
Click on demodb
Click on tables in demodb
Transform Data - Job Configuration
Click on Jobs
navigation
Click on Add job
Job properties
- Name:
glueDemoJob
Click on Create IAM role
Ensure that this role has permission to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job.
Temporarily give a role with AWSGlueConsoleFullAccess
and AmazonS3FullAccess
- IAM role:
unlimitedGlue
- Type:
Spark
- Glue version:
Spark 2.4, Python 3 (Glue Version 1.0)
- This job runs:
A proposed script generated by AWS Glue
Expand Advanced properties
- Job bookmark: Specifies how AWS Glue processes job bookmark when the job runs. It can remember previously processed data (Enable), update state information (Pause), or ignore state information (Disable).
- Disable
Expand Monitoring options
Expand Security configuration, script libraries, and job parameters (optional)
Security configuration:
glueDemoSecurity
Worker type:
Standard
- Standard
- G.1X(Recommended for memory-intensive jobs)
- G.2X(Recommended for jobs with ML transforms)
Number of workers: The maximum number of workers you can define are
299
forG.1X
, and149
forG.2X
.Max concurrency: Sets the maximum number of concurrent runs that are allowed for this job. An error is returned when this threshold is reached. The default is
1
.Job timeout (minutes): Set the execution time limit in minutes.
Delay notification threshold (minutes): Set a delay threshold in minutes. If the job runs longer than the specified time Glue will send a delay notification via CloudWatch
Catalog options (optional): To use Glue data catalog as the Hive metastore, the IAM role used for the job should have glue:CreateDatabase permissions. A database called “default” is created in the Glue catalog if it does not exist.
Click on Next
Data source
Select the data source you created before.
Click on Next
Transform type
- Change schema: Change schema of your source data and create a new target dataset
- Find matching records: Use machine learning to find matching records within your source data
- Remove duplicate records: When records match, the record with the lowest primary key value survives.
- Worker type: Set the type of predefined worker that is allocated when a job runs.
- Maximum capacity: Set the maximum number of AWS Glue data processing units (DPUs) that can be allocated when a job runs.
Select Change schema
then click on Next
Data target
Select Create tables in your data target
- Data store:
Amazon S3
- Format:
JSON
- Compression type:
None
- Target path: choose your sample .csv file bucket
Click on Next
Schema
Review the original .csv file
Notice some Data type will change.
You can Add column, Delete column, Reorder columns.
Click on Save job and edit script
.
Script editor
Click on Database on the left and top page.
Then click on Schema tab.
On the top navigation pane, you can change the Source, Target, Target Location.
See more of Transforms
Click on Run job
You have a second chance to modify the settings we did before.
Click on Run job
The job is running.
Refresh the page to see Logs
Once the refresh icon(on the left and top page) disappeared, the job is finished.
Go to S3 bucket to see the transformed file.
Select the transformed file, then click on Action
and Download
Compare to the .csv file.
Glue ETL from Streaming
Kinesis Configuration
Services -> Data streams
Click on Create data stream
Data stream configuration
- Data stream name:
MySparkStreamSource
Data stream capacity
- Number of open shards:
1
- Because we only need a very light streaming workload to feed AWS Glue
Create data stream
AWS Glue Configuration
Services -> AWS Glue
Databases
In order to be able to get the stream, we need a AWS Glue Database and Table.
Click on Databases
Click on Add database
- Database name:
sparkstreamdb
Click on Create
Click on Tables
on the navigation panel.
Click on sparkstreamdb
Click on Tables in sparkstreamdb
Click on Add tables -> Add table manually
Table properties
- Table name:
MySparkStreamSource
- Database:
sparkstreamdb
Data Store
- Select the type of source:
Kinesis
- Stream name:
MySparkSteamSource
- Kinesis source URL:
https://kinesis.us-east-1.amazonaws.com
(Input your kinesis endpoint url like this)
Data format
- Classification:
JSON
Click on Next
Glue Schema
Click on Add column
Add the columns to meet your data source (.JSON file) requirements
- EVENT_TIME, string
- ITEM, string
- PRICE, float
Click on Next
Review
Click on Finish
Click on the table you just created.
ETL - Jobs
Click on Jobs
on the navigation panel
Click on Add job
Job properties
- Name:
MySparkStreamETL
Click on create IAM Role
Click on Create role
Select Glue
as the AWS Service
Permissions (Policies)
AmazonS3FullAccess
AWSGlueServiceRole
AmazonKinesisFullAccess
Role name:
SparkStreamETL
- IAM Role:
SparkStreamETL
- Type:
Spark Streaming
- Glue version: `Spark 2.4, Python 3 (Glue Version 1.0)
- This job runs:
A proposed script generated by AWS Glue
- Script file name:
MySparkStreamETL
Click on Next
Data source
data source is the database -> table
Click on Next
Data target
We want to put our data in S3, so select Create tables in your data target
- Data store:
Amazon S3
- Format:
Parquet
- Target path: your S3 bucket path, you can create a dedicated folder for this lab
Click on Next
Schema
Select Specify output schema for all records
You can drop any column you don’t want. Here we try to drop price
Save job and edit script
You can edit the script
Click on Save
Click on X on to exit
Now we have the ETL Job
Python Configuration
1 | import sys |
Copy the code to your local machine and give it a name KinesisProducer-C&F.py
1 | python3 KinesisProducer-C&F.py format |
If you have any permissions error, go check AWS CLI
Run Glue ETL Job
Services -> AWS Glue -> Jobs
Select MySparkStreamETL
Click on Action -> Run job
Click on Run job
Select MySparkStreamETL
to see details
Validation Test
We have send the data to AWS Kinesis Data Stream MySparkStreamSource
Let’s check the console.
Kinesis Data Streams
Services -> Kinesis -> Data Streams
Click on MySparkStreamSource
Click on Monitoring
tab.
Scroll down to see the incoming data
S3
Services -> S3
Open your bucket to see the parquet file