AWS Glue

Posted on 2021-01-22 Edited on 2022-02-15 In Tech , AWS , AWS Certificate , AWS DAS Symbols count in article: 5.1k Reading time ≈ 5 mins.

Introduction

AWS Glue
Code Example: Joining and Relationalizing Data
AWS Glue samples repository

Simple, scalable, and serverless data integration

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

Data integration is the process of preparing and combining data for analytics, machine learning, and application development. It involves multiple tasks, such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. These tasks are often handled by different types of users that each use different products.

AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code. With AWS Glue Elastic Views, application developers can use familiar Structured Query Language (SQL) to combine and replicate data across different data stores.

Key point: batch oriented
- Micro-batches but no streaming data
Does not support NoSQL databases as data source
Crawl data source to populate data catalog
Generate a script to transform data or write your own in console or API
Run jobs on demand or via a trigger event
Glue catalog tables contain metadata not data from the data source
Uses a scale-out Apache Spark environment when loading data to destination
- Allocate data processing units (DPUs) to jobs

Glue Data catalog

AWS Glue catalog Lab

Glue ETL

AWS Glue ETL I Lab
AWS Glue ETL II Lab

Glue ETL Jobs - Structure

A Glue job defines the business logic that performs the extract, transform, and load (ETL) work in AWS Glue
- Glue runs your script to extract data from your sources, transform the dta, and load it into your targets
- Glue triggers can start jobs based on a schedule or event, or on demand
- Monitor your job runs to get runtime metrics: completion status, duration, etc
- Based on you r source schema and target location or schema, the Glue code generator automatically creates an Apache Spark API (PySpark) script
  - Edit the script o customize yto your requirements

Glue ETL Jobs - Types

Glue output file formats JSON, CSV, ORC (Optimized Row Columnar), Apache Parquet, and Apache Avro
Three types of Glue jobs
- Spark ETL job: executed in managed Apaache Spark environment, processes data in batches
- Streaming ETL job: (likf s Spark ETL job, but works with data streams) uses the Apache Spark Structured Streaming framework
- Python shell job: schedule and run tasks that don’t require an Apache Spark Environment

A good IoT example

Glue ETL Jobs - Transforms

Glue has built-in transforms for processing data
- Call from within your ETL script
- In a DynamicFrame (an extension of an Apache Spark SQL DataFrame), your data passes from transform to transform
- Built-in transform types (subset)
  - ApplyMapping: maps source DynamicFrame columns and data types to target DynamicFrame columns and data types
  - Filter: selects records from a DynamicFrame and returns a filtered DynamicFrame
  - Map: applies a function to the records of a DynamicFrame and returns a transformed DynamicFrame
  - Relationalize: converts a DynamicFrame to a relational (rows and columns) form

An example for collapse out the currency category

Glue ETL Jobs - Triggers

A trigger can start specified jobs and crawlers
- On demand, based on a schedule or based on a combination of events
- Add trigger via the Glue console, the Command Line Interface (AWS CLI), or the Glue API
- Activate or deactivate a trigger via the Glue console, the Command Line Interface (AWS CLI), or the Glue API

Glue ETL Jobs - Monitoring

Glue produces metrics for crawlers and jobs for monitoring
- Statistics about the health of your environment
- Statistics are written to the Glue Data Catalog
Use automated monitoring tools to watch Glue and report problems
- CloudWatch events
- CloudWatch logs
- CloudTrail logs
Profile your Glue jobs using metrics and visualize on the Glue and CLoudWatch consoles to identify and fix issues

Glue Automation

Introduction

Use workflows to create and visualize complex ETL tasks involving multiple crawlers, jobs, and triggers
Manages the execution and monitoring of all components
Glue console provides a visual representation of a workflow as a graph
Chain interdependent jobs and crawlers
- Event triggers fire by both jobs or crawlers, and can start both jobs and crawlers
Views
- Static: shows the design of the workflow
- Dynamic: run time view, shows the latest run information for each of the jobs and crawlers

Operationalize data processing with Glue and EMR Workflows

Orchestration of Glue and EMR Workflows

Several ways to operationalize Glue and EMR
- Glue workflows
- Automate workflow using Lambda
- Step Functions with Glue
- Step Functions with EMR and Apache Livy
- Step Functions directly with EMR

Glue Workflows

A workflow is grouping of a set of jobs, crawlers, and triggers in Glue
- Can design a complex multi-job (ETL) sequence that Glue can execute and track as single entity
- Create workflows using the AWS Console or the Glue API
- Console lets you to see the components and flow of a workflow with a graph

Automate Workflow Using Lambda

Use Lambda functions and CloudWatch Events to orchestrate your workflow
- Start your workflow with Lambda trigger
- Use CloudWatch Events to trigger other steps in your workflow

Step Functions with Glue

AWS Step Functions

Use Step Functions to automate your Glue workflow
- Serverless orchestration of your Glue steps
- Easily integrate with EMR steps

Labs

AWS Glue Lab