AWS Certified Data Analytics

Introduction


5 domains

  • Collection

  • Storage and Data Management

  • Processing

  • Analysis and Visualization

  • Security

  • Domain 1: Collection
    1.1 Build Kinesis Data Streams data collection system
    1.2 Build Kinesis Data Firehose data collection system
    1.3 Build Glue ETL and cataloging collection system

  • Domain 2: Storage and Data Management
    2.1 Build DynamoDB storage and data management system
    2.2 Build Redshift storage and data management system
    2.3 Populating the Glue data catalog

  • Domain 3: Processing
    3.1 Build EMR data processing systems
    3.2 Build streaming ETL processing systems
    3.3 Glue and EMR

  • Domain 4: Analysis and Visualization
    4.1 Build Athena analysis and visualization
    4.2 Kinesis Data Analytics
    4.3 Elasticsearch and Kibana

  • Domain 5: Security
    5.1 Build secure data analytics systems
    5.2 Encrypt your analytics data
    5.3 Data governance and compliance

Collection

Storage and Data Management

Processing

  • EMR: Electronic Medical Records
  • ETL: Extract, Transform, Load

Analysis and Visualization

Security


Exam


Domain 1: Collection

Introduction

Data Analytics Lifecycle

CRISP-DM


Stages of Data Collection

  • Stage 1: Data Classification
    • Determine the operational characteristics of the collection system
    • Batch, streaming, and transactional data
    • Compare data collection systems
  • Stage 2: Data Collection
    • Select a collection system that handles the frequency, volume, and source of data
    • Streaming operational components
    • Fault tolerance and data persistence
  • Stage 3: Data Preprocessing
    • Select a collection system that addresses the key properties of data, such as order, format, and compression
    • Order and duplication
    • Transformation and filtering

Stage 1: Data Classification

Select a collection system that handles the frequency, volume, and source of data
Batch, streaming, and transactional data

  • Batch Data

    • Kinesis Data Firehose
    • Batch Data: S3, AWS Glue
    • Data Lake
    • Athena
  • Streaming Data

    • Kinesis Data Firehose
    • Streaming Data: Kinesis Data Analytics, Kinesis Data Firehose
    • Redshift
    • QuickSight
  • Transactional Data

    • Data Source
    • AWS DMS (Database Migration Service)
    • S3
    • AWS Glue
    • Data Lake

  • Compare Data Collection Systems
    • AWS DMS (Database Migration Service)
    • Kinesis Data Streams
    • Kinesis Data Firehose
    • AWS Glue

Stage 2: Data Collection

Determine the operational characteristics of the collection system

  • Streaming operational components

    • Kinesis Data Streams
    • EC2, Kinesis Data Analytics, Lambda
    • QuickSight
  • Fault tolerance and data persistence

    • Kinesis Producer Library (KPL)
    • Kinesis Data Streams
    • Kinesis Client Library (KCL)

Stage 3: Data Preprocessing

Select a collection system that addresses the key properties of data, such as order, format, and compression

  • Order and duplication
    • Kinesis Producer Library
    • Data Ingestion
    • Kinesis Client Library

Sometimes we need to change the data order or duplicate the data to the consumer.


  • Transformation and filtering
    • Kinesis Data Firehose
    • Lambda
    • Database Migration Service

Summary

  • Data collection systems give you the capability to ingest any kind of data, structured, unstructured, or semi-structured
  • Can ingest using the appropriate frequency based on your situation
    • Batch
    • Streaming
    • Transactional
  • Transform and/or filter your data as you collect it

Stage 1: Data Classification

The Three Types of Data to Ingest


Batch, Streaming, and Transactional Data

  • Batch data:
    • Application logs, video files, audio files, etc.
    • Larger event payloads ingested an hourly, daily, or weekly basis.
    • Ingested in intervals from aggregated data.
  • Streaming data:
    • Click-stream data, IoT sensor data, stock ticker data, etc.
    • Ingesting large amounts of small records continuously and in real-time.
    • Continuously ingested from live events.
  • Transactional data:
    • Initially load and receive continuous updates from stores used as operational business database.
    • Similar to batch data but a continuous update flow.
    • Ingested from databases storing transactional data.

Batch Data

  • Data is usually ‘colder’ and can be processed on less frequent intervals.
  • Latency: minutes to hours
  • Not real-time

Streaming Data

P.S. The first collection ingestion service should be Kinesis Data Streams.

  • Often bounded by time or event sets in order produce real-time outcomes
  • Data is usually ‘hot’ arriving at a high frequency that you need to analyze in real-time
  • Latency: milliseconds
  • Real-time

Transactional Data

  • Data stored at low latency and quickly accessible
  • SQL based
  • AWS SCT (Schema Conversion Tool) -> DMS
  • Real-time

Stage 2: Data Collection

Determine the operational characteristics of the collection system

  • The characteristics of your data streaming workload guide you in the selection of your streaming components
  • The two key components to remember for the exam
    • Fault tolerance
    • Data persistence
  • Kinesis Data Streams vs. Kinesis Data Firehose
    • Data persistence
  • Kinesis Producer Library vs. Kinesis API vs. Kinesis Agent
    • Fault tolerance and appropriate tool for your data collection problem

The Four Ingestion Services

Frequency, Volume, and Source of data

The Four Ingestion Services

  • Kinesis Data Streams
  • Kinesis Data Firehose
  • AWS DMS (Database Migration Service)
  • AWS Glue

Understand how each ingestion approach is best used

  • Throughput, bandwidth, scalability
  • Availability and fault tolerance
  • Cost of running the services

The key difference for the four ingestion services is the focus of Frequency, Volume, and Source of Data, which is the data itself and you cannot change.

  • Kinesis Data Streams

    • Use when you need custom producers and consumers
    • Use cases that require sub-second processing (low latency)
    • Use cases that require unlimited bandwidth (because the bandwidth based on the shards, and shards are unlimited)
  • Kinesis Firehose

    • Use cases where you want to deliver directly or indirectly to S3, Redshift, Elasticsearch, HTTP Endpoint, and Third-party service provider (Datadog, MongoDB Cloud, New Relic, and Splunk)
    • Use cases where you can tolerate latency of 60 seconds or greater
    • Use cases where you wish to transform your data or convert the data format
  • Database Migration Service

    • Use cases when you need to migrate data from one database to another
    • Use cases where you want to migrate a database to a different database engine
    • Use cases needing continuous replication of database records
  • Glue

    • Batch-oriented use cases where you want to perform an Extract Transform Load (ETL) process
    • Not for use with streaming use cases

Kinesis Data Streams

Each shard supports:

  • 1,000 RPS(Records per second)
  • 5 TPS(Transactions per second) at a max of 2MB/s using GetRecords API
  • number of shard are unlimited


Kinesis Data Firehose

No shard: Firehose automatically scales to match data throughput

Firehose Destinations:

  • Amazon S3
    • Object storage built to store and retrieve any amount of data from anywhere.
  • Amazon Redshift
    • An enterprise-level, petabyte scale, fully managed data warehousing service.
  • Amazon Elasticsearch
    • An open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and click stream analytics.
  • HTTP Endpoint
    • A way to deliver data to your custom destination.
  • Third-party service provider
    • Choose from a list of third-party service providers.


AWS DMS


AWS Glue
  • Key point: batch oriented
    • Micro-batches but no streaming data(not real-time)
  • Does not support NoSQL DB as data source
  • Based on Apache Spark


Kinesis Data Streams

AWS Kinesis Data Streams


Kinesis Data Streams

AWS Kinesis Data Firehose


AWS Glue

AWS Glue Introduction
Glue ETL from S3 Lab

  • Key point: batch oriented
    • Micro-batches but no streaming data
  • Does not support NoSQL databases as data source
  • Crawl data source to populate data catalog
  • Generate a script to transform data or write your own in console or API
  • Run jobs on demand or via a trigger event
  • Glue catalog tables contain metadata not data from the data source
  • Uses a scale-out Apache Spark environment when loading data to destination
    • Allocate data processing units (DPUs) to jobs

Difference among the Four Ingestion Services


Throughput, Bandwidth, Scalability

Kinesis Data Streams

  • Shards can handle up to 1,000 PUT records per second
  • Can increase the umber of shards in a stream without limit
  • Each shard has a capacity of 1 MB/s for input and 2 MB/2 for output

Kinesis Firehose

  • Automatically scales to accommodate the throughput of your stream

Database Migration Service

  • EC2 instances used for the replication instance
  • You need scale your replication instance to accommodate your throughput

Glue

  • Runs in scale-out Apache Spark environment to move data to target system
  • Scales via Data Processing Units (DPUs) for your ETL jobs

Availability and Fault Tolerance

Kinesis Data Streams

  • Synchronously replicates your shard data across 3 AZs

Kinesis Firehose

  • Synchronously replicates your data across 3 AZs
  • For S3 target, Firehose retries for 24 hours, failure persists past 24 hours your data is lost
  • For Redshift you can specify a retry duration from 0 to 7,200 seconds
  • For Elasticsearch you can specify a retry duration from 0 to 7,200 seconds
  • For Splunk you can use a retry duration counter. Firehose retries until counter expires, then backs up your data to S3
  • Retries may cause duplicate records

DMS

  • Can use mulit-AZ for replication that gives you fault tolerance via redundant replication servers

Glue

  • Retries 3 times before making a marking an error condition
  • Create CloudWatch alert for failures that triggers and SNS message

Kinesis Firehose Data Delivery

Amazon Kinesis Data Firehose Data Delivery

Data Delivery Frequency

Each Kinesis Data Firehose destination has its own data delivery frequency.

Amazon S3

  • The frequency of data delivery to Amazon S3 is determined by the Amazon S3 Buffer size and Buffer interval value that you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before it delivers it to Amazon S3. You can configure the values for Amazon S3 Buffer size (1–128 MB) or Buffer interval (60–900 seconds). The condition satisfied first triggers data delivery to Amazon S3. When data delivery to the destination falls behind data writing to the delivery stream, Kinesis Data Firehose raises the buffer size dynamically. It can then catch up and ensure that all data is delivered to the destination.

Amazon Redshift

  • The frequency of data COPY operations from Amazon S3 to Amazon Redshift is determined by how fast your Amazon Redshift cluster can finish the COPY command. If there is still data to copy, Kinesis Data Firehose issues a new COPY command as soon as the previous COPY command is successfully finished by Amazon Redshift.

Amazon Elasticsearch Service

  • The frequency of data delivery to Amazon ES is determined by the Elasticsearch Buffer size and Buffer interval values that you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before delivering it to Amazon ES. You can configure the values for Elasticsearch Buffer size (1–100 MB) or Buffer interval (60–900 seconds), and the condition satisfied first triggers data delivery to Amazon ES.

Splunk

  • Kinesis Data Firehose buffers incoming data before delivering it to Splunk. The buffer size is 5 MB, and the buffer interval is 60 seconds. The condition satisfied first triggers data delivery to Splunk. The buffer size and interval aren’t configurable. These numbers are optimal.

HTTP endpoint destination

  • Kinesis Data Firehose buffers incoming data before delivering it to the specified HTTP endpoint destination. For the HTTP endpoint destinations, including Datadog, MongoDB, and New Relic you can choose a buffer size of 1-64 MiBs and a buffer interval (60-900 seconds). The recommended buffer size for the destination varies from service provider to service provider. For example, the recommended buffer size for Datadog is 4 MiBs and the recommended buffer size for New Relic is 1 MiB. Contact the third-party service provider whose endpoint you’ve chosen as your data destination for more information about their recommended buffer size.

Duplicated Records

Kinesis Data Firehose uses at-least-once semantics for data delivery. In some circumstances, such as when data delivery times out, delivery retries by Kinesis Data Firehose might introduce duplicates if the original data-delivery request eventually goes through. This applies to all destination types that Kinesis Data Firehose supports.


Cost of Running the Services

Kinesis Data Streams

  • Pay per shard hour and PUT payload unit
  • Extended data retention, long-term data retention and enhanced fan-out incur additional costs
  • Retrieval of long-term retention data

Kinesis Firehose

  • Pay for the volume of data ingested
  • Pay for data conversions

Database Migration Service

  • Pay for the EC2 compute resources you use when migrating
  • Pay for log storage
  • Data transfer fees

Glue

  • Pay an hourly rate at a billing per second for both crawlers and ETL jobs
  • Monthly fee for storing and accessing data in your Glue data catalog

Stage 3: Data Preprocessing


Data order, Format, Compression


Managing Data Order, Format, and Compression

  • Problems with your streaming data
    • Data that is out of order
    • Data that is duplicated
    • Data where we need to change the format
    • Data that needs to be compressed
  • Methods to address theses problems
    • Choose an ingestion service that has guaranteed ordering
    • Choose an ingestion service that addresses your data duplication requirements
    • Use conversion feature of ingestion service
    • Use compression feature of ingestion service

Problems with Streaming Data
  • Data out of order
    • Ingestion service needs to support guaranteed ordering
  • Data duplicated
    • Ingestion service needs to support de-duped delivery
      • At-least-once
      • At-most-once
      • Exactly-once


Data out of order - Guaranteed Ordering

  • Guaranteed ordering
    • Kinesis Data Streams
    • DynamoDB Streams

Duplicated Data - De-duped Delivery

  • De-duped delivery
    • DynamoDB Streams - exactly-once
    • Kinesis Data Streams - at-least-once
      • Embed a primary key in data records and remove duplicates later when processing
    • Kinesis Data Firehose - at-least-once
      • Crawl target data with Glue ML FindMatches Transform

Data Format and Compression

  • Kinesis Data Streams
    • Use Lambda consumer to format or compress
    • Use KCL application to format or compress
  • Kinesis Data Firehose
    • Use format conversion feature if data in JSON
    • Use lambda transform to preprocess format conversion feature if data not JSON
    • Use S3 compression (GZIP, Snappy, or Zip)
    • Use GZIP COPY command option for Redshift compression

Transform data while ingesting

Transforming Ingested Data

  • Problems with your streaming data
    • Data where we need to change the format
    • Data that needs to be compressed
  • Methods to address these problems
    • Use conversion feature of ingestion
    • Use compression feature of ingestion service

Data Format and Compression - Use Cases

  • Typical use cases
    • Process your data before ingesting it into your data source
    • Mutation of data, for instance from one format to another
    • Normalizing your data produced from variable source systems
    • Augmenting your data with metadata

Data Format and Compression - Services
  • Kinesis Data Firehose
    • Using lambda, Firehose sends buffered batches to Lambda for transformation
    • Batch, encrypt, and/or compress data
  • Lambda
    • Convert the format of your data, e.g. GZIP to JSON
    • Transform your data, e.g. expand strings into individual columns; replace values in streaming into the normalized data
    • Filter your data to remove extraneous info
    • Enrich your data, e.g. error correction
  • Database Migration Service
    • Table and schema transformations, e.g. change table, schema, and/or columns names


Summary

Data Analytics Lifecycle


Collection Systems

  • Data Classification: Understand your data type before choosing a proper collection system.
    • Batch
    • Streaming
    • Transactional
  • Data Collection: Data collection systems give you the capability to ingest any kind of data, structured, unstructured, or semi-structured
  • Data Preprocessing: Data collection systems can transform and/or filter your data as you collect data

Stage 1: Data Classification

Select a collection system that handles the frequency, volume, and source of data
Batch, streaming, and transactional data

  • Batch Data

    • Kinesis Data Firehose
    • Batch Data: S3, AWS Glue
    • Data Lake
    • Athena
  • Streaming Data

    • Kinesis Data Firehose
    • Streaming Data: Kinesis Data Analytics, Kinesis Data Firehose
    • Redshift
    • QuickSight
  • Transactional Data

    • Data Source
    • AWS DMS (Database Migration Service)
    • S3
    • AWS Glue
    • Data Lake

Stage 2: Data Collection

Determine the operational characteristics of the collection system

  • The characteristics of your data streaming workload guide you in the selection of your streaming components
  • The two key components to remember for the exam
    • Fault tolerance
    • Data persistence
  • Kinesis Data Streams vs. Kinesis Data Firehose
    • Data persistence
  • Kinesis Producer Library vs. Kinesis API vs. Kinesis Agent
    • Fault tolerance and appropriate tool for your data collection problem

The Four Ingestion Services

Understand how each ingestion approach is best used

  • Throughput, bandwidth, scalability
  • Availability and fault tolerance
  • Cost of running the services

The key difference for the four ingestion services is the focus of Frequency, Volume, and Source of Data, which is the data itself and you cannot change.

  • Kinesis Data Streams

    • Use when you need custom producers and consumers
    • Use cases that require sub-second processing (low latency)
    • Use cases that require unlimited bandwidth (because the bandwidth based on the shards, and shards are unlimited)
  • Kinesis Firehose

    • Use cases where you want to deliver directly or indirectly to S3, Redshift, Elasticsearch, HTTP Endpoint, and Third-party service provider (Datadog, MongoDB Cloud, New Relic, and Splunk)
    • Use cases where you can tolerate latency of 60 seconds or greater
    • Use cases where you wish to transform your data or convert the data format
  • Database Migration Service

    • Use cases when you need to migrate data from one database to another
    • Use cases where you want to migrate a database to a different database engine
    • Use cases needing continuous replication of database records
  • Glue

    • Batch-oriented use cases where you want to perform an Extract Transform Load (ETL) process
    • Not for use with streaming use cases

Kinesis Data Streams

AWS Kinesis Data Streams


Kinesis Data Firehose

AWS Kinesis Data Firehose


AWS Glue

AWS Glue Introduction
Glue ETL from S3 Lab

  • Key point: batch oriented
    • Micro-batches but no streaming data
  • Does not support NoSQL databases as data source
  • Crawl data source to populate data catalog
  • Generate a script to transform data or write your own in console or API
  • Run jobs on demand or via a trigger event
  • Glue catalog tables contain metadata not data from the data source
  • Uses a scale-out Apache Spark environment when loading data to destination
    • Allocate data processing units (DPUs) to jobs

Stage 3: Data Preprocessing

Managing Data Order, Format, and Compression

  • Problems with your streaming data
    • Data that is out of order
    • Data that is duplicated
    • Data where we need to change the format
    • Data that needs to be compressed
  • Methods to address theses problems
    • Choose an ingestion service that has guaranteed ordering
    • Choose an ingestion service that addresses your data duplication requirements
    • Use conversion feature of ingestion service
    • Use compression feature of ingestion service

Data Format and Compression

Data Format and Compression Lab

https://zacks.one/aws-kinesis-lab/#3.2

  • Kinesis Data Firehose
    • Use format conversion feature if data in JSON
    • Use S3 compression (GZIP, Snappy, or Zip)

Data Format and Compression - Services

  • Kinesis Data Firehose
    • Using Lambda, Firehose sends buffered batches to Lambda for transformation
    • Batch, encrypt, and/or compress data
  • Lambda
    • Convert the format of your data, e.g. GZIP to JSON
    • Transform your data, e.g. expand strings into individual columns
    • Filter your data to remove extraneous info
    • Enrich your data, e.g. error correction
  • Database Migration Service
    • Table and schema transformations, e.g. change table, schema, and/or column names

Data Transform and Format

Data Transform and Format Lab

  • Kinesis Data Firehose
    • Use Lambda transform to preprocess format conversion feature if data not JSON
    • Use format conversion feature once data in JSON format

Quiz

  1. A data engineer in a manufacturing company is designing a data processing platform that receives a large volume of unstructured data. The data engineer must populate a well-structured star schema in Amazon Redshift. What is the most efficient architecture strategy for this purpose.

A. Transform the unstructured data using Amazon EMR and generate CSV data. COPY the CSV data into the analysis schema within Redshift.
B. Normalize the data using an AWS Marketplace ETL tool, persist the results to Amazon S3, and use AWS Lambda to INSERT the data into Redshift.
C. Load the unstructured data into Redshift, and use string parsing functions to extract structured data for inserting into the analysis schema.
D. When the data is saved to Amazon S3, use S3 Event Notifications and AWS lambda to transform the file contents. Insert the data into the analysis schema on Redshift.

A.


Which data collection service is best suited for loading batch data into a data lake within AWS?

A. DMS
B. Kinesis Data Streams
C. Kinesis Data Firehose
D. Glue

D.


Which collection system component allows you to automatically process retries for your data records?

A. KCL
B. KPL
C. Kinesis API
D. Kinesis Agent

A.


Kinesis Data Streams replicates your data asynchronously accross 3 AZs.

A. True
B. False

False.


Which of the following is NOT a valid destination for Kinesis Data Firehose?

A. S3
B. Redshift
C. DynamoDB
D. Elasticsearch
E. Splunk
F. Kinesis Data Analytics

C.


If a KCL application read fails, the KCL uses which of these to resume at the failed record?

A. checkpoint flag
B. checkpoint cursor
C. checkpoint log
D. logging cursor

B.


Frequent checkpointing by your KCL applications can cause which of these?

A. provisioning checkpoint exceptions
B. allocation checkpoint exceptions
C. provisioning throughput exceptions
D. allocation throughput exceptions

C.


Each Kinesis Data Streams shard can support up to how many RPS for writes with a max of 1 MB/Sec?

A. 10,000
B. 1,000
C. 2,000
D. 5,000

B


Th source database remains partially operational during the migration when using DMS

A. True
B. False

B.


Which of these approaches is best used when you want to load data from relational databases, data warehouses, and NoSQL databases?

A. Batch
B. Streaming
C. Transactional
D. Synchronous

C.


Which data collection service automatically scales to accomodate the throughput of your stream?

A. Glue
B. DMS
C. Kinesis Data Firehose
D. Kinesis Data Streams

C.


Which data collection guarantees correct ordering?

A. Kinesis Data Firehose
B. Kinesis Data Streams
C. Glue
D. DMS

B. Kinesis Data Streams & Dynamo DB Streams


Which data collection service has exactly once delivery?

A. Kinesis Data Firehose
B. Kinesis Data Streams
C. DynamoDB streams
D. Glue

C.

De-duped delivery

  • DynamoDB Streams - exactly-once
  • Kinesis Data Streams - at-least-once
    • Embed a primary key in data records and remove duplicates later when processing
  • Kinesis Data Firehose - at-least-once
    • Crawl target data with Glue ML FindMatches Transform

Domain 2: Storage and Data Management


Introduction

Stages of Storage and Data Management

Storage and Data Management in the Data Analytics Pipeline

  • Stage 1. Data Storage: Determine the operational characteristics of the storage solution for analytics
  • Stage 2. Data Access and Retrieval: Determine data access and retrieval patterns
  • Stage 3. Data Structure: Select appropriate data layout, schema, structure, and format
  • Stage 4. Data Lifecycle: Define data lifecycle based on usage patterns and business requirements
  • Stage 5. Data Catalog and Metadata: Determine the appropriate system for cataloging data and managing metadata

Stage 1: Data Storage
  • Determine the operational characteristics of the storage solution for analytics
    • Choose the correct storage system
      • Operational
        • RDS
        • DynamoDB
        • Elasticache
        • Neptune
      • Analytic
        • Redshift
        • S3

Data Freshness

  • Consider your data’s freshness when selecting your storage system components
    • Place hot data in cache (Elasticache or DAX) or NoSQL (DynamoDB)
    • Place warm data in SQL data stores (RDS)
    • Can use S3 for all types (hot, warm, cold)
    • Place cold data in S3 Glacier

Dynamo DB


Stage 2: Data Access and Retrieval

Determine data access and retrieval patterns

  • Patterns
    • Data retrieval speed
    • Data storage lifecycle
    • Data structures

  • Elasticache vs. DynamoDB Accelerator (DAX)
    • Which caching service is best ofr your near-real-time streaming data needs?

  • Data Lake vs. Data Warehouse
    • Which data storage service is best for your storage needs?

  • Redshift storage options
    • Redshift’s 3 node types to use depending on your storage requirements

Stage 3: Data Layout, Schema, Structure

Select appropriate data layout, schema, structure, and format


Stage 4: Data Lifecycle

Define data lifecycle based on usage patterns and business requirements


Stage 5: Data Catalog and Metadata

Determine the appropriate system for cataloging data and managing metadata


Stage 1: Data Storage


Two Types of Storage Systems

Take into account the cost, performance, latency, durability, consistency, and shelf-life of your data.

  • Operational storage services
    • RDS
    • DynamoDB
    • Elasticache
    • Neptune
  • Analytic storage services
    • Redshift
    • S3


Data Freshness

  • Consider your data’s freshness when selecting your storage system components
    • Place hot data in cache (Elasticache or DAX) or NoSQL (DynamoDB)
    • Place warm data in SQL data stores (RDS)
    • Can use S3 for all types (hot, warm, cold)
    • Place cold data in S3 Glacier

Operational Storage Services

RDS, DynamoDB, Elasticache, and Neptune.

  • Operation Database
    • Data stored as rows
    • Low latency
    • High throughput
    • Highly concurrent
    • Frequent changes
    • Benefits from caching
    • Often used in enterprise critical applications

RDS
  • RDS - distributed relational database service
    • Use cases
      • E-commerce, web, mobile, financial services, healthcare
    • Fast OLAP database options
      • SSD-backed storage options
    • Scale
      • Vertical scaling
      • Instance and storage size determine scale
    • Reliability and durability
      • Multi-AZ
      • Automated backups and snapshots
      • Automated failover

DynamoDB
  • DynamoDB - fully managed NoSQL database service
    • Use cases
      • Ad Tech, gaming, retail, banking and finance
    • Fast NoSQL database options
      • Single-digit millisecond latency at scale
    • Scale
      • Horizontal scaling
      • Can store data without bounds
      • High performance and low cost even at extreme scale
    • Reliability and durability
      • Data replicated across 3 AZs
      • Global-tables for multi-region replication

Elasticache
  • Elasticache - fully managed Redis and Memcached
    • Use cases
      • Caching, session stores, gaming, real-time analytics
    • Sub-millisecond response time from in-memory data store
      • Single-digit millisecond latency at scale
    • Reliability and durability
      • Redis Elasticache offers multi-AZ automatic failover

Timestream
  • Timestream - fully managed time series database service
    • Use cases
      • IoT applications, Industrial telemetry, application monitoring
    • Fast: analyze trillions of events per day
      • One tenth the cost of relational database
    • Scale
      • Vertical scaling
      • Timestream scales up or down depending on your load
    • Reliability and durability
      • Managed service takes care of provisioning, patching, etc.
      • Retention policies to manage reliability and durability

Analytic Storage Services

Redshift and S3

  • Analytic storage services
    • Two types
      • OLAP: Oline Analytics Processing, ad-hoc queries
      • DSS: Decision Support Systems, long running aggregations
    • Data stored as columns
    • Large datasets that take advantage of partitioning
    • Frequent complex aggregations
    • Loaded in bulk or via streaming
    • Less frequent changes

Redshift

Oracle is red. And Redshift means moving your storage architecture from Oracle to AWS.

  • Redshift - AWS Cloud Data Warehouse
    • Use cases
      • Data Science queries, marketing analysis
    • Fast: columnar storage technology that parallelizes queries
      • Millisecond latency queries
    • Reliability and durability
      • Data replicated within the Redshift cluster
      • Continuous backup to S3

S3
  • S3 - object storage via a web service
    • Use cases
      • Data lake, analytics, data archiving, static website
    • Fast: query structured and semi-structured data
      • Use Athena and Redshift Spectrum to query at low latency
    • Reliability and durability
      • Data replicated across 3 AZs in a region
      • Same-region or cross-region replication

Stage 2: Data Access and Retrieval

Patterns

  • Data structures
    • Structured data
    • Unstructured data
    • Semi-structured data
  • Data storage lifecycle
  • Data access retrieval and latency requirements
    • Retrieval speed

Data Structures

  • Data structures
    • Structured data
      • Examples: accounting data, demographic info, logs, mobile device geolocation data
      • Storage options: RDS, Redshift, S3 Data Lake
    • Unstructured data
      • Examples: email text, photos, video, audio, PDFs
      • Storage options: S3 Data Lake, DynamoDB
    • Semi-structured data
      • Examples: email metadata, digital photo metadata, video metadata, JSON data
      • Storage options: S3 Data Lake, DynamoDB

Data Warehouse
  • Data Warehouse
    • Optimized for relational data produced by transactional systems
    • Data structure/schema predefined which optimizes fast SQL queries
    • Used for operational reporting and analysis
    • Data is transformed before loading
    • Centralized data repository for BI and analysis
    • Access the centralized data using BI tools, SQL clients, and other analytics apps

Data Lake
  • Data Lake
    • Relational data and non-relational data: mobile apps, IoT devices, and social media
    • Structured, unstructured, and semi-structured data.
      • Data structure/schema not defined when stored in the data lake
      • New data needed to be handled later
    • Big data analytics, text analysis, ML, dashboards, visualization, real-time analytics
    • Schema on read

Object Storage vs. File Storage vs. Block Storage
  • Object storage
    • S3 is used for object storage: highly scalable and available
    • Store structured, unstructured, and semi-structured data
    • Web sites, mobile apps archive, analytics applications
    • Storage via a web service
  • File storage
    • EFS is used for file storage: shared file systems
    • Content repositories, development environments, media stores, user home directories
  • Block storage
    • EBS attached to EC2 instances, EFS: volume type choice
    • Redshift, Operating Systems, DBMS installs, file systems
    • HDD: throughput intensive, large I/O, sequential I/O, big data
    • SSD: high I/O per second, transaction, random access, boot volumes


Data Storage Lifecycle

  • Persistent data
    • OLTP and OLAP
    • DynamoDB, RDS, Redshift
  • Transient data
    • Cached data (like website session data), streaming data consumed in near-real time
    • Elasticache (Redis memcached), DynamoDB Accelerator (DAX)
    • Website session info, streaming gaming data
  • Archive data
    • Retained for years, typically regulatory
    • S3 Glacier

Data Access Retrieval and Latency

  • Retrieval speed
    • Near-real time
      • Streaming data with near-real time dashboard display (e.g. Tableau, QuickSight)
    • Cached data
      • Elasticache
      • DAX


Data Warehouse vs. Data Lake

Data Warehouse
  • Data Warehouse
    • Optimized for relational data produced by transactional systems
    • Data structure/schema predefined which optimizes fast SQL queries
    • Used for operational reporting and analysis
    • Data is transformed before loading
    • Centralized data repository for BI and analysis
    • Access the centralized data using BI tools, SQL clients, and other analytics apps

  • Data Warehouse systems
    • Teradata
    • Oracle
    • Redshift
    • Cloudera EDH

Data Lake
  • Data Lake
    • Relational data and non-relational data: mobile apps, IoT devices, and social media
    • Structured, unstructured, and semi-structured data.
      • Data structure/schema not defined when stored in the data lake
      • New data needed to be handled later
    • Big data analytics, text analysis, ML, dashboards, visualization, real-time analytics
    • Schema on read

  • Data Lake systems
    • S3
    • EMR
    • Hadoop

Different Characteristics
  • Data optimization
    • Data warehouse is optimized to store relational data from transactional systems with schema-on-write
    • Data lake stores all types of data: relational data and non-relational data from IoT devices, mobile apps, social, etc. with schema-on-read
Data Lake Characteristics Data Warehouse
Any kind of data from IoT devices, social media, sensors, mobile apps, relational, text Data Relational data with corresponding tables defined in the warehouse
Schema-on-read. No knowledge of data format when written to the lake. Schema is constructed by analyst or system when retrieved from the data lake. Schema Schema-on-write. Defined based on knowledge of the data to load.
Raw data from many disparate sources Data Format Data is carefully managed and controlled, predefined schema
Uses low-cost object storage Storage Costly with large data volumes
Change configuration as needed at any time Agility Configuration (schema/table structure) is fixed
Machine learning specialists, data scientists, business analysts Users management, business analysts
Machine learning, data discovery, analytics applications, real-time streaming visualizations Applications Visualizations, business intelligence, reporting

Best Practices


Cost Management
  • To manage cost, use performance optimizations for data layout, schema, data structure, and data format
    • Use DynamoDB partition keys and burst/adaptive capacity for optimal data distribution
    • Use Redshift sort keys to optimize query plans
    • Efficient use of the Redshift COPY command
    • Redshift compression types
    • Redshift primary key and foreign key constraints


Stage 3: Data Layout, Schema, Structure

Redshift

DynamoDB


Stage 4: Data Lifecycle

Data Lifecycle Management

  • Data lifecycle management
    • S3 data lifecycle
      • Lifecycle policies
      • S3 replication
    • Data backups
      • Redshift, RDS, DynamoDB

Take the social media as an example


S3


S3 Data Lifecycle
  • S3 Standard: General-purpose storage of frequently accessed data
  • S3 Standard Infrequent-Access (Standard IA): Long-lived, but less frequently accessed data
  • S3 One Zone-IA: Long-lived, but less frequently accessed, and non-critical data
  • S3 Intelligent Tiering: Cost optimization without performance impact
  • S3 Glacier: Long-term storage with minutes or hours for retrieval time
  • S3 Glacier Deep Archive: Long-term storage with 12 hours for retrieval time as default
  • S3 Reduced Redundancy Storage (RRS): Frequently accessed, non-critical data

Amazon S3 Storage Classes

  S3 Standard S3 Intelligent-Tiering*
S3 Standard-IA
S3 One Zone-IA†
S3 Glacier
S3 Glacier
Deep Archive
Designed for durability
99.999999999%
(11 9’s)
99.999999999%
(11 9’s)
99.999999999%
(11 9’s)
99.999999999%
(11 9’s)
99.999999999%
(11 9’s)
99.999999999%
(11 9’s)
Designed for availability
99.99% 99.9% 99.9% 99.5% 99.99% 99.99%
Availability SLA 99.9% 99% 99% 99% 99.9%
99.9%
Availability Zones ≥3 ≥3 ≥3 1 ≥3 ≥3
Minimum capacity charge per object N/A N/A 128KB 128KB 40KB 40KB
Minimum storage duration charge N/A 30 days 30 days 30 days 90 days 180 days
Retrieval fee N/A
N/A
per GB retrieved
per GB retrieved per GB retrieved per GB retrieved
First byte latency milliseconds milliseconds milliseconds milliseconds select minutes or hours select hours
Storage type Object Object Object Object Object Object
Lifecycle transitions Yes Yes Yes Yes Yes Yes

S3 Data Lifecycle Policies
  • S3 Lifecycle Policies
    • Lifecycle rules configuration configures S3 when to transition objects to another Amazon S3 storage class
    • Define rules to move objects from ont storage class to another
    • Transition between storage classes uses a waterfall model
    • Combine lifecycle actions to move an object through its entire lifecycle
    • Encrypted objects stay encrypted throughout their life cycle
    • Transition to S3 Glacier Deep Archive is a one way trip (use the restore operation to move object from Deep Archive)


S3 Replication
  • Replication copies your S3 objects automatically and asynchronously across S3 buckets
    • Use Cross-Region replication (CRR) to copy objects across S3 buckets in different Regions
    • Use Same-Region replication (SRR) to copy objects across S3 buckets in the same Region
    • Compliance requirements - physically separated backups
    • Latency - house replicated object copies local to your users
    • Operational efficiency - applications in different regions analyzing the same object data

Database Backups

  • Database management requires backups on a given frequency according to your requirements
  • Restores from backups
  • Redshift stores snapshot internally on S3
    • Snapshots are point-in-time backups of your cluster
  • DynamoDB allow for backup on demand
    • Backup and restore have no impact on the performance of your tables
  • RDS performs automated backups of your database instance
    • Can recover to any point-in-time
    • Can perform manual backups using database snapshots

Stage 5: Data Catalog and Metadata


Metadata Management

  • Hive records your data metastore information in a MySQL database housed on the master node file system
    • Hive metastore describes the table and the underlying data on which it is built
      • Partition names
      • Data types
    • At cluster termination time, the master node shuts down
      • Local data is deleted since master node file system is on ephemeral storage
    • To maintain a persistent metastore, create an external metastore
      • Two options
        • Glue data catalog as Hive metastore
        • External MySQL or Aurora Hive metastore

Glue Data Catalog as Hive Metastore

Populating Glue Data Catalog Lab

  • When you need a persistent metastore or a shared metastore used by different clusters, services, applications, or AWS accounts
  • Metadata repository across many data sources and data formats
    • EMR, RDS, Redshift, Redshift Spectrum, Athena, application code compatible with Hive metastor
    • Glue crawlers infer the schema from your data objects in S3 and store the associated metadata in the Data Catalog
  • Limitations: Glue Data Catalog as your Hive Metastore
    • Hive transactions are NOT supported
    • Columns level statistics are NOT supported
    • Hive authorizations are NOT supported, use Glue resource-based policies
    • Cost-based optimization in Hive is NOT supported
    • Temporary tables are NOT supported

External RDS Hive Metastore
  • Override the default for the metastore in Hive, use external database location
  • RDS MySQL or Aurora instance
  • Hive cluster runs using the metastore located in Amazon RDS
  • Start all additional Hive clusters that share this metastore by specifying the RDS metastore location
  • RDS replication is not enabled by default, configure replication to avoid any data loss in the event of failure
  • Hive does not support and also does not prevent concurrent writes to metastore tables
    • When sharing metastore info between two clusters, do not write to the same metastore table concurrently, unless writing to different metastore table partitions


Glue Data Catalog

  • Holds references to data used as sources and targets of your GLUE (ETL) jobs
  • Catalog your data in the Glue Data Catalog to use when creating your data lake or data warehouse
  • Holds information on the location, schema, and runtime metrics of your data
    • Use this information to create ETL jobs
    • Information stored as metadata tables, with each table describing a single data store
  • Way to add metadata tables to your Data Catalog
    • Glue crawler
    • AWS console
    • CreateTable Glue API call
    • CloudFormation templates
    • Migrate an APache Hive metastore


Steps to Populate Glue Data Catalog
  • Four steps to populate your Glue data catalog
    1. Classify your data by running a crawler
    • Custom classifiers
    • Built-in classifiers
    1. Crawler connects to the data store
    2. Crawler infers the schema
    3. Crawler writes metadata to the Data Catalog


Which of the following is NOT a valid way to add table definitions to the Glue Data Catalog?

A. Run a crawler
B. AWS console
C. CloudFormation templates
D. ETL job

D.


Which of the following is a valid way to migrate and Apache Hive metastore to a Glue Data Catalog?

A. Indirect migration
B. Batch migration
C. Transactional migration
D. Migration using S3 objects

D.


Glue Ecosystem
  • Categorize, cleans, enriches, and moves your reliably between various data store
  • Several AWS services natively support querying data sources via the unified metadata repository of the Glue Data Catalog
    • Athena
    • Redshift
    • Redshift Spectrum
    • EMR
    • S3
    • RDS
    • Any application compatible with the Apache Hive metastore


Storage Systems


Quiz

Primary key and foreign key constraints are enforced by Redshift.

A. True
B. False

B.


Which of the following is an accurate description of the process of loading a Redshift table?

A. The COPY command is not an efficient way to load a Redshift table.
B. The COPY command is the only way to load a Redshift table
C. The COPY command is the most efficient way to load a Redshift table.
D. The INSERT command is the most efficient way to load a Redshift table.

C.


After you have completed the initial load of your Redshift table you should do which of these?

A. Run a SWEEP command
B. Run a VACUUM command
C. Run an ANALYZE command
D. Run an INDEX command

B.


Which of the following commands updates your Redshift table statistics?

A. VACUUM
B. ANALYZE
C. INDEX
D. UPDATE

B.


Which is a situation when you should NOT use an interleaved sort key in Redshift?

A. On a column with randomly increasing attributes
B. On a column with monotonically increasing attributes
C. On a column with logistically increasing attributes
D. On a column with exponentially increasing attributes

B.


Whenever you are not fully using your partition’s throughput, DynamoDB reserves a portion of that unused capacity for which of these?

A. Adaptive capacity
B. Burst capacity
C. Excess capacity
D. Reserved capacity

B.


Which of the Redshift distribution options distributes all the rows to the first slice on each node?

A. All
B. Auto
C. Even
D. Distribution key

A.


Which of the Redshift distribution options always distributes your rows evenly across your node slices?

A. All
B. Auto
C. Even
D. Distribution key

C.


Which of the Redshift distribution options results in each slice containing about the same number of rows?

A. All
B. Auto
C. Even
D. Distribution key

D.


Which of the Glue features allows you to create your own classification logic for your crawler?

A. Custom crawler
B. Custom classifier
C. Built-in crawler
D. Built-in classifier

B.


Domain 3: Processing


Introduction

  • In the processing section of the data analytics pipeline you choose the best processing tools and techniques based on workload requirements, performance, cost, and orchestration needs
    • Transform your data to ready it for analytics and visualization
  • Determine appropriate data processing solution requirements
  • Design a solution for transforming and preparing data for analysis
  • Automate and operationalize a data processing solution

Data Processing Solutions

  • ETL processing of your data
    • Spark: Glue ETL jobs running oon fully managed Apache Spark environment
    • Non-Spark: EMR cluster - many options for processing beyond Spark
    • Real/near-real time: Kinesis Data Streams/Firehose and Kinesis Data Analytics with Lambda


Design a solution for transforming and preparing data for analysis

  • Where to apply each processing tool/service
    • Your system’s operational characteristics define your choice of too/service
      • Near real-time processing
      • Operational periodicity - batch
    • Depending on your periodicity optimization of your tool/service
      • Optimizing EMR cluster - compute, storage, concurrency, cost

Automate and operationalize a data processing solution

  • Use orchestration to operationalize your data processing workflows
    • Glue workflows
    • Step Functions and Glue
    • Step Functions and EMR


Data Processing Solution - ETL

Glue ETL on Apache Spark

  • Use Glue when don’t need or want to pay for an EMR cluster
    • Glue generate an Apache Spark (PySpark or Scala) script
  • Glue runs in a fully managed Apache Spark environment
    • Spark has 4 primary libraries
      • Spark SQL
      • Spark Streaming
      • MLlib
      • GraphX


EMR Cluster ETL Processing

Supported Applications and Features

  • More flexible and powerful than Spark
    • Can use Spark on EMR, but also have other options
  • Popular
    • Hadoop
    • Spark
    • Sqoop
    • TensorFlow
    • Zeppelin
    • Hive
    • Hue
    • Presto
Application Description
Flink Framework and distributed processing engine for stateful computations over unbounded and bounded data streams
Ganglia Distributed system designed to monitor clusters and grids while minimizing the impact on their performance
Hadoop Framework that allows for the distributed processing of of large data sets across clusters of computers
HBase Non-relational distributed database modeled after Google’s Bigtable
HCatalog Table storage management tool for Hadoop that expose the tabular data of Hive metastore to other Hadoop applications
JupyterHub Serve Jupyter notebooks for multiple users
Livy Enables easy interaction with a Spark cluster over a REST interface
Mahout Create scalable ML algorithms
MXNet Deep learning software framework, used to train, and deploy deep neural networks
Phoenix Massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store
Pig High-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin
Spark Distributed general-purpose cluster-computing framework
Sqoop Command-line interface application for transferring data between relational database and Hadoop
Tez Extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop
TensorFlow Machine learning library
Zeppelin Web-based notebook for data-driven analytics
ZooKeeper Centralized service for maintain configuration information, naming, providing distributed synchronization, and providing group services
Hive A SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop
Hue Web interface for analyzing data with Hadoop
Oozie Server-based workflow scheduling system to manage Hadoop jobs
Presto High performance, distributed SQL query engine for big data

EMR Integration

  • Integration with the following data stores
    • Use S3 as an object store for Hadoop
    • HDFS (Hadoop Distributed File System) on the Core nodes instance storage
    • Directly access and process data in DynamoDB
    • Process data in RDS
    • Use COPY command to load data in parallel into Redshift from EMR
    • Integrates with S3 Glacier for low-cost storage

Kinesis ETL Processing

  • Use Kinesis Analytics to gain real-time insights into your streaming data
    • Query data in your stream or build streaming application using SQL or Flink
    • Use for filtering, aggregation, and anomaly detection
    • Preprocess your data with Lambda


Glue ETL

AWS Glue


EMR

AWS EMR


Design a solution for transforming and preparing data for analysis

Batch vs. Streaming ETL Services

  • Based on your use case, you need to select the best tool or service
    • Batch or streaming ETL
      • Batch processing model
        • Data is collected over a period of time, then run through analytics tools
        • Time consuming, designed for large quantities of information that aren’t time-sensitive
      • Streaming processing model
        • Data is processed in a stream, a record at a time or in micro-batches of tens, hundreds, or thousands of records
        • Fast, designed for information that’s needed immediately


Batch Processing ETL
  • AWS services used for batch processing
    • Glue batch ETL
      • Schedule ETL jobs to run at a minimum of 5-minute intervals
      • Process micro-batches
      • Serverless
    • EMR batch ETL
      • Use impala or Hive to process batches of data
      • Cluster of servers
      • Very flexible in tool/application selection
      • Expensive


Streaming Processing ETL
  • AWS services used for streaming processing
    • Lambda
      • Reads records from your data stream, run functions synchronously
      • Frequently used with Kinesis
      • Serverless
    • Kinesis
      • Use the KCL, Kinesis Analytics, Kinesis Firehose to process your data
      • Serverless
    • EMR Streaming ETL
      • Use Spark Streaming to build your stream processing ETL application
      • Cluster of servers


Automate and Operationalize a Data Processing Solution

AWS EMR Monitoring
AWS EMR Automation
AWS Glue Workflows
AWS Step Functions


Quiz

Which are not one of the 4 primary Spark libraries?

A. Spark SQL
B. Spark Batch
C. Spark Streaming
D. MLlib
E. GraphX

B.


Which of the following are not one of the Glue output formats?

A. JSON
B. CSV
C. XML
D. ORC
E. Avro
F. Parquet

C.


The relationalize transform in Glue does which of these?

A. maps source DynamicFrame columns and data types to target DynamicFrame columns and data types
B. selects records from a DynamicFrame and returns a filtered DynamicFrame
C. applies a function to the records of a DynamicFrame and returns a transformed DynamicFrame
D. converts a DynamicFrame to a relational (rows and columns) form

D.

  • Glue has built-in transforms for processing data
    • Call from within your ETL script
    • In a DynamicFrame (an extension of an Apache Spark SQL DataFrame), your data passes from transform to transform
    • Built-in transform types (subset)
      • ApplyMapping: maps source DynamicFrame columns and data types to target DynamicFrame columns and data types
      • Filter: selects records from a DynamicFrame and returns a filtered DynamicFrame
      • Map: applies a function to the records of a DynamicFrame and returns a transformed DynamicFrame
      • Relationalize: converts a DynamicFrame to a relational (rows and columns) form

Which of the following describes the EMR File System (EMRFS) file systems used by EMR?

A. distributes the data it stores across instances in the cluster (ephemeral)
B. directly access data stored in S3 as if it were a file system like HDFS
C. EC2 locally connected disk
D. distributes the data it stores across instances in the cluster

B.


Which software component of EMR centrally manages cluster resources?

A. Hive
B. Presto
C. Hadoop
D. Yarn

D.


Which of the following performs the task of monitoring your EMR cluster?

A. Sqoop
B. Tez
C. Oozie
D. Ganglia
E. Presto

D.


Which of the following is a streaming dataflow engine that allows you to run real-time stream processing on high-throughput data sources?

A. Spark
B. Flink
C. Hive
D. Presto

B.


Which of the following supports map/reduce functions and complex extensible user-defined data types like JSON and Thrift?

A. Hive
B. Presto
C. HBase
D. Hue

A.

  • Framework layer that is used to process and analyze data
    • Different frameworks available
      • Hadoop MapReduce
        • Parallel distributed applications that use Map and Reduce functions (e.g. Hive)
          • Map function maps data to sets of key-value pairs
          • Reduce function combines the key-value pairs and processes the data
      • Spark
        • Cluster framework and programming model for processing big data workloads

In the calculation of the storage capacity for your EMR cluster, which of these defines how each data block is stored in HDFS for RAID-like redundancy?

A. Instance fleet
B. Instance group
C. Replication factor
D. EBS storage capacity

C.

  • To calculate the storage allocation for your cluster consider the following
    • Number of EC2 instances used for core nodes
    • Capacity of the EC2 instance store for the instance type used
    • Number and size of EBS volumes attached to core nodes
    • Replication factor: how each data block is stored in HDFS for RAID-like redundancy
      • RAID 3: core nodes ≥ 10
      • RAID 2: 4 ≤ core nodes ≤ 9
      • RAID 1: core nodes ≤ 3
    • HDFS capacity of your cluster
      • For each core node, add instance store volume capacity to EBS storage capacity
      • Multiply by the number of core nodes, then divide the total by the replication factor

Which EMR metric is used in the “Track progress of cluster” use case?

A. RunningMapTasks
B. IsIdle
C. HDFSUtilization
D. HDFSCapacity

A.

Use Case Metrics
Track progress of cluster RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics
Detect idle clusters IsIdele metric tracks if a cluster is live, but not currently running tasks. Set an alarm to fire when the cluster has been idle for a given period of time
Detect when a node runs out of storage HDFSUtilization metric gives the percentage of disk space currently used. If it rises above an acceptable level for your application, such as 80% of capacity used, you take action to resize your cluster and add more core nodes

Which of the following is not an orchestration tool for EMR?

A. Workflow
B. Airflow
C. Oozie
D. Livy
C. Step Functions

A.

  • Ways to manage EMR Steps
    • Use Apache Oozie or Apache Airflow scheduler tools for EMR Spark applications
    • Use Step Functions and interact with Spark applications on EMR using Apache Livy
    • Directly connect Step Functions to EMR
      • Create data processing and analysis workflows with minimal code and optimize cluster utilization

Which of the following is an orchestration tool that can automate Glue crawler jobs?

A. Step Functions
B. Hue
C. Ganglia
D. Workflow

D.

  • Use Step Functions to automate your Glue workflow
    • Serverless orchestration of your Glue steps
    • Easily integrate with EMR steps

Which of the following, excluding Step Functions itself, is the only AWS service that integrates with Step Functions that can handle request/response, run jobs synchronously and asynchronously, and wait for callbacks?

A. EMR
B. Glue
C. ECS/Fargate
D. DynamoDB

C.


Domain 4: Analysis and Visualization


Introduction

Analytics and Visualization in the Data Analytics Pipeline

  • In the analytics and visualization section of the data analytics pipeline you use your collected, processed, and transformed data to create actionable insights
  • Need to understand which analysis and visualization methods and tools to use based on your audience’s expectations for access and the insight they expect to gain from your visualizations
  • Three subdomains
    • Subdomain 1: Determine the operational characteristics of an analysis and visualization solution
    • Subdomain 2: Select the appropriate data analysis solution for a given scenario
    • Subdomain 3: Select the appropriate data visualization solution for a given scenario

Subdomain 1: Determine Analysis & Visualization Solution

  • Understand the analysis and visualization components of a data analytics solution and which AWS services implement these concepts
Category Use Case Analytics Service
Analytics Interactive analytics Athena
Big data processing EMR
Data warehousing Redshift
Real-time analytics Kinesis
Operational analytics Elasticsearch
Dashboards and visualizations QuickSight
Data movement Real-time data movement Managed Streaming for Kafka (MSK)
Kinesis Data Streams
Kinesis Data Firehose
Kinesis Data Analytics
Kinesis Video Streams
Glue
Data Lake Object storage S3, Lake Formation
Backup and archive S3 Glacier, AWS Backup
Data catalog Glue, Lake Formation
Third-part data AWS Data Exchange
Predictive analytics and machine learning Frameworks and interfaces AWS Deep Learning AMIs
Platform SageMaker

Subdomain 2: Determine Analysis Solution

  • Suggest the best analysis and visualization
    • Select the appropriate type of analysis
    • Select best solution for a scenario
      • Analysis method
      • Analysis tools
      • Analysis technology

An example of Alexa Enabled or Embedded Systems AWS IoT Devices Analysis


Subdomain 3: Determine Visualization Solution

  • Understand visualization solution characteristics and AWS services and options for visualization
    • Presentation type
    • Refresh schedule
    • Delivery method
    • Access method
  • Understand aggregating data from different types of data sources into your visualization solution

Subdomain 1: Determine Analysis & Visualization Solution


Purpose-Built Analytics Services

  • Choose the correct approach and tool for your analytics problem
  • Know the AWS purpose-built analytics services
    • Athena
    • Elastic Search
    • EMR
    • Kinesis - Data Stream, Firehose, Analytics, Video Streams
    • MSK (Managed Streaming for Apache kafka)
    • Also know where to use
      • S3, EC2, Glue, Lambda


Use Cases - Appropriate Analytics Service

  • Use cases for the various AWS analytics services
    • Use the right analytics tool for the job
    • Use several tools on the same dataset to answer different questions about the data
Category Use Case Analytics Service
Analytics Interactive analytics Athena
Big data processing EMR
Data warehousing Redshift
Real-time analytics Kinesis
Operational analytics Elasticsearch
Dashboards and visualizations QuickSight
Data movement Real-time data movement Managed Streaming for Kafka (MSK)
Kinesis Data Streams
Kinesis Data Firehose
Kinesis Data Analytics
Kinesis Video Streams
Glue
Data Lake Object storage S3, Lake Formation
Backup and archive S3 Glacier, AWS Backup
Data catalog Glue, Lake Formation
Third-part data AWS Data Exchange
Predictive analytics and machine learning Frameworks and interfaces AWS Deep Learning AMIs
Platform SageMaker

Use Cases - Data Warehousing

  • Without unnecessary data movement use SQL to query structured and unstructured data in your data warehouse and data lake
  • Redshift
    • Query PB of structured, semi-structured, and unstructured data (Redshift Spectrum works with unstructured data instead of Redshift)
      • Data warehouse
      • Operational Database
      • S3 Data Lake using Redshift Spectrum
    • Save query results to S3 data lake using common formats such as Parquet
      • Analyze using other analytics services such as EMR, Athena, and SageMaker


Use Cases - Big Data Processing

  • Process large amount of data(PB-scale) in your data lake
  • EMR
    • Data engineering
    • Data science
    • Data analytics
  • Spin up/spin down clusters for short running jobs, such as ingestion clusters, and pay by the second
  • Automatically scale long running clusters, such as query clusters, that are highly available


Use Cases - Real Time Analytics with MSK

  • Data collection, processing, analysis on streaming data loaded directly into data lake, data stores, analytics services
  • Managed Streaming for Kafka (MSK)
    • Uses Apache Kafka to process streaming data
      • To and from Data lake and databases
      • Power machine learning and analytics applications
  • MSK provisions and runs Apache Kafka cluster
  • Monitors and replaces unhealthy nodes
  • Encrypts data at rest


Use Cases - Real Time Analytics with Kinesis

  • Ingest real-time data such as video, audio, application logs, website clickstreams, and IoT data
  • Kinesis
    • Ingest data in real-time for machine learning and analytics
      • Process and analyze data as it streams to your data lake
      • Process and respond in real-time on your data stream


Use Cases - Operational Analytics

  • Search, filter, aggregate, and visualize your data in near real-time
  • Elasticsearch
    • Application monitoring, log analytics, and clickstream analytics
      • Managed Kibana
      • Altering and SQL querying
    • File types
      • Log Files
      • Messages
      • Metrics
      • Configuration Information
      • Documents


Usage Patterns, Performance, and Cost

  • Manged services that are secure which seamlessly integrate and scale to end-to-end big data analytic applications
  • You need to know the performance versus cost trade offs as well as the common usage patterns
    • Glue
    • Lambda
    • EMR
    • Kinesis - Data Streams, Firehose, Analytics, Video Streams
    • DynamoDB
    • Redshift
    • Athena
    • Elasticsearch
    • QuickSight


Glue

  • Fully managed ETL service used to catalog, clean, enrich, and move data between data stores
  • Usage patterns
    • Crawl your data and generate code to execute, including data transformation and loading
    • Integrate with services like Athena, EMR, and Redshift
    • Generates customizable, reusable, and portable ETL code using Python and Spark
  • Cost
    • Hourly rate, billed by the minute, for crawler and ETL jobs
    • Glue Data Catalog: pay a monthly fee for storing and accessing the metadata
  • Performance
    • Scale-out Apache Spark environment to load data to target data store
    • Specify the number of Data Processing Units (DPUs) to allocate to your ETL Job


Lambda

  • Run code without provisioning or managing servers
  • Usage patterns
    • Execute code in response to triggers such as changes in data, changes in system state, or actions by users
    • Real-time File Processing and stream
    • Process AWS Events
    • Replace Cron
    • ETL
  • Cost Charged by the number of requests to functions and code execution time
    • $0.20 per 1 million requests
    • Code execution $0.00001667 for every GB-second used
  • Performance
    • Process events within milliseconds
    • Latency higher for cold-start
    • Retain a function instance and reuse it to serve subsequent requests, versus creating new copy


EMR

  • Uses Hadoop to distribute data and Processing across a resizable cluster of EC2 instances
  • Usage patterns
    • Reduces large processing problems and data sets into smaller jobs and distributes them across many compute nodes in a Hadoop cluster
    • Log processing and analytics
    • Large ETL data movement
    • Ad targeting, click stream analytics
    • Predictive analytics
  • Cost
    • Pay for the hours the cluster is up
    • EC2 pricing options (On-Demand, Reserved, and Spot)
    • EMR price is in addition to EC2 price
  • Performance
    • Driven by type/number of EC2 instances
    • Ephemeral or long-running


Kinesis

  • Load and analyze streaming data; ingest real-time data into data lakes and data warehouses
  • Usage patterns
    • Move data from producers and continuously process it to transform before moving to another data store; drive real-time metrics and analytics
    • Real-time data analytics
    • Log intake and processing
    • Real-time metrics and reporting
    • Video/Audio processing
  • Cost
    • Pay for the resources consumed
    • Data Streams hourly price per/shard
    • Charge for each 1 million PUT transactions
  • Performance
    • Data Streams: throughput capacity by number of shards
    • Provision as many shards as needed

DynamoDB

  • Fully-managed NoSQL database that provides low-latency access at scale
  • Usage patterns
    • Apps needing low latency NoSQL databases able to scale storage and throughput up or down without code changes or downtime
    • Mobile apps and gaming
    • Sensor networks
    • Digital ad serving
    • E-commerce shopping carts
  • Cost
    • Three components: Provisioned throughput capacity (per hour), indexed data storage (per GB per month), data transfer in or out (per GB per month)
  • Performance
    • SSDs and limiting indexing on attributes provides high throughput and low latency
    • Define provisioned throughput capacity required for your tables

Redshift

  • Fully-managed, PB scale data warehouse service for analyzing data using BI tools
  • Usage patterns
    • Designed for online analytical processing (OLAP) using business intelligence tools
    • Analyze sales data for multiple products
    • Analyze ad impressions and clicks
    • Aggregate gaming data
    • Analyze social trends
  • Cost
    • Charges based on the size and number of cluster nodes
    • Backup storage > provisioned storage sized and backups stored after cluster termination billed at standard S3 rate
  • Performance
    • Columnar storage, data compression, and zone maps to reduce query I/O
    • Parallelizes and distributes SQL operations to take advantage of all available resources

Athena

  • Interactive query services used to analyze data in S3 using Presto and standard SQL
  • Usage patterns
    • Interactive ad hoc querying for web logs
    • Query staging data before loading into Redshift
    • Send AWS services logs to S3 for Analysis with Athena
    • Integrate with Jupyter, Zeppelin
  • Cost
    • $5 per TB of query data scanned
    • Save on per-query costs and get better performance by compressing, partitioning, and converting data into columnar formats
  • Performance
    • Compressing, partitioning, and converting your data into columnar formats
    • Convert data to columnar formats, allowing Athena to read only the columns it needs to process queries

  • Fully managed service that delivers Elasticsearch’s APIs and built-in integration with Kibana
  • Usage patterns
    • Analyze activity logs
    • Analyze social media sentiments
    • Usage monitoring for mobile applications
    • Analyze data stream updates from other AWS services
  • Cost
    • Elasticsearch instance hours
    • EMS storage (if you choose this option), and standard data transfer fees
  • Performance
    • Instance type, workload, index, number of shards used, read replicas, storage configurations
    • Fast SSD instance storage for storing indexes or multiple EBS volumes

QuickSight

  • BI services to build visualizations, perform ad-hoc analysis, and get business insights from data
  • Usage patterns
    • Ad-hoc data exploration/visualization
    • Dashboards and KPIs
    • Analyze and visualize data coming from logs and stored in S3
    • Analyze and visualize data in SaaS applications life Saleforce
  • Cost
    • SPICE (Super-fast, Parallel, In-memory Calculation Engine)
    • Standard $9/user/month; Enterprise $18/user/month
    • SPICE capacity: $0.25/GB/month Standard; $0.38/GB/month Enterprise
  • Performance
    • SPICE uses a combination of columnar storage, in-memory technologies
    • Machine code generation to run interactive queries on large datasets at low latency

Patters & Anti-patters

Durability, availability, scalability, elasticity, interfaces, and anti-patterns


Reliable Analytics Applications

  • Managed analytics services that are durable, available, scalable, and elastic
  • You need to know the analytics service capabilities that deliver theses assurances
  • Also need to know service interfaces and where not to use these services
    • Glue
    • Lambda
    • EMR
    • Kinesis - Data Streams, Data Firehose, Data Analytics, Video Streams
    • DynamoDb
    • Redshift
    • Athena
    • Elasticsearch
    • QuickSight


Glue
  • Connects to many data sources, S3, RDS, or many other types of data sources
  • Durability and availability
    • Glue leverages the durability of the data stores to which you connect
    • Provides job status and pushes notifications to CloudWatch events
    • Use SNS notification from CloudWatch events to notify of job failures or success
  • Scalability and elasticity
    • Runs on top of the Apache Spark for transformation job scale-out execution
  • Interfaces
    • Crawlers scan many data store types
    • Bulk import Hive metastore into Glue Data Catalog
  • Anti-patterns
    • Streaming, unless Spark Streaming
    • Heterogeneous ETL job types: use EMR
    • NoSQL databases: not supported as data source


Lambda
  • Uses replication and redundancy for high availability of the services and the functions it operates
  • Durability and availability
    • No maintenance windows or scheduled downtime
    • On failure, Lambda synchronously responds with an exception
    • Asynchronous functions are retried at least 3 times
  • Scalability and elasticity
    • Scales automatically with no limits with dynamic capacity allocation
  • Interfaces
    • Trigger Lambda from AWS service events
    • Respond to CloudTrail audit log entries as events
  • Anti-patterns
    • Long running applications: 900s runtime
    • Dynamic websites
    • Stateful application: must be stateless


EMR
  • Fault tolerant for core node failures and continues job execution if a slave node goes down
  • Durability and availability
    • Starts up a new node if core node fails
    • Won’t replace nodes if all nodes in the cluster fail
    • Monitor for node failures through CloudWatch
  • Scalability and elasticity
    • Resize your running cluster: add core nodes, add/remove task nodes
  • Interfaces
    • Tools on top of Hadoop: Hive, Pig, Spark, HBase, Presto
    • Kinesis Connector: directly read and query data from Kinesis Data Streams
  • Anti-patterns
    • Small data sets
    • ACID transactions: RDS is a better choice


Kinesis
  • Synchronously replicates data across three AZs
  • Durability and availability
    • Highly available and durable due to config of multiple AZs in one Region
    • Use cursor in DynamoDB to restart failed apps
    • Resume at exact position in the stream where failure occurred
  • Scalability and elasticity
    • Use API calls to automate scaling, increase or decrease stream capacity at any time
  • Interfaces
    • Two interfaces: input (KPL, agent, PUT API), output (KCL)
    • Kinesis Storm Spout: read from an Kinesis stream into Apache Storm
  • Anti-patterns
    • Small scale consistent throughput
    • Long-term data storage and analytics, Redshift, S3, or DynamoDB are better choices


DynamoDB
  • Fault tolerance: automatically synchronously replicates data across 3 data centers in a region
  • Durability and availability
    • Protection against individual machine or facility failures
    • DynamoDB Streams allows replication across regions
    • Streams enables table data activity replicated across geographic regions
  • Scalability adn elasticity
    • No limit to data storage, automatic storage allocation, automatic data partition
  • Interfaces
    • REST API allows management and dat interface
    • DynamoDB select operation creates SQL-like queries
  • Anti-patterns
    • Port application from relational database
    • Joins or complex transactions
    • Large data with low I/O rate
    • Blob data > 400KB, S3 is the better choice


Redshift
  • Automatically detects and replaces a failed node in your data warehouses cluster
  • Durability and availability
    • Failed node cluster is read-only until replacement node is provisioned and added to the DB
    • Cluster remains available on drive failure; Redshift mirrors your data across the cluster
  • Scalability and elasticity
    • With API change the number, or type, of nodes while cluster remains live
  • Interfaces
    • JDBC and ODBC drivers for use with SQL clients
    • S3, DynamoDB, Kinesis, BI tools such as QuickSight
  • Anti-patterns
    • Small data sets
    • Online transaction processing (OLTP)
    • Unstructured data
    • Blob data: S3 is the better choice


Athena
  • Executes queries using comput resources across multiple facilities
  • Durability and avilability
    • Automatically routes queries if a particular facility is unreachable
    • S3 is the underlying data store, gaining S3’s 11 9s durability
  • Scalability and elasticity
    • Serverless, scales automatically as needed
  • Interfaces
    • Athena CLI, API via SDK, and JDBC
    • QuickSight visualizations
  • Anti-patterns
    • Enterprise Reporting and Business Intelligence Workloads; Redshift is the better choice
    • ETL Workloads: EMR and Clue are the better choices
    • Not a replacement for RDBMS


Elasticsearch
  • Zone Awareness: distributes domain instances across two different AZs
  • Durability and availability
    • Zone Awareness gives you cross-zone replication
    • Automated and manual snapshots for durability
  • Scalability and elasticity
    • Use API and CLoudWatch metrics to automatically scale up/down
  • Interfaces
    • S3, Kinesis, DynamoDB Streams, Kibana
    • Lambda function as an event handler
  • Anti-patterns
    • Online transaction processing (OLTP): RDS is the better choice
    • Ad hoc data querying: Athena is the better choice


QuickSight
  • SPICE automatically replicates data for high availability
  • Durability and availability
    • Scale to hundreds of thousands of users
    • Simultaneous analytics across AWS data sources
  • Scalability and elasticity
    • Fully managed service, scale to terabytes of data
  • Interfaces
    • RDS, Aurora, Redshift, Athena, S3
    • SaaS, applications such as Salesforce
  • Anti-patterns
    • Highly formatted canned Reports, better for ad hoc query, analysis and visualization of data
    • ETL: Glue is the better choice


Visualization Services

QuickSight Visualization Services
QuickSight Visualization Services Lab


Analysis for Visualization


Analysis and Visualization Services

  • Understand the best services for your analysis and visualization needs
    • With a data lake you can run your analytics without moving your data to another system
    • With Redshift you can run queries on your Redshift cluster data warehouse directly
    • With Athena and Glue you can directly query data on S3
    • Use Redshift Spectrum to build queries that combine your Redshift warehouse data with data from S3


Analysis and Visualization Services - Data Lake

  • Access vast amounts of structured, unstructured, and semi-structured data in your data lake
    • Lake Formation provides multi-service, fine-grained access control to data
    • Macie helps detect sensitive data that may have been stored in the wrong place
    • Inspector identifies configuration error that could lead to data breaches


Analysis and Visualization Services - Redshift

  • Two options: query via AWS management console using the query editor, or via a SQL client tool
    • Use the query editor on the Redshift console, no need for a SQL client application
    • Supports SQL client tools connecting through JDBC and ODBC
    • Using the Redshift API you can access Redshift with web services-based applications, including AWS Lambda, AWS AppSync, SageMaker notebooks, and AWS Cloud9


Analysis and Visualization Services - Athena and Glue

  • Glue crawls your data, categorizes and cleans it and moves to your data store
  • Athena queries data sources that are registered with AWS Glue Data Catalog
    • Running Data Manipulation Language (DML) queries in Athena via the Data Catalog uses the Data Catalog schema to derive insight from the underlying dataset
    • Run a Glue crawler on a data source from within Athena to create a schema in the Glue Data Catalog


Analysis and Visualization Services - Redshift Spectrum

  • Query data directly from files on S3
  • Need a Redshift cluster and SQL client connected to your cluster to execute SQL commands
  • Visualize data via QuickSight
  • Cluster and data in S3 must be in the same region


Subdomain 2: Determine Analysis Solution


Analytics Scenarios and their Solutions

  • Evaluate an analytics scenario and suggest the best analysis and visualization solution
    • Select the best type of analysis
    • Compare analysis solutions and select the best, including the methods, tools, and technologies

Types of Analysis

  • Given collected data characteristics and analysis go
    • Descriptive analysis
      • Determine what generated the data
      • Highest effort
    • Diagnostic analysis
      • Determine why data was generated
      • Understand root causes of events
    • Predictive analysis
      • Determine future outcomes
      • uses descriptive and diagnostic to predict future trends
    • Prescriptive analysis
      • Determine action to take
      • Uses other three to predict and can be automated


Analysis Solutions

  • Identify analytics processing method based on data type collected and analysis type used
    • Batch analytics
      • Large volumes of raw data
      • Analytics process on a schedule, reports
      • Map-reduce type services: EMR
    • Interactive analytics
      • Complex queries on complex data at high speed
      • See query results immediately
      • Athena, Elasticsearch, Redshift
    • Streaming analytics
      • Analysis of data that has short shelf-life
      • Incrementally ingest data and update metrics
      • Kinesis


EMR, Kinesis, and Redshift Patterns


Analytics Solutions Patterns

  • Select the best option for a scenario based on the type of analytics and processing required
    • Patterns of use
      • EMR
      • Kinesis
      • Redshift


Analytics Solutions Patterns - EMR

  • Uses the map-reduce technique to reduce large processing problems into small jobs distributed across many nodes in a Hadoop cluster
    • On-demand big data analytics
    • Event-driven ETL
    • Machine Learning predictive analytics
    • ClickStream analysis
    • Load data warehouses
  • Don’t use for transactional processing or with small data set


Analytics Solutions Patterns - Kinesis

  • Streams data to analytics processing solutions
    • Video analytics applications
    • Real-time analytics applications
    • Analyze IoT device data
    • Blog posts and article analytics
    • System and application log analytics
  • Don’t use for small-scale throughput or with data with longer shelf-life


Analytics Solutions Patterns - Redshift

  • Online analytical processing (OLAP) using business intelligence tools
    • Near real-time analysis of millions of rows of manufacturing data generated by continuous manufacturing equipment
    • Analyze events from mobile app to gain insight into how users use the applications
    • Gain value and insights from large, complex, and dispersed datasets
    • Make live data generated by range of next-gen security solutions available to large numbers of organizations for analysis
  • Don’t use the OLTP or with small data sets


Subdomain 3: Determine Visualization Solution


Visualization Methods

  • Understand refresh schedule, delivery method, and access method based on a scenario
  • Use data from heterogeneous data sources in visualization solution
  • QuickSight
    • Filtering
    • Sorting
    • Drilling down

Refresh Schedule - Real-time Scenarios

  • Scenarios that question the appropriate refresh schedule based on data freshness requirement
  • Real-time scenarios, typically using Elasticsearch and Kibana
    • Refresh_Interval in Elasticsearch domain updates the domain indices; determines query freshness
    • Default is every second - expensive, make sure the sards are evenly distributed
      • *Formula: Number of shards for index = k * (number of data nodes), where k is the number of sards per node*
    • Balance refresh rate cost with decision making needs


Refresh Schedule - Interactive Scenarios

  • Real-time scenarios, typically ad-hoc exploration using QuickSight
    • Refresh SPICE data
      • Refresh a data set on a schedule, or you can refresh your data by refreshing the page in an analysis or dashboard
      • Use the CreateIngestion API operation to refresh data
    • Data set based on a direct query and not stored in SPICE, refresh data by opening the data set

You can Refresh Now or Schedule refresh on QuickSight


Refresh Schedule - EMR Notebooks

  • Using EMR Notebooks to query and visualize data
    • Data refreshed every time the notebook is run
      • Expensive process, balance refresh rate cost with decision making needs


QuickSight Visual Data

QuickSight Visual Data


AWS Elasticsearch Service

AWS Elasticsearch Service

Quiz


When building an operational analytics solution, which analytics service is best?

A. QuickSight
B. EMR
C. Kinesis
D. Elasticsearch
E. Redshift

D.

Category Use Case Analytics Service
Analytics Interactive analytics Athena
Big data processing EMR
Data warehousing Redshift
Real-time analytics Kinesis
Operational analytics Elasticsearch
Dashboards and visualizations QuickSight
Data movement Real-time data movement Managed Streaming for Kafka (MSK)
Kinesis Data Streams
Kinesis Data Firehose
Kinesis Data Analytics
Kinesis Video Streams
Glue
Data Lake Object storage S3, Lake Formation
Backup and archive S3 Glacier, AWS Backup
Data catalog Glue, Lake Formation
Third-part data AWS Data Exchange
Predictive analytics and machine learning Frameworks and interfaces AWS Deep Learning AMIs
Platform SageMaker

Which data analytics solution is best for data collection, processing, and analysis on streaming data loaded directly into data lake, data stores, or analytics services?

A. MSK
B. Kinesis
C. EMR
D. Redshift
E. Elasticsearch

A.

Use Cases - Real Time Analytics with MSK

  • Data collection, processing, analysis on streaming data loaded directly into data lake, data stores, analytics services
  • Managed Streaming for Kafka (MSK)
    • Uses Apache Kafka to process streaming data
      • To and from Data lake and databases
      • Power machine learning and analytics applications
  • MSK provisions and runs Apache Kafka cluster
  • Monitors and replaces unhealthy nodes
  • Encrypts data at rest


Which data analytics and visualization service allows you to execute code in response to triggers such as changes in data, changes in system state, or actions by users?

A. MSK
B. EMR
C. QuickSight
D. Lambda
E. Athena

D.

Lambda

  • Run code without provisioning or managing servers
  • Usage patterns
    • Execute code in response to triggers such as changes in data, changes in system state, or actions by users
    • Real-time File Processing and stream
    • Process AWS Events
    • Replace Cron
    • ETL
  • Cost Charged by the number of requests to functions and code execution time
    • $0.20 per 1 million requests
    • Code execution $0.00001667 for every GB-second used
  • Performance
    • Process events within milliseconds
    • Latency higher for cold-start
    • Retain a function instance and reuse it to serve subsequent requests, versus creating new copy

Which data analytics and visualization service is an interactive query service used to analyze data in S3 using Presto and standard SQL?

A. EMR
B. Kinesis
C. Athena
D. Elasticsearch

C.

Athena

  • Interactive query services used to analyze data in S3 using Presto and standard SQL
  • Usage patterns
    • Interactive ad hoc querying for web logs
    • Query staging data before loading into Redshift
    • Send AWS services logs to S3 for Analysis with Athena
    • Integrate with Jupyter, Zeppelin
  • Cost
    • $5 per TB of query data scanned
    • Save on per-query costs and get better performance by compressing, partitioning, and converting data into columnar formats
  • Performance
    • Compressing, partitioning, and converting your data into columnar formats
    • Convert data to columnar formats, allowing Athena to read only the columns it needs to process queries

Which data analytics and visualization service provides job status and pushes notifications to CloudWatch events?

A. Lambda
B. Glue
C. EMR
D. Athena
E. Redshift

B.


Which data analytics and visualization solution should you not use when you have joins or complex transactions?

A. EMR
B. Redshift
C. Kibana
D. DynamoDB

D.

DynamoDB

  • Fault tolerance: automatically synchronously replicates data across 3 data centers in a region
  • Durability and availability
    • Protection against individual machine or facility failures
    • DynamoDB Streams allows replication across regions
    • Streams enables table data activity replicated across geographic regions
  • Scalability adn elasticity
    • No limit to data storage, automatic storage allocation, automatic data partition
  • Interfaces
    • REST API allows management and dat interface
    • DynamoDB select operation creates SQL-like queries
  • Anti-patterns
    • Port application from relational database
    • Joins or complex transactions
    • Large data with low I/O rate
    • Blob data > 400KB, S3 is the better choice

Which of the data analytics and visualization solutions should you not use with unstructured data?

A. Redshift
B. EMR
C. Glue
D. Athena

A.

Redshift Spectrum works with unstructured data instead of Redshift.

Redshift

  • Automatically detects and replaces a failed node in your data warehouses cluster
  • Durability and availability
    • Failed node cluster is read-only until replacement node is provisioned and added to the DB
    • Cluster remains available on drive failure; Redshift mirrors your data across the cluster
  • Scalability and elasticity
    • With API change the number, or type, of nodes while cluster remains live
  • Interfaces
    • JDBC and ODBC drivers for use with SQL clients
    • S3, DynamoDB, Kinesis, BI tools such as QuickSight
  • Anti-patterns
    • Small data sets
    • Online transaction processing (OLTP)
    • Unstructured data
    • Blob data: S3 is the better choice

Which of the data analytics and visualization solutions should you not use for enterprise Reporting and Business Intelligence Workloads?

A. Athena
B. Redshift
C. Elasticsearch
D. Kibana


Which of the data analytics and visualization solutions should you not use for ad hoc data querying?

A. Athena
B. Glue
C. Redshift
D. Elasticsearch

D.

Elasticsearch

  • Zone Awareness: distributes domain instances across two different AZs
  • Durability and availability
    • Zone Awareness gives you cross-zone replication
    • Automated and manual snapshots for durability
  • Scalability and elasticity
    • Use API and CLoudWatch metrics to automatically scale up/down
  • Interfaces
    • S3, Kinesis, DynamoDB Streams, Kibana
    • Lambda function as an event handler
  • Anti-patterns
    • Online transaction processing (OLTP): RDS is the better choice
    • Ad hoc data querying: Athena is the better choice


Which of the data analytics and visualization services should you not use for highly formatted canned Reports?

A. Kibana
B. QuickSight
C. Athena
D. Glue

QuickSight

  • SPICE automatically replicates data for high availability
  • Durability and availability
    • Scale to hundreds of thousands of users
    • Simultaneous analytics across AWS data sources
  • Scalability and elasticity
    • Fully managed service, scale to terabytes of data
  • Interfaces
    • RDS, Aurora, Redshift, Athena, S3
    • SaaS, applications such as Salesforce
  • Anti-patterns
    • Highly formatted canned Reports, better for ad hoc query, analysis and visualization of data
    • ETL: Glue is the better choice

Which visualization type is a single value that conveys how well you are doing in an area or function?

A. Relationships
B. KPIs
C. Compositions
D. Comparisons
E. Distributions

B.


Which visualization type should you use when trying to either establish or prove whether a relationship exists between 2 or more variables?

A. Compositions
B. Comparisons
C. Distributions
D. Relationships
E. KPIs

D.


Which visualization type should you use when trying to show or examine how different variables change over time or provide a static snapshot of how different variables compare?

A. Relationships
B. Compositions
C. KPIs
D. Distributions
E. Comparisons

E.


Which visualization type should you use when trying to show how your data is distributed over certain intervals where interval implies clustering or grouping of data, and not time?

A. Compositions
B. Comparisons
C. KPIs
D. Relationships
E. Distributions

E.


Which visualization type should you use when you want to highlight the various elements that make up your data - its composition; static or if it is changing over time?

A. KPIs
B. Relationships
C. Distributions
D. Compositions
E. Comparisons

D.


Which type of analysis allows you to determine what generated the data?

A. Descriptive analysis
B. Diagnostic analysis
C. Predictive analysis
D. Prescriptive analysis

A.


27 of 30

Which type of analysis allows you to determine why data was generated?

A. Descriptive analysis
B. Prescriptive analysis
C. Predictive analysis
D. Diagnostic analysis

D.


Which type of analysis allows you to determine future outcomes?

A. Prescriptive analysis
B. Predictive analysis
C. Diagnostic analysis
D. Descriptive analysis

B.


Which type of analysis allows you to determine the action to take?

A. Diagnostic analysis
B. Descriptive analysis
C. Predictive analysis
D. Prescriptive analysis

D.


Which data analytics and visualization solution is an open-source data visualization and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases?

A. QuickSight
B. Kibana
C. Glue
D. Athena

B.


Domain 5: Security


Introduction

Data Analytics Lifecycle

Security is critically across all phases of the Data Analytics Lifecycle


Security in the Data Analytics Pipeline

  • Ingestion phase to the visualization phase security
  • Protect data that needs to remain private
  • Remain in compliance with regulations such as PII security standards
  • Authenticate and authorize users of your analytics solutions
  • Three subdomains
    • Subdomain 1: Determine Authentication & Authorization Mechanisms
    • Subdomain 2: Apply Data Protection & Encryption Techniques
    • Subdomain 3: Apply Data Governance & Compliance Controls


Subdomain 1: Determine Authentication & Authorization Mechanisms
  • Use IAM to authenticate your requests
  • Use IAM identity and resource-based permissions to authorize access to your data analytics solution
  • Build security around your network and the physical boundary of your solution through network isolation


Subdomain 2: Apply Data Protection & Encryption Techniques
  • Encrypt your data and use tokenization
    • S3 encryption approaches
    • Tokenization vs. encryption
  • Management of secret data on AWS
    • Secrets Manager
    • Systems Manager Parameter Store


Subdomain 3: Apply Data Governance & Compliance Controls
  • Use Artifact to gain insight from AWS security and compliance reports
  • Document and store activity within your analytics solution using CloudTrail and CloudWatch
  • Enforce compliance to your security rules using AWS Config


Subdomain 1: Determine Authentication & Authorization Mechanisms


Authentication vs. Authorization

Authentication vs. Authorization

Authentication and authorization might sound similar, but they are distinct security processes in the world of identity and access management (IAM).

  • Authentication confirms that users are who they say they are.
  • Authorization gives those users permission to access a resource.


IAM Authentication

AWS IAM

  • To set permissions for an identity in IAM, Choose an AWS managed policy, a customer managed policy, or an inline policy
    • AWS managed policy
      • Standalone policy that is created and administered by AWS
      • Provide permissions for many common use cases: full, power-user, and partial access
    • Customer managed policy
      • Standalone policies that you administer in your AWS account
    • Inline policy: strict one-to-one relationship of policy to identity
      • Policy embedded in an AWS identity (a user, group, or role)


IAM Authorization

  • Use IAM identity and resource-based permissions to authorize access to analytics resources
  • Policy is an object in AWS you associate with an identity or resource to define the identity or resource permissions
  • To use a permissions policy to restrict access to a resource you choose a policy type
    • Identity-based policy
    • Resource-based policy


Identity-based policy & Resource-based policy
  • Identity-based policy
    • Attach managed and inline policies to IAM identities (users, groups to which users belong, or roles). Identity-based policies grant permissions to an identity.
    • Attached to an IAM user, group, or role
    • Specify what an identity can do (its permissions)
    • Example: attache a policy to a user that allows that user to perform the EC2 RunInstances action
  • Resource-based policy
    • Attach inline policies to resources. The most common examples of resource-based policies are Amazon S3 bucket policies and IAM role trust policies. Resource-based policies grant permissions to the principal that is specified in the policy. Principals can be in the same account as the resource or in other accounts.
    • Attached to a resource
    • Specify which users have access to the resource and what actions they can perform on it
    • Example: attach resource-based policies to S3 bucket, SQS queues, and Key Management Service encryption keys


Identity-Based Policy Example

Attach managed and inline policies to IAM identities (users, groups to which users belong, or roles). Identity-based policies grant permissions to an identity.
This policy uses tags to identify the resources.


Resource-Based Policy Example

Attach inline policies to resources. Principals can be in the same account as the resource or in other accounts.


Services and IAM

  • Access: Access to all, this arn
  • Resource-Level Permissions: use arn to specify individual resources in your policy
  • Resource-Based Policies: Principal
  • Ath Based on Tags: Resource tag
  • Temp Credentials: sign in with federation across-account role or service role can access the service.
  • Service-Linked Roles: permit AWS services connect to third-party services


Security Use Cases

  • IAM best practices of how to apply IAM and other access controls
Use Case IAM Access
Use the AWS CLI to create resources in AWS account IAM User
Create Glue crawlers and ETL jobs that access S3 buckets IAM Role
Single sign on into corporate network and applications IAM Role
Web application that needs to access DynamoDB through a REST endpoint Cognito

Subdomain 2: Apply Data Protection & Encryption Techniques


Using VPCs to isolate your resources


EMR - Network Security

AWS EMR Network Security


Apply Data Protection and Encryption Techniques


EMR Encryption and Tokenization

EMR Encryption and Tokenization
Encryption at Rest and In Transit EMR


Secrets Managements
  • Secrets Manager
    • Use secrets in application code to connect to databases, APIs, and other resources
    • Provides rotation, audit, and access control
  • Systems Manager Parameter Store
    • Centralized store to manage configuration data
    • Plain-text data such as database strings or secrets such as passwords
    • Does not rotate parameter stores automatically


Subdomain 3: Apply Data Governance & Compliance Controls


Compliance and Governance

PIC
HIPAA
Privacy Shield
COALFIRE
AICPA

  • Identify required compliance frameworks (HIPAA, PCI, etc.)
  • Understand contractual and agreement obligations
  • Monitor policies, standards, and security controls to respond to events and changes in risk
  • Services to create compliant analytics solution
    • AWS Artifact: provides on-demand access to AWS compliance and security related information
    • CloudTrail and CloudWatch: enable governance, compliance, operational auditing, and risk auditing of AWS account
    • AWS Config: ensure AWS resources conform to your organization’s security guidelines and best practicies

AWS Artifact

  • Self-service document retrieval portal
  • Manage AWS agreements: NDAs, Business Associate Addendum (BAA)
  • Produce security controls documentation: provide AWS artifact compliance documents to regulators and auditors
  • Download AWS security and compliance documents, such as AWS ISO certifications, PCI, and SOC report


CloudWatch and CloudTrail

  • CloudTrail tracks actions in your AWS account by recording API activity
  • CloudWatch monitors how resources perform, collecting application metrics and log information
  • Simplify security analysis, resource change tracking, and troubleshooting by combining event history monitoring via CloudTrail and resource history via CloudWatch
  • Use CloudTrail to audit and review API calls and detect security anomalies
  • Use CloudWatch to create alert rules that trigger SNS notifications or Lambda functions that run in response to a security or risk event


AWS Config

  • AWS resource inventory, configuration history, and configuration change notifications that enable security and governance
  • Discover existing AWS resources
  • Export complete inventory of account AWS resources with all configuration details
  • Determine how a resource was configured at any point in time
  • Config Rules: represent desired configurations for a resource and is evaluated against configuration changes on the relevant resources, as recorded by AWS Config
  • Assess overall compliance and risk status from a configuration perspective, view compliance trends over time and pinpoint which configuration change caused a resource to drift out of compliance with a rule


Quiz


With which data analytics service can you only conditionally use authentication based on tags?

A. EMR
B. Elasticsearch
C. Glue
D. Kinesis Data Analytics

C.


Which policy type specifies what an identity can do (its permissions)?

A. Identity-based policy
B. Resource-based policy
C. User-based policy
D. Group-based policy

A.


Which policy type specifies which users have access to the resource and what actions they can perform on it?

A. Identity-based policy
B. Resource-based policy
C. User-based policy
D. Group-based policy

B.


Rules that EMR creates in __ security groups allow the cluster to communicate among internal components

A. Additional
B. Customer
C. Default
D. Managed

D.


You can specify security groups on cluster create, you can also add to a cluster or cluster instances while a cluster is running

A. True
B. False

B.


How do you give your EMR cluster running in a private subnet direct access to data in Amazon S3

A. use an ENI
B. use an S3 Gateway Endpoint
C. use an S3 Interface Endpoint
D. use a NAT Instance

B.


Use __ if you require your keys stored in dedicated, third-party validated hardware security modules under your exclusive control

A. Encryption at rest
B. Encryption in transit
C. HSM
D. Secrets Manager

C.


Use __ to prevent unauthorized users from reading data on a cluster and associated data storage systems

A. Encryption at rest
B. Encryption in transit
C. HSM
D. Tokenization

A.

Amazon S3 encryption works with EMR File System (EMRFS) objects read from and written to Amazon S3. Specify server-side encryption (SSE) or client-side encryption (CSE).


Use __ to protect certain elements in the data that contains high sensitivity or a specific regulatory compliance requirement, such as PCI

A. HSM
B. Encryption at rest
C. Encryption in transit
D. Tokenization

D.


__ provides rotation, audit, and access control

A. Systems Manager Parameter Store
B. Secrets Manager
C. HSM
D. IAM

B.


What provides on-demand access to AWS compliance and security related information

A. CloudTrail
B. CloudWatch
C. AWS Artifact
D. AWS Config

C.


Manage AWS agreements: NDAs, Business Associate Addendum (BAA)

A. AWS Config
B. AWS Artifact
C. CloudTrail
D. CloudWatch

B.


Ensures what AWS resources conform to your organization’s security guidelines and best practices

A. CloudTrail
B. CloudWatch
C. AWS Config
D. AWS Artifact

C.


Allows you to assess overall compliance and risk status from a configuration perspective, view compliance trends over time and pinpoint which configuration change caused a resource to drift out of compliance with a rule

A. AWS Config
B. CloudWatch
C. CloudTrail
D. AWS Artifact

A.


Study Case

Your company uses Amazon EMR clusters to analyze its workloads on a regular basis. The data science team leaves these EMR clusters running even after the task completes. The idle resources cause unnecessary costs for the company and the CFO is noticing a rise in resource costs. She wants you to find a solution to terminate the idle clusters.
What would you suggest as the BEST, viable automated solution?

Configure an AWS CloudWatch metrics alarm on the IsIdle metric for the EMR cluster to send a message to Amazon SNS. Subscribe an AWS Lambda function to a topic that terminates the EMR cluster

is correct. This provides the most effective and automated solution. It uses native AWS services to complete the job in an efficient manner.

Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda


A security team for a company aggregates log data from VPC Flow Logs, Amazon CloudWatch Logs, and AWS CloudTrail logs. The team is building a solution that allows them to use a separate AWS account to analyze daily logs from all accounts within the Organization. Aggregated data totals over 150 GB of data. The team needs the data to be kept and stored but only needs the last 3 months of data for monitoring and analysis.
What is the most cost-effective solution for these requirements?

Create an S3 bucket in one AWS account to collect logs from all accounts. Use the AWS Lambda event trigger to have AWS Lambda add indexes for the logs to an Amazon Elasticsearch Service cluster in the security team’s account. Drop Elasticsearch indexes older than 3 months.

is correct. It’s good security practice to aggregate logs into one account. This limits the blast area and makes analysis more efficient. The solution uses Lambda as an event-driven architecture, and Elasticsearch for querying. Data stored in S3 will persist and meets this business requirement even after an Elasticsearch index is removed.

Stream Amazon CloudWatch Logs to a Centralized Account for Audit and Analysis


An insurance company that provides coverage collects data for their customers including personal identifiable information. The application for the company allows customers to upload files into their data store to submit insurance claims and personal data. The company using Amazon S3 as the data store and the application writes data directly into S3. The company then uses AWS Lambda to process these forms and further stores object data into Amazon Redshift for storage and analytics. The company is concerned about malware since customers upload files from their own sources. They are looking for a virus scan solution with AWS before the uploaded files are processed by Lambda.
What would you suggest to accomplish this?

As this is currently not supported in AWS, look for a third-party solution on AWS Marketplace

is correct. AWS currently does not have any services to natively support virus scans on Amazon S3. The best solution would be to look for a third-party solution on the AWS Marketplace.

virus scan


A biotech company provides a SaaS solution for their big pharmaceutical clients. The clients use their platform to share healthcare research data and medical information. These data are critical as it allows clients to share knowledge and findings to be able to provide health care solutions for the public. However, the clients need to retain access to their individual data sources for regulatory and compliance requirements. Thus a multi-tenant solution is required with minimal administrative work.

The company uses Amazon EMR clusters to run data analytics for their clients. How would they architect a solution to provide a multi-tenant solution for their clients?

Launch a Kerberos-enabled EMRFS (file system) cluster that enables a one-way trust to an Active Director domain.

is correct. A Kerberos-enabled cluster can manage the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain. This is the first step is utilizing this architecture.

The application must support accessing Amazon S3 via EMRFS

is correct. This is the other requirement to support a multi-tenant architecture on Amazon EMR. EMRFS uses S3 as its default interface thus it will respect the IAM roles configured for EMRFS.


You have an application using Amazon Athena’s APIs to allow users to execute GET requests for a web application. The application uses Amazon Athena since it is a fully-managed service that doesn’t require a long-running database. Users rarely use this service but there are times of high burst requests. During a period of high usage, you see in the Amazon CloudWatch logs a “ClientError: An error occurred (Throttling Exception)”
What would be the most effective way to resolve this issue?

Request a quota increase to increase the curst capacity of API calls‘’

is correct. The default burst capacity limit is up to 80 calls. Since this application is used sporadically and occasionally experiences burst usage, the easiest way would be to request this service quota increase.

Service Quotas


A company uses Amazon Redshift as a central data store and to run analytics. Five teams are using the Redshift cluster for queries. The total members across all teams is 200. Some teams’ queries are complex and CPU-intensive, while other teams’ queries are short, fast-running queries. The team that administers the data warehouse cluster is tasked with optimizing queries so that multiple teams’ access don’t affect the overall performance.
What can the team do to ensure the short, fast-running queries don’t get stuck behind long-running queries so that performance is optimized efficiently?

Set up a workload management configuration configured with five queues and a concurrency level of one.

is correct. You can configure up to 8 queues with each queue having a maximum concurrency level of 50. In this case, there are five teams so it would be most efficient to assign each team a queue. In this way you can have five queries running concurrently without one query impacting the performance of the other.

Workload management


As a security administrator for a biotech research company, you are in charge of securing a data lake architecture solution for your firm. Many different users from many different business units will be tapping into the data lake for complex and simple queries, analysis, visualizations, reporting, and dashboards. You need a centralized way to restrict permissions so that the appropriate users have access to only the parts of the data lake they need. Users will be accessing different parts of the data lake including databases, tables, and columns. The firm requires the use of Amazon Redshift as a centralized data warehouse and users will need to be able to join data in Amazon Redshift to objects in Amazon S3.
What would be the most effective way to implement this for the various users within the various teams?

Use AWS Lake Formation as the central database. Grant and revoke permissions to various data sources through Lake Formation. Use Amazon Redshift Spectrum to run queries on external tables in Amazon S3.

is correct. This is the best solution because AWS Lake Formation is the solution for creating a central data store which you can grant access to databases, tables, and columns to data stored in Amazon S3. You can use Amazon Redshift Spectrum to run queries in Amazon S3.

Using Redshift Spectrum with AWS Lake Formation


You are a data scientist working for a financial services company that has several relational databases, data warehouses, and NoSQL databases that hold transactional information about their financial trades and operational activities. The company wants to manage their financial counterparty risk through using their real-time trading/operational data to perform risk analysis and build risk management dashboards.
You need to build a data repository that combines all of these disparate data sources so that your company can perform their Business Intelligence (BI) analysis work on the complete picture of their risk exposure.
What collection system best fits this use case?

Financial data sources data -> Database Migration Service -> S3 -> Glue -> S3 Data Lake -> Athena

This type of data collection infrastructure is best used for streaming transactional data from existing relational data stores. You create a task within the Database Migration Service that collects ongoing changes within your various operational data stores, an approach called ongoing replication or change data capture (CDC). These changes are streamed to an S3 bucket where a Glue job is used to transform the data and move it to your S3 data lake.


You are a data scientist working on a project where you have two large tables (orders and products) that you need to load into Redshift from one of your S3 buckets. Your table files, which are both several million rows large, are currently on an EBS volume of one of your EC2 instances in a directory titled $HOME/myredshiftdata.
Since your table files are so large, what is the most efficient approach to first copy them to S3 from your EC2 instance?

Load your orders.tbl and products.tbl by first getting a count of the number of rows in each using the commands: ‘wc -l orders.tbl’ and ‘wc -l products.tbl’. Then splitting each tbl file into smaller parts using the command: ‘split -d -l # -a 4 orders.tbl orders.tbl-’ and ‘split -d -l # -a 4 products.tbl products.tbl-’ where # is replaced by the result of your wc command divided by 4

because you have used the wc command to find the number of rows in each tbl file, and you have used the split command with the trailing ‘-’ to get the proper file name format on your S3 bucket in preparation for loading into Redshift.

Tutorial: Loading data from Amazon S3


You are working on a project where you need to perform real-time analytics on your application server logs. Your application is split across several EC2 servers in an auto-scaling group and is behind an application load balancer as depicted in this diagram:

You need to perform some transformation on the log data, such as joining rows of data, before you stream the data to your real-time dashboard.
What is the most efficient and performant solution to aggregate your application logs?

Install the Kinesis Agent on your application servers to watch your logs and ingest the log data. Write a Kinesis Data Analytics application that reads the application log data from the agent, performs the required transformations, and pushes the data into your Kinesis data output stream. Use Kinesis Data Analytics queries to build your real-time analytics dashboard

The Kinesis Agent ingests the application log data, the Kinesis Analytics application transforms the data, and Kinesis Analytics queries are used to build your dashboard.

Implement Serverless Log Analytics Using Amazon Kinesis Analytics


You are a data scientist on a team where you are responsible for ingesting IoT streamed data into a data lake for use in an EMR cluster. The data in the data lake will be used to allow your company to do business intelligence analytics on the IoT data. Due to the large amount of data being streamed to your application you will need to compress the data on the fly as you process it into your EMR cluster.
How would you most efficiently collect the data from your IoT devices?

Use the AWS IoT service to get the device data from the IoT devices, use Kinesis Data Firehose to stream the data to your data lake, then use S3DistCp to move the data from S3 to your EMR cluster

. The AWS IoT service ingests the device data, Kinesis Data Firehose streams the data to your S3 data lake, then the S3DistCp command is used to compress and move the data into your EMR cluster


You are a data scientist working for a mobile gaming company that is developing a new mobile gaming app that will need to handle thousands of messages per second arriving in your application data store. Due to the user interactivity of your game, all changes to the game datastore must be recorded with a before-change and after-change view of the data record. These data store changes will be used to deliver a near-real-time usage dashboard of the app for your management team.
What application collection system infrastructure best delivers these capabilities in the most performant and cost effective way?

DynamoDB -> DynamoDB Streams -> Lambda -> Kinesis Firehose -> Redshift -> QuickSight

. Your application will write its game activity data to your DynamoDB table which will have DynamoDB streams enabled. DynamoDB Streams will record both the new and old (or before and after) images of any item in the DynamoDB table that is changed. Your Lambda function will be triggered by DynamoDB Streams. Your Lambda function will use the Firehose client to write to your Firehose stream. Firehose will stream your data to Redshift. Quicksite will visualize your data in near-real-time.


You are a data scientist working for an online retail electronics chain. Their website receives very heavy traffic during certain months of the year, but these heavy traffic periods fluctuate over time. Your firm wants to get a better understanding of these patterns. Therefore, they have decided to build a traffic prediction machine learning model based on click-stream data.
Your task is to capture the click-stream data and store it in S3 for use as training and inference data in the machine learning model. You have built a streaming data capture system using Kinesis Data Streams and its Kinesis Producer Library (KPL) for your click-stream data capture component. You are using collection batching in your KPL code to improve performance of your collection system. Exception and failure handling is very important to your collection process, since losing click-stream data will compromise the integrity of your machine learning model data.
How can you best handle failures in your KPL component?

With the KPL PutRecords operation, if a put fails, the record is automatically put back into the KPL buffer and retried.

You would use the Kinesis Producer Library (KPL) PutRecords method in your KPL code to send click-stream records into your Kinesis Data Streams stream. The KPL PutRecords automatically adds any failed records back into the KPL buffer so it can be retried.


You are a data scientist working for a medical services company that has a suite of apps available for patients and their doctors to share their medical data. These apps are used to share patient details, MRI and XRAY images, appointment schedules, etc. Because of the importance of this data and its inherent Personally Identifiable Information (PII), your data collection system needs to be secure and the system cannot suffer lost data, process data out of order, or duplicate data.
Which data collection system(s) gives you the security and data integrity your requirements demand? (SELECT 2)

SQS (FIFO)
SQS in the FIFO mode guarantees the correct order of delivery of your data messages and it uses the “exactly-once” delivery method. Exactly-once means that all messages will be delivered exactly one time. No message losses, no duplicate data.

DynamoDB Streams
DynamoDB Streams guarantees the correct order of delivery of your data messages and it uses the “exactly-once” delivery method. Exactly-once means that all messages will be delivered exactly one time. No message losses, no duplicate data.


You work for a ski resort corporation. Your company is developing a lift ticket system for mobile devices that allows skiers and snowboarders to use their phone as their lift ticket. The ski resort corporation owns many resorts around the world. The lift ticketing system needs to handle users who move from resort to resort throughout any given time period. Resort customers can also purchase packages where they can ski or snowboard at a defined list (a subset of the total) of several different resorts across the globe as part of their package.


You work for a mobile gaming company that has developed a word puzzle game that allows multiple users to challenge each other to complete a crossword puzzle type of game board. This interactive game works on mobile devices and web browsers. You have a world-wide user base that can play against each other no matter where each player is located.
You now need to create a leaderboard component of the game architecture where players can look at the daily point leaders for the day, week, or other timeframes. Each time a player accumulates points, the points counter for that player needs to be updated in real-time. This leaderboard data is transient in that it only needs to be stored for a limited duration.
Which of the following architectures best suits your data access and retrieval patterns using the simplest, most efficient approach?

Data sources -> Kinesis Data Streams -> Spark Streaming on EMR -> ElastiCache Redis

. Kinesis Data Streams is the appropriate streaming solution for gathering the streaming player data and loading it onto your EMR cluster, then using Spark Streaming to transform the data into a format that is efficiently stored in ElastiCache Redis. You can use the Redis INCR and DECR functions to keep track of user points and the Redis Sorted Set data structure to maintain the leader list sorted by player. You can maintain your real-time ranked leader list by updating each user’s score each time it changes.


You work for a car manufacturer who has implemented many sensors into their vehicles such as GPS, lane-assist, braking-assist, temperature/humidity, etc. These cars continuously transmit their structured and unstructured sensor data. You need to build a data collection system to capture their data for use in ad-hoc analytics applications to understand the performance of the cars, the locations traveled to and from, the effectiveness of the lane and brake assist features, etc. You also need to filter and transform the sensor data depending on rules based on parameters such as temperature readings. The sensor data needs to be stored indefinitely, however you only wish to pay for the analytics processing when you use it.
Which of the following architectures best suits your data lifecycle and usage patterns using the simplest, most efficient approach?

Sensor data -> IoT Core -> S3 -> Athena

. The simplest data collection architecture that meets your data lifecycle and usage patterns uses IoT Core to ingest the sensor data. Also, IoT Core is used to run a rules-based filtering and transformation set of functions. IoT Core then streams the sensor data to S3 where you house your data lake. You then use Athena to run your ad-hoc queries on your sensor data, taking advantage of Athena’s serverless query service so that you only pay for the service when you use it.


You work for a large city police department as a data scientist. You have been given the task of tracking crime by city district for each criminal committing the given crime. You have created a DynamoDB table to track the crimes across your city’s districts. The table has this configuration: for each crime the table contains a CriminalId (the partition key), CityDistrict, and CrimeDate the crime was reported. Your police department wants to create a dashboard of the crimes reported by district and date.
What is the most cost effective way to retrieve the crime data from your DynamoDB table to build your crimes reported by district and date?

Create a global secondary index with CityDistrict as the partition key and CrimeDate as the sort key

. Since you are looking to use the CityDistrict and CrimeDate to retrieve your dashboard data, the combination of CityDistrict and CrimeDate won’t always be unique. A global secondary index is the best choice for this use case since the combination of primary key attributes does not require unique values.

Using Global Secondary Indexes in DynamoDB


You work for a large retail and wholesale business with a significant ecommerce web presence. Your company has just acquired a new ecommerce clothing line and needs to build a data warehouse for this new line of business. The acquired ecommerce business sells clothing to a niche market of men’s casual and business attire. You have chosen to use Amazon Redshift for your data warehouse. The data that you’ll initially load into the warehouse will be relatively small. However, you expect the warehouse data to grow as the niche customer base expands once the parent company makes a significant investment in advertising.
What is the most cost effective and best performing Redshift strategy that you should use when you create your initial tables in Redshift?

Use the AUTO distribution key

. The AUTO distribution strategy Redshift assigns the best distribution strategy based on the table size. It then changes the distribution strategy as the changing table activity and size demands. So Redshift may initially assign an ALL distribution strategy to your table since it is small, then change the distribution strategy to EVEN as your table grows in size. When Redshift changes the distribution strategy the change happens very quickly (a few seconds) in the background.


You are a data scientist working for a multinational conglomerate corporation that has many data stores for which you need to provide a common repository. All of your company’s systems need to use this common repository to store and retrieve metadata to work with the data stored in all of the data siolos throughout the organization. You also need to provide the ability to query and transform the data in the organization’s data silos. This common repository will be used for data analytics by your data scientist team to produce dashboards and KPIs for your management team.
You are using AWS Glue to build your common repository as depicted in this diagram:

As you begin to create this common repository you notice that you aren’t getting the inferred schema for some of your data stores. You have run your crawler against your data stores using your custom classifiers. What might be the problem with your process?

The username you provided to your JDBC connection to your Redshift clusters does not have SELECT permission to retrieve metadata from the Redshift data store

. For data stores such as Redshift and RDS, you need to use a JDBC connector to crawl these types of data stores. If the username you provide to your JDBC connection does not have the appropriate permissions to access the data store, the connection will fail and Glue will not produce the inferred schema for that data store.


You are a data scientist working for a large transportation company that manages its distribution data across all of its distribution lines: trucking, shipping, airfreight, etc. This data is stored in a data warehouse in Redshift. The company ingests all of the distribution data into an EMR cluster before loading the data into their data warehouse in Redshift. The data is loaded from EMR to Redshift on a schedule, once per day.
How might you lower the operational costs of running your EMR cluster? (Select TWO)

EMR Transient Cluster
. EMR Transient Clusters automatically terminate after all steps are complete. This will lower your operational costs by not leaving the EMR nodes running when they are not in use.

EMR Task Nodes as spot instances
. EMR Task Nodes do not store data in HDFS. If you lose your Task Node through the spot instance process you will not lose data stored on HDFS.


You work as a data scientist at a large global bank. Your bank receives loan information in the form of weekly files from several different loan processing and credit verification agencies. You need to automate and operationalize a data processing solution to take these weekly files, transform them and then finish up by combining them into one file to be ingested into your Redshift data warehouse. The files arrive at different times every week, but the delivering agencies attempt to meet their service level agreement (SLA) of 1:00 AM to 4:00 AM. Unfortunately, the agencies frequently miss their SLAs. You have a tight batch time frame into which you have to squeeze all of this processing.
How would you build a data processing system that allows you to gather the agency files and process them for your data warehouse in the most efficient manner and in the shortest time frame?

Agency files arrive on an S3 bucket. Use CloudWatch events to schedule a weekly Step Functions state machine. The Step Functions state machine calls a Lambda function to verify that the agency files have arrived. The state machine then starts several Glue ETL jobs in parallel to transform the agency data. Once the agency file transformation jobs have completed the state machine starts another Glue ETL job to combine the transformed agency files and convert the data to a parquet file. The parquet file is written to an S3 bucket. Then the state machine finally runs a last Glue ETL job to run the COPY command to load the parquet file data into Redshift.

. Using Step Functions state machines to orchestrate this data processing workflow allows you to take advantage of processing all of your transformation ETL jobs in parallel. This makes your data processing workflow efficient and allows it to fit within your tight batch window.


You work as a cloud architect for a gaming company that is building an analytics platform for their gaming data. This analytics platform will ingest game data from current games being played by users of their mobile game platform. The game data needs to be loaded into a data lake where business intelligence (BI) tools will be used to build analytics views of key performance indicators (KPIs). You load your data lake from an EMR cluster where you run Glue ETL jobs to perform the transformation of the incoming game data to the parquet file format. Once transformed, the parquet files are stored in your S3 data lake. From there you can run BI tools, such as Athena, to build your KPIs.
You want to handle EMR step through recovery logic. What is the simplest way to build retry logic into your data processing solution?

CloudWatch event rule triggers a Lambda function via a Simple Notification Service (SNS) topic which retries the EMR step.

. Using SNS to trigger a Lambda function on failure allows you to use automated retry logic in your data processing solution.


You work as a cloud security architect for a financial services company. Your company has an EMR cluster that is integrated with their AWS Lake Formation managed data lake. You use the Lake Formation service to enforce column-level access control driven by policies you have defined. You need to implement a real-time alert and notification system if authenticated users run the TerminateJobFlows, DeleteSecurityConfiguration, or CancelSteps actions within EMR.

How would you implement this real-time alert mechanism in the simplest way possible?

Create a CloudTrail trail and enable continuous delivery of events to an S3 bucket. Use the aws cloudtrail create-trail CLI command to create an SNS topic. When an event occurs a Simple Queue Service (SQS) queue that subscribes to the SNS topic will receive the message. Use a Lambda function triggered by SQS to filter the messages for the TerminateJobFlows, DeleteSecurityConfiguration, or CancelSteps actions. The Lambda function will notify security alert subscribers via another SNS topic.

CLI command to create the SNS topic. When events occur you use a Lambda function triggered by an SQS queue which receives the alert. The Lambda function filters for the events for which you are concerned. If you don’t filter the events you’ll receive alerts for every event generated by CloudTrail.


You work as a data scientist for a rideshare company. Rideshare request data is collected in one of the company’s S3 buckets (inbound bucket). This data needs to be processed (transformed) very quickly, within seconds of being put onto the S3 bucket. Once transformed, the rideshare request data must be put into another S3 bucket (transformed bucket) where it will be processed to link rideshare drivers with rideshare requesters.

You have already written Spark jobs to do the transformation. You need to control costs and minimize data latency for the rideshare request transformation operationalization of your data collection system. Which option best meets your requirements?

Build an EMR cluster that runs Apache Livy. Lambda function triggered when the rideshare data request is put onto the inbound S3 bucket. The Lambda function passes the request data to a Spark job on the EMR cluster.

. A Livy server on a long running EMR cluster will handle requests much faster than starting an EMR cluster with each request or using an SQS polling structure.


You are a data scientist working for the Fédération Internationale de Football Association (FIFA). Your management team has asked you to select the appropriate data analysis solution to analyze streaming football data in near real-time. You need to use this data to build interactive results through graphics and interactive charts for the FIFA management team. The football streaming events are based on time series that are unordered and may frequently be duplicated. You also need to transform the football data before you store it. You’ve been instructed to focus on providing high quality functionality based on fast data access.
Which solution best fits your needs?

Kinesis data firehose -> Lambda -> Elasticsearch Cluster -> Kibana

You can leverage a Lambda function together with Kinesis Data Firehose to transform your streaming football data prior to storage on the Elasticsearch cluster storage volumes. You can then use Elasticsearch together with Kibana to perform near real-time analytics on your streaming football data.


You are a data scientist working for a medical services firm where you are building out an EMR cluster used to house the data lake used for your company’s client healthcare protected health information (PHI) data. The storage of this type of data is highly regulated through the Health Insurance Portability and Accountability Act (HIPAA). Specifically, HIPAA requires that healthcare companies, like your company, encrypt their client’s PHI data using encryption technology.
You have set up your EMR cluster to use the default of using the EMRFS to read and write your client’s PHI data to and from S3. You need to encrypt your client’s PHI data before you send it to S3.
Which option is the best encryption technique to use for your EMR cluster configuration?

CSE-KMS

. When you use CSE-KMS to encrypt your data, EMR first encrypts the data with a CMK, then sends it to Amazon S3 for storage. This meets your requirement of encrypting your data before you send it to S3.


You are a data scientist working for a retail clothing manufacturer that has a large online presence through their retail website. The website gathers Personally Identifiable Information (PII), such as credit card numbers, when customers complete their purchases on the website. Therefore, your company must adhere to the Payment Card Industry Data Security Standard (PCI DSS). Your company wishes to store the client data and purchase information data gathered through these transactions in their data warehouse, running on Redshift, where they intend to build Key Performance Indicator (KPI) dashboards using QuickSight.
You and your security department know that your data collection system needs to obfuscate the PII (credit card) data, gathered through your data collection system. How should you protect the highly sensitive credit card data in order to meet the PCI DSS requirements while keeping your data collection system as efficient and cost effective as possible?

Tokenization PII data
. You can use tokenization instead of encryption when you only need to protect specific highly sensitive data for regulatory compliance requirements, such as PCI DSS.


You are a data scientist working for a healthcare company that needs to comply with Health Insurance Portability and Accountability Act (HIPAA) regulations. Your company needs to take all of their patient’s data, including test diagnostic data, wearable sensor data, diagnostic data from all doctor visits, etc. and store it in a data lake. They then want to use Athena and other Business Intelligence (BI) tools to query the patient data to enable their healthcare providers to give optimal service to their patients.
In order to apply the appropriate data governance and compliance controls, what AWS service(s) will allow you to provide the appropriate (HIPAA) reports? Also, what AWS service(s) will allow you to monitor changes to your data lake S3 bucket ACLs and bucket policies to scan for public read/write access violations?

AWS Artifact to retrieve the Business Associate Addendum (BAA) HIPAA compliance report. Use custom rules in AWS Config to track and report on S3 ACL and/or bucket policy changes that violate your security policies.

. AWS Artifact gives you the capability to retrieve the Business Associate Addendum (BAA) HIPAA compliance report directly from AWS. Also, AWS Config monitors your AWS resource configuration changes. It allows you to take action or alert, using custom rules, on configuration changes that violate your policies.


You are a data scientist working for a company that provides credit card verification services to banks and insurance companies. Your client credit card data is streamed into your S3 data lake on a daily basis in the form of large sets of JSON files. Due to the Personally Identifiable Information (PII) data contained in these JSON files, your company must adhere to the regulations defined in the Payment Card Industry Data Security Standard (PCI DSS). This means you must encrypt the data at rest in your S3 buckets. You also need to recognize and take action on any abnormal data access activity.
Which option best satisfies your data governance and compliance controls in the most cost effective manner?

Store the credit card JSON data in buckets in S3 with encryption enabled. Use the AWS Macie service to determine if any of your compliance rules are violated by scanning the S3 buckets. When compliance rule violations are found, use CloudWatch events to trigger alerts sent via Simple Notification Service (SNS).

. Use the AWS Macie service to guard against security violations by continuously scanning your S3 bucket data and your account settings. Macie uses machine learning to properly classify your PII data. Macie also monitors access activity for your data, looking for access abnormalities and data leaks.


You are a data scientist working for a large hedge fund. Your hedge fund managers rely on analytics data produced from the S3 data lake you have built that houses trade data produced by the firm’s various traders. You are configuring a public Elasticsearch domain that will allow your hedge fund managers to gain access to your trade data stored in your data lake. You have given your hedge fund managers Kibana to allow them to use visualizations you’ve produced to manage their traders activity.
When your hedge fund managers first test out your Kibana analytics visualizations, you find that Kibana cannot connect to your Elasticsearch cluster. Which options are ways to securely give your hedge fund managers access to your Elasticsearch cluster via their Kibana running on their local desktop? (SELECT TWO)

Configure a proxy server that acts as an intermediary between your Kibana users and your Elasticsearch cluster. Add an IP-based access IAM policy which allows requests from your user’s IP address to gain access to your Elasticsearch cluster through the proxy server’s IP address.

. You can use a proxy server to avoid having to include all of your hedge fund manager’s IP addresses in your access policy. You only include the proxy server’s IP address in your IAM access policy with a policy statement segment like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
...
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "es:*",
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"57.201.547.32"
]
}
}
...
}

Setup Amazon Cognito by creating a user pool and an identity pool to authenticate your Kibana users

. You can use Cognito and its user pools and identity pools to provide username and password access for Kibana users.


You have just landed a new job as a data scientist for a worldwide retail and wholesale business with distribution centers located all around the globe.Your first assignment is to build a data collection system that stores all of the company’s product distribution performance data from all of their distribution centers into S3. You have been given the requirement that the data collected from the distribution centers must be encrypted at rest. You also have to load your distribution center data into your company’s analytics EMR cluster on a daily basis so that your management team can produce daily Key Performance Indicators (KPIs) for the various regional distribution centers.
Which option best meets your encryption at rest requirement?

Use an AWS KMS customer master key (CMK) for server side encryption when writing your distribution center performance data to S3. Create an IAM role that allows your analytics EMR cluster to have permission to access your S3 buckets and to use the AWS KMS CMK.

. Using server side encryption through an AWS KMS CMK gives you the most secure encryption of the options provided.


You work as a data scientist for a data analytics firm that collects data for various industries, including the airline industry. Your airline clients wish to have your firm create analytics for use in machine learning models that predict air travel in the global market. To this end, you have created a Kinesis Data Streams data collection system that gathers flight data for use in your analysis.
You are writing a consumer application, using the Kinesis Client Library (KCL), that will consume the flight data stream records and process them before placing the data into your S3 data lake.
You need to handle the condition of when your consumer application fails in the middle of reading a data record from the data stream. What is the most efficient way to handle this condition?

Use the KCL application state tracking feature implemented in the DynamoDB table associated with the KCL application that failed when reading

. The KCL application state tracking feature is implemented in a unique DynamoDB table that is associated with the KCL consumer application. The table is created using the name of the KCL consumer application.


You work as a lead data scientist on a development team in a large consulting firm. Your team is working on a contract for a client that needs to gather key statistics from their application server logs. This data needs to be loaded into their S3 data lake for use in analytics applications.
Your data collection process requires transformation of the streamed data records as they are ingested through the collection process. You also have the requirement to keep an unaltered copy of every source record ingested by your data collection process.
Which option meets all of your requirements in the most efficient manner?

Amazon Kinesis Agent streams source data into Kinesis Data Firehose. Kinesis Data Firehose invokes a Lambda function which transforms the data record. Kinesis Data Firehose then writes the transformed record to the S3 data lake destination. Kinesis Data Firehose saves the unaltered record to another S3 destination.

. The Amazon Kinesis Agent is the most efficient way to collect data from application log files and send the data to a Kinesis Data Firehose stream. To transform the log data, Kinesis uses a Lambda function. Once the Lambda function has transformed the data, the Lambda function returns the transformed data record to Kinesis Data Firehose which writes the transformed record to your S3 destination. Kinesis Data Firehose can also be configured to write the original source data record to another S3 bucket.


You work as a data scientist for a streaming music service. Your company wishes to catalog and analyze the metadata about the most frequently streamed songs in their catalog. To do this you have created a Glue crawler that you have scheduled to crawl the company song database every hour. You want to load the song play statistics and metadata into your Redshift data warehouse using a Glue ETL job as soon as the crawler completes.
What is the most efficient way to automatically start the Glue ETL job as soon as the crawler completes?

Create two Glue triggers. The first Glue trigger is a timer based event that triggers every hour and starts the crawler. The second Glue trigger watches for the crawler to reach the SUCCEEDED state and then starts the ETL job. The ETL job will transform and load the data into your Redshift cluster.

. This option describes the most efficient approach, the use of Glue triggers, to start both the crawler and the ETL job automatically.


You work as a data architect for a literature publishing firm that publishes literature (novels, non-fiction, poetry, etc.) around the globe in several different languages. Your management team has moved your published format to almost exclusively digital to allow for immediate delivery of their product to their consumers. The platforms used by your customers to read your literature generate many IoT data messages as the customers interact with your literature. This data flows into your data collection system at very high volume levels.
You have been given the requirements that the IoT data must be housed in your corporate data lake and that the data must be highly available. You have also been asked to transform the IoT data and group the data records into batches according to the literature’s published language. Your most important data collection system characteristics are durability of the data and data lake retrieval performance.
You have built a Kinesis Data Stream to collect the IoT data. Which of the following options will meet your requirements in the most cost optimized, durable, and performant manner?

Construct a Kinesis Data Firehose that receives the IoT record data from the Kinesis Data Stream. The Kinesis Data Firehose buffers and converts the data to partitioned ORC files and writes them to your data lake.

. Kinesis Data Firehose is durable in that it uses multiple availability zones. Kinesis Data Firehose also facilitates the easy conversion of your data to partitioned ORC files for storage in your data lake. The partitioned ORC files make for highly optimized SQL queries, which gives you the best performance when retrieving data from your data lake.


You work as a data scientist for a global securities trading firm. Your management team needs to track all trading activity through the visualization of analytics data, such as Key Performance Indicators (KPIs), for all of its regional offices around the globe. Each regional office reads from and writes to your trading database at high frequency throughout the global trading day windows. Also, each regional manager needs to have analytics visualizations of KPIs that compare his/her region to all of the other regions around the globe in near real-time.
What data collection and storage solution best meets your requirements?

Leverage the global tables capability of DynamoDB to house your trading data. Make your tables available in the corporate headquarters region as well as in your regional office regions.

. The DynamoDB global tables feature gives you a multi-master, multi-region solution. This gives you the capability to write to your DynamoDB tables in the user’s local region. DynamoDB then replicates the local write to all other global tables in your other regions. This keeps all regional replicas synchronized by propagating all changes to all regional tables for every change in every regional table.


You work as a data scientist for a retail clothing chain. Your company has decided that their social media platform activity has become popular enough to provide valuable insight into their customer preferences and buying habits. They wish to gather their Instagram and Twitter social media data and use it for analytics to provide insight into various customer attributes, such as demographics, purchasing tendencies, relationships to other potential customers, etc. Your management team wants to build business intelligence (BI) ad-hoc visualizations from this data.
What option best describes the operational characteristics of the solution that best meets your requirements in the most efficient manner?

Kinesis Data Firehose receives the Instagram and Twitter social media feeds. Kinesis Data Firehose streams the raw data to S3. A Glue crawler catalogs the social media feed data. Athena is used to perform ad-hoc queries. QuickSight is used for data visualization.

Kinesis Data Firehose receives the social media data and writes it directly to S3. AWS Glue is used to crawl the data and catalog it. Athena uses the Glue catalog to allow for ad-hoc queries of the social media data. QuickSight is used to build the BI visualizations.


You are a data scientist working for a securities trading firm that receives trading data from multiple market data producer sources. Your task is to consume the data from these producers cost effectively while also maximizing the performance of your data collection system. Your data collection system must deliver the aggregated producer data to your firm’s data lake for analytics application use.
RecordMaxBufferedTime = 200
MaxConnections = 2
RequestTimeout = 5000
Region = us-east-1
Your Kinesis Data Streams writes to Kinesis Data Firehose. Kinesis Data Firehose uses a Lambda function to transform your data into the Avro format before writing it to your S3 bucket in your data lake.
You have noticed that your data collection pipeline is not performing as well as you had expected. What may be the cause, and what can you do to improve the situation?

Your RecordMaxBufferedTime value is too low, resulting in lower aggregation efficiency, so your pipeline throughput is slow. Change the RecordMaxBufferedTime to 3000 and restart your KPL application. This allows the KinesisProducer to deliver larger aggregate packages to your Kinesis Data Stream.

. Changing your RecordMaxBufferedTime to a higher value will increase your aggregate package size, thereby improving the performance of your pipeline throughput. Also, you must restart your KPL application if you want to change any of the KinesisProducerConfiguration values.


You are a data scientist working for a transportation company that specializes in delivering cargo to manufacturing companies. You have been tasked with building a data collection system to gather all of your logistics data into a data lake. This data will be used by analytics applications to perform operations management tasks such as solving the “traveling salesman” problem, where your analytics application needs to find the optimal path for your delivery truck to take to all of its destinations. This optimal path needs to maximize cost efficiency as well as meeting delivery timelines.
You have constructed a Kinesis Data Streams infrastructure with KPL producer applications delivering the transportation data into your Kinesis shards. You are in the process of building your Kinesis Consumer Library application code to consume the streaming data from Kinesis and write the data to your S3 buckets. What happens when your KCL worker code fails in the middle of retrieving a record from your Kinesis stream?

Your KCL implementation takes advantage of checkpointing, where KCL stores a cursor in DynamoDB to keep track of records that you have read from a shard. To recover from a failed KCL read, a new KCL worker uses the cursor to restart from the record so you don’t lose the record from the shard.

. Your KCL worker takes advantage of checkpointing, persisting its checkpoint cursor data to DynamoDB. The KCL will use the cursor information to restart at the exact record where the previous worker failed.


You recently started working as a data scientist for a large real estate company. Your real estate brokers need near real-time streaming data on interest rates and loan offerings for their regional markets. They also need near real-time streaming data describing their regional real estate inventory, for example on the market, sold, pending sale, etc.
You have constructed a Kinesis Data Firehose data collection pipeline to gather the data. You now wish to store the data in a DynamoDB database for access via REST APIs by your real estate agents out in the field using their mobile devices. You have implemented the REST APIs using API Gateway.
When you run your first canary deployment of your Lambda function you notice that your Lambda function attempts to process your buffered Kinesis records 3 times and then skips the batch of records. What might be the cause of the problem, and how can you correct the issue?

Your Kinesis Firehose buffer size is set to 7 MB. This setting is too high, causing your Lambda function to fail with an invocation limit error. Lower your Kinesis Firehose buffer size.

. Lambda has an invocation payload limit of 6 MB for synchronous invocations. Kinesis Firehose invokes your Lambda function in synchronous invocation mode. This type of data transformation failure results in three tries before skipping the batch of records. Lowering your Kinesis Firehose buffer size to a value 6 MB or less will solve the issue.


You work as a data scientist for an ocean cruise ship resort company. You have been tasked with building an S3 data lake to store information about customer interaction and satisfaction with the company’s resort offerings. The data will be captured from social media and the firm’s website.
Your data collection system will need to stream the social media and web site comments in real-time to your data store. Your management team wishes to use the data in the data store to perform ad-hoc analysis of the customer feedback in real-time. Which option gives you the most cost efficient and performant solution?

Streaming customer data -> AWS IoT Core -> AWS Glue -> S3 -> Athena

. You can receive your web and social media streamed data into AWS IoT Core. Then write the messages directly to S3 using the S3 IoT Core rule action. You can use Glue to crawl and catalog your data so that you can easily query it from Athena.


You work as a data scientist for a financial services firm that trades commodities on the futures markets in the United States. Specifically, your traders trade the S&P 500, Nasdaq-100, Yen, and Bitcoin equity index products on the Chicago Mercantile Exchange (CME). In order to have the real-time information needed to make informed trades, your traders need futures market data streamed in real-time into their data repository. They need to use their machine learning models to perform predictive analytics on their data.
Which option meets your requirements and gives you the most cost efficient solution to your design problem?

Streaming futures market data -> Kinesis Firehose -> S3 -> SageMaker

. Stream your futures market data using Kinesis Firehose. Firehose writes the data to S3. SageMaker sources its model with the raw data in S3. This is the most efficient option that also meets your requirements.


You work as a data scientist for a national polling institute. Your institute performs state-wide and national polls in the areas of politics, elections, and general public interest subjects. Your data collection system receives hundreds of thousands of data records through your data streaming pipeline. You have chosen DynamoDB as the data store for several of these data structures.
As you create your DynamoDB table for your political election polling data store you need to select a partition key and a sort key, since you wish to use a composite key to improve performance and DynamoDB capacity management. You have several choices for your political election polling partition/sort key combination. Your researchers need to produce several visualizations of the data to understand the distribution of votes by age, nationality, political party affiliation, selected candidate, etc. An example would be to visualize votes collected for a particular candidate by age group and by voter nationality.
Which option will give you the best performance for your political election polling table?

Partition key: registered voter id, Sort key: selected candidate name

. The choice of registered voter id for your primary key gives you high cardinality; every registered voter will have a unique voter id. Therefore, there is no need for a composite partition key.


You work as a data scientist for a data analytics company that specializes in supplying data sets to industry partners for use in their machine learning models. Your company’s data sets are used by your partners as seed data for their own corporate data stores, allowing your partners to leverage a much larger sample of data for their models.
One of your partners needs you to transform industry data that is sourced in the JSON format so that the data can be used by their machine learning model in the CSV format. You have chosen AWS Glue as your transformation tool. One particular requirement is that your ETL script needs to convert a composite JSON format of, for example {“id”: 1435678, “product name”: “product A”, “product cost”: 54.23}, to values in your CSV file of int, string, and double.
Which option leverages AWS Glue to perform the required JSON transformation in the most cost effective optimal manner?

Use the Glue built-in transform Unbox to reformat the JSON into the required elements

. The Glue Unbox built-in transform reformats string fields, like your composite JSON field, into distinct fields that represent the types of the composites.


You work as a data analytics specialist for a social media software company. Your product generates data that your company can use in predictive analytics applications that leverage machine learning. These applications use Natural Language Processing (NLP) and click prediction techniques for use in targeted advertising on your social media app.
You need to build an EMR cluster to store and process this streaming data to prepare it for use in your machine learning analytics applications. Based on your streaming data activity volume you estimate that your cluster will need to have more than 50 nodes.
Based on your streaming data volume and your machine learning based use cases, which types of EC2 instances should you use for your master node and core/task nodes?

Master node: m4.xlarge. Core/Task: Cluster Compute instance type

. The best practice is to use an m4.xlarge instance type for your master node if your cluster will have more than 50 nodes. Also, for NLP and machine learning applications the Cluster Compute instance type is recommended.

  • High CPU instance type is recommended for computation-intensive clusters
  • High Memory instance type is recommended for clusters running database and memory-caching applications

You work as a data analytics specialist for a television network that has started using data analytics for its sports broadcasts. You receive sports streaming data into your data collection system and store it in your EMR cluster for use in real-time analytics during the broadcast of sporting events. The analytics are overlaid onto the live sports action to give detailed insight into the action. The analytics are also broadcast out via your website for consumption by your millions of users worldwide.
Based on the schedule of sporting events and the popularity of some events, such as the Fédération Internationale de Football Association (FIFA) football world cup, you need to be able to scale your EMR cluster EC2 instances in or out depending on the particular demand for analytics for the given event. Your goal is to provide adequate performance for the given workload while also maintaining the most cost effective environment over time.
Which type of scaling plan should you use for your EMR cluster?

Define your EC2 instance type during the initial configuration of your instance groups. Then automatically resize your core instance group and task instance groups by leveraging automatic scaling to add or remove EC2 instances. Do this by defining rules that Auto Scaling uses based on a CloudWatch metric you specify.

. You can only define the EC2 instance type during the initial creation of your instance group. Leveraging automatic scaling to add or remove EC2 instances to your core instance group and task instance groups based on the changes in a CloudWatch metric is the best practice for maintaining the most cost effective and performant EMR cluster.


You work as a data scientist for a management consulting company. The management team of your company’s business process improvement practice needs real-time visualizations of Key Performance Improvement (KPI) outliers for their clients. You have a large historical data set and you also have real-time streaming data from your current engagements.
Which option gives you the most cost effective solution to your data analysis visualization problem?

Use an anomaly detection insight in QuickSight to detect the outliers in your clients KPI data.

. The anomaly detection insight of QuickSight allows you to continually analyze your KPI data to find anomalies. You can then visualize your insight data using the insight widget in QuickSight. This option is far more cost effective than building a SageMaker machine learning model.


You work as a data scientist for a healthcare corporation where you are required by Health Insurance Portability and Accountability Act (HIPAA) regulations to record all changes to your data stores for auditing purposes. You have created a data collection pipeline using Kinesis Data Streams, Kinesis Data Firehose and S3 to build your corporate data lake. You have also established AWS Config to record the configuration changes for your AWS resources.
Your AWS Config rules for your S3 buckets in your data lake should send you notifications whenever an S3 bucket is created, modified, or deleted. However, you are not receiving these notifications when your S3 resources change. What might be the cause of this problem?

The IAM role you assigned to AWS Config does not include the AWSConfigRole managed policy to allow AWS Config to record changes to your S3 buckets.

. The AWSConfigRole managed policy, associated with the role assigned to AWS Config, allows AWS Config to record and notify on changes to your S3 buckets.


You work as a data analytics specialist for a web retail company with vast warehouses across the globe. All of the products sold on your company’s retail web store are distributed from these warehouses to the end customer. You have been asked to produce data analytics applications that allow your management team to understand movement of product through your warehouse system. You have built a data collection system consisting of a Kinesis Data stream fed by data producers written using the Kinesis Producer Library (KPL) and consumed by Kinesis applications written using the Kinesis Client Library (KCL). The consumers use the Kinesis Connector Library to write your records to S3.
Because of the very large number of records produced by your KPL applications, you have decided to use KPL aggregation. Also, your KCL record processing code relies on unique identifiers for the processing of your KPL user records. What attribute of your streamed records can you use as your unique identifier for your KPL user records after de-aggregating the Kinesis Data Stream record?

Use the KPL user record sequence number

. You can use the KPL user record sequence number as your unique identifier as long as you use the KPL Record or UserRecord class hashCode and equals operations when comparing your user records.


You work as a data analytics specialist for a Software as a Service (SaaS) provider that provides software to the insurance industry. Your software allows small to medium sized insurance agencies to manage their client base and their insurance premium data.
You have built a data collection system that uses Kinesis Data Streams to feed a Kinesis Data Firehose stream. You want to configure the Firehose stream to leverage Lambda to transform your data and then write your data to your Elasticsearch cluster so that you can provide a cached data search capability for your SaaS offering.
When you conduct your first tests you find that your streaming data is not being delivered to your Elasticsearch domain. What may be the root of the problem?

Check the SucceedProcessing metric data in CloudWatch

. The SucceedProcessing metric data in CloudWatch tells you how many of records were successfully processed over a period of time when using Lambda for transformation. You are using Lambda to transform your data prior to writing it to Elasticsearch.


You work as a data analytics specialist for a music streaming service. Your team has been assigned the task of capturing the music selection activity of your millions of users and storing the data in a data lake for use in analytics applications.
You have chosen to populate your data lake via a data collection system that uses Kinesis Data Streams to capture the data records from a producer application. You have the requirement to keep the data records in sequence order so that your analytics applications can infer when users take a sequence of actions, such as selecting a song to play and then either skipping the song or marking the song as a favorite.
As you start to test your data collection pipeline you notice that some of your data records arrive out of sequence. Which option can help you correct this problem?

Use the PutRecord API call to write your records to your Kinesis stream

. With the PutRecords API call, a failed record is skipped and all subsequent records are processed. Therefore, the PutRecords API call does not guarantee data record ordering. When you write your records to the same shard, the PutRecord API call will guarantee data record ordering.


You work as a data analytics specialist for a polling analytics firm that is building a data warehouse to hold its polling data for an upcoming parliamentary election. The data will need to be loaded into your data warehouse in compressed format to allow for the best performance when querying the warehouse.
You have decided to use the Redshift automatic compression feature to accomplish your data compression in your warehouse. Which of the following options is NOT one of the operations performed by automatic compression when loading your data into your Redshift tables using the COPY command?

Automatic compression copies the sample rows to the table

. As the third step in the compression process, automatic compression removes the sample rows from the table, it does not copy the sample rows to the table.


You work as a data analytics specialist for a food delivery mobile application service. Your service matches restaurants offering delivery service with customers in the regional area of the given restaurants in your system. Your management team needs to gain insight into fulfillment, delivery route efficiency, customer satisfaction, and other key metrics of their service. To meet this end your team is building a data warehouse that will store the business data and allow you to produce key metric analytics views and dashboards to your management team.
You have loaded all of your data into your Redshift data warehouse and have started to create the business intelligence analytics views for your management team. However, your queries are not performing as well as they should. The response time for producing analytics insights is slow. You have decided to leverage column compression on your Redshift tables to improve query performance. How would you apply compression to your customer table in Redshift?

create table customer(
customer_id int encode raw,
customer_name char(20) encode bytedict);

. After you run the ANALYZE COMPRESSION command on your existing table you can use the results to select the compression encodings you’ll use when you create a new table to populate with your existing data. You need to create a new table with your desired compression encodings and then load your existing data into the new table.


You work as a data analytics specialist for a food processing corporation. Your company processes plant and animal products for distribution across the globe. You need to maintain a data warehouse to store information about your food products such as their production date, their shelf life, and their destination. This information must be backed up and stored for auditing purposes.
You have chosen Redshift as your data warehouse technology. You are now configuring your snapshot schedule for your primary tables. Which of the following options defines your snapshot schedule to occur every day of the week starting at 12:30 AM on a 1 hour increment?

*cron(30 0/1 )

. This command will run your snapshot every day starting at 12:30 AM on a one hour increment.


You work as a data scientist for a car manufacturer that has started to venture into the fully electric vehicle market. Your company currently has two models of electric cars in the US market. These cars have many sensors on them that emit data back to your EMR cluster where you store the data in your S3 data lake. You need to be able to be able to generate interactive visualizations in real-time of the sensor data for your management team so that they can get insight into the use of their new electric models.
What is the most efficient way to build your real-time visualization for your management team?

Create Presto tables in your Hive metastore then create QuickSight visualizations using Presto queries.

. You can directly query your data using Presto queries from your QuickSight visualizations. This gives you real-time visualizations as the data arrives in your EMR cluster.


You work as a data analytics specialist for a banking company where you are building a data warehouse to store data about customers and their accounts. This data warehouse will be used by analysts building customer insight analytics applications.
Since the data warehouse you are creating houses Personally Identifiable Information (PII), you need to restrict access to your Redshift data warehouse tables. Of the options given, how can you restrict access so that users only can retrieve the information they need at the most granular level?

Build stored procedures with the INVOKER security attribute in Redshift that control the access to the data needed for each function.

. Using a stored procedure in Redshift you can give your users access to the data they need, and only the data they need, without giving the users access to the underlying tables. The INVOKER security attribute for Redshift stored procedures is the default security attribute, where the procedure runs under the permissions of the user who calls it, thereby restricting access to only retrieving the results of the stored procedure.


You work as a data analytics specialist for a software company that develops Software as a Service options for the healthcare industry. In order to provide very fast access to your systems data using the simplest caching mechanism, you have chosen to implement the Elasticache Memcached cache engine.
Because your company develops software solutions for the healthcare industry, you are required by Health Insurance Portability and Accountability Act (HIPAA) regulations to record all changes to your data stores for auditing purposes. To this end, you need to record each time your Elasticache Memcached cache engine scales out or in, adding and removing nodes based on changes to the demand on your system. How would you most efficiently implement this monitoring in your infrastructure?

Create a custom AWS Config rule to monitor your Memcached nodes and record each time they scale in or out.

. The most efficient way to monitor your resource use, such as monitoring your Elasticache Memcached node autoscaling, is to leverage the AWS Config custom rule capability.


You work as a data analytics specialist for a mobile software company that is moving its customer database from their on-premises database instances to DMS in their AWS account. You are on the team creating the Database Migration Service (DMS) tasks to move your data tables and views to your RDS instance.
Which option is the most efficient and performant method of migrating your database resources?

Create a DMS task that does a full-load only task.

. The only way to migrate your tables and views using DMS is to use the full-load only task.


You work as a data analytics specialist for a multinational publishing conglomerate that has just purchased another publishing firm in mid-April of 2020. You and your team are now in the process of integrating the newly acquired publisher’s PostgreSQL datastore source endpoint into your RDS PostgreSQL target endpoint using DMS. The acquisition migration will take several months to complete so you’ll need to perform ongoing replication from the source endpoint to the target endpoint for several months until the acquired firm’s systems are decommissioned.
You have been instructed to start your DMS Change Data Capture (CDC) task from the starting point of 2 weeks after the acquisition migration began, which is May 15, 2020. What CDC task should you use to accomplish your ongoing replication activity?

Perform your replication of the source PostgreSQL endpoint from a CDC recovery checkpoint generated on 5/15/2020.

. When using DMS to migrate a PostgreSQL source endpoint you need to use a CDC native start point. To do this you can use a CDC recovery checkpoint in your source endpoint.


You work as a data scientist for an automobile manufacturer that has an EMR cluster that is used to populate your S3 data lake with automobile information such as sales across regions and countries, parts source partner performance, etc.
You need to create Apache Spark applications and run interactive queries on your EMR cluster to gain insight into your company’s sales and partner performance with minimal effort. You also need to give your fellow data scientists access to create Apache Spark applications.
Which option gives you the simplest, most cost effective operation for your data processing solution?

Create EMR notebooks within your EMR cluster and use Jupyter notebooks to run your Spark applications.

. Running your Spark applications in EMR notebooks running on your EMR cluster’s master node(s) is the most cost effective option listed.


You work as a data analytics specialist for a company that processes massive amounts of data through your EMR cluster. You need to transform this data with as little latency as possible. The data arrives in Avro files in one of your S3 buckets at a rate of hundreds of files (from approximately 100 to up to 500) per second. You need to process these files as close to real-time as you can.
What option is the most efficient and most performant that gives you the transformation of your data for analysis at the processing rate you need?

Create a step in your EMR cluster to execute a Hive script with concurrency set to 250 to transform the file data as the Avro files arrive.

. Concurrency, set to 250, with a Hive script will allow you to process your Avro files as they arrive using your Hive script running many instances in parallel.


You work as a data analytics specialist for a food production company that houses all of the data defining their food processing in their data lake. You use EMR to process the production data sourced from various systems within your organization and store it in ORC files in your S3 buckets.
You have been tasked with producing a Pig program that allows your data scientists to generate reports detailing their food production by region and supplier. When you add your Pig program as a step in your EMR cluster you need to define what action to take on failure. Which of the following is NOT a valid action on failure?

Cancel and continue

. The valid actions on failure are ‘Continue’, ‘Cancel and wait’, ‘Terminate cluster’, and when using the CLI ‘Terminate job flow’.


You work as a data analytics specialist for a mobile game software company. You have been tasked with building a data collection system that can efficiently manage the volume of data records pushed through your data collection system. The data collection system will gather data from player devices and deliver the game data to a DynamoDB database. Your management team wants to have analytics applications that show player data in real-time. These analytics applications will read the player data from your DynamoDB database as the data streams into it.
Which option describes the most cost efficient way to manage this data collection system?

Mobile game data -> Kinesis Data Streams -> KCL application consumers -> DynamoDB -> Analytics applications

. Your KCL application consumers will read the streamed game data from your Kinesis Data Streams shards and write the data to your DynamoDB database. Your analytics applications will be able to read the streamed data in real-time to produce useful analytics applications. The Kinesis Scaling Utility allows you to automatically scale your Kinesis Streams shard capacity to match your data load at any given time.


You work as a data analytics specialist for a transportation company where you are responsible for gathering sensor data from the various transport vehicles and ingesting it into an S3 data lake for use in a data warehouse and data analytics applications.
You have built a data collection system that leverages Kinesis Data Streams to gather the sensor data. You have also built Kinesis Consumer Library applications to consume the stream data and write it to S3.
You have noticed in your initial roll out of your data collection stream that on heavy transport activity days you experience hot shard conditions. What monitoring of your Kinesis Data Streams environment will help you best manage your shard scaling activities? (SELECT TWO)

Enable enhanced monitoring and compare IncomingBytes values and investigate your shards individually. Then identify which shards are receiving more traffic or breaching any service limits

. The IncomingBytes metric shows you the rate at which your shard is ingesting data. This shard level metric will alert you when you have a hot shard.

Monitor the IncomingRecords metric. If the combined total of the incoming and throttled records are greater than the stream limits, then consider changing the number of records.

. The IncomingRecords metric shows you the rate at which your shard is ingesting data. This shard level metric will alert you when you have a hot shard.


You work as a data analytics specialist for a company that produces a music streaming app for mobile devices such as smartphones, smart watches, and tablets. The service collects music preference data through a Kinesis Data Streams application that also uses Kinesis Data Analytics. The Kinesis stream gathers data about your users as the users listen to music on the streaming service. You and your team have been assigned the task of creating real-time dashboards and producing real-time metrics using Kinesis Data Analytics.
Your management team wants the real-time analytics and real-time metrics analytics applications to be fault tolerant and also scale when volume increases or decreases, thereby saving your company money.
Which option will give you the appropriate data analysis solution for your scenario in the most cost effective manner?

Write your Kinesis Data Analytics application as a Flink application written in Scala that leverages checkpointing for fault tolerance while also leveraging parallel execution of tasks and allocation of resources to implement scaling of your application.

. The two languages in which you can write your Kinesis Data Analytics Flink application are Scala or Java. Using Flink, you can leverage checkpointing for fault tolerance while also leveraging parallel execution of tasks and allocation of resources to implement scaling of your application.


You work as a data analytics specialist for a company that produces analytics data for political campaigns where your data is leveraged to gain insight into the political preferences of citizens during a state or federal campaign. You have built a Kinesis Data Analytics application that you wish to publish to your client organizations via a web interface.
You need to control access to your Kinesis Data Analytics application using authentication. Which option gives you the most secure and most cost effective way to authenticate your client users?

Create federated users using a web identity provider. Assign a role with the AmazonKinesisAnalyticsReadOnly AWS managed policy.

. Using federated users leveraging a web identity provider such as Google or Facebook. Assigning the AmazonKinesisAnalyticsReadOnly AWS managed policy to your federated users gives them the appropriately secure access required.


You work as a data analytics specialist for a securities trading firm where you are responsible for the firm’s real-time streaming security price feed. Your team has built a Kinesis Data Streams pipeline to stream security pricing data from various industry sources such as Bloomberg, Reuters, etc. You now need to create data analytics visualization applications in Kinesis Data Analytics using SQL. Your goal is to transform the streaming security pricing data by enhancing it with proprietary reference data generated by your firm and then stream the enhanced pricing data to your data warehouse.
Which option gives the optimal data visualization solution for your needs that is the most cost efficient?

Use your Kinesis Data Analytics SQL application to transform your security stream and use a CSV or JSON file stored on one of your S3 buckets to enhance your security data with your proprietary reference data. Write the stream from your Kinesis Data Analytics SQL application to a Kinesis Data Firehose stream that stores the stream data in S3. Then use the ‘copy … from’ Redshift command to load your enhanced security data into your Redshift data warehouse from S3.

. Your SQL code can transform your securities data. Your JSON or CSV files stored in your S3 buckets can be used to enhance using your proprietary reference data. Writing the output of your Kinesis Data Analytics SQL application to a Kinesis Data Firehose allows you to then stream the transformed and enhanced data to S3. From S3 you can copy the data into Redshift.


You work as a data analytics specialist for a credit card company that wishes to build a data lake to hold information about their customer base and their customer’s credit card information. Your management team plans to use the data lake to build business intelligence data analytics applications to gain insight into their customer’s spending habits and fraud vulnerabilities.
You and your team are in the process of creating your EMR cluster that you’ll use to populate your S3 data lake. Your team lead has emphasized that the cluster must be integrated with your corporate Active Directory domain for authentication of user access to the EMR cluster. You need to have your users access your EMR cluster with their domain user account when they use SSH to connect to your cluster or work with your data analytics applications.
Which option gives you the corporate Active Directory domain authentication for user access that is required for data lake in the most efficient manner?

Enable Kerberos authentication for EMR by configuring your own external key distribution center (KDC) in your EMR cluster. Make sure the external KDC is reachable by all the nodes of the cluster. Enable active directory integration.

. If you enable Kerberos authentication using an external key distribution center (KDC) you can enable active directory integration by making sure your external KDC is reachable by all the nodes of the cluster. This will allow you to have your users authenticate using their AD credentials.