AWS Redshift Lab
AWS Redshift
Redshift
Redshift Introduction
Prerequisite
Services -> S3
Skip this step if you alread have tickitdb dataset in your S3 bucket.
Download the sample dataset tickitdb then unzip it.
Upload the folder to your S3 bucket.
Click on Upload
Click on Add folder
Select tickitdb
then click on Choose for Upload
Click on Upload
Upload succeed
IAM Configuration
Set up an IAM role for Redshift
Services -> IAM -> Roles
Step 1: Create Role
Click on Create role
Select Redshift
Select Redshift - Customizable
Click on Next: Permissions
Step 2: Attach Permissions Policies
Search for S3
then select AmazonS3ReadOnlyAccess
Click on Next: Tags
Step 3: Add Tags
Click on Next: Review
Step 4: Review
- Role name:
myRedshiftRole
Click on Create role
Redshift Configuration
Services -> Amazon Redshift
Create Cluster
Click on Create cluster
Cluster configuration
- Cluster identifier:
examplecluster
Under Choose the size of the cluster
Click on Help me choose
to see the cost & performance simulator
Click on I'll choose
Click on Node type
, which is the AWS EC2 size.
- Node type:
ra3.4xlarge
Node type:
- RA3 (recommended)
- High performance with scalable managed storage
- DC2
- High performance with fixed local SSD storage
- SD2 (legacy)
- Large workloads with fixed local HDD storage
- Nodes:
2
(1 for leader node, 1 for compute node)
Database Configuration
- Master user password: input your password
Cluster permissions (optional)
- Available IAM roles:
myRedshiftRole
- Click on
Add IAM role
Click on Create cluster
Wait 1~2 minutes until the Status changes to Available
Redshift Query Editor
Click on EDITOR -> Query Editor
on navigation pane
- Connection:
Create new connection
- Cluster:
examplecluster
- Database name:
dev
- Database user:
awsuser
- Database password: input your password
Click on Connect to database
Create Tables
- Select schema:
public
Copy and paste the following sql query to your editor.
1 | create table users( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table venue( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table category( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table date( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table event( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table listing( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table sales( |
Click on Run
Load Data into Tables from S3
Copy and paste the following sql query to your editor.
Replace the <>
field to your information.
1 | copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt' |
Click on Run
- Select schema:
public
Click on Preview data
on the right of users
Repeat the above step and replace the following parameters to yours.
1 | copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt' |
The following is my example.
1 | copy venue from 's3://aws-labs-bucket/Redshift/tickitdb/venue_pipe.txt' |
Query Data from Redshift Cluster
1 | SELECT sum(qtysold) |
1 | SELECT firstname, lastname, total_quantity |
1 | SELECT eventname, total_price |
Warning: Delete the Cluster
After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.
Click on CLUSTERS
on the navigation pane.
Select your cluster, then click on Actions
Select Delete
Uncheck Create final snapshot
Click on Delet cluster
Refresh the page
Delete successfully!
Redshift Data Storage and Retrieval Patterns
Prerequisite
Services -> S3
Skip this step if you alread have tickitdb dataset in your S3 bucket.
Download the sample dataset tickitdb then unzip it.
Upload the folder to your S3 bucket.
Click on Upload
Click on Add folder
Select tickitdb
then click on Choose for Upload
Click on Upload
Upload succeed
Redshift Configuration
Services -> Amazon Redshift
Take a look of cost calculator
Click on Create cluster
Under Choose the size of the cluster
Click on Help me choose
GB level storage: DC2
TB level storage: RA3
Notice the difference between DC2 and RA3. The former is Storage capacity the latter is Managed storage capacity
IAM Configuration
Set up an IAM role for Redshift
Services -> IAM -> Roles
Step 1: Create Role
Click on Create role
Select Redshift
Select Redshift - Customizable
Click on Next: Permissions
Step 2: Attach Permissions Policies
Search for S3
then select AmazonS3ReadOnlyAccess
Click on Next: Tags
Step 3: Add Tags
Click on Next: Review
Step 4: Review
- Role name:
myRedshiftRole
Click on Create role
Redshift Configuration
Services -> Amazon Redshift
Create Cluster
Click on Create cluster
Under Choose the size of the cluster
Click on Help me choose
to see the cost & performance simulator
Click on I'll choose
Click on Node type
, which is the AWS EC2 size.
- Node type:
ra3.4xlarge
Node type:
- RA3 (recommended)
- High performance with scalable managed storage
- DC2
- High performance with fixed local SSD storage
- SD2 (legacy)
- Large workloads with fixed local HDD storage
- Nodes:
2
(1 for leader node, 1 for compute node)
Database Configuration
- Master user password: input your password
Cluster permissions (optional)
- Available IAM roles:
myRedshiftRole
- Click on
Add IAM role
Click on Create cluster
Wait 1~2 minutes until the Status changes to Available
Redshift Query Editor
Click on EDITOR -> Query Editor
on navigation pane
- Connection:
Create new connection
- Cluster:
examplecluster
- Database name:
dev
- Database user:
awsuser
- Database password: input your password
Click on Connect to database
Create Tables
- Select schema:
public
Copy and paste the following sql query to your editor.
1 | create table users( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table venue( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table category( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table date( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table event( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table listing( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table sales( |
Click on Run
Load Data into Tables from S3
Copy and paste the following sql query to your editor.
Replace the <>
field to your information.
1 | copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt' |
Click on Run
- Select schema:
public
Click on Preview data
on the right of users
Repeat the above step and replace the following parameters to yours.
1 | copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt' |
The following is my example.
1 | copy users from 's3://aws-labs-bucket/Redshift/tickitdb/allusers_pipe.txt' |
DISTSTYLE DISTKEY Examples
1 | select "column", type, encoding, distkey, sortkey |
userid
is distkey & sortkey.
1 | select count(*) from users; |
Notice we have 49990
rows in table users.
1 | select slice, col, num_values as rows, minvalue, maxvalue |
Column number is zero based. So userid
is col 0
So userid
is a good choice for the distribution column of the table.
The table contains 49990
rows.
- We have 8
slice
- Each
slice
contains about the same number of the rows (≈ 6200)
- Each
- The
minvalue
and themaxvalue
shows the range of the value on eachslice
.- Each
slice
includs nearly the entire range of the values.
- Each
A bad distkey
1 | create table userskey distkey(state) as select * from users; |
Create a new table called userskey
with the same rows as table users
, and it has the distkey state
.
Notice the slice
(rows) are not evenly ditributed.
This is because the state
is not a great choice as a distkey.
DISTSTYLE EVEN Example
1 | create table userseven diststyle even as |
Create a new table called userseven
with the same rows as table users
, but force the rows always evenly distributed.
1 | select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage |
DISTSTYLE ALL Example
1 | create table usersall diststyle all as |
Create a new table called usersall
with the same rows as table users
, but force the rows always all distributed.
As you can see, all the rows are distributed to the 1st slice of each node.
So slice 0 is the 1st node. slice 2 is the 2nd node. (We only have 2 nodes)
Warning: Delete the Cluster
After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.
Click on CLUSTERS
on the navigation pane.
Select your cluster, then click on Actions
Select Delete
Uncheck Create final snapshot
Click on Delet cluster
Refresh the page
Delete successfully!
Redshift Spectrum
Redshift Spectrum & QuickSight
Lab Details
We are going to access data stored in S3 bucket from QuickSight through Redshift Spectrum out to the S3 bucket.
- Build S3 Data Lake
- Query data directly from files on S3 through Redshift Sepctrum
- Use QuickSight to run data analysis and visualization
Task Details
- Switch your region to us-west-2
- Download the tickitdb dataset and upload it to your S3 bucket
- Create IAM role and policy for Redshift Spectrum
- Create Redshift table, load dataset from S3
- Create Redshift external table, load dataset from S3
- Query Redshift
- Run analysis and visualization on QuickSight
Prerequisite
Switch your region to us-west-2
Services -> S3
Skip this step if you alread have tickitdb dataset in your S3 bucket.
Download the sample dataset tickitdb then unzip it.
Upload the folder to your S3 bucket.
Click on Upload
Click on Add folder
Select tickitdb
then click on Choose for Upload
Click on Upload
Upload succeed
IAM Configuration
Set up an IAM role for Redshift
Services -> IAM -> Roles
IAM Role
Step 1: Create Role
Click on Create role
Select Redshift
Select Redshift - Customizable
Click on Next: Permissions
Step 2: Attach Permissions Policies
Search for S3
then select AmazonS3ReadOnlyAccess
Search for Glue
then select AWSGlueConsoleFullAccess
Click on Next: Tags
Step 3: Add Tags
Click on Next: Review
Step 4: Review
- Role name:
myRedshiftRole
Click on Create role
IAM Policies
Click on Policies on the navigation panel
Click on Create policy
Step 1: Create policy
Click on JSON
tab.
Paste in the following JSON policy document, which grants access to the Data Catalog but denies the administrator permissions for Lake Formation.
1 | { |
Click on Review policy
Step 2: Review policy
- Name:
mySpectrumPolicy
Click on Create policy
Attache mySpectrumPolicy
to myRedshiftRole
Click on Roles
on the navigation panel
Search for myRedshiftRole
Click on myRedshiftRole
Click on Attach policies
Search for mySpectrumPolicy
Select it then click on Attach policy
Redshift Configuration
Services -> Amazon Redshift
Create Cluster
Click on Create cluster
Cluster configuration
- Cluster identifier:
redshift-cluster-1
Under Choose the size of the cluster
Click on Help me choose
to see the cost & performance simulator
Click on I'll choose
Click on Node type
, which is the AWS EC2 size.
- Node type:
dc2.large
Node type:
- RA3 (recommended)
- High performance with scalable managed storage
- DC2
- High performance with fixed local SSD storage
- SD2 (legacy)
- Large workloads with fixed local HDD storage
- Click on
I'll choose
- Nodes:
2
(1 for leader node, 1 for compute node)
Database Configuration
- Master user password: input your password
Cluster permissions (optional)
- Available IAM roles:
myRedshiftRole
- Click on
Add IAM role
Click on Create cluster
Wait 1~2 minutes until the Status changes to Available
Public Access
Select redshift-cluster-1
, click on Modify publicyly acceeible setting
Select Yes
and click on Confirm
Security Group
Click on CONFIG -> Subnet groups
on the navigation panel
Notice the VPC ID
Servies -> VPC -> Security Groups
Click on the security group id of the vpc of your cluster
Click on Edit inbound rules
- Type:
TCP
- Port range:
0 - 65535
- Source:
Anywhere
Click Save rules
Redshift Query Editor
Click on EDITOR -> Query Editor
on navigation pane
- Connection:
Create new connection
- Cluster:
redshift-cluster-1
- Database name:
dev
- Database user:
awsuser
- Database password: input your password
Click on Connect to database
Create Tables
- Select schema:
public
Copy and paste the following sql query to your editor.
1 | create table users( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table venue( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table category( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table date( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table event( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table listing( |
Click on Run
Copy and paste the following sql query to your editor.
1 | create table sales( |
Click on Run
Load Data into Tables from S3
Copy and paste the following sql query to your editor.
Replace the <>
field to your information.
1 | copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt' |
Click on Run
- Select schema:
public
Click on Preview data
on the right of users
Repeat the above step and replace the following parameters to yours.
1 | copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt' |
The following is my example.
1 | copy venue from 's3://aws-labs-bucket/Redshift/tickitdb/venue_pipe.txt' |
Redshfit Spectrum Configuration
At this step, we create a Redshift Spectrum Database and Table WITHOUT CPOY data from S3, which means we query data directy from S3 through Redshift Spectrum schema.
Create an external schema and an external table
1 | create external schema spectrum |
1 | create external table spectrum.sales( |
Query your data in Amazon S3
Get the number of rows in the SPECTRUM.SALES table.
1 | select count(*) from spectrum.sales; |
Load the EVENT table by replacing the IAM role ARN in the following COPY command with the role ARN you created.
1 | copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' |
The following example joins the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top 10 events.
1 | select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event |
View the query plan for the previous query. Note the S3 Seq Scan
, S3 HashAggregate
, and S3 Query Scan
steps that were executed against the data on Amazon S3.
1 | explain |
QuickSight Configuration
Switch your region to us-west-2
Services -> S3
Connect to Redshift
Click on Datasets
on navigation panel
Click on New dataset
on the right and top page
Click on Redshift
- Data source name:
RedshiftSpectrum
- Instance ID:
redshift-cluster-1
- Connection type:
Public network
- Database name:
dev
- Username:
awsuser
- Password
Click on Validate connection
Click on Create datat source
- Schema:
spectrum
- Tables:
sales
Click on Edit/Preview data
Edit/Preview data
Click on Save & visualize
on the top and middle page
Line Chart
- Visual types:
Line chart
- X axis:
saletime
- Value:
pricepaid (count)
Gauge Chart
- Visual types:
Gauge chart
- Value:
qtysold (count)
Donut Chart
- Visual types:
Donut chart
- Group/Color:
eventid
- Value:
qtysold (Average)
Warning: Delete the Cluster
After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.
Click on CLUSTERS
on the navigation pane.
Select your cluster, then click on Actions
Select Delete
Uncheck Create final snapshot
Click on Delet cluster
Refresh the page
Delete successfully!