AWS Redshift Lab

AWS Redshift

AWS Redshift


Redshift


Redshift Introduction

Redshift


Prerequisite

Services -> S3


Skip this step if you alread have tickitdb dataset in your S3 bucket.


Download the sample dataset tickitdb then unzip it.


Upload the folder to your S3 bucket.
Click on Upload


Click on Add folder

Select tickitdb then click on Choose for Upload

Click on Upload

Upload succeed


IAM Configuration

Set up an IAM role for Redshift

Services -> IAM -> Roles


Step 1: Create Role

Click on Create role


Select Redshift

Select Redshift - Customizable
Click on Next: Permissions


Step 2: Attach Permissions Policies

Search for S3 then select AmazonS3ReadOnlyAccess
Click on Next: Tags


Step 3: Add Tags

Click on Next: Review


Step 4: Review
  • Role name: myRedshiftRole

Click on Create role


Redshift Configuration

Services -> Amazon Redshift


Create Cluster

Click on Create cluster


Cluster configuration
  • Cluster identifier: examplecluster

Under Choose the size of the cluster
Click on Help me choose to see the cost & performance simulator


Click on I'll choose
Click on Node type, which is the AWS EC2 size.

  • Node type: ra3.4xlarge

Node type:

  • RA3 (recommended)
    • High performance with scalable managed storage
  • DC2
    • High performance with fixed local SSD storage
  • SD2 (legacy)
    • Large workloads with fixed local HDD storage


  • Nodes: 2 (1 for leader node, 1 for compute node)


Database Configuration
  • Master user password: input your password


Cluster permissions (optional)
  • Available IAM roles: myRedshiftRole
  • Click on Add IAM role


Click on Create cluster


Wait 1~2 minutes until the Status changes to Available


Redshift Query Editor

Click on EDITOR -> Query Editor on navigation pane


  • Connection: Create new connection
  • Cluster: examplecluster
  • Database name: dev
  • Database user: awsuser
  • Database password: input your password

Click on Connect to database


Create Tables
  • Select schema: public

Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
create table venue(
venueid smallint not null distkey sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
create table category(
catid smallint not null distkey sortkey,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table date(
dateid smallint not null distkey sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table listing(
listid integer not null distkey,
sellerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

Click on Run


Load Data into Tables from S3

Copy and paste the following sql query to your editor.
Replace the <> field to your information.

1
2
3
copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

Click on Run

  • Select schema: public

Click on Preview data on the right of users


Repeat the above step and replace the following parameters to yours.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy category from 's3://<your Bucket name>/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy date from 's3://<your Bucket name>/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy event from 's3://<your Bucket name>/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region '<your aws-region>';

copy listing from 's3://<your Bucket name>/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy sales from 's3://<your Bucket name>/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region '<your aws-region>';

The following is my example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
copy venue from 's3://aws-labs-bucket/Redshift/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy category from 's3://aws-labs-bucket/Redshift/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy date from 's3://aws-labs-bucket/Redshift/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy event from 's3://aws-labs-bucket/Redshift/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';

copy listing from 's3://aws-labs-bucket/Redshift/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy sales from 's3://aws-labs-bucket/Redshift/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-east-1';


Query Data from Redshift Cluster
Find total sales on a given calendar date.
1
2
3
4
SELECT sum(qtysold)
FROM sales, date
WHERE sales.dateid = date.dateid
AND caldate = '2008-01-05';


Find top 10 buyers by quantity.
1
2
3
4
5
6
7
SELECT firstname, lastname, total_quantity
FROM (SELECT buyerid, sum(qtysold) total_quantity
FROM sales
GROUP BY buyerid
ORDER BY total_quantity desc limit 10) Q, users
WHERE Q.buyerid = userid
ORDER BY Q.total_quantity desc;


Find events in the 99.9 percentile in terms of all time gross sales.
1
2
3
4
5
6
7
8
SELECT eventname, total_price
FROM (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile
FROM (SELECT eventid, sum(pricepaid) total_price
FROM sales
GROUP BY eventid)) Q, event E
WHERE Q.eventid = E.eventid
AND percentile = 1
ORDER BY total_price desc;


Warning: Delete the Cluster

After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.

Click on CLUSTERS on the navigation pane.


Select your cluster, then click on Actions
Select Delete


Uncheck Create final snapshot
Click on Delet cluster


Refresh the page


Delete successfully!


Redshift Data Storage and Retrieval Patterns


Prerequisite

Services -> S3


Skip this step if you alread have tickitdb dataset in your S3 bucket.


Download the sample dataset tickitdb then unzip it.


Upload the folder to your S3 bucket.
Click on Upload


Click on Add folder

Select tickitdb then click on Choose for Upload

Click on Upload

Upload succeed


Redshift Configuration

Services -> Amazon Redshift


Take a look of cost calculator

Click on Create cluster

Under Choose the size of the cluster
Click on Help me choose


GB level storage: DC2

TB level storage: RA3

Notice the difference between DC2 and RA3. The former is Storage capacity the latter is Managed storage capacity


IAM Configuration

Set up an IAM role for Redshift

Services -> IAM -> Roles


Step 1: Create Role

Click on Create role


Select Redshift

Select Redshift - Customizable
Click on Next: Permissions


Step 2: Attach Permissions Policies

Search for S3 then select AmazonS3ReadOnlyAccess
Click on Next: Tags


Step 3: Add Tags

Click on Next: Review


Step 4: Review
  • Role name: myRedshiftRole

Click on Create role


Redshift Configuration

Services -> Amazon Redshift


Create Cluster

Click on Create cluster

Under Choose the size of the cluster
Click on Help me choose to see the cost & performance simulator


Click on I'll choose
Click on Node type, which is the AWS EC2 size.

  • Node type: ra3.4xlarge

Node type:

  • RA3 (recommended)
    • High performance with scalable managed storage
  • DC2
    • High performance with fixed local SSD storage
  • SD2 (legacy)
    • Large workloads with fixed local HDD storage


  • Nodes: 2 (1 for leader node, 1 for compute node)


Database Configuration
  • Master user password: input your password


Cluster permissions (optional)
  • Available IAM roles: myRedshiftRole
  • Click on Add IAM role


Click on Create cluster


Wait 1~2 minutes until the Status changes to Available


Redshift Query Editor

Click on EDITOR -> Query Editor on navigation pane


  • Connection: Create new connection
  • Cluster: examplecluster
  • Database name: dev
  • Database user: awsuser
  • Database password: input your password

Click on Connect to database


Create Tables
  • Select schema: public

Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
create table venue(
venueid smallint not null distkey sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
create table category(
catid smallint not null distkey sortkey,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table date(
dateid smallint not null distkey sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table listing(
listid integer not null distkey,
sellerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

Click on Run


Load Data into Tables from S3

Copy and paste the following sql query to your editor.
Replace the <> field to your information.

1
2
3
copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

Click on Run

  • Select schema: public

Click on Preview data on the right of users


Repeat the above step and replace the following parameters to yours.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy category from 's3://<your Bucket name>/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy date from 's3://<your Bucket name>/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy event from 's3://<your Bucket name>/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region '<your aws-region>';

copy listing from 's3://<your Bucket name>/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy sales from 's3://<your Bucket name>/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region '<your aws-region>';

The following is my example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
copy users from 's3://aws-labs-bucket/Redshift/tickitdb/allusers_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy venue from 's3://aws-labs-bucket/Redshift/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy category from 's3://aws-labs-bucket/Redshift/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy date from 's3://aws-labs-bucket/Redshift/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy event from 's3://aws-labs-bucket/Redshift/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';

copy listing from 's3://aws-labs-bucket/Redshift/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy sales from 's3://aws-labs-bucket/Redshift/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-east-1';


DISTSTYLE DISTKEY Examples

1
2
select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'users';

userid is distkey & sortkey.


1
select count(*) from users;

Notice we have 49990 rows in table users.


1
2
3
4
select slice, col, num_values as rows, minvalue, maxvalue
from svv_diskusage
where name='users' and col=0 and rows>0
order by slice, col;

Column number is zero based. So userid is col 0
So userid is a good choice for the distribution column of the table.
The table contains 49990 rows.

  • We have 8 slice
    • Each slice contains about the same number of the rows (≈ 6200)
  • The minvalue and the maxvalue shows the range of the value on each slice.
    • Each slice includs nearly the entire range of the values.


A bad distkey
1
create table userskey distkey(state) as select * from users;

Create a new table called userskey with the same rows as table users, and it has the distkey state.


Notice the slice (rows) are not evenly ditributed.
This is because the state is not a great choice as a distkey.


DISTSTYLE EVEN Example

1
2
create table userseven diststyle even as
select * from users;

Create a new table called userseven with the same rows as table users, but force the rows always evenly distributed.


1
2
3
select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage
where name = 'userseven' and col=0 and rows>0
order by slice, col;


DISTSTYLE ALL Example

1
2
create table usersall diststyle all as
select * from users;

Create a new table called usersall with the same rows as table users, but force the rows always all distributed.

As you can see, all the rows are distributed to the 1st slice of each node.
So slice 0 is the 1st node. slice 2 is the 2nd node. (We only have 2 nodes)


Warning: Delete the Cluster

After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.

Click on CLUSTERS on the navigation pane.


Select your cluster, then click on Actions
Select Delete


Uncheck Create final snapshot
Click on Delet cluster


Refresh the page


Delete successfully!


Redshift Spectrum

Redshift Spectrum


Redshift Spectrum & QuickSight

Getting started with Amazon Redshift Spectrum


Lab Details

We are going to access data stored in S3 bucket from QuickSight through Redshift Spectrum out to the S3 bucket.

  • Build S3 Data Lake
  • Query data directly from files on S3 through Redshift Sepctrum
  • Use QuickSight to run data analysis and visualization


Task Details

  1. Switch your region to us-west-2
  2. Download the tickitdb dataset and upload it to your S3 bucket
  3. Create IAM role and policy for Redshift Spectrum
  4. Create Redshift table, load dataset from S3
  5. Create Redshift external table, load dataset from S3
  6. Query Redshift
  7. Run analysis and visualization on QuickSight

Prerequisite

Switch your region to us-west-2


Services -> S3


Skip this step if you alread have tickitdb dataset in your S3 bucket.


Download the sample dataset tickitdb then unzip it.


Upload the folder to your S3 bucket.
Click on Upload


Click on Add folder

Select tickitdb then click on Choose for Upload

Click on Upload

Upload succeed


IAM Configuration


Set up an IAM role for Redshift

Services -> IAM -> Roles


IAM Role
Step 1: Create Role

Click on Create role


Select Redshift

Select Redshift - Customizable
Click on Next: Permissions


Step 2: Attach Permissions Policies

Search for S3 then select AmazonS3ReadOnlyAccess
Search for Glue then select AWSGlueConsoleFullAccess
Click on Next: Tags


Step 3: Add Tags

Click on Next: Review


Step 4: Review
  • Role name: myRedshiftRole

Click on Create role


IAM Policies

Click on Policies on the navigation panel
Click on Create policy


Step 1: Create policy

Click on JSON tab.
Paste in the following JSON policy document, which grants access to the Data Catalog but denies the administrator permissions for Lake Formation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RedshiftPolicyForLF",
"Effect": "Allow",
"Action": [
"glue:*",
"lakeformation:GetDataAccess"
],
"Resource": "*"
}
]
}

Click on Review policy


Step 2: Review policy
  • Name: mySpectrumPolicy

Click on Create policy


Attache mySpectrumPolicy to myRedshiftRole

Click on Roles on the navigation panel
Search for myRedshiftRole
Click on myRedshiftRole

Click on Attach policies

Search for mySpectrumPolicy
Select it then click on Attach policy


Redshift Configuration

Services -> Amazon Redshift


Create Cluster

Click on Create cluster


Cluster configuration
  • Cluster identifier: redshift-cluster-1

Under Choose the size of the cluster
Click on Help me choose to see the cost & performance simulator


Click on I'll choose
Click on Node type, which is the AWS EC2 size.

  • Node type: dc2.large

Node type:

  • RA3 (recommended)
    • High performance with scalable managed storage
  • DC2
    • High performance with fixed local SSD storage
  • SD2 (legacy)
    • Large workloads with fixed local HDD storage


  • Click on I'll choose
  • Nodes: 2 (1 for leader node, 1 for compute node)


Database Configuration
  • Master user password: input your password


Cluster permissions (optional)
  • Available IAM roles: myRedshiftRole
  • Click on Add IAM role


Click on Create cluster


Wait 1~2 minutes until the Status changes to Available


Public Access

Select redshift-cluster-1, click on Modify publicyly acceeible setting

Select Yes and click on Confirm


Security Group

Click on CONFIG -> Subnet groups on the navigation panel
Notice the VPC ID


Servies -> VPC -> Security Groups

Click on the security group id of the vpc of your cluster

Click on Edit inbound rules

  • Type: TCP
  • Port range: 0 - 65535
  • Source: Anywhere

Click Save rules


Redshift Query Editor

Click on EDITOR -> Query Editor on navigation pane


  • Connection: Create new connection
  • Cluster: redshift-cluster-1
  • Database name: dev
  • Database user: awsuser
  • Database password: input your password

Click on Connect to database


Create Tables
  • Select schema: public

Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
create table venue(
venueid smallint not null distkey sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
create table category(
catid smallint not null distkey sortkey,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table date(
dateid smallint not null distkey sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'));

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
create table listing(
listid integer not null distkey,
sellerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);

Click on Run


Copy and paste the following sql query to your editor.

1
2
3
4
5
6
7
8
9
10
11
create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

Click on Run


Load Data into Tables from S3

Copy and paste the following sql query to your editor.
Replace the <> field to your information.

1
2
3
copy users from 's3://<your Bucket name>/tickitdb/allusers_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

Click on Run

  • Select schema: public

Click on Preview data on the right of users


Repeat the above step and replace the following parameters to yours.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
copy venue from 's3://<your Bucket name>/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy category from 's3://<your Bucket name>/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy date from 's3://<your Bucket name>/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy event from 's3://<your Bucket name>/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region '<your aws-region>';

copy listing from 's3://<your Bucket name>/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '|' region '<your aws-region>';

copy sales from 's3://<your Bucket name>/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=<your iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region '<your aws-region>';

The following is my example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
copy venue from 's3://aws-labs-bucket/Redshift/tickitdb/venue_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy category from 's3://aws-labs-bucket/Redshift/tickitdb/category_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy date from 's3://aws-labs-bucket/Redshift/tickitdb/date2008_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy event from 's3://aws-labs-bucket/Redshift/tickitdb/allevents_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-east-1';

copy listing from 's3://aws-labs-bucket/Redshift/tickitdb/listings_pipe.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' region 'us-east-1';

copy sales from 's3://aws-labs-bucket/Redshift/tickitdb/sales_tab.txt'
credentials 'aws_iam_role=arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-east-1';


Redshfit Spectrum Configuration

At this step, we create a Redshift Spectrum Database and Table WITHOUT CPOY data from S3, which means we query data directy from S3 through Redshift Spectrum schema.


Create an external schema and an external table
1
2
3
4
5
create external schema spectrum
from data catalog
database 'spectrumdb'
iam_role 'arn:aws:iam::542892269888:role/myRedshiftRole'
create external database if not exists;


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
create external table spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://awssampledbuswest2/tickit/spectrum/sales/'
table properties ('numRows'='172000');


Query your data in Amazon S3

Get the number of rows in the SPECTRUM.SALES table.

1
select count(*) from spectrum.sales;


Load the EVENT table by replacing the IAM role ARN in the following COPY command with the role ARN you created.

1
2
3
copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' 
iam_role 'arn:aws:iam::542892269888:role/myRedshiftRole'
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';


The following example joins the external table SPECTRUM.SALES with the local table EVENT to find the total sales for the top 10 events.

1
2
3
4
5
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;


View the query plan for the previous query. Note the S3 Seq Scan, S3 HashAggregate, and S3 Query Scan steps that were executed against the data on Amazon S3.

1
2
3
4
5
6
7
explain
select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid)
from spectrum.sales, event
where spectrum.sales.eventid = event.eventid
and spectrum.sales.pricepaid > 30
group by spectrum.sales.eventid
order by 2 desc;


QuickSight Configuration

Switch your region to us-west-2

Services -> S3


Connect to Redshift

Click on Datasets on navigation panel


Click on New dataset on the right and top page


Click on Redshift


  • Data source name: RedshiftSpectrum
  • Instance ID: redshift-cluster-1
  • Connection type: Public network
  • Database name: dev
  • Username: awsuser
  • Password

Click on Validate connection

Click on Create datat source


  • Schema: spectrum
  • Tables: sales

Click on Edit/Preview data


Edit/Preview data

Click on Save & visualize on the top and middle page


Line Chart
  • Visual types: Line chart
  • X axis: saletime
  • Value: pricepaid (count)


Gauge Chart
  • Visual types: Gauge chart
  • Value: qtysold (count)


Donut Chart
  • Visual types: Donut chart
  • Group/Color: eventid
  • Value: qtysold (Average)


Warning: Delete the Cluster

After you finish the lab, you must delete the cluster. Otehrwise, it will make you bankrupt.

Click on CLUSTERS on the navigation pane.
Select your cluster, then click on Actions
Select Delete


Uncheck Create final snapshot
Click on Delet cluster


Refresh the page


Delete successfully!