Python Machine Learning

Python Tutorial


Python Visualization


Getting Start

Machine Learning is making the computer learn from studying data and statistics.

Machine Learning is a step into the direction of artificial intelligence (AI).

Machine Learning is a program that analyses data and learns to predict the outcome.

W3School - Python Machine Learning
Get more information about Statistics, see Enterprise Analytics
Numpy.org
Pandas.org
SciPy.org
Matplotlib.org


Data Set

In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database.

Example of an array:

1
[99,86,87,88,111,86,103,87,94,78,77,85,86]

Example of a database:

Carname Color Age Speed AutoPass
BMW red 5 99 Y
Volvo black 7 86 Y
VW gray 8 87 N
VW white 7 88 Y
Ford white 2 111 Y
VW white 17 86 Y
Tesla red 2 103 Y
BMW black 9 87 Y
Volvo gray 4 94 N
Ford white 11 78 N
Toyota gray 12 77 N
VW white 9 85 N
Toyota blue 6 86 Y

By looking at the array, we can guess that the average value is probably around 80 or 90, and we are also able to determine the highest value and the lowest value, but what else can we do?

And by looking at the database we can see that the most popular color is white, and the oldest car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the other values?

That is what Machine Learning is for! Analyzing data and predict the outcome!

In Machine Learning it is common to work with very large data sets. In this tutorial we will try to make it as easy as possible to understand the different concepts of machine learning, and we will work with small easy-to-understand data sets.


Data Types (by the type of measurement scale)

  • Categorical (Qualitative)
    • Nominal: According to Name
      • Examples: Data containing names, genders, races, etc.
    • Ordinal: According to Order
      • Examples: Data containing ranks, data that has been organized alphabetically, etc.
  • Numerical (Quantitative)
    • Discrete: A discrete data set is one in which the measurements take a countable set of isolated values. For example, the number of chairs, the number of patients, the number of accidents, etc., are all examples of discrete data.
    • Continuous: A continuous data set is one in which the measurements can take any real value within a certain range. For example, the amount of rainfall in Charlotte in January during the last 30 years or the amount of customer waiting times at a local bank are examples of continuous data sets.

Descriptive Statistics for Numerical Data

  • Measures of location
  • Measures of dispersion
  • Measures of shape
  • Measures of association

Measures of Location

  • Measures of Central Tendency
  • Data Profile

Measures of Central Tendency

  • Mean - The average value
  • Median - The midpoint value
  • Mode - The most common value

Mean

The mean value is the average value.

  • Population mean: $\mu = \displaystyle \frac{\sum^N_{i=1}x_i}{N}$
  • Sample mean: $\bar x = \displaystyle \frac{\sum^N_{i=1}x_i}{n}$

To calculate the mean, find the sum of all values, and dived the sum by the number of values:

Example: We have registered the speed of 13 cars:

1
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Python customized algorithm
1
2
3
4
5
6
7
8
9
10
11
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

sum = 0

for i in speed:
sum = sum + i

mean = sum/len(speed)

print(mean)
# 89.76923076923077

Use the NumPy mean() method to find the average speed:

Python Mean
1
2
3
4
5
6
7
8
import numpy as np

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

mean = np.mean(speed)

print(mean)
# 89.76923076923077
R Mean
1
2
3
speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)
mean(speed)
# 89.76923

Median

The median value is the value in the middle, after you have sorted all the values:

It is important that the numbers are sorted before you can find the median.

Example: We have registered the speed of 13 cars:

1
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Python customized algorithm
1
2
3
4
5
6
7
8
9
10
11
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

speed.sort()

x = len(speed)/2

if x % 2 == 0.0:
print((speed[int(x-1)] + speed[int(x)])/2)
else:
print(speed[int(x)])
# 87

Use the NumPy median() method to find the middle value:

Python Median
1
2
3
4
5
6
7
import numpy as np

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
median = np.median(speed)

print(median)
# 87.0
R Median
1
2
3
speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)
median(speed)
# 87

If there are two numbers in the middle, divide the sum of those numbers by two.

Example: We have registered the speed of 12 cars:

1
speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]
Python customized algorithm
1
2
3
4
5
6
7
8
9
10
speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]
speed.sort()

x = len(speed)/2

if x % 2 == 0.0:
print((speed[int(x-1)] + speed[int(x)])/2)
else:
print(speed[int(x)])
# 86.5

Use the NumPy median() method to find the middle value:

Python Median
1
2
3
4
5
6
7
import numpy as np

speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]
median = np.median(speed)

print(median)
# 86.5
R Median
1
2
3
speed <- c(77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103)
median(speed)
# 86.5

Mode

The Mode value is the value that appears the most number of times:

Use the SciPy mode() method to find the number that appears the most:

Example: We have registered the speed of 13 cars:

1
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Python Mode
1
2
3
4
5
6
7
8
9
10
from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
mode = stats.mode(speed)

print(mode)
# ModeResult(mode=array([86]), count=array([3]))

print(mode[0][0])
# 86
R Mode
1
2
3
4
5
6
7
8
9
10
11
12
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.
speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)

# Calculate the mode using the user function.
getmode(speed)
# 86

Data Profiles (Fractiles)

Describe the location and spread of data over its range

  • Quartiles – a division of a data set into four equal parts; shows the points below which 25%, 50%, 75% and 100% of the observations lie (25% is the first quartile, 75% is the third quartile, etc.)
  • Deciles – a division of a data set into 10 equal parts; shows the points below which 10%, 20%, etc. of the observations lie
  • Percentiles – a division of a data set into 100 equal parts; shows the points below which “k” percent of the observations lie

Example: Let’s say we have an array of the ages of all the people that lives in a street.

What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.

Python Percentiles & Quartiles
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

# Percentiles
k = np.percentile(ages, 70)
print(k)
# 41.0

# Quartiles
np.percentile(ages, 25) # Q1
# 11.0

np.percentile(ages, 50) # Q2
# 31.0

np.percentile(ages, 75) # Q3
# 43.0
R Percentiles & Quartiles
1
2
3
4
5
6
7
8
ages <- c(5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31)
quantile(ages, c(.70)) # Percentile
# 70%
# 41

quantile(ages, c(.25, .50, .75)) # Quartiles
# 25% 50% 75%
# 11 31 43

Measures of Dispersion

  • Dispersion – the degree of variation in the data.
    • Example:
      • {48, 49, 50, 51, 52} vs. {10, 30, 50, 70, 90}
    • Both means are 50, but the second data set has larger dispersion

Variance

  • Population variance: $\displaystyle \sigma^2 = \frac{\sum^N_{i=1}(x_i - \mu)^2}{N}$
  • Sample variance: $\displaystyle s^2 = \frac{\sum^N_{i=1}(x_i - \bar x)^2}{n - 1}$

Standard Deviation

  • Population SD: $\displaystyle \sigma = \sqrt{\frac{\sum^N_{i=1}(x_i - \mu)^2}{N}}$
  • Sample SD: $\displaystyle s = \sqrt{\frac{\sum^N_{i=1}(x_i - \bar x)^2}{n - 1}}$
  • The standard deviation has the same units of measurement as the original data, unlike the variance

Example 1: Calculate the variance and standard deviation of [1, 2, 3]:

Python Variance
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Variance - Population
x = [1, 2, 3]
np.var(x)
# 0.6666666666666666

# Variance - Sample
x = [1, 2, 3]
np.var(x, ddof = 1) # ddof: Degree of freedom
# 1.0
R Variance
1
2
3
4
5
6
7
8
9
10
# Variance - Population
x <- c(1:3)
varp <- function(x) mean((x-mean(x))^2)
varp(x)
# 0.6666667

# Variance - Sample
x <- c(1:3)
var(x)
# 1
Python Standard Deviation
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Standard Deviation - Population
x = [1, 2, 3]
np.std(x)
# 0.816496580927726

# Standard Deviation - Sample
x = [1, 2, 3]
np.std(x, ddof = 1) # ddof: Degree of freedom
# 1.0
R Standard Deviation
1
2
3
4
5
6
7
8
9
10
11
# Standard Deviation - Population
x <- c(1, 2, 3)
sd(x)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(x)
# 0.8164966

# Standard Deviation - Sample
x <- c(1, 2, 3)
sd(x)
# 1

Standard Deviation: Meaning that most of the values are within the range of 0.82 from the mean value, which is 2.


Example 2: This time we have registered the speed of 7 cars:

1
speed = [86,87,88,86,87,85,86]
Python Variance
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Variance - Population
speed = [86,87,88,86,87,85,86]
np.var(speed)
# 0.8163265306122449

# Variance - Sample
speed = [86,87,88,86,87,85,86]
np.var(speed, ddof = 1) # ddof: Degree of freedom
# 0.9523809523809524
R Variance
1
2
3
4
5
6
7
8
9
10
# Variance - Population
speed <- c(86,87,88,86,87,85,86)
varp <- function(x) mean((x-mean(x))^2)
varp(speed)
# 0.8163265

# Variance - Sample
speed <- c(86,87,88,86,87,85,86)
var(speed)
# 0.952381
Python Standard Deviation
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Standard Deviation - Population
speed = [86,87,88,86,87,85,86]
np.std(speed)
# 0.9035079029052513

# Standard Deviation - Sample
speed = [86,87,88,86,87,85,86]
np.std(speed, ddof = 1) # ddof: Degree of freedom
# 0.9759000729485332
R Standard Deviation
1
2
3
4
5
6
7
8
9
10
# Standard Deviation - Population
speed <- c(86,87,88,86,87,85,86)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(speed)
# 0.9035079

# Standard Deviation - Sample
speed <- c(86,87,88,86,87,85,86)
sd(speed)
# 0.9759001

Standard Deviation: Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.


Example 3: Let us do the same with a selection of numbers with a wider range:

1
speed = [32,111,138,28,59,77,97]
Python Variance
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Variance - Population
speed = [32,111,138,28,59,77,97]
np.var(speed)
# 1432.2448979591834

# Variance - Sample
speed = [32,111,138,28,59,77,97]
np.var(speed, ddof = 1) # ddof: Degree of freedom
# 1670.9523809523807
R Variance
1
2
3
4
5
6
7
8
9
10
# Variance - Population
speed <- c(32,111,138,28,59,77,97)
varp <- function(x) mean((x-mean(x))^2)
varp(speed)
# 1432.245

# Variance - Sample
speed <- c(32,111,138,28,59,77,97)
var(speed)
# 1670.952
Python Standard Deviation
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Standard Deviation - Population
speed = [32,111,138,28,59,77,97]
np.std(speed)
# 37.84501153334721

# Standard Deviation - Sample
speed = [32,111,138,28,59,77,97]
np.std(speed, ddof = 1) # ddof: Degree of freedom
# 40.877284412646354
R Standard Deviation
1
2
3
4
5
6
7
8
9
10
# Standard Deviation - Population
speed <- c(32,111,138,28,59,77,97)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(speed)
# 37.84501

# Standard Deviation - Sample
speed <- c(32,111,138,28,59,77,97)
sd(speed)
# 40.87728

Standard Deviation: Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.

As you can see, a higher standard deviation indicates that the values are spread out over a wider range.


Measures of Shape


Uniform Distribution

Earlier in this tutorial we have worked with very small amounts of data in our examples, just to understand the different concepts.

In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.

How Can we Get Big Data Sets?
To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

Create an array containing 250 random floats between 0 and 5:

Python
1
2
3
4
5
import numpy as np

x = np.random.uniform(0.0, 5.0, 250)

print(x)
R
1
2
x <- runif(250, min = 0, max = 5)
print(x)

Histogram & Relative Frequency Distribution

  • A graphical representation of a frequency distribution
  • Relative frequency – fraction or proportion of observations that fall within a cell

Histogram & Relative Frequency Distribution in matplotlib

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)
plt.title("Uniform Random Number Histogram")
plt.xlabel("Range")
plt.ylabel("Frequency")
plt.show()

plt.hist(x, bins = 'auto', normed = True)
plt.title("Uniform Random Number Histogram")
plt.xlabel("Range")
plt.ylabel("Relative Frequency")
plt.show()

Python Histogram & Relative Frequency Distribution in plotly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import pandas as pd
import plotly.express as px

x = np.random.uniform(0.0, 5.0, 250)
df = pd.DataFrame(data = x, columns = ['x']) # Convert x from array to dataframe

fig = px.histogram(df, x = 'x',)
fig.update_layout(title = 'Uniform Random Number Histogram',
xaxis_title="Range",
yaxis_title="Frequency",)

fig.show()

fig = px.histogram(df, x = 'x', histnorm = 'probability')
fig.update_layout(title = 'Uniform Random Number Histogram',
xaxis_title="Range",
yaxis_title="Frequency")

fig.show()

R Histogram & Relative Frequency Distribution
1
2
3
4
5
6
7
8
9
10
11
12
13
x <- runif(250, min = 0, max = 5)

hist(x,
freq = TRUE,
main = paste("Uniform Random Number Histogram"),
xlab = "Range",
ylab = "Frequency")

hist(x,
freq = FALSE,
main = paste("Uniform Random Number Histogram"),
xlab = "Range",
ylab = "Relative Frequency")

Python Normal Distribution in matplotlib
1
2
3
4
5
6
7
8
9
10
11
import numpy as np as np
import matplotlib.pyplot as plt

np.random.seed(100)
x = np.random.normal(200, 25, size = 10000) # mu = 200, sigma = 25

plt.hist(x, bins = 'auto', normed = True)
plt.title("Normal Distribution")
plt.xlabel("Range")
plt.ylabel("Relative Frequency")
plt.show()

Python Normal Distribution in plotly
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

fig = px.histogram(df, x = 'x', histnorm = 'probability')
fig.update_layout(title = 'Normal Distribution',
xaxis_title = 'Range',
yaxis_title = 'Relative Frequency')

fig.show()

Python Normal Distribution Density Curve
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import plotly.figure_factory as ff
import numpy as np

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

hist_data = [df['x']]
group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, curve_type = 'normal') # override default 'kde'
fig.update_layout(title = 'Normal Distribution Cumulative Relative Frequency',
xaxis_title = 'Range',
yaxis_title = 'Relative Frequency')

fig.show()

R Normal Distribution
1
2
3
4
5
6
7
set.seed(100)
x <- rnorm(n = 10000, mean = 200, sd = 25)
hist(x,
freq = FALSE,
main = paste("Normal Distribution"),
xlab = "Range",
ylab = "Relative Frequency")

  • Cumulative relative frequency – proportion or percentage of observations that fall below the upper limit of a cell

https://matplotlib.org/3.1.1/gallery/statistics/histogram_cumulative.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cumfreq.html

Python Cumulative Relative Frequency
1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

fig = px.histogram(df, x = 'x', histnorm = 'probability', cumulative = True)
fig.update_layout(title = 'Normal Distribution Cumulative Relative Frequency',
xaxis_title = 'Range',
yaxis_title = 'Cumulative Relative Frequency')
fig.show()


Unsupervised Learning

https://campus.datacamp.com/courses/unsupervised-learning-in-python/clustering-for-dataset-exploration?ex=1

What is unsupervised machine learning?

  • Unsupervised learning finds patterns in data
    • E.g. clustering customers by their purchases
    • Compressing the data using purchase patterns (dimension reduction)

Supervised learning vs Unsupervised learning

  • Supervised learning finds patterns for a prediction task
    • E.g. classify tumors as benign or cancerous (labels)
  • Unsupervised learning finds patterns in data
    • but without a specific prediction task in mind

k-means clustering

  • Finds clusters of samples
  • Number of clusters must be specified
  • Implemented in sklearn (“scikit-learn“)

Example: Iris dataset

iris: homepage
iris: K-means Clustering

  • Description: Measurements of many iris plants
    • Columns are measurements (the features)
    • Rows represent iris plants (the samples)
  • Classification: 3 species of iris: setosa, versicolor, virginica
  • Features: Petal length, petal width, sepal length, sepal width (the features of the dataset)
  • Targets: Try to classify flowers in different species based on the four features

Dimensions

Iris data is 4-dimensional

  • Iris samples are points in 4 dimensional space
  • Dimension = number of features
  • Dimension too high to visualize!
    • … but unsupervised learning gives insight
Python
1
2
3
4
5
6
7
# import data and modules
import plotly.express as px
import numpy as np
import pandas as pd

df = px.data.iris() # df is the original dataset, don't modify it!
df

Take a glance at the dataset, the petal_length is a huge difference between setosa and virginica. For example, the average height of humans must be a number between 1 meter to 2 meters. If there is one has 5 meters height, you would better do not classify it as a human; you would better just run away. So you might not classify a iris plant with 5 cm petal_length as a setosa.


Clusters

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Kmeans
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3) # Define the number of clusters
model.fit(df.iloc[:, 0:4]) # skip species & species_id
# KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
# n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
# random_state=None, tol=0.0001, verbose=0)

labels = model.predict(df.iloc[:, 0:4])

print(labels) # == print(model.labels_) Show the results
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
# 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
# 2 1]

In the original dataset, 150 rows are setosa, 51100 rows are versicolor, and 101~150 rows are virginica. After observing the results. We can replace the results from numbers to string leading to more readable.

Python
1
2
3
4
5
6
7
# Convert labels from ndarray to dataframe
species = labels.copy()
species = pd.DataFrame({'labels': species})
species.replace([0, 1, 2], ['setosa', 'versicolor', 'virginica'], inplace = True) # replace 0, 1, 2 to species name

pd.options.display.max_rows = None # Unlock the limit of numbers of rows
species


Predictive testing

  • New samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the “centroids”)
  • Finds the nearest centroid to each new sample
  • In other words, checking the accuracy through testing dataset
Python
1
2
3
4
5
6
# Predictive testing
new_samples = np.array([[5.7, 4.4, 1.5, 0.4], [ 6.5, 3, 5.5, 1.8], [ 5.8, 2.7, 5.1, 1.9]]) # Create new_samples
new_labels = model.predict(new_samples) # test

new_labels = pd.DataFrame(new_labels, columns = ['species'])
new_labels.replace([0, 1, 2], ['setosa', 'versicolor', 'virginica']) # without inplace = true, it will show the results directly

Here are three new samples. And after using our model, we predict that a flower has the following features should be classified to setosa:

  • Petal length: 5.7
  • petal width: 4.4
  • sepal length: 1.5
  • sepal width: 0.4

Scatter plots

  • Scatter plot of sepal length vs. petal length
  • Each point represents an iris sample
  • Color points by cluster labels
  • PyPlot (`plotly.express)
Python
1
2
3
4
5
6
7
8
9
# Original Scatter plot
fig = px.scatter(df, x = df['sepal_length'], y = df['petal_length'], color = df['species'],
hover_name = 'species',
labels = {'sepal_length': 'Sepal Length', 'petal_length': 'Petal Length', 'species': 'Species'},
height = 800)
fig.update_layout(title = 'Original Sepal Length vs. Petal Length',
xaxis_title = 'Sepal Length',
yaxis_title = 'Petal Length')
fig.show()

Python
1
2
3
4
5
6
7
8
9
# Preditive Scatter plot
fig = px.scatter(df, x = df['sepal_length'], y = df['petal_length'], color = species, # Notice this time color is not from df dataframe
hover_name = 'species',
labels = {'sepal_length': 'Sepal Length', 'petal_length': 'Petal Length', 'species': 'Species'},
height = 800)
fig.update_layout(title = 'Preditive Sepal Length vs. Petal Length',
xaxis_title = 'Sepal Length',
yaxis_title = 'Petal Length')
fig.show()


Evaluating a clustering

  • Can check correspondence with e.g. iris species
  • … but what if there are no species to check against?
  • Measure quality of a clustering
  • Informs choice of how many clusters to look for

Cross tabulation with pandas

  • Clusters vs species is a “cross-tabulation”
  • Use the pandas library
  • Given the species of each sample as a list species
Python
1
2
3
# Create new dataframe df2 combined labels and species from df
df2 = pd.DataFrame({'labels': labels, 'species': df['species']})
df2

Python
1
2
3
# Create the cross table
ct = pd.crosstab(df2['species_id'], df['species'])
ct # show results

Iris: clusters vs species

  • k-means found 3 clusters amongst the iris samples
  • Do the clusters correspond to the species?

Notice this is the distribution of the result. We can see that setosa has a 100% accuracy. And versicolor is also good enough. But the accuracy of virginica should be improved.


Measuring clustering quality

  • Using only samples and their cluster labels
  • A good clustering has tight clusters
  • … and samples in each cluster bunched together

Inertia measures clustering quality

  • Measures how spread out the clusters are (lower is better)
  • Distance from each sample to centroid of its cluster
  • After fit(), available as attribute inertia_
  • k-means attempts to minimize the inertia when choosing clusters
Python
1
2
3
4
5
6
7
# model inertia

# The following two rows of codes we ran it before
# model = KMeans(n_clusters = 3) # Define the number of clusters
# model.fit(samples)
model.inertia_
# 78.94084142614602

The number of clusters - Elbow Chart

  • Clusterings of the iris dataset with different numbers of clusters
  • More clusters means lower inertia
  • What is the best number of clusters?
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define number of clusters & inertias
df.shape # 150 rows, 6 columns
# (150, 6)

ks = range(1, 5) # We only use four columns since the fifth is not dimension; it is the species, the classification
inertias = [] # define a empty list

for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters = k)

# Fit model to samples
model.fit(df.iloc[:, 0:4])

# Append the inertia to the list of inertias
inertias.append(model.inertia_)

print(inertias)
# [680.8244, 152.36870647733906, 78.94084142614602, 57.31787321428571]

How many clusters to choose?

  • A good clustering has tight clusters (so low inertia)
    • … but not too many clusters!
  • Choose an “elbow” in the inertia plot
  • Where inertia begins to decrease more slowly
    • E.g. for iris dataset, 3 is a good choice
Python
1
2
3
4
5
# Draw the Elbow Chart
fig = px.line(x = ks, y = inertias, render_mode = 'svg',
title = 'Elbow Chart',
height = 800)
fig.show()


Example: Piedmont wines dataset

https://archive.ics.uci.edu/ml/datasets/Wine

  • 178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera
  • Features measure chemical composition e.g. alcohol content
  • … also visual properties like “color intensity”

Data Set Information:

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

The attributes are (dontated by Riccardo Leardi, riclea ‘@’ anchem.unige.it )

  1. Alcohol
  2. Malic acid
  3. Ash
  4. Alcalinity of ash
  5. Magnesium
  6. Total phenols
  7. Flavanoids
  8. Nonflavanoid phenols
  9. Proanthocyanins
  10. Color intensity
  11. Hue
  12. OD280/OD315 of diluted wines
  13. Proline

Data Preprocessing
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# import data and modules
import plotly.express as px
import numpy as np
import pandas as pd

# data source https://archive.ics.uci.edu/ml/datasets/Wine
df = pd.read_csv('https://raw.githubusercontent.com/ZacksAmber/Code/master/Python/DataCamp/Python/Machine%20Learning/Unsupervised%20Learning%20in%20Python/wine.data', sep = ",", header = None) # df is the original dataset, don't modify it!
# rename the columns
df.columns = ['Classification',
'Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline']

df.head()

Python
1
2
3
4
# Replace the Classification value from number to wine name
df2 = df.copy()
df2['Classification'].replace([1, 2, 3], ['Barolo', 'Grignolino', 'Barbera'], inplace = True)
df2.head()

K-means clustering without standardization
Python
1
2
3
4
5
6
7
# k-means clustering without standardization
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3) # Define the number of clusters
labels = model.fit_predict(df2.iloc[:, 1:14]) # skip the Classification

print(labels)

Python
1
2
3
4
df3 = pd.DataFrame({'labels': labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df3['labels'], df3['varieties'])
print(ct)


Data Profile

Feature variances

  • The wine features have very different variances!
  • Variance of a feature measures spread of its values
Python
1
2
# descriptive analytics
df.iloc[:, 1:14].describe()

Python
1
2
# descriptive analytics
df.iloc[:, 1:14].var()

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
fig = px.scatter(df, x = 'OD280/OD315 of diluted wines', y = 'Proline', color = 'Classification',
height = 800)

fig.update_layout(title = 'OD280/OD315 of diluted wines vs. Proline',
xaxis_title = 'OD280/OD315 of diluted wines',
yaxis_title = 'Proline',
xaxis = dict(
range = [-1800, 1800], # sets the range of xaxis
constrain = "domain", # meanwhile compresses the xaxis by decreasing its "domain"
),
yaxis = dict(
range = [0, 1800]))
fig.show()


Standardization: StandardScaler

  • In kmeans: feature variance = feature influence
  • StandardScaler transforms each feature to have mean 0 and variance 1
  • Features are said to be “standardized”
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df)
# StandardScaler(copy = True, with_mean = True, with_std = True)

df_scaled = scaler.transform(df)

df_scaled = pd.DataFrame(df_scaled)

df_scaled.columns = ['Classification',
'Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline']

df_scaled.var()

Python
1
2
3
4
5
6
7
8
9
fig = px.scatter(df2, x = df_scaled['OD280/OD315 of diluted wines'], y = df_scaled['Proline'], color = 'Classification',
height = 800)

fig.update_layout(title = 'Standardized OD280/OD315 of diluted wines vs. Proline',
xaxis_title = 'Standardized OD280/OD315 of diluted wines',
yaxis_title = 'Standardized Proline',
)

fig.show()

Similar methods

  • StandardScaler and KMeans have similar methods, BUT
  • Use fit() / transform() with StandardScaler
  • Use fit() / predict() with KMeans

K-means clustering with standardization

StandardScaler, then KMeans

  • Need to perform two steps: StandardScaler, then KMeans
  • Use sklearn pipeline to combine multiple steps
  • Data flows from one step into the next

Standardization rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1 (unit variance).

$$X_{changed} = \frac{X - \mu}{\sigma}$$

For most applications standardization is recommended.

Python
1
2
3
4
5
6
7
8
9
10
11
# K-means clustering with standardization
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters = 3)

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(scaler, kmeans) # use pipline to combine scaler & kmeans
pipeline.fit(df.iloc[:, 1:14])

Python
1
2
3
scaled_labels = pipeline.predict(df.iloc[:, 1:14])

print(scaled_labels)

Python
1
2
3
4
df_scaled = pd.DataFrame({'labels': scaled_labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df_scaled['labels'], df_scaled['varieties'])
print(ct)

Without feature standardization was very bad:


K-means clustering with normalization
  • StandardScaler is a “preprocessing” step
  • MaxAbsScaler and Normalizer are other examples

Normalization rescales the values into a range of $[0,1]$. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.

$$X_{changed} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Another option: Normalization

# Import Normalizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 3 clusters: kmeans
kmeans = KMeans(n_clusters = 3)

df_norm = normalizer.transform(df)

df_norm = pd.DataFrame(df_norm)

df_norm.columns = ['Classification',
'Alcohol',
'Malic acid',
'Ash',
'Alcalinity of ash',
'Magnesium',
'Total phenols',
'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity',
'Hue',
'OD280/OD315 of diluted wines',
'Proline']
Python
1
df_norm.head()

Python
1
2
3
4
5
# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(df.iloc[:, 1:14])

Python
1
2
norm_labels = pipeline.predict(df.iloc[:, 1:14])
print(norm_labels)

Python
1
2
3
4
df_norm = pd.DataFrame({'labels': norm_labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df_norm['labels'], df_norm['varieties'])
print(ct)


Supervised Machine Learning

ML: The art and science of:

  • Giving computers the ability to learn to make decisions from data
  • without being explicitly programmed!

Examples:

  • Learning to predict whether an email is spam or not
  • Clustering wikipedia entries into different categories

Difference between Supervised learning and Unsupervised learning:

  • Supervised learning: Uses labeled data
  • Unsupervised learning: Uses unlabeled data

Unsupervised learning

  • Uncovering hidden patterns from unlabeled data
    • Example: Grouping customers into distinct categories (Clustering)

Reinforcement learning

  • Software agents interact with an environment
    • Learn how to optimize their behavior
    • Given a system of rewards and punishments
    • Draws inspiration from behavioral psychology
  • Applications
    • Economics
    • Genetics
    • Game playing
  • AlphaGo: First computer to defeat the world champion in Go

Supervised learning

  • Predictor variables/features and a target variable
  • Aim: Predict the target variable, given the predictor variables
    • Classification: Target variable consists of categories
    • Regression: Target variable is continuous

Naming conventions

  • Features = predictor variables = independent variables
  • Target variable = dependent variable = response variable

Supervised learning

  • Automate time-consuming or expensive manual tasks
    • Example: Doctor’s diagnosis
  • Make predictions about the future
    • Example: Will a customer click on an ad or not?
  • Need labeled data
    • Historical data with labels
    • Experiments to get labeled data
    • Crowd-sourcing labeled data

Supervised learning in Python

  • We will use scikit-learn/sklearn
    • Integrates well with the SciPy stack
  • Other libraries
    • TensorFlow
    • keras

Matplotlib

Pyplot tutorial
Parts of Figure
matplotlib.pyplot.plot


Intro to pyplot

matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes (please note that “axes” here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).

Generating visualizations with pyplot is very quick:

1
2
3
4
5
import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers') # Set for Y label
plt.show() # Execute the plot function

You may be wondering why the x-axis ranges from 0-3 and the y-axis from 1-4. If you provide a single list or array to the plot() command, matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0,1,2,3].

Therefore, matplotlib automatically generate the x value [0, 1, 2, 3] for the counterpart y value [1, 2, 3, 4]. Then you have the coordinates (0, 1), (1, 2), (2, 3), (3, 4).

plot() is a versatile command, and will take an arbitrary number of arguments. For example, to plot x versus y, you can issue the command:

1
2
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.show()


Formatting the style of your plot

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. The letters and symbols of the format string are from MATLAB, and you concatenate a color string with a line style string. The default format string is ‘b-‘, which is a solid blue line. For example, to plot the above with red circles, you would issue

Python
1
2
3
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'ro') # 'ro' for red circles
plt.axis([0, 6, 0, 20]) # Set for Axis
plt.show()


One More Thing

Python Algorithms - Words: 2,640

Python Crawler - Words: 1,663

Python Data Science - Words: 4,551

Python Django - Words: 2,409

Python File Handling - Words: 1,533

Python Flask - Words: 874

Python LeetCode - Words: 9

Python Machine Learning - Words: 5,532

Python MongoDB - Words: 32

Python MySQL - Words: 1,655

Python OS - Words: 707

Python plotly - Words: 6,649

Python Quantitative Trading - Words: 353

Python Tutorial - Words: 25,451

Python Unit Testing - Words: 68