Python Machine Learning

Posted on 2020-02-18 Edited on 2022-02-15 In Tech , Python , Science , Data Science , ML Symbols count in article: 32k Reading time ≈ 29 mins.

Getting Start

Machine Learning is making the computer learn from studying data and statistics.

Machine Learning is a step into the direction of artificial intelligence (AI).

Machine Learning is a program that analyses data and learns to predict the outcome.

W3School - Python Machine Learning
Get more information about Statistics, see Enterprise Analytics
Numpy.org
Pandas.org
SciPy.org
Matplotlib.org

Data Set

In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database.

Example of an array:

1	[99,86,87,88,111,86,103,87,94,78,77,85,86]

Example of a database:

Carname	Color	Age	Speed	AutoPass
BMW	red	5	99	Y
Volvo	black	7	86	Y
VW	gray	8	87	N
VW	white	7	88	Y
Ford	white	2	111	Y
VW	white	17	86	Y
Tesla	red	2	103	Y
BMW	black	9	87	Y
Volvo	gray	4	94	N
Ford	white	11	78	N
Toyota	gray	12	77	N
VW	white	9	85	N
Toyota	blue	6	86	Y

By looking at the array, we can guess that the average value is probably around 80 or 90, and we are also able to determine the highest value and the lowest value, but what else can we do?

And by looking at the database we can see that the most popular color is white, and the oldest car is 17 years, but what if we could predict if a car had an AutoPass, just by looking at the other values?

That is what Machine Learning is for! Analyzing data and predict the outcome!

In Machine Learning it is common to work with very large data sets. In this tutorial we will try to make it as easy as possible to understand the different concepts of machine learning, and we will work with small easy-to-understand data sets.

Data Types (by the type of measurement scale)

Categorical (Qualitative)
- Nominal: According to Name
  - Examples: Data containing names, genders, races, etc.
- Ordinal: According to Order
  - Examples: Data containing ranks, data that has been organized alphabetically, etc.
Numerical (Quantitative)
- Discrete: A discrete data set is one in which the measurements take a countable set of isolated values. For example, the number of chairs, the number of patients, the number of accidents, etc., are all examples of discrete data.
- Continuous: A continuous data set is one in which the measurements can take any real value within a certain range. For example, the amount of rainfall in Charlotte in January during the last 30 years or the amount of customer waiting times at a local bank are examples of continuous data sets.

Descriptive Statistics for Numerical Data

Measures of location
Measures of dispersion
Measures of shape
Measures of association

Measures of Location

Measures of Central Tendency
Data Profile

Measures of Central Tendency

Mean - The average value
Median - The midpoint value
Mode - The most common value

Mean

The mean value is the average value.

Population mean: $\mu = \displaystyle \frac{\sum^N_{i=1}x_i}{N}$
Sample mean: $\bar x = \displaystyle \frac{\sum^N_{i=1}x_i}{n}$

To calculate the mean, find the sum of all values, and dived the sum by the number of values:

Example: We have registered the speed of 13 cars:

1	speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Python customized algorithm

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

sum = 0

for i in speed:
    sum = sum + i

mean = sum/len(speed)

print(mean)
# 89.76923076923077

Use the NumPy mean() method to find the average speed:

Python Mean

import numpy as np

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

mean = np.mean(speed)

print(mean)
# 89.76923076923077

R Mean

1
2
3

speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)
mean(speed)
# 89.76923

Median

The median value is the value in the middle, after you have sorted all the values:

It is important that the numbers are sorted before you can find the median.

Example: We have registered the speed of 13 cars:

1	speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Python customized algorithm

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

speed.sort()

x = len(speed)/2

if x % 2 == 0.0:
    print((speed[int(x-1)] + speed[int(x)])/2)
else:
    print(speed[int(x)])
# 87

Use the NumPy median() method to find the middle value:

Python Median

import numpy as np

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
median = np.median(speed)

print(median)
# 87.0

R Median

1
2
3

speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)
median(speed)
# 87

If there are two numbers in the middle, divide the sum of those numbers by two.

Example: We have registered the speed of 12 cars:

1	speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]

Python customized algorithm

speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]
speed.sort()

x = len(speed)/2

if x % 2 == 0.0:
    print((speed[int(x-1)] + speed[int(x)])/2)
else:
    print(speed[int(x)])
# 86.5

Use the NumPy median() method to find the middle value:

Python Median

import numpy as np

speed = [77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103]
median = np.median(speed)

print(median)
# 86.5

R Median

1
2
3

speed <- c(77, 78, 85, 86, 86, 86, 87, 87, 94, 98, 99, 103)
median(speed)
# 86.5

Mode

The Mode value is the value that appears the most number of times:

Use the SciPy mode() method to find the number that appears the most:

Example: We have registered the speed of 13 cars:

1	speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

Python Mode

from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
mode = stats.mode(speed)

print(mode)
# ModeResult(mode=array([86]), count=array([3]))

print(mode[0][0])
# 86

R Mode

# Create the function.
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.
speed <- c(99,86,87,88,111,86,103,87,94,78,77,85,86)

# Calculate the mode using the user function.
getmode(speed)
# 86

Data Profiles (Fractiles)

Describe the location and spread of data over its range

Quartiles – a division of a data set into four equal parts; shows the points below which 25%, 50%, 75% and 100% of the observations lie (25% is the first quartile, 75% is the third quartile, etc.)
Deciles – a division of a data set into 10 equal parts; shows the points below which 10%, 20%, etc. of the observations lie
Percentiles – a division of a data set into 100 equal parts; shows the points below which “k” percent of the observations lie

Example: Let’s say we have an array of the ages of all the people that lives in a street.

What is the 75. percentile? The answer is 43, meaning that 75% of the people are 43 or younger.

Python Percentiles & Quartiles

import numpy as np

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

#  Percentiles
k = np.percentile(ages, 70)
print(k)
# 41.0

# Quartiles
np.percentile(ages, 25) # Q1
# 11.0

np.percentile(ages, 50) # Q2
# 31.0

np.percentile(ages, 75) # Q3
# 43.0

R Percentiles & Quartiles

ages <- c(5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31)
quantile(ages, c(.70)) # Percentile
# 70% 
# 41

quantile(ages, c(.25, .50, .75)) # Quartiles
# 25% 50% 75%
#  11  31  43

Measures of Dispersion

Dispersion – the degree of variation in the data.
- Example:
  - {48, 49, 50, 51, 52} vs. {10, 30, 50, 70, 90}
- Both means are 50, but the second data set has larger dispersion

Variance

Population variance: $\displaystyle \sigma^2 = \frac{\sum^N_{i=1}(x_i - \mu)^2}{N}$
Sample variance: $\displaystyle s^2 = \frac{\sum^N_{i=1}(x_i - \bar x)^2}{n - 1}$

Standard Deviation

Population SD: $\displaystyle \sigma = \sqrt{\frac{\sum^N_{i=1}(x_i - \mu)^2}{N}}$
Sample SD: $\displaystyle s = \sqrt{\frac{\sum^N_{i=1}(x_i - \bar x)^2}{n - 1}}$
The standard deviation has the same units of measurement as the original data, unlike the variance

Example 1: Calculate the variance and standard deviation of [1, 2, 3]:

Python Variance

import numpy as np

# Variance - Population
x = [1, 2, 3]
np.var(x)
# 0.6666666666666666

# Variance - Sample
x = [1, 2, 3]
np.var(x, ddof = 1) # ddof: Degree of freedom
# 1.0

R Variance

# Variance - Population
x <- c(1:3)
varp <- function(x) mean((x-mean(x))^2)
varp(x)
# 0.6666667

# Variance - Sample
x <- c(1:3)
var(x)
# 1

Python Standard Deviation

import numpy as np

# Standard Deviation - Population
x = [1, 2, 3]
np.std(x)
# 0.816496580927726

# Standard Deviation - Sample
x = [1, 2, 3]
np.std(x, ddof = 1) # ddof: Degree of freedom
# 1.0

R Standard Deviation

# Standard Deviation - Population
x <- c(1, 2, 3)
sd(x)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(x)
# 0.8164966

# Standard Deviation - Sample
x <- c(1, 2, 3)
sd(x)
# 1

Standard Deviation: Meaning that most of the values are within the range of 0.82 from the mean value, which is 2.

Example 2: This time we have registered the speed of 7 cars:

1	speed = [86,87,88,86,87,85,86]

Python Variance

import numpy as np

# Variance - Population
speed = [86,87,88,86,87,85,86]
np.var(speed)
# 0.8163265306122449

# Variance - Sample
speed = [86,87,88,86,87,85,86]
np.var(speed, ddof = 1) # ddof: Degree of freedom
# 0.9523809523809524

R Variance

# Variance - Population
speed <- c(86,87,88,86,87,85,86)
varp <- function(x) mean((x-mean(x))^2)
varp(speed)
# 0.8163265

# Variance - Sample
speed <- c(86,87,88,86,87,85,86)
var(speed)
# 0.952381

Python Standard Deviation

import numpy as np

# Standard Deviation - Population
speed = [86,87,88,86,87,85,86]
np.std(speed)
# 0.9035079029052513

# Standard Deviation - Sample
speed = [86,87,88,86,87,85,86]
np.std(speed, ddof = 1) # ddof: Degree of freedom
# 0.9759000729485332

R Standard Deviation

# Standard Deviation - Population
speed <- c(86,87,88,86,87,85,86)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(speed)
# 0.9035079

# Standard Deviation - Sample
speed <- c(86,87,88,86,87,85,86)
sd(speed)
# 0.9759001

Standard Deviation: Meaning that most of the values are within the range of 0.9 from the mean value, which is 86.4.

Example 3: Let us do the same with a selection of numbers with a wider range:

1	speed = [32,111,138,28,59,77,97]

Python Variance

import numpy as np

# Variance - Population
speed = [32,111,138,28,59,77,97]
np.var(speed)
# 1432.2448979591834

# Variance - Sample
speed = [32,111,138,28,59,77,97]
np.var(speed, ddof = 1) # ddof: Degree of freedom
# 1670.9523809523807

R Variance

# Variance - Population
speed <- c(32,111,138,28,59,77,97)
varp <- function(x) mean((x-mean(x))^2)
varp(speed)
# 1432.245

# Variance - Sample
speed <- c(32,111,138,28,59,77,97)
var(speed)
# 1670.952

Python Standard Deviation

import numpy as np

# Standard Deviation - Population
speed = [32,111,138,28,59,77,97]
np.std(speed)
# 37.84501153334721

# Standard Deviation - Sample
speed = [32,111,138,28,59,77,97]
np.std(speed, ddof = 1) # ddof: Degree of freedom
# 40.877284412646354

R Standard Deviation

# Standard Deviation - Population
speed <- c(32,111,138,28,59,77,97)
sdp <- function(x) sqrt(mean((x-mean(x))^2))
sdp(speed)
# 37.84501

# Standard Deviation - Sample
speed <- c(32,111,138,28,59,77,97)
sd(speed)
# 40.87728

Standard Deviation: Meaning that most of the values are within the range of 37.85 from the mean value, which is 77.4.

As you can see, a higher standard deviation indicates that the values are spread out over a wider range.

Measures of Shape

Uniform Distribution

Earlier in this tutorial we have worked with very small amounts of data in our examples, just to understand the different concepts.

In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.

How Can we Get Big Data Sets?
To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

Create an array containing 250 random floats between 0 and 5:

Python

import numpy as np

x = np.random.uniform(0.0, 5.0, 250)

print(x)

1 2	x <- runif(250, min = 0, max = 5) print(x)

Histogram & Relative Frequency Distribution

A graphical representation of a frequency distribution
Relative frequency – fraction or proportion of observations that fall within a cell

Histogram & Relative Frequency Distribution in matplotlib

Python

import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)
plt.title("Uniform Random Number Histogram")
plt.xlabel("Range")
plt.ylabel("Frequency")
plt.show()

plt.hist(x, bins = 'auto', normed = True)
plt.title("Uniform Random Number Histogram")
plt.xlabel("Range")
plt.ylabel("Relative Frequency")
plt.show()

Python Histogram & Relative Frequency Distribution in plotly

import numpy as np
import pandas as pd
import plotly.express as px

x = np.random.uniform(0.0, 5.0, 250)
df = pd.DataFrame(data = x, columns = ['x']) # Convert x from array to dataframe

fig = px.histogram(df, x = 'x',)
fig.update_layout(title = 'Uniform Random Number Histogram',
                  xaxis_title="Range",
                  yaxis_title="Frequency",)

fig.show()

fig = px.histogram(df, x = 'x', histnorm = 'probability')
fig.update_layout(title = 'Uniform Random Number Histogram',
                  xaxis_title="Range",
                  yaxis_title="Frequency")

fig.show()

R Histogram & Relative Frequency Distribution

x <- runif(250, min = 0, max = 5)

hist(x,
     freq = TRUE,
     main = paste("Uniform Random Number Histogram"),
     xlab = "Range",
     ylab = "Frequency")

hist(x,
     freq = FALSE,
     main = paste("Uniform Random Number Histogram"),
     xlab = "Range",
     ylab = "Relative Frequency")

Python Normal Distribution in matplotlib

import numpy as np as np
import matplotlib.pyplot as plt

np.random.seed(100)
x = np.random.normal(200, 25, size = 10000) # mu = 200, sigma = 25

plt.hist(x, bins = 'auto', normed = True)
plt.title("Normal Distribution")
plt.xlabel("Range")
plt.ylabel("Relative Frequency")
plt.show()

Python Normal Distribution in plotly

import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

fig = px.histogram(df, x = 'x', histnorm = 'probability')
fig.update_layout(title = 'Normal Distribution',
                 xaxis_title = 'Range',
                 yaxis_title = 'Relative Frequency')

fig.show()

Python Normal Distribution Density Curve

import plotly.figure_factory as ff
import numpy as np

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

hist_data = [df['x']]
group_labels = ['distplot'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, curve_type = 'normal') # override default 'kde'
fig.update_layout(title = 'Normal Distribution Cumulative Relative Frequency',
                 xaxis_title = 'Range',
                 yaxis_title = 'Relative Frequency')

fig.show()

R Normal Distribution

set.seed(100)
x <- rnorm(n = 10000, mean = 200, sd = 25)
hist(x,
     freq = FALSE,
     main = paste("Normal Distribution"),
     xlab = "Range",
     ylab = "Relative Frequency")

Cumulative relative frequency – proportion or percentage of observations that fall below the upper limit of a cell

https://matplotlib.org/3.1.1/gallery/statistics/histogram_cumulative.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.cumfreq.html

Python Cumulative Relative Frequency

import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(100)
df = pd.DataFrame(data = np.random.normal(200, 25, size = 10000), columns = ['x']) # mu = 200, sigma = 25

fig = px.histogram(df, x = 'x', histnorm = 'probability', cumulative = True)
fig.update_layout(title = 'Normal Distribution Cumulative Relative Frequency',
                 xaxis_title = 'Range',
                 yaxis_title = 'Cumulative Relative Frequency')
fig.show()

Unsupervised Learning

https://campus.datacamp.com/courses/unsupervised-learning-in-python/clustering-for-dataset-exploration?ex=1

What is unsupervised machine learning?

Unsupervised learning finds patterns in data
- E.g. clustering customers by their purchases
- Compressing the data using purchase patterns (dimension reduction)

Supervised learning vs Unsupervised learning

Supervised learning finds patterns for a prediction task
- E.g. classify tumors as benign or cancerous (labels)
Unsupervised learning finds patterns in data
- but without a specific prediction task in mind

k-means clustering

Finds clusters of samples
Number of clusters must be specified
Implemented in sklearn (“scikit-learn“)

Example: Iris dataset

iris: homepage
iris: K-means Clustering

Description: Measurements of many iris plants
- Columns are measurements (the features)
- Rows represent iris plants (the samples)
Classification: 3 species of iris: setosa, versicolor, virginica
Features: Petal length, petal width, sepal length, sepal width (the features of the dataset)
Targets: Try to classify flowers in different species based on the four features

Dimensions

Iris data is 4-dimensional

Iris samples are points in 4 dimensional space
Dimension = number of features
Dimension too high to visualize!
- … but unsupervised learning gives insight

Python

# import data and modules
import plotly.express as px
import numpy as np
import pandas as pd

df = px.data.iris() # df is the original dataset, don't modify it!
df

Take a glance at the dataset, the petal_length is a huge difference between setosa and virginica. For example, the average height of humans must be a number between 1 meter to 2 meters. If there is one has 5 meters height, you would better do not classify it as a human; you would better just run away. So you might not classify a iris plant with 5 cm petal_length as a setosa.

Clusters

Python

# Kmeans
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3) # Define the number of clusters
model.fit(df.iloc[:, 0:4]) # skip species & species_id
# KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
#        n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
#        random_state=None, tol=0.0001, verbose=0)

labels = model.predict(df.iloc[:, 0:4])

print(labels) # == print(model.labels_) Show the results
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#  1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
#  2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
#  2 1]

In the original dataset, 1~~50 rows are setosa, 51~~100 rows are versicolor, and 101~150 rows are virginica. After observing the results. We can replace the results from numbers to string leading to more readable.

Python

# Convert labels from ndarray to dataframe
species = labels.copy()
species = pd.DataFrame({'labels': species})
species.replace([0, 1, 2], ['setosa', 'versicolor', 'virginica'], inplace = True) # replace 0, 1, 2 to species name

pd.options.display.max_rows = None # Unlock the limit of numbers of rows
species

Predictive testing

New samples can be assigned to existing clusters
k-means remembers the mean of each cluster (the “centroids”)
Finds the nearest centroid to each new sample
In other words, checking the accuracy through testing dataset

Python

# Predictive testing
new_samples = np.array([[5.7, 4.4, 1.5, 0.4], [ 6.5, 3, 5.5, 1.8], [ 5.8, 2.7, 5.1, 1.9]]) # Create new_samples
new_labels = model.predict(new_samples) # test

new_labels = pd.DataFrame(new_labels, columns = ['species'])
new_labels.replace([0, 1, 2], ['setosa', 'versicolor', 'virginica']) # without inplace = true, it will show the results directly

Here are three new samples. And after using our model, we predict that a flower has the following features should be classified to setosa:

Petal length: 5.7
petal width: 4.4
sepal length: 1.5
sepal width: 0.4

Scatter plots

Scatter plot of sepal length vs. petal length
Each point represents an iris sample
Color points by cluster labels
PyPlot (`plotly.express)

Python

# Original Scatter plot
fig = px.scatter(df, x = df['sepal_length'], y = df['petal_length'], color = df['species'],
                 hover_name = 'species',
                 labels = {'sepal_length': 'Sepal Length', 'petal_length': 'Petal Length', 'species': 'Species'},
                 height = 800)
fig.update_layout(title = 'Original Sepal Length vs. Petal Length',
                 xaxis_title = 'Sepal Length',
                 yaxis_title = 'Petal Length')
fig.show()

Python

# Preditive Scatter plot
fig = px.scatter(df, x = df['sepal_length'], y = df['petal_length'], color = species, # Notice this time color is not from df dataframe
                 hover_name = 'species',
                 labels = {'sepal_length': 'Sepal Length', 'petal_length': 'Petal Length', 'species': 'Species'},
                 height = 800)
fig.update_layout(title = 'Preditive Sepal Length vs. Petal Length',
                 xaxis_title = 'Sepal Length',
                 yaxis_title = 'Petal Length')
fig.show()

Evaluating a clustering

Can check correspondence with e.g. iris species
… but what if there are no species to check against?
Measure quality of a clustering
Informs choice of how many clusters to look for

Cross tabulation with pandas

Clusters vs species is a “cross-tabulation”
Use the pandas library
Given the species of each sample as a list species

Python

1
2
3

# Create new dataframe df2 combined labels and species from df
df2 = pd.DataFrame({'labels': labels, 'species': df['species']})
df2

Python

1
2
3

# Create the cross table
ct = pd.crosstab(df2['species_id'], df['species'])
ct # show results

Iris: clusters vs species

k-means found 3 clusters amongst the iris samples
Do the clusters correspond to the species?

Notice this is the distribution of the result. We can see that setosa has a 100% accuracy. And versicolor is also good enough. But the accuracy of virginica should be improved.

Measuring clustering quality

Using only samples and their cluster labels
A good clustering has tight clusters
… and samples in each cluster bunched together

Inertia measures clustering quality

Measures how spread out the clusters are (lower is better)
Distance from each sample to centroid of its cluster
After fit(), available as attribute inertia_
k-means attempts to minimize the inertia when choosing clusters

Python

# model inertia

# The following two rows of codes we ran it before
# model = KMeans(n_clusters = 3) # Define the number of clusters
# model.fit(samples)
model.inertia_
# 78.94084142614602

The number of clusters - Elbow Chart

Clusterings of the iris dataset with different numbers of clusters
More clusters means lower inertia
What is the best number of clusters?

Python

# Define number of clusters & inertias
df.shape # 150 rows, 6 columns
# (150, 6)

ks = range(1, 5) # We only use four columns since the fifth is not dimension; it is the species, the classification
inertias = [] # define a empty list

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k)
    
    # Fit model to samples
    model.fit(df.iloc[:, 0:4])
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
print(inertias)
# [680.8244, 152.36870647733906, 78.94084142614602, 57.31787321428571]

How many clusters to choose?

A good clustering has tight clusters (so low inertia)
- … but not too many clusters!
Choose an “elbow” in the inertia plot
Where inertia begins to decrease more slowly
- E.g. for iris dataset, 3 is a good choice

Python

# Draw the Elbow Chart
fig = px.line(x = ks, y = inertias, render_mode = 'svg',
             title = 'Elbow Chart',
             height = 800)
fig.show()

Example: Piedmont wines dataset

https://archive.ics.uci.edu/ml/datasets/Wine

178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera
Features measure chemical composition e.g. alcohol content
… also visual properties like “color intensity”

Data Set Information:

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

The attributes are (dontated by Riccardo Leardi, riclea ‘@’ anchem.unige.it )

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

Data Preprocessing

Python

# import data and modules
import plotly.express as px
import numpy as np
import pandas as pd

# data source https://archive.ics.uci.edu/ml/datasets/Wine
df = pd.read_csv('https://raw.githubusercontent.com/ZacksAmber/Code/master/Python/DataCamp/Python/Machine%20Learning/Unsupervised%20Learning%20in%20Python/wine.data', sep = ",", header = None) # df is the original dataset, don't modify it!
# rename the columns
df.columns = ['Classification', 
                'Alcohol', 
                'Malic acid', 
                'Ash', 
                'Alcalinity of ash', 
                'Magnesium', 
                'Total phenols', 
                'Flavanoids', 
                'Nonflavanoid phenols', 
                'Proanthocyanins', 
                'Color intensity', 
                'Hue', 
                'OD280/OD315 of diluted wines', 
                'Proline']

df.head()

Python

# Replace the Classification value from number to wine name
df2 = df.copy()
df2['Classification'].replace([1, 2, 3], ['Barolo', 'Grignolino', 'Barbera'], inplace = True)
df2.head()

K-means clustering without standardization

Python

# k-means clustering without standardization
from sklearn.cluster import KMeans

model = KMeans(n_clusters = 3) # Define the number of clusters
labels = model.fit_predict(df2.iloc[:, 1:14]) # skip the Classification

print(labels)

Python

df3 = pd.DataFrame({'labels': labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df3['labels'], df3['varieties'])
print(ct)

Data Profile

Feature variances

The wine features have very different variances!
Variance of a feature measures spread of its values

Python

1 2	# descriptive analytics df.iloc[:, 1:14].describe()

Python

1 2	# descriptive analytics df.iloc[:, 1:14].var()

Python

fig = px.scatter(df, x = 'OD280/OD315 of diluted wines', y = 'Proline', color = 'Classification',
                 height = 800)

fig.update_layout(title = 'OD280/OD315 of diluted wines vs. Proline',
                  xaxis_title = 'OD280/OD315 of diluted wines',
                  yaxis_title = 'Proline',
                  xaxis = dict(
                  range = [-1800, 1800],  # sets the range of xaxis
                  constrain = "domain",  # meanwhile compresses the xaxis by decreasing its "domain"
                    ),
                  yaxis = dict(
                  range = [0, 1800]))
fig.show()

Standardization: StandardScaler

In kmeans: feature variance = feature influence
StandardScaler transforms each feature to have mean 0 and variance 1
Features are said to be “standardized”

Python

# Standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df)
# StandardScaler(copy = True, with_mean = True, with_std = True)

df_scaled = scaler.transform(df)

df_scaled = pd.DataFrame(df_scaled)

df_scaled.columns = ['Classification', 
                'Alcohol', 
                'Malic acid', 
                'Ash', 
                'Alcalinity of ash', 
                'Magnesium', 
                'Total phenols', 
                'Flavanoids', 
                'Nonflavanoid phenols', 
                'Proanthocyanins', 
                'Color intensity', 
                'Hue', 
                'OD280/OD315 of diluted wines', 
                'Proline']

df_scaled.var()

Python

fig = px.scatter(df2, x = df_scaled['OD280/OD315 of diluted wines'], y = df_scaled['Proline'], color = 'Classification',
                 height = 800)

fig.update_layout(title = 'Standardized OD280/OD315 of diluted wines vs. Proline',
                  xaxis_title = 'Standardized OD280/OD315 of diluted wines',
                  yaxis_title = 'Standardized Proline',
                  )

fig.show()

Similar methods

StandardScaler and KMeans have similar methods, BUT
Use fit() / transform() with StandardScaler
Use fit() / predict() with KMeans

K-means clustering with standardization

StandardScaler, then KMeans

Need to perform two steps: StandardScaler, then KMeans
Use sklearn pipeline to combine multiple steps
Data flows from one step into the next

Standardization rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1 (unit variance).

$$X_{changed} = \frac{X - \mu}{\sigma}$$

For most applications standardization is recommended.

Python

# K-means clustering with standardization
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

scaler = StandardScaler()
kmeans = KMeans(n_clusters = 3)

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(scaler, kmeans) # use pipline to combine scaler & kmeans
pipeline.fit(df.iloc[:, 1:14])

Python

1
2
3

scaled_labels = pipeline.predict(df.iloc[:, 1:14])

print(scaled_labels)

Python

df_scaled = pd.DataFrame({'labels': scaled_labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df_scaled['labels'], df_scaled['varieties'])
print(ct)

Without feature standardization was very bad:

K-means clustering with normalization

StandardScaler is a “preprocessing” step
MaxAbsScaler and Normalizer are other examples

Normalization rescales the values into a range of $[0,1]$. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.

$$X_{changed} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

Python

# Another option: Normalization

# Import Normalizer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 3 clusters: kmeans
kmeans = KMeans(n_clusters = 3)

df_norm = normalizer.transform(df)

df_norm = pd.DataFrame(df_norm)

df_norm.columns = ['Classification', 
                'Alcohol', 
                'Malic acid', 
                'Ash', 
                'Alcalinity of ash', 
                'Magnesium', 
                'Total phenols', 
                'Flavanoids', 
                'Nonflavanoid phenols', 
                'Proanthocyanins', 
                'Color intensity', 
                'Hue', 
                'OD280/OD315 of diluted wines', 
                'Proline']

Python

1	df_norm.head()

Python

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(df.iloc[:, 1:14])

Python

1 2	norm_labels = pipeline.predict(df.iloc[:, 1:14]) print(norm_labels)

Python

df_norm = pd.DataFrame({'labels': norm_labels, 'varieties': df2['Classification']})

ct = pd.crosstab(df_norm['labels'], df_norm['varieties'])
print(ct)

Supervised Machine Learning

ML: The art and science of:

Giving computers the ability to learn to make decisions from data
without being explicitly programmed!

Examples:

Learning to predict whether an email is spam or not
Clustering wikipedia entries into different categories

Difference between Supervised learning and Unsupervised learning:

Supervised learning: Uses labeled data
Unsupervised learning: Uses unlabeled data

Unsupervised learning

Uncovering hidden patterns from unlabeled data
- Example: Grouping customers into distinct categories (Clustering)

Reinforcement learning

Software agents interact with an environment
- Learn how to optimize their behavior
- Given a system of rewards and punishments
- Draws inspiration from behavioral psychology
Applications
- Economics
- Genetics
- Game playing
AlphaGo: First computer to defeat the world champion in Go

Supervised learning

Predictor variables/features and a target variable
Aim: Predict the target variable, given the predictor variables
- Classification: Target variable consists of categories
- Regression: Target variable is continuous

Naming conventions

Features = predictor variables = independent variables
Target variable = dependent variable = response variable

Supervised learning

Automate time-consuming or expensive manual tasks
- Example: Doctor’s diagnosis
Make predictions about the future
- Example: Will a customer click on an ad or not?
Need labeled data
- Historical data with labels
- Experiments to get labeled data
- Crowd-sourcing labeled data

Supervised learning in Python

We will use scikit-learn/sklearn
- Integrates well with the SciPy stack
Other libraries
- TensorFlow
- keras

Matplotlib

Pyplot tutorial
Parts of Figure
matplotlib.pyplot.plot

Intro to pyplot

matplotlib.pyplot is a collection of command style functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes (please note that “axes” here and in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).

Generating visualizations with pyplot is very quick:

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers') # Set for Y label
plt.show() # Execute the plot function

You may be wondering why the x-axis ranges from 0-3 and the y-axis from 1-4. If you provide a single list or array to the plot() command, matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0,1,2,3].

Therefore, matplotlib automatically generate the x value [0, 1, 2, 3] for the counterpart y value [1, 2, 3, 4]. Then you have the coordinates (0, 1), (1, 2), (2, 3), (3, 4).

plot() is a versatile command, and will take an arbitrary number of arguments. For example, to plot x versus y, you can issue the command:

1 2	plt.plot([1, 2, 3, 4], [1, 4, 9, 16]) plt.show()

Formatting the style of your plot

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. The letters and symbols of the format string are from MATLAB, and you concatenate a color string with a line style string. The default format string is ‘b-‘, which is a solid blue line. For example, to plot the above with red circles, you would issue

Python

1
2
3

plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'ro') # 'ro' for red circles
plt.axis([0, 6, 0, 20]) # Set for Axis
plt.show()

One More Thing

Python Algorithms - Words: 2,640

Python Crawler - Words: 1,663

Python Data Science - Words: 4,551

Python Django - Words: 2,409

Python File Handling - Words: 1,533

Python Flask - Words: 874

Python LeetCode - Words: 9

Python Machine Learning - Words: 5,532

Python MongoDB - Words: 32

Python MySQL - Words: 1,655

Python OS - Words: 707

Python plotly - Words: 6,649

Python Quantitative Trading - Words: 353

Python Tutorial - Words: 25,451

Python Unit Testing - Words: 68

Python Tutorial

Python Visualization

Getting Start

Data Set

Data Types (by the type of measurement scale)

Descriptive Statistics for Numerical Data

Measures of Location

Measures of Central Tendency

Mean

Median

Mode

Data Profiles (Fractiles)

Example: Let’s say we have an array of the ages of all the people that lives in a street.

Measures of Dispersion

Variance

Standard Deviation

Example 1: Calculate the variance and standard deviation of [1, 2, 3]:

Example 2: This time we have registered the speed of 7 cars:

Example 3: Let us do the same with a selection of numbers with a wider range:

Measures of Shape

Uniform Distribution

Histogram & Relative Frequency Distribution

Unsupervised Learning

k-means clustering

Example: Iris dataset

Dimensions

Clusters

Predictive testing

Scatter plots

Evaluating a clustering

Measuring clustering quality

The number of clusters - Elbow Chart

Example: Piedmont wines dataset

Data Preprocessing

K-means clustering without standardization

Data Profile

Standardization: StandardScaler

K-means clustering with standardization

K-means clustering with normalization

Supervised Machine Learning

Matplotlib

Intro to pyplot

Formatting the style of your plot

One More Thing