Python Data Preprocessing

Python Tutorial


Regular Expression


Numpy Array

Numpy Org
NumPy Developer Documentation
NumPy User Guide
NumPy Reference
Numpy Quickstart tutorial
NumPy: the absolute basics for beginners


Introduction

Array data structure
Matrix (mathematics)

In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of the same type of elements (values or variables), each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

The array is a concept that similar to Matrix (mathematics). In mathematics, a matrix (plural matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimension of the matrix below is $2 × 3$ (read “two by three”), because there are two rows and three columns:

$$
\left[
\begin{matrix}
1 & 2 & 3 \
4 & 5 & 6 \
\end{matrix}
\right]
$$

The matrix always called a $m \times n \space matrix$. $m$ is the number of the rows; $n$ is the number of the columns. Therefore, the above sample is a two-dimensional array.

And in Numpy, dimensions are called axes. There is a little difference between mathematic matrix and computer science array. In mathematic matrix, the first index is 1. But in computer science, the first index is 0.


One-dimensional array

1D array has almost the same appearance as the list.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

l = [1, 2, 3]

print(l)
# [1, 2, 3]

a = np.array([1, 2, 3])

print(a)
# [1 2 3]
a
# array([1, 2, 3])

1D array only has one axis, the row. And in the following sample, the length of axis = 0 is three since we have three elements.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

a = np.array([1, 2, 3])

np.sum(a, axis = 0) # the sum of array a, where axis = 0
# 6

a[0] # the first element
# 1
a[1] # the second element
# 2
a[2] # the third element
# 3

Two-dimensional array

2D array has two axes. In the following sample, the length of axis = 0 is three since we have three elements; the length of axis = 1 is two since we have two dimensions.

Python
1
2
3
4
5
6
7
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])

a
# array([[1, 2, 3],
# [4, 5, 6]])

Now we have an array a that has two axes.

col 1 col 2 col 3
row 1 1 2 3
row 2 4 5 6
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
a[0] # the first row
# array([1, 2, 3])
a[1] # the second row
# array([4, 5, 6])

np.sum(a, axis = 0) # the sum of the array by row
# array([5, 7, 9])
np.sum(a, axis = 1) # the sum of the array by column
# array([ 6, 15])

a[0][0] # the element of the first row and the first column
# 1
a[1][2] # the element of the second row and the third column
# 6
a[-1][-1] # the element of the last row and the last column
# 6

Three-dimensional array

3D array is hard to understand. But we can split it into 2 parts. And each part is a 2D array.

Python
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

a = np.array([[[0, 1, 2, 3], [4, 5, 6, 7]],
[[8, 9, 10, 11], [12, 13, 14, 15]]])

a[0] # axis = 0, size = 0 -> the first 2D array
# array([[0, 1, 2, 3],
# [4, 5, 6, 7]])
a[1] # axis = 0, size = 1 -> the second 2D array
# array([[ 8, 9, 10, 11],
# [12, 13, 14, 15]])

Just imagine we have two tables, the 2D array, and we combine them together. Then we have a 3D array.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
a[0][0] # axis = 0, size = 0; axis = 1, size = 0 -> the first 2D array, the first row
# array([0, 1, 2, 3])
a[0][1] # axis = 0, size = 0; axis = 1, size = 1 -> the first 2D array, the second row
# array([4, 5, 6, 7])
a[1][0] # axis = 0, size = 1; axis = 1, size = 0 -> the second 2D array, the first row
# array([ 8, 9, 10, 11])
a[1][1] # axis = 0, size = 1; axis = 1, size = 1 -> the second 2D array, the second row
# array([12, 13, 14, 15])

a[0][0][0] # axis = 0, size = 0; axis = 1, size = 0; axis = 2, size = 0 -> the first 2D array, the first row, the first column
# 0
a[0][0][1] # axis = 0, size = 0; axis = 1, size = 0; axis = 2, size = 1 -> the first 2D array, the first row, the second column
# 1

From the observation of the 3D array, we can know that in Numpy, the size is the value of a specific axis. For example, a[0][3] means in the first axis (axis = 0), the value is 0; in the second axis (axis = 1), the value is 3.

And whatever how many dimensions an array has, the first axis (axis = 0) should be overall view sight. And the last axis (axis = -1) should be the column. The last-second axis (axis = -2) is the row.


What’s the difference between a Python list and a NumPy array?

NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogenous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogenous.

Why use NumPy?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

The difference between a Python list and a NumPy array

Python
1
2
3
4
5
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

height * weight
# TypeError: can't multiply sequence by non-int of type 'list'

If you really need to figure out the product of height times weight, you have build your own code:

Python
1
2
3
4
5
6
7
8
9
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

h_w = []
for i in range(len(height)):
h_w.append(round(height[i] * weight[i], 2))

print(h_w)
# [113.142, 99.456, 108.756, 167.076, 122.973]

Or in an easier way:

Python
1
2
3
4
5
6
7
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

h_w = [round(height[i] * weight[i], 2) for i in range(len(height))]

print(h_w)
# [113.142, 99.456, 108.756, 167.076, 122.973]

However, the code is not concise enough. Additionally, you cannot use this method too many times. Therefore, let’s try the Numpy Array - also called ndarray(Numpy dtype array).

Python
1
2
3
4
5
6
7
import numpy as np

height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

print(height * weight)
# [113.142 99.456 108.756 167.076 122.973]

What is an array?
An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype.


Create a basic array

np.array()

To create a NumPy array, you can use the function np.array().

All you need to do to create a simple array is pass a list to it. If you choose to, you can also specify the type of data in your list. You can find more information about data types here.

Python
1
2
3
4
5
6
import numpy as np

a = np.array([1, 2, 3])

print(a)
# [1, 2, 3]

Once you run the above code, the Numpy will initialize a row for array a. The row is a similar concept of dataframe, matrix, or database. In dataframe and database, we use column to store the same type of data. And we use row to store data from different individual.

For example, here is a sample dataframe of a table in a database.

Name Height Age Gender
Zack Fair 1.85 23 M
Cloud Strife 1.73 21 M

Obviously, Name and Gender should be type string, Height should be type float, and Age should be type int. Every column stores the same type of the data

An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

a = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])

print(a)
# [[1 2 3]
# [4 5 6]
# [7 8 9]]

a
# array([[1, 2, 3],
# [4, 5, 6],
# [7, 8, 9]])
col 1 col 2 col 3
row 1 1 2 3
row 2 4 5 6
row 3 7 8 9

Sorting elements

np.sort()

Sorting an element is simple with np.sort(). You can specify the axis, kind, and order when you call the function.

Python
1
2
3
4
5
6
import numpy as np

a = np.array([2, 1, 5, 3, 7, 4, 6, 8])

np.sort(a)
# array([1, 2, 3, 4, 5, 6, 7, 8])

order: str or list of str, optional
When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np

height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
values = [(height[i], weight[i]) for i in range(len(height))]

dtype = [('height', float), ('weight', float)] # assign the data type of each column
h_w = np.array(values, dtype = dtype)

h_w
# array([(1.73, 65.4), (1.68, 59.2), (1.71, 63.6), (1.89, 88.4),
# (1.79, 68.7)], dtype=[('height', '<f8'), ('weight', '<f8')])

np.sort(h_w, order = 'height')
# array([(1.68, 59.2), (1.71, 63.6), (1.73, 65.4), (1.79, 68.7),
# (1.89, 88.4)], dtype=[('height', '<f8'), ('weight', '<f8')])

np.sort(h_w, order = 'weight')
#
np.sort(h_w, order = 'weight')...
# array([(1.68, 59.2), (1.71, 63.6), (1.73, 65.4), (1.79, 68.7),
# (1.89, 88.4)], dtype=[('height', '<f8'), ('weight', '<f8')])

In addition to sort, which returns a sorted copy of an array, you can use:

  • argsort, which is an indirect sort along a specified axis,
  • lexsort, which is an indirect stable sort on multiple keys,
  • searchsorted, which will find elements in a sorted array, and
  • partition, which is a partial sort.

np.concatenate()

You can concatenate arrays with np.concatenate().

Python
1
2
3
4
5
6
7
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

np.concatenate((a, b))
# array([1, 2, 3, 4, 5, 6, 7, 8])

Attention: Do not use + in numpy. It is not the same as Python list.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

# List
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]

a + b
# [1, 2, 3, 4, 5, 6, 7, 8]

# ndarray
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

a + b
# array([ 6, 8, 10, 12])
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
np.concatenate((x, y), axis = 0)
# array([[1, 2],
# [3, 4],
# [5, 6],
# [7, 8]])

x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6], [7, 8]])
np.concatenate((x, y), axis = 1)
# array([[1, 2, 5, 6],
# [3, 4, 7, 8]])

Indexing, slicing, and filtering

Review the concepts of the axis if you cannot understand this chapter.


Indexing

Indexing 1D array.

Python
1
2
3
4
5
6
7
8
import numpy as np

a = np.array([1, 2, 3])

a[0]
# 1
a[-1]
# 3

Indexing 2D array.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]])

a
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15]])

a[0][0]
# 0
a[1][1]
# 5
a[-1][-1]
# 15

Slicing

Slicing 1D array.

Python
1
2
3
4
5
6
7
8
9
10
import numpy as np

a = np.array([1, 2, 3])

a[0:2]
# array([1, 2])
a[1:]
# array([2, 3])
a[:-2]
# array([1])

Slicing 2D array.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]])

a
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15]])

a[0:1, 1:3]
# array([[1, 2]])
a[0:3, 2:]
# array([[ 2, 3],
# [ 6, 7],
# [10, 11]])

Filtering

Use the condition will only return the indices of the array.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]])

a
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15]])

a < 5
# array([[ True, True, True, True],
# [ True, False, False, False],
# [False, False, False, False],
# [False, False, False, False]])

Use the indices for filtering the values you need.

Python
1
2
a[a < 5]
# array([0, 1, 2, 3, 4])

You can also select, for example, numbers that are equal to or greater than 5, and use that condition to index an array.

Python
1
2
a[a >= 5]
# array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

You can select elements that are divisible by 2:

Python
1
2
a[a%2 == 0]
# array([ 0, 2, 4, 6, 8, 10, 12, 14])

Or you can select elements that satisfy two conditions using the & and | operators:

Python
1
2
3
4
5
6
7
b = a[(a > 5) | (a == 5)]
b
# array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

c = a[(a > 2) & (a < 11)]
c
# array([ 3, 4, 5, 6, 7, 8, 9, 10])

You can also use np.nonzero() to select elements or indices from an array.

Python
1
2
3
4
5
6
index = np.nonzero(a < 5)

index # row + column
# (array([0, 0, 0, 0, 1]), array([0, 1, 2, 3, 0]))
type(index)
# tuple

In this example, a tuple of arrays was returned: one for each dimension. The first array represents the row indices where these values are found, and the second array represents the column indices where the values are found.

If you want to generate a list of coordinates where the elements exist, you can zip the arrays, iterate over the list of coordinates, and print them. For example:

Python
1
2
3
4
5
6
7
8
9
list_of_coordinates = list(zip(index[0], index[1]))

for coord in list_of_coordinates:
print(coord)
# (0, 0)
# (0, 1)
# (0, 2)
# (0, 3)
# (1, 0)

You can also use np.nonzero() to print the elements in an array that are less than 5 with:

Python
1
2
a[index]
# array([0, 1, 2, 3, 4])

Create a bmi function.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np

height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

np_height = np.array(height)
np_weight = np.array(weight)

bmi = np_weight / np_height ** 2
bmi = np.around(bmi, 3)

bmi
# array([21.852, 20.975, 21.75 , 24.747, 21.441])

bmi > 24
# array([False, False, False, True, False])

bmi[bmi > 24]
# array([24.747])

Numpy Statistics

Numpy Descriptive Statistics for Numerical Data

Descriptive Statistics for Numerical Data


Measures of Location

Percentiles & Quartiles
Python Percentiles & Quartiles
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np

ages = [5, 31, 43, 48, 50, 41, 7, 11, 15, 39, 80, 82, 32, 2, 8, 6, 25, 36, 27, 61, 31]

# Percentiles
k = np.percentile(ages, 70)
print(k)
# 41.0

# Quartiles
np.percentile(ages, 25) # Q1
# 11.0

np.percentile(ages, 50) # Q2
# 31.0

np.percentile(ages, 75) # Q3
# 43.0

Arithmetic Mean
Python Arithmetic Mean
1
2
3
4
5
6
7
import numpy as np

x = list(range(0,11))
x.append(50)
np.mean(x)

8.75

Median
Python Median
1
2
3
4
5
6
import numpy as np

x = list(range(0,11))
np.median(x)

5.0

Mode
Python Mode
1
2
3
4
from scipy import stats

x = [2,1,2,3,1,2,3,4,1,5,5,3,2,3]
stats.mode(x)[0][0]

Measures of Dispersion

Variance
Python Variance
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Variance - Population
x = [1, 2, 3]
np.var(x)
# 0.6666666666666666

# Variance - Sample
x = [1, 2, 3]
np.var(x, ddof = 1) # ddof: Degree of freedom
# 1.0

Standard Deviation

Python Standard Deviation
1
2
3
4
5
6
7
8
9
10
11
import numpy as np

# Standard Deviation - Population
x = [1, 2, 3]
np.std(x)
# 0.816496580927726

# Standard Deviation - Sample
x = [1, 2, 3]
np.std(x, ddof = 1) # ddof: Degree of freedom
# 1.0

Measures of Association

Correlation Coefficient
Python
1
2
3
4
5
import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/ZacksAmber/Code/master/Python/Projects/MLB.csv")
df.head()

Correlation Coefficient in Dataframe

Python
1
2
3
np.corrcoef(df['Height(inches)'][:100], df['Weight(pounds)'][:100])
# array([[1. , 0.54518481],
# [0.54518481, 1. ]])

Correlation Coefficient in ndarray

Python
1
2
3
4
np_baseball = np.array(df)
np.corrcoef(np_baseball[:100, 3].astype(float), np_baseball[:100, 4].astype(float))
# array([[1. , 0.54518481],
# [0.54518481, 1. ]])

Pandas

Data Manipulation with Pandas


DataFrames

Basic

  • Sorting and subsetting
  • Creating new columns

df: DataFrame Object

Python
1
2
3
4
5
6
7
print(df)
df.head() # returns the first few rows (the “head” of the DataFrame).
df.info() # shows information on each of the columns, such as the data type and number of missing values.
df.shape # returns the number of rows and columns of the DataFrame.
df.describe() # calculates a few summary statistics for each column.
df.columns # Return columns names
df.index # Return index

Pandas Philosophy
There should be one – and preferably only one – obvious way to do it.


Sorting

Python
1
2
3
4
df.sort_values("column_name")
df.sort_values("column_name", ascending=False)
df.sort_values(["column_1, column_2"])
df.sort_values(["column_1, column_2"], ascending=[True, False])

Subsetting

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df["column_name"]
df[["column_1", "column_2"]]

df["value"] > 50
df[df["value"] > 50]]

df[["value"] == "condition"]

df[df["value]" > "2000-01-01"]

is_condition_1 == df["value_1"] == "condition_1"
is_condition_2 == df["value_2"] == "condition_2"
df[condition_1 & condition_2]

df[(df["value_1"] == "condition_1") & (df["value_2"] == "condition_2")]

is_condition_1_or_condition_2 = df["value"].isin(["condition_1", "condition_2"])
df[is_condition_1_or_condition_2]

New Columns

  • Transforming, mutating, and feature engineering.
Python
1
dogs["bmi"] = dogs["wight_kg"] / dogs["height_m] ** 2

Aggregating Data

  • Summary statistics
  • Counting
  • Grouped summary statistics

Summarizing Data

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
df["column_name"].mean()
df["column_name"].median()
df["column_name"].mode()
df["column_name"].min()
df["column_name"].max()
df["column_name"].var()
df["column_name"].std()
df["column_name"].sum()
df["column_name"].quantile()


def func(column): # define the aggregation function
return ...
df.["column_names"].agg(func) # The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.
df.["column_names"].agg([func_1, func_2])

df.["column_name"].cumsum()
df.["column_name"].cummax()
df.["column_name"].cummin()
df.["column_name"].cumprod() # cumulative product


Counting

Python
1
2
3
4
df.drop_duplicates(subset="column_names")
df["column"].value_counts()
df["column"].value_counts(sort=True)
df["column"].value_counts(normalize=True) # return each proportion of the total

Grouped Summary Statistics

Without Groupby

With Groupby

Python
1
dogs.groupby("color")["weight_kg"].mean()

Multiple Grouped Summaries

Python
1
dogs.groupby("color")["weight_kg"].agg([min, max, sum])

Grouping by multiple variables

Python
1
dogs.groupby(["color", "breed"])["weight_kg"].mean()

Many groups, many summaries

Python
1
dogs.groupby(["colors", "breed"])[["weight_ky", "height_cm"]].mean()

Pivot Tables

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the .pivot_table() method is just an alternative to .groupby().

Python
1
2
3
dogs.groupby("color")["weight_kg"].mean()

dogs.pivot_table(values="weight_kg", index="color") # values are the columns you want to summarize; index is the column you want to group by.

Python
1
2
# By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy
dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Python
1
2
3
4
# By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
# fill_value is the value for missing value
# margins is the sum of each column

Slicing and Indexing Data

  • Subsetting using slicing
  • Indexes and subsetting using indexs

Explicit indexes

Pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

Python
1
2
df.set_index("name") # setting index; index doesn't have to be unique; index makes subsetting more readable
df.reset_index(drop=True)

Subsetting with .loc[]
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

Python
1
2
df[df["col"].isin(["val_1", "val_2"])]
df.loc[["val_1", "val_2"]] # filtering on index values

Setting multi-level indexes
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.

Python
1
2
df_ind = df.set_index(["lev-1", "lev-2"]) # for example, lev-1 is country, lev-2 is state
df_ind = [("US", "CA"), ("JP", "TOKYO")]

Sorting by index values
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It’s also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

Python
1
2
df.sort_index()
df.sort_index(level=["lev_1", "lev_2"], ascending=[True, True])

Example

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index("city")

# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

###

# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

### Setting multi-level indexes

# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

### Sorting by index values

# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level="city"))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=["country", "city"], ascending=[True, False]))

Slicing and Subsetting with .loc and .iloc

Slicing Indexes
1
2
3
4
5
6
7
8
# 1. Set indexes
df.set_index(["outer_index", "inner_index"]).sort_index()

# 2.1 Slicing Outer index
df.loc["outer_index_1":"outer_index_2"]

# 2.2 Slicing inner index
df.loc[("outer_index_1", "inner_index_1"):("outer_index_2":"inner_index_2")]

Slicing Columns
1
df.loc["row_1":"row_2", "col_1":"col_2"]

Slicing Twice
1
df.loc[("outer_index_1", "inner_index_1"):("outer_index_2":"inner_index_2"), "col_1":"col_2"]

Subsetting by row/column number
1
df.iloc[n:m, n:m]

Working with Pivot Tables

The Axis Argument
1
2
3
df.mean(axis="index") # The default value is "index", which means "calculate the statistic across rows.

df.mean(axis="columns")

You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.

Example

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values="avg_temp_c", index=["country", "city"], columns="year")

# See the result
print(temp_by_country_city_vs_year)

###

# Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi")]

# Subset in both directions at once
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi"), 2005:2010]

###

# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Find the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Find the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])


Creating and Visualizing Data

  • Plotting
  • Handling missing data
  • Reading data into a DataFrame

Visualization

Python
1
import matplotlib.pyplot as plt

Histogram
1
2
3
4
5
6
# Histogram
df.["col"].hist()
plt. show()

df.["col"].hist(bins=20)
plt. show()

Bar Plot
1
2
3
4
5
6
# Bar Plot
df.plot(kind="bar)
plt. show()

df.plot(kind="bar", title="")
plt.show()

Line Plot
1
2
3
4
5
# Line Plot
df.plot(x="col",
y="col",
kind="line")
plt.show()

Rotating Axis Labels
1
2
3
4
5
6
# Rotating Axis Labels 
df.plot(x="col",
y="col",
kind="line",
rot="degree")
plt.show()

Scatter Plot
1
2
3
4
5
# Scatter Plot
df.plot(x="col",
y="col",
kind="scatter")
plt.show()

Layering Plot
1
2
3
4
5
6
# Layering Plot
plt_1
plt_2

plt.legend(["par_1", "par_2"])
plt.show()

Example


Missing Values

In a pandas DataFrame, missing values are indicated with NaN, which stands for “not a number”.

Detecting missing values
1
2
3
4
5
6
7
8
df.isna()
df.isna().any()
df.isna().sum()

import matplotlib.pyplot as plt

df.isna().sum().plot(kind="bar")
plt.show()
Removing Missing Values
1
df.dropna() # It is not ideal if you have a lot of missing values.
Replacing Missing Values
1
df.fillna(0)

Creating DataFrames

Creating Dataframes:

  • From a list of dictionaries: Constructed row by row
  • From a dictionary of lists: Constructed column by column
1
2
3
4
5
dic = {
"key1": value1,
"key2": value2,
"key3": value3
}

Creating DataFrames from List of Dictionaries


Creating DataFrames from Dictionary of lists


Reading and Writing CSVs

  • CSV: Comma-Separated-Values.
  • Designed for DataFrame-like data.
  • Most database and spreadsheet programs can use them or create them.
CSV
1
2
df = pd.read_csv("csv_file")
df.to_csv("path/file_name.csv")

More to Learn

  • Merging DataFrames with Pandas
  • Streamlined Data Ingestion with Pandas
  • Analyzing Police Activity with Pandas
  • Analyzing Marketing Campaigns with Pandas