# Python Data Preprocessing

## Python Tutorial

## Regular Expression

## Numpy Array

Numpy Org

NumPy Developer Documentation

NumPy User Guide

NumPy Reference

Numpy Quickstart tutorial

NumPy: the absolute basics for beginners

### Introduction

In computer science, an array data structure, or simply an **array**, is a data structure consisting of a collection of **the same type of elements** (values or variables), each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

The array is a concept that similar to Matrix (mathematics). In mathematics, a matrix (plural matrices) is a **rectangular array** of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimension of the matrix below is $2 × 3$ (read “two by three”), because there are two rows and three columns:

$$

\left[

\begin{matrix}

1 & 2 & 3 \

4 & 5 & 6 \

\end{matrix}

\right]

$$

The matrix always called a $m \times n \space matrix$. $m$ is the number of the rows; $n$ is the number of the columns. Therefore, the above sample is a two-dimensional array.

And in Numpy, dimensions are called `axes`

. There is a little difference between mathematic matrix and computer science array. In mathematic matrix, the first index is `1`

. But in computer science, the first index is `0`

.

#### One-dimensional array

1D array has almost the same appearance as the list.

1 | import numpy as np |

1D array only has one `axis`

, the row. And in the following sample, the length of `axis = 0`

is three since we have three elements.

1 | import numpy as np |

#### Two-dimensional array

2D array has two `axes`

. In the following sample, the length of `axis = 0`

is three since we have three elements; the length of `axis = 1`

is two since we have two dimensions.

1 | import numpy as np |

Now we have an array `a`

that has two axes.

col 1 | col 2 | col 3 | |
---|---|---|---|

row 1 | 1 | 2 | 3 |

row 2 | 4 | 5 | 6 |

1 | a[0] # the first row |

#### Three-dimensional array

3D array is hard to understand. But we can split it into 2 parts. And each part is a 2D array.

1 | import numpy as np |

Just imagine we have two tables, the 2D array, and we combine them together. Then we have a 3D array.

1 | a[0][0] # axis = 0, size = 0; axis = 1, size = 0 -> the first 2D array, the first row |

From the observation of the 3D array, we can know that in Numpy, the `size`

is the value of a specific `axis`

. For example, `a[0][3]`

means in the first axis (`axis = 0`

), the value is `0`

; in the second axis (`axis = 1`

), the value is `3`

.

And whatever how many dimensions an array has, the first axis (`axis = 0`

) should be overall view sight. And the last axis (`axis = -1`

) should be the column. The last-second axis (`axis = -2`

) is the row.

**The difference between a Python list and a NumPy array**

1 | height = [1.73, 1.68, 1.71, 1.89, 1.79] |

If you really need to figure out the product of `height`

times `weight`

, you have build your own code:

1 | height = [1.73, 1.68, 1.71, 1.89, 1.79] |

Or in an easier way:

1 | height = [1.73, 1.68, 1.71, 1.89, 1.79] |

However, the code is not concise enough. Additionally, you cannot use this method too many times. Therefore, let’s try the Numpy Array - also called `ndarray`

(Numpy dtype array).

1 | import numpy as np |

**What is an array?**

An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array `dtype`

.

### Create a basic array

`np.array()`

To create a NumPy array, you can use the function `np.array()`

.

All you need to do to create a simple array is pass a list to it. If you choose to, you can also specify the type of data in your list. You can find more information about data types here.

1 | import numpy as np |

Once you run the above code, the Numpy will initialize a `row`

for array `a`

. The `row`

is a similar concept of dataframe, matrix, or database. In dataframe and database, we use `column`

to store the same type of data. And we use `row`

to store `data`

from different individual.

For example, here is a sample dataframe of a table in a database.

Name | Height | Age | Gender |
---|---|---|---|

Zack Fair | 1.85 | 23 | M |

Cloud Strife | 1.73 | 21 | M |

Obviously, `Name`

and `Gender`

should be type `string`

, `Height`

should be type `float`

, and `Age`

should be type `int`

. Every column stores the same type of the data

An `array`

can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The `rank`

of the array is the number of dimensions. The `shape`

of the array is a tuple of integers giving the size of the array along each dimension.

One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.

1 | import numpy as np |

col 1 | col 2 | col 3 | |
---|---|---|---|

row 1 | 1 | 2 | 3 |

row 2 | 4 | 5 | 6 |

row 3 | 7 | 8 | 9 |

### Sorting elements

`np.sort()`

Sorting an element is simple with `np.sort()`

. You can specify the axis, kind, and order when you call the function.

1 | import numpy as np |

`order`

: str or list of str, optional

When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the `dtype`

, to break ties.

1 | import numpy as np |

In addition to sort, which returns a sorted copy of an array, you can use:

- argsort, which is an indirect sort along a specified axis,
- lexsort, which is an indirect stable sort on multiple keys,
- searchsorted, which will find elements in a sorted array, and
- partition, which is a partial sort.

`np.concatenate()`

You can concatenate arrays with `np.concatenate()`

.

1 | import numpy as np |

Attention: Do not use `+`

in numpy. It is not the same as Python list.

1 | import numpy as np |

1 | import numpy as np |

### Indexing, slicing, and filtering

Review the concepts of the axis if you cannot understand this chapter.

#### Indexing

Indexing 1D array.

1 | import numpy as np |

Indexing 2D array.

1 | a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]) |

#### Slicing

Slicing 1D array.

1 | import numpy as np |

Slicing 2D array.

1 | a = np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]) |

#### Filtering

Use the condition will only return the `indices`

of the array.

1 | import numpy as np |

Use the `indices`

for filtering the values you need.

1 | a[a < 5] |

You can also select, for example, numbers that are equal to or greater than 5, and use that condition to index an array.

1 | a[a >= 5] |

You can select elements that are divisible by 2:

1 | a[a%2 == 0] |

Or you can select elements that satisfy two conditions using the `&`

and `|`

operators:

1 | b = a[(a > 5) | (a == 5)] |

You can also use `np.nonzero()`

to select `elements`

or `indices`

from an array.

1 | index = np.nonzero(a < 5) |

In this example, a tuple of arrays was returned: one for each dimension. The first array represents the row indices where these values are found, and the second array represents the column indices where the values are found.

If you want to generate a list of coordinates where the elements exist, you can zip the arrays, iterate over the list of coordinates, and print them. For example:

1 | list_of_coordinates = list(zip(index[0], index[1])) |

You can also use `np.nonzero()`

to print the elements in an array that are less than 5 with:

1 | a[index] |

Create a bmi function.

1 | import numpy as np |

## Numpy Statistics

### Numpy Descriptive Statistics for Numerical Data

#### Measures of Location

##### Percentiles & Quartiles

1 | import numpy as np |

##### Arithmetic Mean

1 | import numpy as np |

##### Median

1 | import numpy as np |

##### Mode

1 | from scipy import stats |

#### Measures of Dispersion

##### Variance

1 | import numpy as np |

#### Standard Deviation

1 | import numpy as np |

#### Measures of Association

##### Correlation Coefficient

1 | import pandas as pd |

Correlation Coefficient in Dataframe

1 | np.corrcoef(df['Height(inches)'][:100], df['Weight(pounds)'][:100]) |

Correlation Coefficient in ndarray

1 | np_baseball = np.array(df) |

## Pandas

### DataFrames

#### Basic

- Sorting and subsetting
- Creating new columns

df: DataFrame Object

1 | print(df) |

Pandas Philosophy

There should be one – and preferably only one – obvious way to do it.

#### Sorting

1 | df.sort_values("column_name") |

#### Subsetting

1 | df["column_name"] |

#### New Columns

- Transforming, mutating, and feature engineering.

1 | dogs["bmi"] = dogs["wight_kg"] / dogs["height_m] ** 2 |

### Aggregating Data

- Summary statistics
- Counting
- Grouped summary statistics

#### Summarizing Data

1 | df["column_name"].mean() |

#### Counting

1 | df.drop_duplicates(subset="column_names") |

#### Grouped Summary Statistics

Without Groupby

With Groupby

1 | dogs.groupby("color")["weight_kg"].mean() |

Multiple Grouped Summaries

1 | dogs.groupby("color")["weight_kg"].agg([min, max, sum]) |

Grouping by multiple variables

1 | dogs.groupby(["color", "breed"])["weight_kg"].mean() |

Many groups, many summaries

1 | dogs.groupby(["colors", "breed"])[["weight_ky", "height_cm"]].mean() |

#### Pivot Tables

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the `.pivot_table() `

method is just an alternative to `.groupby()`

.

1 | dogs.groupby("color")["weight_kg"].mean() |

1 | # By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy |

1 | # By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy |

### Slicing and Indexing Data

- Subsetting using slicing
- Indexes and subsetting using indexs

#### Explicit indexes

Pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

1 | df.set_index("name") # setting index; index doesn't have to be unique; index makes subsetting more readable |

**Subsetting with .loc[]**

The killer feature for indexes is

`.loc[]`

: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.The code for subsetting using `.loc[]`

can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

1 | df[df["col"].isin(["val_1", "val_2"])] |

**Setting multi-level indexes**

Indexes can also be made out of multiple columns, forming a *multi-level index* (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.

1 | df_ind = df.set_index(["lev-1", "lev-2"]) # for example, lev-1 is country, lev-2 is state |

**Sorting by index values**

Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It’s also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

1 | df.sort_index() |

**Example**

1 | # Look at temperatures |

#### Slicing and Subsetting with `.loc`

and `.iloc`

1 | # 1. Set indexes |

1 | df.loc["row_1":"row_2", "col_1":"col_2"] |

1 | df.loc[("outer_index_1", "inner_index_1"):("outer_index_2":"inner_index_2"), "col_1":"col_2"] |

1 | df.iloc[n:m, n:m] |

#### Working with Pivot Tables

1 | df.mean(axis="index") # The default value is "index", which means "calculate the statistic across rows. |

You can access the components of a date (year, month and day) using code of the form `dataframe["column"].dt.component`

. For example, the month component is `dataframe["column"].dt.month`

, and the year component is `dataframe["column"].dt.year`

.

Example

1 | # Add a year column to temperatures |

### Creating and Visualizing Data

- Plotting
- Handling missing data
- Reading data into a DataFrame

#### Visualization

1 | import matplotlib.pyplot as plt |

1 | # Histogram |

1 | # Bar Plot |

1 | # Line Plot |

1 | # Rotating Axis Labels |

1 | # Scatter Plot |

1 | # Layering Plot |

Example

### Missing Values

In a pandas DataFrame, missing values are indicated with NaN, which stands for “not a number”.

1 | df.isna() |

1 | df.dropna() # It is not ideal if you have a lot of missing values. |

1 | df.fillna(0) |

### Creating DataFrames

Creating Dataframes:

- From a list of dictionaries: Constructed row by row
- From a dictionary of lists: Constructed column by column

1 | dic = { |

#### Creating DataFrames from List of Dictionaries

#### Creating DataFrames from Dictionary of lists

### Reading and Writing CSVs

- CSV: Comma-Separated-Values.
- Designed for DataFrame-like data.
- Most database and spreadsheet programs can use them or create them.

1 | df = pd.read_csv("csv_file") |

### More to Learn

- Merging DataFrames with Pandas
- Streamlined Data Ingestion with Pandas
- Analyzing Police Activity with Pandas
- Analyzing Marketing Campaigns with Pandas