In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of the same type of elements (values or variables), each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.
The array is a concept that similar to Matrix (mathematics). In mathematics, a matrix (plural matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimension of the matrix below is $2 × 3$ (read “two by three”), because there are two rows and three columns:
The matrix always called a $m \times n \space matrix$. $m$ is the number of the rows; $n$ is the number of the columns. Therefore, the above sample is a two-dimensional array.
And in Numpy, dimensions are called axes. There is a little difference between mathematic matrix and computer science array. In mathematic matrix, the first index is 1. But in computer science, the first index is 0.
One-dimensional array
1D array has almost the same appearance as the list.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13
import numpy as np
l = [1, 2, 3]
print(l) # [1, 2, 3]
a = np.array([1, 2, 3])
print(a) # [1 2 3] a # array([1, 2, 3])
1D array only has one axis, the row. And in the following sample, the length of axis = 0 is three since we have three elements.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13
import numpy as np
a = np.array([1, 2, 3])
np.sum(a, axis = 0) # the sum of array a, where axis = 0 # 6
a[0] # the first element # 1 a[1] # the second element # 2 a[2] # the third element # 3
Two-dimensional array
2D array has two axes. In the following sample, the length of axis = 0 is three since we have three elements; the length of axis = 1 is two since we have two dimensions.
Python
1 2 3 4 5 6 7
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
a # array([[1, 2, 3], # [4, 5, 6]])
Now we have an array a that has two axes.
col 1
col 2
col 3
row 1
1
2
3
row 2
4
5
6
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
a[0] # the first row # array([1, 2, 3]) a[1] # the second row # array([4, 5, 6])
np.sum(a, axis = 0) # the sum of the array by row # array([5, 7, 9]) np.sum(a, axis = 1) # the sum of the array by column # array([ 6, 15])
a[0][0] # the element of the first row and the first column # 1 a[1][2] # the element of the second row and the third column # 6 a[-1][-1] # the element of the last row and the last column # 6
Three-dimensional array
3D array is hard to understand. But we can split it into 2 parts. And each part is a 2D array.
Just imagine we have two tables, the 2D array, and we combine them together. Then we have a 3D array.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13
a[0][0] # axis = 0, size = 0; axis = 1, size = 0 -> the first 2D array, the first row # array([0, 1, 2, 3]) a[0][1] # axis = 0, size = 0; axis = 1, size = 1 -> the first 2D array, the second row # array([4, 5, 6, 7]) a[1][0] # axis = 0, size = 1; axis = 1, size = 0 -> the second 2D array, the first row # array([ 8, 9, 10, 11]) a[1][1] # axis = 0, size = 1; axis = 1, size = 1 -> the second 2D array, the second row # array([12, 13, 14, 15])
a[0][0][0] # axis = 0, size = 0; axis = 1, size = 0; axis = 2, size = 0 -> the first 2D array, the first row, the first column # 0 a[0][0][1] # axis = 0, size = 0; axis = 1, size = 0; axis = 2, size = 1 -> the first 2D array, the first row, the second column # 1
From the observation of the 3D array, we can know that in Numpy, the size is the value of a specific axis. For example, a[0][3] means in the first axis (axis = 0), the value is 0; in the second axis (axis = 1), the value is 3.
And whatever how many dimensions an array has, the first axis (axis = 0) should be overall view sight. And the last axis (axis = -1) should be the column. The last-second axis (axis = -2) is the row.
What’s the difference between a Python list and a NumPy array?
NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogenous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogenous.
Why use NumPy?
NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.
The difference between a Python list and a NumPy array
However, the code is not concise enough. Additionally, you cannot use this method too many times. Therefore, let’s try the Numpy Array - also called ndarray(Numpy dtype array).
What is an array? An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype.
Create a basic array
np.array()
To create a NumPy array, you can use the function np.array().
Once you run the above code, the Numpy will initialize a row for array a. The row is a similar concept of dataframe, matrix, or database. In dataframe and database, we use column to store the same type of data. And we use row to store data from different individual.
For example, here is a sample dataframe of a table in a database.
Name
Height
Age
Gender
Zack Fair
1.85
23
M
Cloud Strife
1.73
21
M
Obviously, Name and Gender should be type string, Height should be type float, and Age should be type int. Every column stores the same type of the data
An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.
One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13
import numpy as np
a = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
print(a) # [[1 2 3] # [4 5 6] # [7 8 9]]
a # array([[1, 2, 3], # [4, 5, 6], # [7, 8, 9]])
col 1
col 2
col 3
row 1
1
2
3
row 2
4
5
6
row 3
7
8
9
Sorting elements
np.sort()
Sorting an element is simple with np.sort(). You can specify the axis, kind, and order when you call the function.
Python
1 2 3 4 5 6
import numpy as np
a = np.array([2, 1, 5, 3, 7, 4, 6, 8])
np.sort(a) # array([1, 2, 3, 4, 5, 6, 7, 8])
order: str or list of str, optional When a is an array with fields defined, this argument specifies which fields to compare first, second, etc. A single field can be specified as a string, and not all fields need be specified, but unspecified fields will still be used, in the order in which they come up in the dtype, to break ties.
In this example, a tuple of arrays was returned: one for each dimension. The first array represents the row indices where these values are found, and the second array represents the column indices where the values are found.
If you want to generate a list of coordinates where the elements exist, you can zip the arrays, iterate over the list of coordinates, and print them. For example:
print(df) df.head() # returns the first few rows (the “head” of the DataFrame). df.info() # shows information on each of the columns, such as the data type and number of missing values. df.shape # returns the number of rows and columns of the DataFrame. df.describe() # calculates a few summary statistics for each column. df.columns # Return columns names df.index # Return index
Pandas Philosophy There should be one – and preferably only one – obvious way to do it.
deffunc(column): # define the aggregation function return ... df.["column_names"].agg(func) # The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient. df.["column_names"].agg([func_1, func_2])
df.drop_duplicates(subset="column_names") df["column"].value_counts() df["column"].value_counts(sort=True) df["column"].value_counts(normalize=True) # return each proportion of the total
Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the .pivot_table() method is just an alternative to .groupby().
Python
1 2 3
dogs.groupby("color")["weight_kg"].mean()
dogs.pivot_table(values="weight_kg", index="color") # values are the columns you want to summarize; index is the column you want to group by.
Python
1 2
# By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)
Python
1 2 3 4
# By default, pivot_table takes the mean value for each group. For other summary statistic, using numpy dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True) # fill_value is the value for missing value # margins is the sum of each column
Slicing and Indexing Data
Subsetting using slicing
Indexes and subsetting using indexs
Explicit indexes
Pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).
Python
1 2
df.set_index("name") # setting index; index doesn't have to be unique; index makes subsetting more readable df.reset_index(drop=True)
Subsetting with .loc[] The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.
The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.
Python
1 2
df[df["col"].isin(["val_1", "val_2"])] df.loc[["val_1", "val_2"]] # filtering on index values
Setting multi-level indexes Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.
The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside country.
The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.
Python
1 2
df_ind = df.set_index(["lev-1", "lev-2"]) # for example, lev-1 is country, lev-2 is state df_ind = [("US", "CA"), ("JP", "TOKYO")]
Sorting by index values Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It’s also useful to be able to sort by elements in the index. For this, you need to use .sort_index().
df.mean(axis="index") # The default value is "index", which means "calculate the statistic across rows.
df.mean(axis="columns")
You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.
# Add a year column to temperatures temperatures["year"] = temperatures["date"].dt.year
# Pivot avg_temp_c by country and city vs year temp_by_country_city_vs_year = temperatures.pivot_table(values="avg_temp_c", index=["country", "city"], columns="year")
# See the result print(temp_by_country_city_vs_year)
###
# Subset for Egypt to India temp_by_country_city_vs_year.loc["Egypt":"India"]
# Subset for Egypt, Cairo to India, Delhi temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi")]
# Subset in both directions at once temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India", "Delhi"), 2005:2010]
###
# Get the worldwide mean temp by year mean_temp_by_year = temp_by_country_city_vs_year.mean()
# Find the year that had the highest mean temp print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])
# Get the mean temp by city mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")
# Find the city that had the lowest mean temp print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])
Creating and Visualizing Data
Plotting
Handling missing data
Reading data into a DataFrame
Visualization
Python
1
import matplotlib.pyplot as plt
Histogram
1 2 3 4 5 6
# Histogram df.["col"].hist() plt. show()
df.["col"].hist(bins=20) plt. show()
Bar Plot
1 2 3 4 5 6
# Bar Plot df.plot(kind="bar) plt. show() df.plot(kind="bar", title="") plt.show()
Line Plot
1 2 3 4 5
# Line Plot df.plot(x="col", y="col", kind="line") plt.show()