Ggplot2 & Plotnine

Be Awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization
如何在Python里用ggplot2绘图
plotnine
ggplot function reference

Introduction

ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

ggplot is a package for R. In Python, we will use plotnine, which is extremely similar to ggplot.


Installation & Loading

For R:

  1. Download package ggplot that integrated in package tidyverse.
  2. Then load it.
R
1
2
3
4
5
6
7
# Installation
install.packages("tidyverse")
# OR
# install.packages("ggplot2")

# Loading
library("ggplot2")

For Python:

  1. Download Anaconda Environment, otherwise you will not have package pandas and numpy.
  2. Download package plotnine through conda
    Shell
    1
    conda install -c conda-forge plotnine
  3. Then load them.
    Python
    1
    2
    3
    4
    # Loading
    import pandas as pd
    import numpy as np
    from plotnine import *

Understanding ggplot

To understand the logic of ggplot, you’d better learn what is the principles of graphics.

  1. Data: Data must be data.frame.
  2. Aesthetics: Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, or etc.
  3. Geometric Objects: A layer combines data, aesthetic mapping, a geom (geometric object), a stat (statistical transformation), and a position adjustment. Typically, you will create layers using a geom_ function, overriding the default position and stat if needed.

Basically:
Data: Assume your data are two points, for example, point $(0, 0)$ and $(1, 1)$. You need to tell the ggplot your data source.

Aesthetics: Then you need to define the x variables and y variables.

Geometric Objects: Tell ggplot what kind of plot - the object, you need. Let’s draw it on your paper.

R
1
2
3
4
5
6
7
# Define data frame
two_points <- data.frame(x_value = c(0, 1), y_value = c(0, 1)) # Create your data two points (0, 0) and (1, 1)

# Draw two points
ggplot(data = two_points, # Data: tell ggplot the data source.
aes(x = x_value, y = y_value)) + # Aesthetics: Define x variables and y variables.
geom_point() # Geometric Objects: Tell ggplot what kind of plot you need.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
import numpy as py
from plotnine import *

# Define data as dictionary
d = {'x_value': [0, 1], 'y_value': [0, 1]}

# Convert dict to data frame
two_points = pd.DataFrame(data = d)

# Draw two points
(ggplot(two_points, # Data: tell ggplot the data source.
aes(x = 'x_value', y = 'y_value')) + # Aesthetics: Define x variables and y variables.
geom_point()) # Geometric Objects: Tell ggplot what kind of plot you need.

Observe the difference of two part of code from R and Python. You must notice that Python need to create a dictionary first then convert it to data.frame. Because Python do not have data type named data.frame. You have to import package pandas then load data.frame.

Let’s put the main part of R code and Python code together, and find out the difference in Code.

R
1
2
3
4
# Draw two points
ggplot(data = two_points,
aes(x = x_value, y = y_value)) +
geom_point()
Python
1
2
3
4
# Draw two points
(ggplot(two_points,
aes(x = 'x_value', y = 'y_value')) +
geom_point())

In Python:

  1. You must add an extra () to cover the whole ggplot expression.
  2. You can declare data = data frame in R; but you cannot declare it in Python.
  3. You must use '' or "" to wrap your parameters in the ggplot expression. But NOT for data frame.
  4. You can put the + in the end or the beginning of each line.

In R:

  1. You can omit data = in the first line.
  2. You must put the + in the end of each line.

This time, we try to connect the two point. That is the simplest function - directly proportional function. And it is also the simplest plot - line chart.

R
1
2
3
4
# Connect two points
ggplot(data = two_points, # Data: tell ggplot the data source.
aes(x = x_value, y = y_value)) + # Aesthetics: Define x variables and y variables.
geom_point() + geom_line() # Geometric Objects: This time we connect the two points.

Python
1
2
3
4
# Connect two points
(ggplot(two_points, # Data: tell ggplot the data source.
aes(x = 'x_value', y = 'y_value')) + # Aesthetics: Define x variables and y variables.
geom_point() + geom_line()) # Geometric Objects: This time we connect the two points.


Explore more layers

Now we have understood the first three layers. Then let’s try to explore the rest four layers. If you cannot understand now, don’t worry. Just remember the concepts of layers is here.

See more detailed definition, click ggplot function reference.

From bottom to top:

  1. Data: Data must be data.frame.
  2. Aesthetics: Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, or etc.
  3. Geometric Objects: A layer combines data, aesthetic mapping, a geom (geometric object), a stat (statistical transformation), and a position adjustment. Typically, you will create layers using a geom_ function, overriding the default position and stat if needed.
  4. Facets: Facetting generates small multiples, each displaying a different subset of the data. Facets are an alternative to aesthetics for displaying additional discrete variables.
  5. Statistical Transformations: A handful of layers are more easily specified with a stat_ function, drawing attention to the statistical transformation rather than the visual appearance. The computed variables can be mapped using stat().
  6. Coordinates: The coordinate system determines how the x and y aesthetics combine to position elements in the plot. The default coordinate system is Cartesian (coord_cartesian()), which can be tweaked with coord_map(), coord_fixed(), coord_flip(), and coord_trans(), or completely replaced with coord_polar().
  7. Themes: Themes control the display of all non-data elements of the plot. You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme() and the element_ functions. Use theme_set() to modify the active theme, affecting all future plots.

qplot(): Quick plot with ggplot2

The qplot() function is very similar to the standard R plot() function. It can be used to create quickly and easily different types of graphs: scatter plots, box plots, violin plots, histogram and density plots.

A simplified format of qplot() is :

R
1
qplot(x, y = NULL, data, geom="auto")
Python
1
qplot(x = 'NULL', y = 'NULL', data = 'data.frame', geom="auto")
  • x, y : x and y values, respectively. The argument y is optional depending on the type of graphs to be created.
  • data : data frame to use (optional).
  • geom : Character vector specifying geom to use. Defaults to “point” if x and y are specified, and “histogram” if only x is specified.

Read more about qplot(): Quick plot with ggplot2.

Reproducing our experiment in qplot():

R:

R qplot()
1
2
3
4
5
# Loading
library("ggplot2")

# Define data frame
two_points <- data.frame(x_value = c(0, 1), y_value = c(0, 1)) # Create your data two points (0, 0) and (1, 1)
R qplot()
1
2
3
# Draw two points: qplot
qplot(x = x_value, y = y_value, data = two_points,
geom = "point")

Compare to ggplot:

R ggplot()
1
2
3
# Draw two points: ggplot()
ggplot(data = two_points, aes(x = x_value, y = y_value)) +
geom_point()

R qplot()
1
2
3
# Connect two points: qplot
qplot(x = x_value, y = y_value, data = two_points,
geom = c("point", "line"))
R ggplot()
1
2
3
# Connect two points: ggplot()
ggplot(data = two_points, aes(x = x_value, y = y_value)) +
geom_point() + geom_line()

Python

Python
1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as py
from plotnine import *

# Define data as dictionary
d = {'x_value': [0, 1], 'y_value': [0, 1]}

# Convert dict to data frame
two_points = pd.DataFrame(data = d)
Python qplot()
1
2
3
# Draw two points: qplot()
qplot(x = 'x_value', y = 'y_value', data = two_points,
geom = 'point')

Compare to ggplot: Notice this time you have to add data =. And also without '' or "". If you don’t understand why you have to add data =, click here

1
2
3
4
# Draw two points: ggplot()
(ggplot(two_points,
aes(x = 'x_value', y = 'y_value')) +
geom_point())

Python qplot()
1
2
3
# Connect two points: qplot()
qplot(x = 'x_value', y = 'y_value', data = two_points,
geom = ['point', 'line'])

Compare to ggplot:

Python ggplot()
1
2
3
4
# Connect two points: ggplot()
(ggplot(two_points,
aes(x = 'x_value', y = 'y_value')) +
geom_point() + geom_line())


Data format and preparation

The data set mtcars is used in the examples below:

Data profile:

  • mtcars : Motor Trend Car Road Tests.
  • Description: The data comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973 - 74 models).
  • Format: A data frame with 32 observations on 3 variables.
    • [, 1] mpg Miles/(US) gallon
    • [, 2] cyl Number of cylinders
    • [, 3] wt Weight (lb/1000)
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Load the data
data(mtcars)
df <- mtcars[, c("mpg", "cyl", "wt")] # Extract all of the rows; extract columns: "mpg", "cyl", "wt".

# Convert cyl to a factor variable
df$cyl <- as.factor(df$cyl)
head(df) # show the first 6 rows of data

## mpg cyl wt

## Mazda RX4 21.0 6 2.620
## Mazda RX4 Wag 21.0 6 2.875

## Datsun 710 22.8 4 2.320
## Hornet 4 Drive 21.4 6 3.215

## Hornet Sportabout 18.7 8 3.440
## Valiant 18.1 6 3.460

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
from plotnine import *
from plotnine.data import mtcars

df = mtcars[["name", "mpg", "cyl", "wt"]]
df.head(6) # show the first 6 rows of data

# name mpg cyl wt
# Mazda RX4 21.0 6 2.620
# Mazda RX4 Wag 21.0 6 2.875
# Datsun 710 22.8 4 2.320
# Hornet 4 Drive 21.4 6 3.215
# Hornet Sportabout 18.7 8 3.440
# Valiant 18.1 6 3.460

Scatter plots

The R code below creates basic scatter plots using the argument geom = “point”. It’s also possible to combine different geoms (e.g.: geom = c(“point”, “smooth”)).

R
1
2
# Basic scatter plot
qplot(x = mpg, y = wt, data = df, geom = "point")

R
1
2
3
# Scatter plot with smoothed line
qplot(mpg, wt, data = df,
geom = c("point", "smooth"))

Python
1
2
# Basic scatter plot
qplot(x = 'mpg', y = 'wt', data = df, geom = "point")

Python
1
2
3
# Scatter plot with smoothed line
qplot('mpg', 'wt', data = df,
geom = ["point", "smooth"])

The following R code will change the color and the shape of points by groups. The column cyl will be used as grouping variable. In other words, the color and the shape of points will be changed by the levels of cyl.

R
1
qplot(mpg, wt, data = df, colour = cyl, shape = cyl)

Python
1
2
3
library("ggplot")

qplot('mpg', 'wt', data = df, color = 'cyl', shape = 'cyl')


Box plot, violin plot and dot plot

The R code below generates some data containing the weights by sex (M for male; F for female):

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Define Data Frame
wdata = data.frame(
sex = factor(rep(c("F", "M"), each=200)),
weight = c(rnorm(200, 55), rnorm(200, 58)))

head(wdata)

# sex weight
# 1 F 53.8
# 2 F 55.3
# 3 F 56.1
# 4 F 52.7
# 5 F 55.4
# 6 F 55.5
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandas as pd
import numpy as np
from plotnine import *

# Define Wight
np.random.seed(1234)
weight1 = np.random.normal(55, size = 200)
weight2 = np.random.normal(58, size = 200)
weight = np.append(weight1, weight2)

# Define sex
sex = ['F'] * 200 + ['M'] * 200

# Define Data Frame
wdata = pd.DataFrame(list(zip(sex, weight)), columns = ["Sex", "Weight"]) # zip() for zipping two lists

wdata.head(6)
"""
Sex Weight
0 F 55.471435
1 F 53.809024
2 F 56.432707
3 F 54.687348
4 F 54.279411
5 F 55.887163
"""
R Box plot
1
2
3
# Basic box plot from data frame
qplot(sex, weight, data = wdata,
geom = "boxplot", fill = sex)

R Violin plot
1
2
# Violin plot
qplot(sex, weight, data = wdata, geom = "violin")

R Dot plot
1
2
3
# Dot plot
qplot(sex, weight, data = wdata, geom = "dotplot",
stackdir = "center", binaxis = "y", dotsize = 0.5)

Python Box plot
1
2
3
# Basic box plot from data frame
qplot('sex', 'weight', data = wdata,
geom = "boxplot", fill = 'sex')

Python Violin plot
1
2
# Violin plot
qplot('sex', 'weight', data = wdata, geom = "violin")


Histogram and density plots

The histogram and density plots are used to display the distribution of data.

R Histogram
1
2
3
4
# Histogram  plot
# Change histogram fill color by group (sex)
qplot(weight, data = wdata, geom = "histogram",
fill = sex)

Density Plot
1
2
3
4
5
# Density plot
# Change density plot line color by group (sex)
# change line type
qplot(weight, data = wdata, geom = "density",
color = sex, linetype = sex)

Python Histogram
1
2
3
4
# Histogram  plot
# Change histogram fill color by group (sex)
qplot('weight', data = wdata, geom = "histogram",
fill = 'sex')

Python
1
2
3
4
5
# Density plot
# Change density plot line color by group (sex)
# change line type
qplot('weight', data = wdata, geom = "density",
color = 'sex', linetype = 'sex')