ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
ggplot is a package for R. In Python, we will use plotnine, which is extremely similar to ggplot.
Installation & Loading
For R:
Download package ggplot that integrated in package tidyverse.
Then load it.
R
1 2 3 4 5 6 7
# Installation install.packages("tidyverse") # OR # install.packages("ggplot2")
# Loading import pandas as pd import numpy as np from plotnine import *
Understanding ggplot
To understand the logic of ggplot, you’d better learn what is the principles of graphics.
Data: Data must be data.frame.
Aesthetics: Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, or etc.
Geometric Objects: A layer combines data, aesthetic mapping, a geom (geometric object), a stat (statistical transformation), and a position adjustment. Typically, you will create layers using a geom_ function, overriding the default position and stat if needed.
Basically: Data: Assume your data are two points, for example, point $(0, 0)$ and $(1, 1)$. You need to tell the ggplot your data source.
Aesthetics: Then you need to define the x variables and y variables.
Geometric Objects: Tell ggplot what kind of plot - the object, you need. Let’s draw it on your paper.
R
1 2 3 4 5 6 7
# Define data frame two_points <- data.frame(x_value = c(0, 1), y_value = c(0, 1)) # Create your data two points (0, 0) and (1, 1)
# Draw two points ggplot(data = two_points, # Data: tell ggplot the data source. aes(x = x_value, y = y_value)) + # Aesthetics: Define x variables and y variables. geom_point() # Geometric Objects: Tell ggplot what kind of plot you need.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import pandas as pd import numpy as py from plotnine import *
# Define data as dictionary d = {'x_value': [0, 1], 'y_value': [0, 1]}
# Convert dict to data frame two_points = pd.DataFrame(data = d)
# Draw two points (ggplot(two_points, # Data: tell ggplot the data source. aes(x = 'x_value', y = 'y_value')) + # Aesthetics: Define x variables and y variables. geom_point()) # Geometric Objects: Tell ggplot what kind of plot you need.
Observe the difference of two part of code from R and Python. You must notice that Python need to create a dictionary first then convert it to data.frame. Because Python do not have data type named data.frame. You have to import package pandas then load data.frame.
Let’s put the main part of R code and Python code together, and find out the difference in Code.
R
1 2 3 4
# Draw two points ggplot(data = two_points, aes(x = x_value, y = y_value)) + geom_point()
Python
1 2 3 4
# Draw two points (ggplot(two_points, aes(x = 'x_value', y = 'y_value')) + geom_point())
In Python:
You must add an extra () to cover the whole ggplot expression.
You can declare data = data frame in R; but you cannot declare it in Python.
You must use '' or "" to wrap your parameters in the ggplot expression. But NOT for data frame.
You can put the + in the end or the beginning of each line.
In R:
You can omit data = in the first line.
You must put the + in the end of each line.
This time, we try to connect the two point. That is the simplest function - directly proportional function. And it is also the simplest plot - line chart.
R
1 2 3 4
# Connect two points ggplot(data = two_points, # Data: tell ggplot the data source. aes(x = x_value, y = y_value)) + # Aesthetics: Define x variables and y variables. geom_point() + geom_line() # Geometric Objects: This time we connect the two points.
Python
1 2 3 4
# Connect two points (ggplot(two_points, # Data: tell ggplot the data source. aes(x = 'x_value', y = 'y_value')) + # Aesthetics: Define x variables and y variables. geom_point() + geom_line()) # Geometric Objects: This time we connect the two points.
Explore more layers
Now we have understood the first three layers. Then let’s try to explore the rest four layers. If you cannot understand now, don’t worry. Just remember the concepts of layers is here.
Aesthetics: Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, or etc.
Geometric Objects: A layer combines data, aesthetic mapping, a geom (geometric object), a stat (statistical transformation), and a position adjustment. Typically, you will create layers using a geom_ function, overriding the default position and stat if needed.
Facets: Facetting generates small multiples, each displaying a different subset of the data. Facets are an alternative to aesthetics for displaying additional discrete variables.
Statistical Transformations: A handful of layers are more easily specified with a stat_ function, drawing attention to the statistical transformation rather than the visual appearance. The computed variables can be mapped using stat().
Coordinates: The coordinate system determines how the x and y aesthetics combine to position elements in the plot. The default coordinate system is Cartesian (coord_cartesian()), which can be tweaked with coord_map(), coord_fixed(), coord_flip(), and coord_trans(), or completely replaced with coord_polar().
Themes: Themes control the display of all non-data elements of the plot. You can override all settings with a complete theme like theme_bw(), or choose to tweak individual settings by using theme() and the element_ functions. Use theme_set() to modify the active theme, affecting all future plots.
qplot(): Quick plot with ggplot2
The qplot() function is very similar to the standard R plot() function. It can be used to create quickly and easily different types of graphs: scatter plots, box plots, violin plots, histogram and density plots.
A simplified format of qplot() is :
R
1
qplot(x, y = NULL, data, geom="auto")
Python
1
qplot(x = 'NULL', y = 'NULL', data = 'data.frame', geom="auto")
x, y : x and y values, respectively. The argument y is optional depending on the type of graphs to be created.
data : data frame to use (optional).
geom : Character vector specifying geom to use. Defaults to “point” if x and y are specified, and “histogram” if only x is specified.
The R code below creates basic scatter plots using the argument geom = “point”. It’s also possible to combine different geoms (e.g.: geom = c(“point”, “smooth”)).
R
1 2
# Basic scatter plot qplot(x = mpg, y = wt, data = df, geom = "point")
R
1 2 3
# Scatter plot with smoothed line qplot(mpg, wt, data = df, geom = c("point", "smooth"))
Python
1 2
# Basic scatter plot qplot(x = 'mpg', y = 'wt', data = df, geom = "point")
Python
1 2 3
# Scatter plot with smoothed line qplot('mpg', 'wt', data = df, geom = ["point", "smooth"])
The following R code will change the color and the shape of points by groups. The column cyl will be used as grouping variable. In other words, the color and the shape of points will be changed by the levels of cyl.
The histogram and density plots are used to display the distribution of data.
R Histogram
1 2 3 4
# Histogram plot # Change histogram fill color by group (sex) qplot(weight, data = wdata, geom = "histogram", fill = sex)
Density Plot
1 2 3 4 5
# Density plot # Change density plot line color by group (sex) # change line type qplot(weight, data = wdata, geom = "density", color = sex, linetype = sex)
Python Histogram
1 2 3 4
# Histogram plot # Change histogram fill color by group (sex) qplot('weight', data = wdata, geom = "histogram", fill = 'sex')
Python
1 2 3 4 5
# Density plot # Change density plot line color by group (sex) # change line type qplot('weight', data = wdata, geom = "density", color = 'sex', linetype = 'sex')