1.0 INTRODUCTION
One of the most common features in tidyverse in making a visualisation is using the ggplot2, and that will be the focus of this chapter. The terms ggplot2and ggplotrefer to different aspects of the same data visualization package in R. Here’s a detailed explanation of each term and their differences:
ggplot2
- Definition: ggplot2 is a data visualization package in R, part of the Tidyverse. It is a powerful and flexible system for creating static graphics based on the Grammar of Graphics, which breaks up graphics into semantic components such as scales and layers.
- Usage: You load the
ggplot2package in R using library(ggplot2). This makes all the functions and capabilities of the package available for use in your R session. - Scope: The term ggplot2 encompasses the entire package, including functions like
ggplot(), geom_point(), geom_line(), and many others.
ggplot
- Definition:
ggplotis a function within the ggplot2 package. It is the main function used to initialize and create a ggplot object. - Usage: You use the
ggplot()function to create a new plot. This function specifies the data frame to use and the aesthetic mappings for the plot. - Scope: The term
ggplotrefers specifically to this function, which serves as the foundation for building a plot in ggplot2.
Before we can use the tidyverse, we will have to load it using the command, and this have to be done anytime we intends to call the visualisation package. This automatically loading ggplot2, and other packages too
library(tidyverse)- In addition to the library, we will have to load a dataset, and for the purpose of this note, we will be using palmerpenguins package, which is the part of R dataset.
- The package includes
penguinsdataset containing body measurements for penguins on three islands in the Palmer Archipelago, and - The
ggthemespackage, which offers a color blind safe color palette. Theggthemespackage in R is an extension of the popularggplot2package, designed to provide additional themes, scales, and geoms that enhance the appearance of data visualizations. - To load the package, and the specific dataset, we invoke the followings:
# dataset
library(palmerpenguins)
# colour theme
library(ggthemes)R comes with several built-in datasets, which are useful for practice and examples. These datasets are part of the datasets package, which is loaded by default. To access and view these datasets, you can use the data() function. Here’s how you can preload and inspect these datasets:
data()The followings are the current classification of dataset package in R
palmerpenguins: Dataset about penguins from thepalmerpenguinspackage.gapminder: Dataset on country statistics from thegapminderpackage.mpg: Fuel economy data from theggplot2package.
diamonds: Diamond prices and attributes from theggplot2package.economics: US economic time series data from theggplot2package.datasetswhich contains various industry data and informationdatasets in package ‘dplyr’
1.1 DATA VISUALISATION
Going back to our example of penguins, unlike in Python that we will have to use print function before we can call any variables for a display, we can actually type the variable only in R, and it will print to the console. To have a view of the penguins, we call it to the console direct thus:
penguinsThis data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse(). Or, if you’re in RStudio, run View(penguins) to open an interactive data viewer.
glimpse(penguins)View(penguins)Looking at the 8 columns carefully, we can see that among the penguins variable is the species which comprised of 3 types ((Adelie, Chinstrap, or Gentoo). Another variable is the flipper_length_mm which is the length of a penguin’s flipper, in millimeters. body_mass_g: which represent the penguins body mass of a penguin, in grams is another important variable.
1.2 STUDY GOAL
There is a relationship between the flipper_length_mm, and body_mass_g, and the purpose of this section is to create a visualisation representing thus:
Task: Create the plot above
Step 1: Loading the data
After importing the tidverse, we call on the ggplot with the data as the first parameters thus: This will only print an empty graph since we had not invoke the display functions.
ggplot(data = penguins)Step 2: identifying the and axis
After loading of our intended data, we map the and variable with the visual properties (aesthetics). The mapping argument is always defined in the aes() function, and the x and y arguments of aes()specify which variables to map to the x and y axes. For now, we will only map flipper length to the x aesthetic and body mass to the y aesthetic. ggplot2 looks for the mapped variables in the data argument, in this case, penguins.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)With the above code, we define the and y axis on the graph as flipper_length_mm, and body_mass_g respectively as shown below:
- To represent our variable data in the chart following the definition of the and axis, we will need to define
geom. This is the geometrical object that a plot uses to represent data. - These geometric objects are made available in
ggplot2with functions that start withgeom_. - People often describe plots by the type of
geomthat the plot uses. - bar charts use bar geoms
geom_bar() - line charts use line geoms
geom_line() - boxplots use boxplot geoms
geom_boxplot() - scatterplots use point geoms
geom_point(), and so on…
Step 3: Adding the graph
The function geom_point() adds a layer of points to the plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. The code for the inclusion of the geom is invoke by using the plus sign (+)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).The plot shows relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn’t too much scatter around such a line). Penguins with longer flippers are generally larger in terms of their body mass.
Note that there is a warning message that display along with the code, and this is happening because the is a missing 2 rows values in our data, and this has to be part of what supposed to be handled when we are doing the data wrangling. As much as any of the observation values are missing, R will give a warning sign.
Step 4: Adding Colour to our graph
If we are to add further details to our graph in views of colour differentiation to represent each unique point in the scatter diagram, we will expand the mapping to include a syntax of colour, with an indicator of a column that we want to group it into. Note that the specified column that will be used must be a categorical variable or nominal variables.
This process of identification by a unique colour is called scaling, and ggplot2 will also add a legend that explains which values correspond to which levels.
In this example, the species column has a 3 classification of the penguins, which can either be a Adelie, Chinstrap, or Gentoo. Herein is the code to include the species represented with a colour in the code:
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()Step 5: Adding Smooth curve
We use the geom_smooth( ) to add a additional layer of a smooth line to the graph, with a specify parameter method of “, which means linear model. The line is best fit base on the linear regresion fitting line. Herein is the code:
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth(method = "lm")In ggplot2, aesthetic mappings determine how data is visually represented in a plot. When you define these mappings globally using the ggplot() function, they apply to all the layers (geom functions) in your plot. However, you can also specify mappings locally within individual geom functions, which can override or add to the global mappings.
For example, if you want to color points based on species but don't want lines to be differentiated by species, you should only specify the color = species mapping within the geom_point() function. This ensures that the color mapping applies only to the points and not to the lines. By setting color = species locally in geom_point(), you control the appearance of each layer independently while still inheriting any relevant global mappings.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")Step 6: Using colour shape to identify the species
It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species to the shape aesthetic. This is using shape as a additional parameters in our mapping as a argument.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm")Note that the legend is automatically updated to reflect the different shapes of the points as well.
Step 7: Changing the label to a more illustrated info
And finally, we can improve the labels of our plot using the labs()function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend. In addition, we can improve the color palette to be colorblind safe with the scale_color_colorblind() function from the ggthemes package.
1.3 THE ggplot2 PACKAGE CALLING
It is important to know the function of calling the ggplot parameters by heart as it will be use continually in creating graph. One of the method of calling the ggplot without all the parameters is to invoke the ggplot without the usual data and mapping as thus:
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()Explanation:
- Data Frame: The
penguinsdata frame is specified directly as the first argument in theggplot()function. - Aesthetic Mappings: The
aes(x = flipper_length_mm, y = body_mass_g)specifies the aesthetic mappings within theggplot()function. - Plot Layer:
geom_point()adds a layer to the plot, indicating that points should be plotted using the specified aesthetics.
This approach is straightforward and directly initializes the ggplot object with the data frame and aesthetics.
Another possibility is the use of pipe, |>, which allows us to create a plot thus:
penguins |>
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()Explanation:
- Pipe Operator (
|>): This snippet uses the pipe operator (|>) introduced in R 4.1.0. The pipe operator passes the left-hand side (penguins) as the first argument to the right-hand side function (ggplot()). - Aesthetic Mappings: The
aes(x = flipper_length_mm, y = body_mass_g)specifies the aesthetic mappings within theggplot()function, similar to the first snippet. - Plot Layer:
geom_point()adds a layer to the plot, indicating that points should be plotted using the specified aesthetics.
This approach is more in line with the tidyverse style, where the pipe operator is often used to create a sequence of data transformations and visualizations.
aes did not have equal sign before the bracket. 1.4 VISUALISATION DISTRIBUTIONS
it is important to know which geom_ to use for the numerical and categorical data.
- To examine the distribution of a categorical variables, we use the bar chart. The height of the bars displays how many observations occurred with each
xvalue.
ggplot(penguins, aes(x = species)) +
geom_bar()if we want to re-ordered the distribution of the species from high to low, doing so requires the transformation of variable to a factor (That is how R handles the categorical data), and then, reordering the levels of that factors.
ggplot(penguins, aes(x = fct_infreq(species))) +
geom_bar()The fct_infreq function is part of the forcats package in R, which is a collection of tools for working with categorical variables (factors). The forcats package is included in the tidyverse, a comprehensive suite of R packages designed for data science.
The fct_infreq function is used to reorder factor levels based on their frequency, with the most frequent level first. This can be particularly useful for plotting or for any analysis where you want to consider factor levels in order of their occurrence in the dataset.
Here's a breakdown of how fct_infreq works:
- Input: The function takes a factor or a character vector as input. If a character vector is provided, it is first converted to a factor.
- Operation: It counts the occurrences of each level in the factor.
- Output: It returns a factor with the same levels as the input, but reordered so that the most frequently occurring level comes first, followed by the next most frequent, and so on. If there are ties in frequency, the levels are ordered by their first appearance in the data.
- Usage: This function is particularly useful in data visualization. For example, when creating bar plots, you might want to order the bars based on the frequency of the factor levels. Using
fct_infreqto reorder the factor levels before plotting ensures that the bars are displayed in descending order of frequency.
Aside the categorical data, numerical data is another most common type of data that is often graph using the histogram. This is invoke with the use of the geom_histogram(binwidth = y)
if we want to use histogram, we will have to call the binwidth along with this which indicates
The function geom_histogram(binwidth = y) is a part of the ggplot2 package in R, used to create histograms with a specified bin width. Here's an explanation of each component:
geom_histogram(): This function creates a histogram, a type of plot that displays the distribution of a numeric variable. The histogram consists of bins (bars) that represent the frequency (count) of data points within specified ranges.binwidth = y: This argument specifies the width of each bin (bar) in the histogram. The valueyis a numeric value that determines the size of the bins. A larger bin width means fewer, wider bins, while a smaller bin width means more, narrower bins.
How It Works
- Data Range: The range of the data is divided into intervals (bins) of width
y. - Counting Data Points: For each bin, the function counts how many data points fall within that interval.
- Drawing Bins: Each bin is represented as a bar. The height of the bar corresponds to the number of data points in that bin.
Example:
Below are the lengths wall (in meters): 23, 78, 130, 147, 156, 177, 184, 213 Here's how to make a histogram of this data:
- Step 1: Decide on the width of each bin. If we go 0 to 250 using the bins with a width of 50, we can fit all of the data in 5 bins which is define as 0 - 49, 50 - 99,…There is no strict rule on how many bins to use—we just avoid using too few or too many bins.
ggplot(
data, aes(x= driving distance, y= Number of drivers)
)+
geom_histogram(binwidth = 50)- Step 2: Count how many data points fall in each bin. The driving distance is represent the bin.
Step 3: Scale the -axis from 0 to 250 using intervals of width 50 . Label the -axis "driving distance (meters)".
Step 4: Scale the -axis up to 3—or something just past it—since that will be the highest bar.
Step 5: Draw a bar for each interval so its height matches the number of drives in that interval.
To visualise body_mass_g using the histogram will be thus:
ggplot(penguins, aes(x= body_mass_g)) +
geom_histogram(binwidth = 200)A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a body_mass_g value between 3,500 and 3,700 grams, which are the left and right edges of the bar.
An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution. We use geom_density() for the smooth curve.
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()+ geom_density(): This line adds a density layer to the plot. A density plot is a smoothed, continuous version of a histogram, used to show the distribution of a numeric variable. It plots the density of data points on the y-axis against the data values on the x-axis. The area under the curve represents the total probability of observing the variable within a particular range.
The entire code snippet creates a density plot for the body_mass_g variable of the penguins dataset, allowing us to visually assess the distribution of penguin body masses, such as identifying where the majority of data points fall, spotting any skewness, and observing the presence of multiple modes if any.
1.4.3 Exercises
Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
ggplot(penguins, aes(y=species))+
geom_bar()How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?
ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")Combining the two …
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red", colour = "black")1.5 VISUALISATION RELATIONSHIP
To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.
1.5.1 A numerical and a categorical variable
To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution. It is also useful for identifying potential outliers.
Let’s take a look at the distribution of body mass by species using geom_boxplot():
ggplot(
penguins, aes(x = species, y = body_mass_g)
) +
geom_boxplot()Alternatively, we can make density plots with geom_density()
ggplot(
penguins, aes(x = body_mass_g, color = species)
) +
geom_density(linewidth = 0.75)The linewidth = 0.75 is used to control the thickness of the graph line, and it can be adjusted to a range of 1 - 5 depending on how visible and thick we want our line to be.
Aside the thickness of the line, we can equally fill, and colour the graph to make the overlapping of the three graph visible, and this can be achieve with the inclusion of the fill = species, and
color = species as thus:
ggplot(
penguins, aes(x = body_mass_g, color = species, fill = species)
) +
geom_density(alpha = 0.4)The alpha parameter in the geom_density() function of the ggplot2 package in R controls the transparency of the density plot's fill color. An alpha value of 0.5 means that the fill color will be semi-transparent. This allows for the overlapping areas between density plots (if multiple groups are plotted, as indicated by the color and fill aesthetics set to species) to be visible, making it easier to see where densities overlap and how they differ across groups. Transparency is particularly useful in density plots for distinguishing between multiple distributions that might have similar ranges or where one distribution might be overshadowed by another when plotted on the same graph.
Note the terminology we have used here:
- We map variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable.
- Otherwise, we set the value of an aesthetic.
1.5.2 Two categorical variables
We can use stacked bar plots to visualize the relationship between two categorical variables.
A stacked bar chart is a type of bar chart that shows the composition and comparison of a dataset. In a stacked bar chart, each bar represents a total, and the segments within the bar represent different categories that make up that total. Here's how to interpret and understand a stacked bar chart:
Components of a Stacked Bar Chart
- Bars: Each bar represents a total value for a specific category or group.
- Segments: Each segment within a bar represents a sub-category's contribution to the total value.
- Axes:
- X-Axis: Represents the categories or groups being compared.
- Y-Axis: Represents the numerical value, often indicating quantity or frequency.
Let's consider a stacked bar chart that shows the sales of different products (Product A, B, and C) across different regions (North, South, East, West).
Region | Products | Sales |
North | Product A | 30 |
North | Product B | 20 |
North | Product C | 10 |
South | Product A | 40 |
South | Product B | 30 |
South | Product C | 20 |
East | Product A | 50 |
East | Product B | 40 |
East | Product C | 30 |
West | Product A | 20 |
West | Product B | 10 |
West | Product C | 5 |
Summary
Region | Total |
North | 60 |
South | 90 |
East | 120 |
West | 35 |
Insights from the Stacked Bar Chart
Total Sales Comparison:
- East has the highest total sales.
- West has the lowest total sales.
Product Contribution:
- In the East region, Product A has the highest contribution to the total sales.
- In the North region, Product C has the smallest contribution.
Regional Performance of Products:
- Product A performs best in the East region.
- Product C has relatively low sales in all regions, especially in the West.
Representing the data in the stacked bar chat is thus:
Interpretation Steps
- Identify the Bars:
- Each bar represents the total sales for a specific region (North, South, East, West).
- Identify the Segments:
- Each segment within a bar represents the sales of a specific product (Product A, B, C) in that region.
- The different colors (or patterns) distinguish the products.
- Compare the Total Values:
- Look at the height of each bar to compare the total sales across different regions.
- For example, if the bar for the East region is the highest, it means the East region has the highest total sales.
- Analyze the Composition:
- Look at the size of each segment within a bar to see the contribution of each product to the total sales in that region.
- For example, if the segment for Product A in the North region is larger than the segments for Product B and Product C, it means Product A contributes more to the total sales in the North region.
- Compare Segments Across Bars:
- Compare the segments for the same product across different bars (regions) to see how the sales of that product vary by region.
- For example, if the segment for Product B is larger in the South region than in other regions, it means Product B has higher sales in the South region.
Applying the knowledge of the stacked bar to plot the penguins, For example, we can display the the relationship between island and species, or specifically, visualising the distribution of species within each island.
The first plot shows the frequencies of each species of penguins on each island. The plot of frequencies shows that there are equal numbers of Adelies on each island. But we don’t have a good sense of the percentage balance within each island.
ggplot(
penguins, aes(x = island, fill = species)
) +
geom_bar()
The second plot depict below is a relative frequency plot, which we create by setting position = "fill" in geom_bar(). This type of plot is better for comparing the distribution of penguin species across islands because it adjusts for the different numbers of penguins on each island. In this plot, we can see:
- All Gentoo penguins live on Biscoe Island, making up about 75% of the penguins there.
- All Chinstrap penguins live on Dream Island, making up about 50% of the penguins there.
- Adelie penguins live on all three islands but are the only penguins on Torgersen Island.
ggplot(
penguins, aes(x = island, fill = species)
) +
geom_bar(position = "fill")x aesthetic, and the variable that will change the colors inside the bars to the fill aesthetic.1.5.3 Two Numerical variables
So far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.
ggplot(
penguins, aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()1.5.4 Three or more variables
As shown earlier, we can incorporate more variables into a plot by mapping them to additional aesthetics.For Example, in the following scatterplot, the colors of points represent species, and the shape represents the islands. This is a additional dimension to the data interpretation.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))As much as the aesthetics add value to the graph, having too many mappings to a plot makes it difficult to read, and interpret. To avoid combining too many mappings together, particularly for the categorical data, it is advisable to use a function to split the data base on the mapping into a subplot. This is achievable with the with the use of faceting your plot by a single variable. We use the facet_wrap(). The first argument of facet_wrap() is a which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island)1.6 SAVING YOUR PLOT
Once you’ve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk:
ggplot(penguins,aes(x = flipper_length_mm, y = body_mass_g)) +geom_point()ggsave(filename = "penguin-plot.png")This will save your plot to your working directory, a concept you’ll learn more about in Chapter 6.
If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.
Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 28.
