Chapter 1: Data Visualization

Status

Not started

1.0 INTRODUCTION

One of the most common features in tidyverse in making a visualisation is using the ggplot2, and that will be the focus of this chapter. The terms ggplot2and ggplotrefer to different aspects of the same data visualization package in R. Here’s a detailed explanation of each term and their differences:

ggplot2

Definition: ggplot2 is a data visualization package in R, part of the Tidyverse. It is a powerful and flexible system for creating static graphics based on the Grammar of Graphics, which breaks up graphics into semantic components such as scales and layers.
Usage: You load the ggplot2 package in R using library(ggplot2). This makes all the functions and capabilities of the package available for use in your R session.
Scope: The term ggplot2 encompasses the entire package, including functions like ggplot(), geom_point(), geom_line(), and many others.

ggplot

Definition: ggplot is a function within the ggplot2 package. It is the main function used to initialize and create a ggplot object.
Usage: You use the ggplot() function to create a new plot. This function specifies the data frame to use and the aesthetic mappings for the plot.
Scope: The term ggplot refers specifically to this function, which serves as the foundation for building a plot in ggplot2.

Before we can use the tidyverse, we will have to load it using the command, and this have to be done anytime we intends to call the visualisation package. This automatically loading ggplot2, and other packages too

library(tidyverse)

In addition to the library, we will have to load a dataset, and for the purpose of this note, we will be using palmerpenguins package, which is the part of R dataset.
The package includes penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago, and
The ggthemes package, which offers a color blind safe color palette. The ggthemes package in R is an extension of the popular ggplot2 package, designed to provide additional themes, scales, and geoms that enhance the appearance of data visualizations.
To load the package, and the specific dataset, we invoke the followings:

# dataset
library(palmerpenguins)

# colour theme
library(ggthemes)

R comes with several built-in datasets, which are useful for practice and examples. These datasets are part of the datasets package, which is loaded by default. To access and view these datasets, you can use the data() function. Here’s how you can preload and inspect these datasets:

data()

Data sets in package 'datasets':

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BJsales.lead (BJsales)
                        Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock
                        Indices, 1991-1998
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level of Lake Huron 1875-1972
LifeCycleSavings        Intercountry Life-Cycle Savings Data
Loblolly                Growth of Loblolly Pine Trees
Nile                    Flow of the River Nile
Orange                  Growth of Orange Trees
OrchardSprays           Potency of Orchard Sprays
...
who2                    World Health Organization TB data
world_bank_pop          Population data from the World Bank

Use 'data(package = .packages(all.available = TRUE))'
to list the data sets in all *available* packages.
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

The followings are the current classification of dataset package in R

palmerpenguins: Dataset about penguins from the palmerpenguins package.
gapminder: Dataset on country statistics from the gapminder package.
mpg: Fuel economy data from the ggplot2 package.

diamonds: Diamond prices and attributes from the ggplot2 package.
economics: US economic time series data from the ggplot2 package.
datasets which contains various industry data and information
data sets in package ‘dplyr’

1.1 DATA VISUALISATION

Going back to our example of penguins, unlike in Python that we will have to use print function before we can call any variables for a display, we can actually type the variable only in R, and it will print to the console. To have a view of the penguins, we call it to the console direct thus:

penguins

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
<fct>	<fct>	<dbl>	<dbl>	<int>	<int>	<fct>	<int>
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007
Adelie	Torgersen	39.2	19.6	195	4675	male	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007
Adelie	Torgersen	37.8	17.1	186	3300	NA	2007
Adelie	Torgersen	37.8	17.3	180	3700	NA	2007
Adelie	Torgersen	41.1	17.6	182	3200	female	2007

This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse(). Or, if you’re in RStudio, run View(penguins) to open an interactive data viewer.

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               <fct> male, female, female, NA, female, male, female, male~
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

View(penguins)

💡

Note that the “V” is capital letter.

Looking at the 8 columns carefully, we can see that among the penguins variable is the species which comprised of 3 types ((Adelie, Chinstrap, or Gentoo). Another variable is the flipper_length_mm which is the length of a penguin’s flipper, in millimeters. body_mass_g: which represent the penguins body mass of a penguin, in grams is another important variable.

1.2 STUDY GOAL

There is a relationship between the flipper_length_mm, and body_mass_g, and the purpose of this section is to create a visualisation representing thus:

Task: Create the plot above

Step 1: Loading the data

After importing the tidverse, we call on the ggplot with the data as the first parameters thus: This will only print an empty graph since we had not invoke the display functions.

ggplot(data = penguins)

Step 2: identifying the $x$ and $y$ axis

After loading of our intended data, we map the $x$ and $y$ variable with the visual properties (aesthetics). The mapping argument is always defined in the aes() function, and the x and y arguments of aes()specify which variables to map to the x and y axes. For now, we will only map flipper length to the x aesthetic and body mass to the y aesthetic. ggplot2 looks for the mapped variables in the data argument, in this case, penguins.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

With the above code, we define the $x$ and y axis on the graph as flipper_length_mm, and body_mass_g respectively as shown below:

To represent our variable data in the chart following the definition of the $x$ and $y$ axis, we will need to define geom. This is the geometrical object that a plot uses to represent data.
These geometric objects are made available in ggplot2 with functions that start with geom_.
People often describe plots by the type of geom that the plot uses.

bar charts use bar geoms geom_bar()
line charts use line geoms geom_line()
boxplots use boxplot geoms geom_boxplot()
scatterplots use point geoms geom_point(), and so on…

Step 3: Adding the graph

The function geom_point() adds a layer of points to the plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. The code for the inclusion of the geom is invoke by using the plus sign (+)

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).

The plot shows relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn’t too much scatter around such a line). Penguins with longer flippers are generally larger in terms of their body mass.

Note that there is a warning message that display along with the code, and this is happening because the is a missing 2 rows values in our data, and this has to be part of what supposed to be handled when we are doing the data wrangling. As much as any of the observation values are missing, R will give a warning sign.

Step 4: Adding Colour to our graph

If we are to add further details to our graph in views of colour differentiation to represent each unique point in the scatter diagram, we will expand the mapping to include a syntax of colour, with an indicator of a column that we want to group it into. Note that the specified column that will be used must be a categorical variable or nominal variables.

This process of identification by a unique colour is called scaling, and ggplot2 will also add a legend that explains which values correspond to which levels.

In this example, the species column has a 3 classification of the penguins, which can either be a Adelie, Chinstrap, or Gentoo. Herein is the code to include the species represented with a colour in the code:

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point()

Step 5: Adding Smooth curve

We use the geom_smooth( ) to add a additional layer of a smooth line to the graph, with a specify parameter method of “ $lm”$ , which means linear model. The line is best fit base on the linear regresion fitting line. Herein is the code:

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point() +
  geom_smooth(method = "lm")

In ggplot2, aesthetic mappings determine how data is visually represented in a plot. When you define these mappings globally using the ggplot() function, they apply to all the layers (geom functions) in your plot. However, you can also specify mappings locally within individual geom functions, which can override or add to the global mappings.

For example, if you want to color points based on species but don't want lines to be differentiated by species, you should only specify the color = species mapping within the geom_point() function. This ensures that the color mapping applies only to the points and not to the lines. By setting color = species locally in geom_point(), you control the appearance of each layer independently while still inheriting any relevant global mappings.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")

Step 6: Using colour shape to identify the species

It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species to the shape aesthetic. This is using shape as a additional parameters in our mapping as a argument.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

Note that the legend is automatically updated to reflect the different shapes of the points as well.

Step 7: Changing the label to a more illustrated info

And finally, we can improve the labels of our plot using the labs()function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend. In addition, we can improve the color palette to be colorblind safe with the scale_color_colorblind() function from the ggthemes package.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()

1.3 THE `ggplot2` PACKAGE CALLING

It is important to know the function of calling the ggplot parameters by heart as it will be use continually in creating graph. One of the method of calling the ggplot without all the parameters is to invoke the ggplot without the usual data and mapping as thus:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

Explanation:

Data Frame: The penguins data frame is specified directly as the first argument in the ggplot() function.
Aesthetic Mappings: The aes(x = flipper_length_mm, y = body_mass_g) specifies the aesthetic mappings within the ggplot() function.
Plot Layer: geom_point() adds a layer to the plot, indicating that points should be plotted using the specified aesthetics.

This approach is straightforward and directly initializes the ggplot object with the data frame and aesthetics.

Another possibility is the use of pipe, |>, which allows us to create a plot thus:

penguins |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

Explanation:

Pipe Operator (|>): This snippet uses the pipe operator (|>) introduced in R 4.1.0. The pipe operator passes the left-hand side (penguins) as the first argument to the right-hand side function (ggplot()).
Aesthetic Mappings: The aes(x = flipper_length_mm, y = body_mass_g) specifies the aesthetic mappings within the ggplot() function, similar to the first snippet.
Plot Layer: geom_point() adds a layer to the plot, indicating that points should be plotted using the specified aesthetics.

This approach is more in line with the tidyverse style, where the pipe operator is often used to create a sequence of data transformations and visualizations.

💡

Note that the aes did not have equal sign before the bracket.

1.4 VISUALISATION DISTRIBUTIONS

it is important to know which geom_ to use for the numerical and categorical data.

To examine the distribution of a categorical variables, we use the bar chart. The height of the bars displays how many observations occurred with each x value.

ggplot(penguins, aes(x = species)) +
  geom_bar()

if we want to re-ordered the distribution of the species from high to low, doing so requires the transformation of variable to a factor (That is how R handles the categorical data), and then, reordering the levels of that factors.

ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()

The fct_infreq function is part of the forcats package in R, which is a collection of tools for working with categorical variables (factors). The forcats package is included in the tidyverse, a comprehensive suite of R packages designed for data science.

The fct_infreq function is used to reorder factor levels based on their frequency, with the most frequent level first. This can be particularly useful for plotting or for any analysis where you want to consider factor levels in order of their occurrence in the dataset.

Here's a breakdown of how fct_infreq works:

Input: The function takes a factor or a character vector as input. If a character vector is provided, it is first converted to a factor.
Operation: It counts the occurrences of each level in the factor.
Output: It returns a factor with the same levels as the input, but reordered so that the most frequently occurring level comes first, followed by the next most frequent, and so on. If there are ties in frequency, the levels are ordered by their first appearance in the data.
Usage: This function is particularly useful in data visualization. For example, when creating bar plots, you might want to order the bars based on the frequency of the factor levels. Using fct_infreq to reorder the factor levels before plotting ensures that the bars are displayed in descending order of frequency.

Aside the categorical data, numerical data is another most common type of data that is often graph using the histogram. This is invoke with the use of the geom_histogram(binwidth = y)

if we want to use histogram, we will have to call the binwidth along with this which indicates

The function geom_histogram(binwidth = y) is a part of the ggplot2 package in R, used to create histograms with a specified bin width. Here's an explanation of each component:

geom_histogram(): This function creates a histogram, a type of plot that displays the distribution of a numeric variable. The histogram consists of bins (bars) that represent the frequency (count) of data points within specified ranges.
binwidth = y: This argument specifies the width of each bin (bar) in the histogram. The value y is a numeric value that determines the size of the bins. A larger bin width means fewer, wider bins, while a smaller bin width means more, narrower bins.

How It Works

Data Range: The range of the data is divided into intervals (bins) of width y.
Counting Data Points: For each bin, the function counts how many data points fall within that interval.
Drawing Bins: Each bin is represented as a bar. The height of the bar corresponds to the number of data points in that bin.

Example:

Below are the lengths wall (in meters): 23, 78, 130, 147, 156, 177, 184, 213 Here's how to make a histogram of this data:

Step 1: Decide on the width of each bin. If we go 0 to 250 using the bins with a width of 50, we can fit all of the data in 5 bins which is define as 0 - 49, 50 - 99,…There is no strict rule on how many bins to use—we just avoid using too few or too many bins.

ggplot(
      data, aes(x= driving distance, y= Number of drivers)
      )+
    geom_histogram(binwidth = 50)

Step 2: Count how many data points fall in each bin. The driving distance is represent the bin.

Step 3: Scale the $x$ -axis from 0 to 250 using intervals of width 50 . Label the $x$ -axis "driving distance (meters)".

Step 4: Scale the $y$ -axis up to 3—or something just past it—since that will be the highest bar.

Step 5: Draw a bar for each interval so its height matches the number of drives in that interval.

📚Code

To visualise body_mass_g using the histogram will be thus:

ggplot(penguins, aes(x= body_mass_g)) +
    geom_histogram(binwidth = 200)

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a body_mass_g value between 3,500 and 3,700 grams, which are the left and right edges of the bar.

An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution. We use geom_density() for the smooth curve.

ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()

+ geom_density(): This line adds a density layer to the plot. A density plot is a smoothed, continuous version of a histogram, used to show the distribution of a numeric variable. It plots the density of data points on the y-axis against the data values on the x-axis. The area under the curve represents the total probability of observing the variable within a particular range.

The entire code snippet creates a density plot for the body_mass_g variable of the penguins dataset, allowing us to visually assess the distribution of penguin body masses, such as identifying where the majority of data points fall, spotting any skewness, and observing the presence of multiple modes if any.

1.4.3 Exercises

Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

ggplot(penguins, aes(y=species))+
    geom_bar()

How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")

Combining the two …

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red", colour = "black")

1.5 VISUALISATION RELATIONSHIP

To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.

1.5.1 A numerical and a categorical variable

To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution. It is also useful for identifying potential outliers.

Let’s take a look at the distribution of body mass by species using geom_boxplot():

ggplot(
       penguins, aes(x = species, y = body_mass_g)
       ) +
  geom_boxplot()

Alternatively, we can make density plots with geom_density()

ggplot(
       penguins, aes(x = body_mass_g, color = species)
       ) +
  geom_density(linewidth = 0.75)

The linewidth = 0.75 is used to control the thickness of the graph line, and it can be adjusted to a range of 1 - 5 depending on how visible and thick we want our line to be.

Aside the thickness of the line, we can equally fill, and colour the graph to make the overlapping of the three graph visible, and this can be achieve with the inclusion of the fill = species, and

color = species as thus:

ggplot(
       penguins, aes(x = body_mass_g, color = species, fill = species)
       ) +
  geom_density(alpha = 0.4)

The alpha parameter in the geom_density() function of the ggplot2 package in R controls the transparency of the density plot's fill color. An alpha value of 0.5 means that the fill color will be semi-transparent. This allows for the overlapping areas between density plots (if multiple groups are plotted, as indicated by the color and fill aesthetics set to species) to be visible, making it easier to see where densities overlap and how they differ across groups. Transparency is particularly useful in density plots for distinguishing between multiple distributions that might have similar ranges or where one distribution might be overshadowed by another when plotted on the same graph.

Note the terminology we have used here:

We map variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable.
Otherwise, we set the value of an aesthetic.

READ COMPREHENSIVE NOTE ON mapping:

1.5.2 Two categorical variables

We can use stacked bar plots to visualize the relationship between two categorical variables.

A stacked bar chart is a type of bar chart that shows the composition and comparison of a dataset. In a stacked bar chart, each bar represents a total, and the segments within the bar represent different categories that make up that total. Here's how to interpret and understand a stacked bar chart:

Components of a Stacked Bar Chart

Bars: Each bar represents a total value for a specific category or group.
Segments: Each segment within a bar represents a sub-category's contribution to the total value.
Axes:

X-Axis: Represents the categories or groups being compared.
Y-Axis: Represents the numerical value, often indicating quantity or frequency.

Let's consider a stacked bar chart that shows the sales of different products (Product A, B, and C) across different regions (North, South, East, West).

Region	Products	Sales
North	Product A	30
North	Product B	20
North	Product C	10
South	Product A	40
South	Product B	30
South	Product C	20
East	Product A	50
East	Product B	40
East	Product C	30
West	Product A	20
West	Product B	10
West	Product C	5

Summary

Region	Total
North	60
South	90
East	120
West	35

Insights from the Stacked Bar Chart

Total Sales Comparison:

East has the highest total sales.
West has the lowest total sales.

Product Contribution:

In the East region, Product A has the highest contribution to the total sales.
In the North region, Product C has the smallest contribution.

Regional Performance of Products:

Product A performs best in the East region.
Product C has relatively low sales in all regions, especially in the West.

Representing the data in the stacked bar chat is thus:

# Load necessary library
library(ggplot2)

# Sample data
sales_data <- data.frame(
  region = rep(c("North", "South", "East", "West"), each = 3),
  product = rep(c("Product A", "Product B", "Product C"), 4),
  sales = c(30, 20, 10, 40, 30, 20, 50, 40, 30, 20, 10, 5)
)

# Create the stacked bar chart
ggplot(
       sales_data, aes(x = region, y = sales, fill = product)
       ) +
  geom_bar(stat = "identity") +
  labs(
    title = "Sales by Product and Region",
    x = "Region",
    y = "Sales",
    fill = "Product"
  ) +
  theme_minimal()

Interpretation Steps

Identify the Bars:

Each bar represents the total sales for a specific region (North, South, East, West).

Identify the Segments:

Each segment within a bar represents the sales of a specific product (Product A, B, C) in that region.
The different colors (or patterns) distinguish the products.

Compare the Total Values:

Look at the height of each bar to compare the total sales across different regions.
For example, if the bar for the East region is the highest, it means the East region has the highest total sales.

Analyze the Composition:

Look at the size of each segment within a bar to see the contribution of each product to the total sales in that region.
For example, if the segment for Product A in the North region is larger than the segments for Product B and Product C, it means Product A contributes more to the total sales in the North region.

Compare Segments Across Bars:

Compare the segments for the same product across different bars (regions) to see how the sales of that product vary by region.
For example, if the segment for Product B is larger in the South region than in other regions, it means Product B has higher sales in the South region.

Applying the knowledge of the stacked bar to plot the penguins, For example, we can display the the relationship between island and species, or specifically, visualising the distribution of species within each island.

The first plot shows the frequencies of each species of penguins on each island. The plot of frequencies shows that there are equal numbers of Adelies on each island. But we don’t have a good sense of the percentage balance within each island.

ggplot(
       penguins, aes(x = island, fill = species)
       ) +
  geom_bar()

The second plot depict below is a relative frequency plot, which we create by setting position = "fill" in geom_bar(). This type of plot is better for comparing the distribution of penguin species across islands because it adjusts for the different numbers of penguins on each island. In this plot, we can see:

All Gentoo penguins live on Biscoe Island, making up about 75% of the penguins there.
All Chinstrap penguins live on Dream Island, making up about 50% of the penguins there.
Adelie penguins live on all three islands but are the only penguins on Torgersen Island.

ggplot(
      penguins, aes(x = island, fill = species)
      ) +
  geom_bar(position = "fill")

💡

In creating these bar charts, we map the variable that will be separated into bars to the x aesthetic, and the variable that will change the colors inside the bars to the fill aesthetic.

1.5.3 Two Numerical variables

So far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.

ggplot(
    penguins, aes(x = flipper_length_mm, y = body_mass_g)
    ) +
  geom_point()

1.5.4 Three or more variables

As shown earlier, we can incorporate more variables into a plot by mapping them to additional aesthetics.For Example, in the following scatterplot, the colors of points represent species, and the shape represents the islands. This is a additional dimension to the data interpretation.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))

As much as the aesthetics add value to the graph, having too many mappings to a plot makes it difficult to read, and interpret. To avoid combining too many mappings together, particularly for the categorical data, it is advisable to use a function to split the data base on the mapping into a subplot. This is achievable with the with the use of faceting your plot by a single variable. We use the facet_wrap(). The first argument of facet_wrap() is a $formula^3$ which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island)

1.6 SAVING YOUR PLOT

Once you’ve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk:

ggplot(penguins,aes(x = flipper_length_mm, y = body_mass_g)) +geom_point()ggsave(filename = "penguin-plot.png")

This will save your plot to your working directory, a concept you’ll learn more about in Chapter 6.

If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.

Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 28.