3 Grammar of graphics & Visualizing data – Introduction to R programming

This chapter serves as reading material for Meeting 3 of your Introduction to R programming classes. We strongly advice to walk through all steps described BEFORE the meeting takes place.

In this chapter you can find all information, and tips and tricks for creating basic data visualizations using R. All visualizations are created using the R packages tidyverse, and ggplot2 (included in the tidyverse package). In addition, you can find some information about add-on packages such as patchwork, which allows you to arrange multiple plots as you like.

3.1 Brief comment on visualizations

Visualizations mainly serve the goal to identify patterns and exceptions in your data (Tukey) and to aid analytical thinking in terms of understanding causality, to make (multivariate) comparisons and to examine (the credibility of) relevant evidence and conclusions (Tufte). Good visualizations should be informative about your data, tell a clear story, serve a specific goal and have a visual form fitting the story. In other words, a powerful visualization should:

show the data
induce the viewer to think about the substance
avoid distorting what the data says
present many numbers in a small space
make large datasets coherent
encourage comparison between data
reveal the data at several levels of detail
have a clear purpose: description, exploration, tabulation or decoration
be closely integrated with the statistical and verbal descriptions of the data

To clarify, let’s take a look at an example:

Both visualizations clearly display data about the average transaction size for different payment types. However, while a pie chart technically is a “transformed” bar chart, in a pie chart values are supposed to add up to 100%. You may ask why. The answer is much more straightforward, and way less scientific, than you might think. Remember, that you want to tell a story with your visualizations. Therefore, when using a pie chart you could say that you typically have an actual pie in mind. And to have a nice round shape, the pie needs to be complete (i.e., 100% intact).

Back to the example: In the pie chart you see a number of different values, in absolute numbers. Theoretically, these numbers could be infinitely high, and therefore not adding up to \(100%\). In the bar chart on the other hand, you simply have one bar for each payment type that is registered in the data. An individual bar can be, just as skyscrapers, infinitely high. At least when you briefly ignore physics. In conclusion, when you want to compare different categories (or people, or companies, and so on), and these do not add up to a value that can be considered to be “all of it” (i.e., \(100%\)), then you should use a bar chart, rather than a pie chart.

Note

In scientific contexts please never use pie charts.

3.2 Grammar of graphics & ggplot2

Before going into detail have a look into these publications for more in-depth information about graphics:

Grammar of Graphics (Wilkinson, 2005)
Layered Grammar of Graphics (Wickham, 2010)

According to the grammar of graphics, graphics are composed of data, variables, scales, statistics, geometry, aesthetics, facets and guides. The R package ggplot2 is based on this idea, which is what you will learn about below.

Using ggplot you can slowly build up your plot. The basis for a graphic here is your data and the variables in it. Ideally, you have a clear picture of what variables you want to visualize, and how you would like to do that.

Once your data is prepared, you need to decide on the aesthetics and scales of your plot. For instance, if you want to display the variable Mathscore on the x-axis (aesthetic), what range of values (scale) do you want to do that on? ggplot defaults to the observed range of values.

However, simply defining the data, the variables, the aesthetics and scales does not provide you with a visualization yet. You also have to indicate statistics and geometry. For instance, if you want to visualize how often each possible value of Mathscore appears in your data (statistic, here counts). The geometry then defines how exactly you display these counts. In Figure 3.6 for instance, the histogram displays the counts using bars for ranges of values. On top of simply displaying the counts for your overall data set, you can also argue that you want to display these counts for different groups of people separately. This is what is called facets; allowing you to display the histograms, but now for each Educationlevel separately.

Last, there is something called guides. Essentially, guides are just legends displayed next to your plot. If in your aesthetics you defined that a grouping variable should result in different colored bars for different levels of education, the ggplot will automatically display the legend next to your plot. These guides are highly customizable, but often the default suffices:

3.2.1 Data preparation

Before you learn about creating visualizations in R, let’s prepare the data you work with. Load the tidyverse and the same data that was used in Chapter 2, and perform the data wrangling steps listed below: Select the listed variables, rename them, and add labels to the three categorical variables.

# Load the tidyverse
library(tidyverse)

# Load the dataset
data <- read_csv("OccupationalExpectations.csv")

# Some data wrangling
data <- data %>%
  select(GENDER, EDUCATIONTRACK, GRADEREPEAT,
         MOTIVATION, MATHSCORE, AGE_arrival_NL, EMOSUPS) %>%
  rename(Gender = GENDER, 
         Educationtrack = EDUCATIONTRACK,
         Graderepeat = GRADEREPEAT,
         Motivation = MOTIVATION, 
         Mathscore = MATHSCORE, 
         Age_arrival_NL = AGE_arrival_NL,
         EmoSups = EMOSUPS) %>%
  mutate(Gender = factor(Gender, 
                         labels = c("Female", "Male")),
         Educationtrack = factor(Educationtrack,
                                 labels = c("PRO", "VMBO-bb", "VMBO-kb", 
                                            "VMBO-tl", "HAVO", "VWO")),
         Graderepeat = factor(Graderepeat,
                              labels = c("No", "Yes")))

3.2.2 Visualizing data using ggplot2

3.2.2.1 Empty canvas & data

As mentioned above, ggplot2 basically follows the idea of the grammar of graphics described above. If you specify the data only, you get the following empty graphic:

# Tell ggplot what dataset to use
data %>%
  ggplot()

This graphic is empty, because you did not yet specify which variables you want to display.

3.2.2.2 Add variable (aesthetic)

You can specify the variable(s) using the aesthetics (aes()) function of ggplot:

# Tell ggplot what dataset to use
data %>% 
  # and what variable to display on what axis
  ggplot(aes(x = Mathscore))

There are still no data points displayed, however as you can see the variable Mathscore is shown on the x-axis.

3.2.2.3 Add geometry (& statistics)

As next step, ggplot requires you to define the geometry. While previously you also saw the term statistics, ggplot takes care of this in one step: For instance using the geom_histogram() function, the package automatically defines the geometry (here bars of certain width) and the statistics (for a histogram these are counts).

But before you go ahead, look closely at below code. Since you are building a plot now, you will NOT use the pipe operator, you will use a simple \(+\) instead.

# Tell ggplot what dataset to use
data %>% 
  # and what variable to display on what axis
  ggplot(aes(x = Mathscore)) +
  # add the geometry (includes statistic)
  geom_histogram()

3.2.2.4 Finetune

Now the basic histogram is finished! But for use in your reports, you might want to change the title of the plot, or the labels on axes. ggplot offers functions for this, such as below. All these refinements are also added using the \(+\), as they are additions to the graphic you produced.

# Tell ggplot what dataset to use
data %>% 
  # and what variable to display on what axis
  ggplot(aes(x = Mathscore)) +
  # add the geometry (includes statistic)
  geom_histogram() +
  # change the label for the x-axis
  scale_x_continuous(name = "Mathematics score") +
  # change the title of the plot
  labs(title = "Distribution of Mathematics score")

3.2.2.5 Different geometry

The nice thing in ggplot is, that the previously used geom_histogram() function takes care of the most difficult part of creating visualizations: Performing all the calculations to come up with the statistic you want. For instance, if you want to display the (smoothed) distribution curve of your data, you can simply replace geom_histogram() with geom_density():

data %>% 
  ggplot(aes(x = Mathscore)) +
  geom_density() +
  scale_x_continuous(name = "Mathematics score") +
  labs(title = "Distribution of Mathematics score")

Of course, not all options for geom_ are applicable for every type of data or plot. Take a look at the Cheatsheets for ggplot to see a variety of options. The R Graph Gallery or the BMS Visualizations tool can also be useful.

3.2.2.6 Multiple layers

A last, nice to know thing, about ggplot is that you can add multiple layers of geometries and statistics, simply by adding it to above code. When you do this, you however need to pay attention to the statistics definition of the second layer; It needs to match the previous layer’s statistics. Luckily, ggplot has functions for this, like the after_stat() function:

data %>% 
  ggplot(aes(x = Mathscore)) +
  geom_density() +
  geom_histogram(aes(y = after_stat(density)), alpha = 0.2) +
  scale_x_continuous(name = "Mathematics score") +
  labs(title = "Distribution of Mathematics score")

3.3 Visualizing univariate data

Variables in your dataset come in different measurement levels. Your data can be:

Numerical - the values themselves have meaning
- Interval - such as time or temperature, where the difference between values is meaningful
- Ratio - such as height, weight, enzyme activity, which has a clear null point
Categorical - each value represents a different category
- Dichotomous - 0 and 1; no and yes; false and true; etc.
- Nominal - unordered, mutually exclusive categories, such as as study programmes, nationality, etc.
- Ordinal - ordered categories, such as year of study

Let’s make some example visualizations!

3.3.1 One continuous variable

Next to density plots or histograms (see above), you can create boxplots to display continuous variables. Boxplots display the five-number summary of a set of data. That is the minimum, 1st quartile, median, 3rd quartile and the maximum:

data %>% 
  ggplot(aes(y = Mathscore)) +
  geom_boxplot()

Without finetuning the plot, there are some ticks on the x-axis that do not look nice. You can remove them using the theme() function like below:

data %>% 
  ggplot(aes(y = Mathscore)) +
  geom_boxplot() +
  ## Remove x axis ticks & text, as you don't need it
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

An alternative to boxplots, which is used more and more in the recent years, is the violin plot. Unfortunately, in ggplot this requires some extra definitions in the aesthetics, even though this needs to be empty. This has to do with the way geom_violin() computes the statistics (density to be exact). Therefore, you define an empty x-axis, and change the label of the x-axis to be non-existent:

data %>% 
  # add an "empty" variable to the x-axis
  ggplot(aes(x = " ", y = Mathscore)) +
  geom_violin() +
  # change label of x-axis to be non-existent
  xlab("")

3.3.2 One categorical variable

For categorical variables (dichotomous, nominal and ordinal) you can simply use barcharts:

data %>% 
  ggplot(aes(x = Educationtrack)) +
  geom_bar()

However, when you want to inform your reader more precisely, you can also just opt for a table. For this the tabyl() function from the janitor package is quite useful:

library(janitor)
data %>%
  tabyl(Educationtrack)

 Educationtrack    n     percent
            PRO    9 0.002237136
        VMBO-bb  318 0.079045488
        VMBO-kb  611 0.151876709
        VMBO-tl 1171 0.291076311
           HAVO 1042 0.259010689
            VWO  872 0.216753666

3.4 Visualizing bivariate data

Visualizing bivariate data, rather than univariate data is more straightforward than you might think; the geometries displayed above can be widely reused, you often only need to add another aesthetic (i.e. the second variable).

3.4.1 Two categorical variables

Suppose you want to create a barchart for the variable Educationtrack, as above. Only this time, you want to distinguish an additional grouping variable, here Gender.

Add the fill argument to the aes() function of the code as follows:

data %>%
  ggplot(aes(x = Educationtrack, fill = Gender)) +
  geom_bar(position = "stack")

In this code, you also see that I used geom_bar(position = "stack") rather than simply geom_bar(). This position argument specifies that the different groups of Gender are stacked on top of each other.

You can also make them dodge each other, so you get one different colored bar for each Gender; for that set position to dodge rather than stack:

data %>%
  ggplot(aes(x = Educationtrack, fill = Gender)) +
  geom_bar(position = "dodge")

In case you are not interested in absolute numbers, but rather in percentages, you could use the fill option for the position argument. Change the label on the y-axis though, otherwise it does not say accurately what it displays:

data %>%
  ggplot(aes(x = Educationtrack, fill = Gender)) +
  geom_bar(position = "fill") +
  # change the label of the y-axis
  ylab("Percent")

Of course, this plot is still rather simple, and if you do not have a lot of categories for each of the two variables, you might rather opt for a table. For this, simply add the Gender variable to the code that you used for a univariate visualization, for instance:

data %>%
  tabyl(Educationtrack, Gender)

 Educationtrack Female Male
            PRO      3    6
        VMBO-bb    141  177
        VMBO-kb    294  317
        VMBO-tl    625  546
           HAVO    558  484
            VWO    479  393

This code sets the variable Educationtrack in the rows (first argument in the tabyl() function), and the variable Gender in the columns (second argument in the tabyl() function).

3.4.1.1 A “better” table

The previous table is relatively basic, it displays the counts for each combination of categories. Using the janitor() package you also have access to a series of adorn_ based functions, each of these offer additions to the table. Below code uses them all at once, but try them out individually and see what happens!

data %>%
  tabyl(Educationtrack, Gender) %>%
  adorn_title("combined") %>%
  adorn_totals("col") %>%
  adorn_totals("row") %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()

 Educationtrack/Gender         Female           Male          Total
                   PRO   0.1%     (3)   0.3%     (6)   0.2%     (9)
               VMBO-bb   6.7%   (141)   9.2%   (177)   7.9%   (318)
               VMBO-kb  14.0%   (294)  16.5%   (317)  15.2%   (611)
               VMBO-tl  29.8%   (625)  28.4%   (546)  29.1% (1,171)
                  HAVO  26.6%   (558)  25.2%   (484)  25.9% (1,042)
                   VWO  22.8%   (479)  20.4%   (393)  21.7%   (872)
                 Total 100.0% (2,100) 100.0% (1,923) 100.0% (4,023)

3.4.2 One categorical variable & one continuous variable

As barcharts, you can also split histograms by a categorical variable. Using the identity position, you basically create a histogram per level of the categorical variable, each with a different color:

data %>% 
  ggplot(aes(x = Mathscore, fill = Gender)) +
  geom_histogram(position = "identity")

Density plots offer the same option, however they will be overlapping by default. Therefore, specifying \(alpha = 0.6\) in the geom_density() function makes them seethrough. Try out some different values (between 0 and 1) here:

data %>% 
  ggplot(aes(x = Mathscore, fill = Gender)) +
  # Create a densityplot. With the alpha argument, you make the graphs transparent
  geom_density(alpha = 0.6)

Visualizing one categorical and one numerical variable can be useful to detect clustering in your data, which, for example, can provide you insight whether it is useful to predict a variable in your statistical model or not. For this purpose, you can use a boxplot:

data %>%   
  ggplot(aes(x = Educationtrack, y = Mathscore)) + 
  geom_boxplot()

Or a violin plot:

data %>%   
  ggplot(aes(x = Educationtrack, y = Mathscore)) + 
  geom_violin()

3.4.3 Two continuous variables

To visualize two continuous variables, you have again multiple options. Which one you pick, is often dependent on whether there is some sort of time series involved or not. If not, you often pick the scatterplot, for which you use the geom_point() function:

data %>%
  ggplot(aes(x = Mathscore, y = Motivation)) +
  geom_point()

3.5 Visualizing multivariate data

To make more comparisons, or investigate more potential clustering, you can also add a third variable to each visualization.

3.5.1 Two categorical variables & one continuous variable

The previous boxplot created one box for each Educationtrack. If you move the Educationtrack variable to “fill” in the aesthetics, and enter Gender on the x-axis instead, you can see the relationship between Educationtrack and Mathscore per separate Gender:

data %>%
  ggplot(aes(x = Gender, y = Mathscore, fill = Educationtrack)) +
  geom_boxplot()

Of course you can also create a violin plot for this. Try it yourself!

3.5.2 Two continuous variables & one categorical variable

A scatterplot with two continuous variables also allows you to add a third grouping variable. Here you need to use the “color” argument in the aesthetics, resulting in:

data %>%
  ggplot(aes(x = Mathscore, y = Motivation, color = Educationtrack)) +
  geom_point()

3.5.3 Three continuous variables

If all three of your variables are continuous, that is also no problem. ggplot automatically detected that and uses a color scale instead of distinct colors:

data %>%
  ggplot(aes(x = Mathscore, y = Motivation, color = EmoSups)) +
  geom_point()

3.5.4 Three categorical variables

To visualize three categorical variables, aside from tables (which can get rather messy by now), I prefer to use the facets part of ggplot. For instance, using facet_wrap() you can create a bivariate plot separately for each level of the third categorical variable:

data %>%
  ggplot(aes(x = Graderepeat, fill = Educationtrack)) +
  geom_bar(position = "dodge") +
  facet_wrap(~Gender)

3.6 Useful add-on: Combine plots

There are a lot of add-on packages that work with ggplot. Since I cannot go into all of them, this section simply introduces this possibility based on the example of the patchwork package. Let’s first prepare some plots and save them in separate objects.

# Make and save a simple histogram
histogram <- data %>% 
  ggplot(aes(x = Mathscore)) +
  geom_histogram(binwidth = 30) +
  scale_x_continuous(name = "Mathematics score")
# Make and save a simple boxplot
boxplot <- data %>% 
  ggplot(aes(y = Mathscore)) + 
  geom_boxplot() +
  scale_x_continuous(name = "Mathematics score") +
  theme(axis.text.x  = element_blank(),
        axis.ticks.x = element_blank())
# Make and save a simple qqplot
qqplot <- data %>% ggplot(aes(sample = Mathscore)) + 
  geom_qq() + 
  geom_qq_line()

Using some of the operators you learned about in Chapter 2, you can now arrange these three plots however you see fit:

library(patchwork)
histogram | qqplot / boxplot

As said, there are tons of add-on packages. The easiest way to find them is via internet search, or using the generative AI of your choice. You can also reach out to the Methodologyshop or Johannes Steinrücke if you ever want or need to do this for a project or your thesis.