2 Working with data & Troubleshooting – Introduction to R programming

This chapter serves as reading material for Meeting 2 of your Introduction to R programming classes. We strongly advice to walk through all steps described BEFORE the meeting takes place.

2.1 The tidyverse

“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” - Tidyverse.org (2023)

The tidyverse is a powerful and very useful collection of R-packages. It includes many useful functions for data wrangling, data reshaping, plotting and many more. Further, with the tidyverse you can create more readable code than with base R.

The tidyverse also introduces the pipe operator %>%, which takes whatever is before (e.g. your data) and feeds it to what comes after (e.g., mutate). In other words, the function mutate is applied to your data.

See for example some code written in base R:

mtcars$ratio <- mtcars$disp/mtcars$cyl            # create new variable
summary(lm(ratio ~ wt, data = mtcars))$r.squared  # run lm and get R2

[1] 0.7706461

Now take a look at code written using the tidyverse. Here you write code with the idea of performing one action per line, such that the code remains readable. Further, it is useful to annotate what you do in each line, such that you can easily find things back whenever you need to update your code. Despite looking substantially different from the previous code, it produces the exact same output:

mtcars %>%                      # data set on cars
  mutate(ratio = disp/cyl) %>%  # compute ratio variable
  lm(ratio ~ wt, data = .) %>%  # run linear model
  summary() %>%                 # make summary of linear model
  pluck("r.squared")            # select R squared from the summary

[1] 0.7706461

Next to being more readable, the tidyverse comes with many useful functions for data wrangling, which you can learn about below.

2.2 Data manipulation with the tidyverse

2.2.1 Load data

Assuming that your dataset (downloaded from Canvas) is in the same project folder as this script, you can just run the following code. The read_csv function from the tidyverse can be used to load comma-separated value (.csv) files into R:

data <- read_csv("OccupationalExpectations.csv")

If you make use of subfolders, to keep things more organized, you simply add the name of the subfolder to the code:

data <- read_csv("data/OccupationalExpectations.csv")

Before you continue, inspect your data set using the following code:

data %>% View()

Starting now, after every step, inspect the newest dataset by clicking on it in your R environment (on the top right of RStudio), or using the (adjusted version of) above code.

2.2.2 Basic data manipulation using the tidyverse

Thanks to the functions the tidyverse offers us, you can easily manipulate data sets.

Suppose you want to work with only 5 of the originally 20 variables, you can use the select function. Together with the pipe operator (in R %>%) the R code to select 5 variables would look like this:

data %>% 
  select(GENDER, EDUCATIONTRACK, 
         MOTIVATION, MATHSCORE, AGE_arrival_NL)

However, if you run this code, it does not save the now smaller dataframe. Therefore, you should save them in a new object (called data_2) using the assignment operator <-, just like this:

data_2 <- data %>% 
  select(GENDER, EDUCATIONTRACK, 
         MOTIVATION, MATHSCORE, AGE_arrival_NL)

The variable names do not look nice like this, they are all in capital letters. Let’s change that using the rename function:

data_2 <- data_2 %>% 
  rename(Gender = GENDER, 
         Educationtrack = EDUCATIONTRACK, 
         Motivation = MOTIVATION, 
         Mathscore = MATHSCORE, 
         Age_arrival_NL = AGE_arrival_NL)

Notice how the new variable name is on the left side, and the old variable name is on the right side. Let’s take a look at the data set now using below code (in the reader, only the first 10 rows are shown):

data_2 %>% View()

Gender	Educationtrack	Motivation	Mathscore	Age_arrival_NL
1	3	-1.6781	452.676	0
1	4	-0.8771	494.199	0
1	2	0.1609	454.358	NA
2	4	-0.7408	632.601	0
2	2	-1.3169	406.702	0
2	2	-1.1296	447.140	0
2	3	0.5920	502.751	0
2	4	-0.7186	494.047	0
1	4	-0.5996	533.317	0
1	3	-0.2803	548.188	0

Sometimes, you might want to use only those cases in your data set that fulfill certain conditions. To filter out the rest, you can use the filter function. Let’s only look at people with a motivation larger than or equal to \(0\).

data_2 <- data_2 %>% 
  filter(Motivation >= 0)

If you have to transform some of your data, for example because you need to first take the logarithm of it, or assign labels to the values of the categorical variables, you can use the mutate function. You can perform multiple transformations in one go, by separating each transformation using a comma.

data_2 <- data_2 %>%
  mutate(Motivation_log = log(Motivation), 
         # Enforce that R sees Gender as categorical;
         # With labels Female (for value 0), and Male (for value 1)
         Gender = factor(Gender, labels = c("Female", "Male")),
         # use the case_when() function as conditional statement
         Motivation_cat = case_when(
           # Assign "very unmotivated" when motivation is smaller than -1
           Motivation < -1 ~ "Very unmotivated", 
           # Assign "very motivated" when motivation is larger than 1
           Motivation > 1 ~ "Very motivated", 
           # Assign average for everything else
           .default = "Average" )
         )

If you look closely, I also included a conditional statement in this step, but I did it the tidyverse way!

2.2.3 Summarise your data (per group)

When you want to compute summary statistics describing your data set, you can use the summarise function. If you want to do that for subgroups within your data separately, you can use the group_by function

data_2 <- data_2 %>% 
  group_by(Gender) %>% 
  summarise(
    mean_motivation = mean(Motivation), 
    n = n()
    )

We now computed the average level of motivation for each Gender separately, and you computed how many observations you used for that. Let’s look at the result:

data_2

# A tibble: 2 × 3
  Gender mean_motivation     n
  <fct>            <dbl> <int>
1 Female           0.625   386
2 Male             0.665   474

As you see, the output is ordered in a certain way. If you want to change the order by sample size, such that the larger group is on top, you could use the arrange function, together with the desc (descending) function :

data_2 <- data_2 %>% 
  arrange(desc(n))

Let’s take a look at the result:

data_2

# A tibble: 2 × 3
  Gender mean_motivation     n
  <fct>            <dbl> <int>
1 Male             0.665   474
2 Female           0.625   386

2.2.4 All in one pipe

Alternatively, you can, thanks to the pipe operator %>%, do this in one single pipe:

data_3 <- data %>%
  select(GENDER, EDUCATIONTRACK, 
         MOTIVATION, MATHSCORE, AGE_arrival_NL) %>%
  rename(Gender = GENDER, 
         Educationtrack = EDUCATIONTRACK,
         Motivation = MOTIVATION, 
         Mathscore = MATHSCORE, 
         Age_arrival_NL = AGE_arrival_NL) %>%
  filter(Motivation >= 0) %>%
  mutate(Motivation_log = log(Motivation),
         Gender = factor(Gender, labels = c("Female", "Male")),
         Motivation_cat = case_when(
           Motivation < -1 ~ "Very unmotivated",
           Motivation > 1 ~ "Very motivated",
           .default = "Average"
         )) %>% 
  group_by(Gender) %>%
  summarise(mean_motivation = mean(Motivation),
            n = n()) %>%
  arrange(desc(n))

You can now look at the result again:

data_3

# A tibble: 2 × 3
  Gender mean_motivation     n
  <fct>            <dbl> <int>
1 Male             0.665   474
2 Female           0.625   386

2.3 Useful tips & troubleshooting

Working with R is not always without any issues. Find some tips & tricks below.

2.3.1 Coding tips

Take a look at the code shown in Figure 2.1. While it does work, it easily becomes cramped. Further, when coming back to this code after some time, it requires quite some effort to understand what the code is doing, as it is not annotated.

Figure 2.1: Non-annotated and dense code - Example

In contrast, the code shown in Figure 2.2 is more readable, and is easier and faster to recognize what is being done with what piece of code. When you write code, although this takes slightly longer, you help your future self and your collaborators a lot when you make use of spaces and annotations properly.

Figure 2.2: Annotated and spaced code - Example

Sometimes you might produce code that does not work, this is totally normal. Now, instead of writing your code entirely new from scratch, go back and fix the mistake! Think of it like writing a report: In the end, you submit only the final version, not all the intermittent, partially error-prone, versions. For example your code should NOT look like this:

# Make a 2x2 matrix
m <- matrix(z, ncol = 2, nrwo = 2)
m <- matrix(z, ncol = 2, nrow = 2)

Do you spot the mistake in the first line of code (there is a typo)? There is no reason to keep that line of code in your script. Instead, only keep the correct code, namely like this:

# Make a 2x2 matrix
m <- matrix(z, ncol = 2, nrow = 2)

2.3.2 Copying code from other sources

In many tutorials we will provide you with code, or you can use code provided in the R-manual. While it is easiest to simply copy and paste the code, this sometimes might cause issues. For example, there is a distinction between a - (minus) sign and a – (en-dash) sign. While not always directly visible to you, R does not recognize the en-dash, and would therefore not be able to perform the computation. If you type the code yourself, you prevent that.

However, when typing code yourself, you might end up making typing errors. This will also lead to error messages. Here it is CRUCIAL to remain calm, and first check whether you have written everything correct, whether all parentheses are where they need to be, and so on.

2.3.3 Common problems

Obviously, when you work with R you will run into (a variety of) problems. Let’s look at the most common ones and how to approach them:

First, check whether you ran all the code that you need to run.

If you did run everything, but you still run into errors, try the following approaches:

The error says “function not found”? Check whether you loaded (and installed) all the necessary packages
R is case-sensitive: Mean() instead of mean()
Typing errors: read_csv("daat.csv") instead of read_csv("data.csv")
R says “file not found” or “no such file exists”: Your files are probably not stored in your project directory.
Missing values in your data: some functions like mean() need the argument na.rm = TRUE, otherwise the output will be \(NA\)
The data is in wrong format for the function: double check with documentation, e.g. by typing ?mean in the console

In case you run into more complex problems, you often will be able to solve it yourself through using the internet. With many issues you will see that someone, some time ago, ran into a similar issue and asked about it in online fora such as stackoverflow or similar. In most cases, you can find (more or less) efficient fixes in the replies. If after all this you still cannot manage to solve your problem or if you feel too new to R, please contact the teacher of this course.

Important

However, the most important thing in troubleshooting is the following:

READ THE ERROR / WARNING MESSAGE

2.3.4 The use of AI

Up until now we did not mention whether and how to use artificial intelligence (AI) for learning R. In this section we want to briefly present our opinion and provide some tips how can use AI to learn better, during this course and during your studies. Specifically, we will be referring to generative AI, respectively the platforms where you can use it, such as ChatGPT, Google Gemini, Microsoft CoPilot, Claude, Mistral and many more. In 2025 we have been testing various platforms. We do not want to recommend a specific platform, but we will provide some aspects you should consider when making your own choice.

First, you should consider your data privacy. Platforms that are hosted outside of the European Union do not follow the same rigorous guidelines (such as the GDPR). When you use these platforms, you therefore should be aware that you are the product. Your data is unlikely to be safe. In addition, these platforms did not have to follow the same regulations during development. That means that they might have been trained (i.e. made) based on unethical principles, such as illegally obtained personal data. European platforms, such as Mistral or Lumo tend to treat your data more ethically, often are open-source (i.e. you can “double-check”) and are following European laws more rigorously. However, they might be less performant in very demanding tasks.

Important

No matter what platform or model you use, we strongly recommend that you:

DO NOT UPLOAD ANY (PERSONAL) DATA ON ANY AI PLATFORM, ESPECIALLY WITHOUT FULL INFORMED CONSENT

Second, you should know what AI actually can and cannot do. The short answer is: It can do a lot. The actual answer is more nuanced though. Looking at the aforementioned generative AI platforms, you should know that generative AI, or more specifically large language models (LLMs), simply predict the most likely words based on your input (prompt). While they do this incredibly well, they still are prone to errors. As these models are trained on publicly available texts, websites and more (basically the entire internet), they also including online forums. In the context of R programming, that means that they are trained on a lot of faulty code, as there are more questions than answers. Do you want to know facts? Use a good old online search engine, check wikipedia, check your textbooks. But don’t ask AI, unless you have enough understanding to judge its output.

Third and last, when studying R with the help of AI, we would recommend that ask the AI platform to explain the code to you. Because of the way they were trained, this is incredibly useful for beginning scientists (like you). They can help you better understand every step that is taken in the code you provide the platform. Furthermore, developed by Henk van der Kolk, you could start your conversation with Henk’s custom prompt, to engage with the AI platform as if it is a Study Buddy. Read the prompt carefully, and see what it does. While there are already plenty of resources available on how to use AI to support your learning. One of these guides is the AITutor guide. Aside from asking AI to explain R code to you, you can also create a socratic tutor like system yourself. Be aware though, in this specific guide R code is written differently than how we teach it at UT. We strongly recommend you to follow the coding style that we teach, as this is how you can expect the best support from your teachers. Therefore use Henk’s custom prompt to instruct your AI platform on how to study with you.

2.3.5 Keeping R & RStudio up-to-date

R and RStudio, as other apps and programmes, receive regular updates. While you do not need to have the newest versions for everything, it is advised to update them at least at the start of new academic year. For that, simply uninstall R and RStudio, and install them again as described in Chapter 1.

2.3.6 Useful links

At BMS there already are various useful resources about working with R:

You can always find the newest version of Stephanie van den Berg’s book
Consult the BMS Visualization Tool initiated by Henk van der Kolk and Karel Kroeze
Browse the the R-Manual
If you want to use any generative AI tool, like ChatGPT, please use it only to explain the code to you. Paste your code in there, and ask for a line-by-line explanation

Note

Of course, you can find way more resources online. However, they do not always follow the coding style taught at BMS, so you might have to adjust that yourself!

Find the R Cheat Sheets on https://posit.co/resources/cheatsheets/