1 Basics of R

This section is intended to provide a very bare bones explanation of how to do some things in R.

It covers some basic methods including reading and writing data, manipulating and reshaping datasets with tidyverse verbs and drawing some simple charts with ggplot2. For much better resources, please use https://r4ds.hadley.nz/ which is awesome.

This is a work in progress.

1.1 Load Libraries

Libraries (also known as packages) allow R to do things it can’t do out of the box.
You can install packages using the install.packages("packagename") syntax. If you haven’t used R before then you should first install the tidyverse packages by typing install.packages("tidyverse")

You should always load the libraries you’ll be using at the top of your script.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)   # For reading xls and xlsx files
library(haven)    # For reading Stata files

1.2 Write CSV files

We’ll be using the iris dataset, which is built in and which can be accessed by typing iris.

We need a toy dataset to work with, so let’s start by saving a copy of iris as a csv file. This is backwards from what we normally do, which is to save a data set at the end, but will serve our purpose here.

write_csv(iris, "iris.csv")  # Saving to CSV

Check that it appears in the files list on the right

1.3 Read CSV files

Next let’s read the file back in and assign it to a data frame

The <- symbol assigns the data contained in the file to an object in the R environment.

You can have as many or as few objects as you need.

iris_data <- read_csv("iris.csv")

Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The messages are important. Check which variables have been assigned to which class

chr:
- Full Form: character
- Used for columns containing text data (strings).
dbl or num:
- Full Form: double/number
- Represents floating-point numbers (numeric values with decimal places).
- This is the most common numeric data type used in R for real numbers.
int:
- Full Form: integer
- Used for whole numbers.
lgl:
- Full Form: logical
- Used for boolean (TRUE/FALSE) values.
fct:
- Full Form: factor
- Used for categorical data (discrete values like categories, levels, or groups).
dttm:
- Full Form: date-time
- Represents date and time objects (POSIXct class) which include both date and time information.
date:
- Full Form: Date
- Used for date objects, containing only date information (year, month, day).
time:
- Full Form: time
- Used for time objects (sometimes seen when working with time series data, though less common than date or POSIXct).
lst:
- Full Form: list
- Represents a list column, which can hold any type of R object, including vectors, data frames, or even other lists.

1.4 Other types of files

If you’re just beginning, leave this section for later.
Often you’ll need to work with CSV, STATA, XLS, XLSX. This section shows you how to list, find the size, read and write data in other format.

# Load necessary libraries
library(readxl)   # For reading xls and xlsx files
library(haven)    # For reading Stata files

# List files in the directory recursively and show their sizes
file_info <- tibble(
  file_name = list.files("/data", full.names = TRUE, recursive = TRUE),
  file_size = file.info(list.files("p:/foo/bar bar", full.names = TRUE, recursive = TRUE))$size
)

# Display file information
print(file_info)

# Current working directory to save files
output_dir <- getwd()

# Read and save a CSV file
csv_data <- read_csv("p:/foo/bar bar/example.csv")
write_csv(csv_data, file.path(output_dir, "example_saved.csv"))

# Read and save an XLS file
xls_data <- read_excel("p:/foo/bar bar/example.xls")
write_csv(xls_data, file.path(output_dir, "example_saved_from_xls.csv"))

# Read and save an XLSX file
xlsx_data <- read_excel("p:/foo/bar bar/example.xlsx")
write_csv(xlsx_data, file.path(output_dir, "example_saved_from_xlsx.csv"))

# Read and save a Stata file
dta_data <- read_dta("p:/foo/bar bar/example.dta")
write_csv(dta_data, file.path(output_dir, "example_saved_from_dta.csv"))

# Print a message confirming completion
message("Files listed, read, and saved in the current working directory: ", output_dir)

1.5 Look at the structure of an object

str(iris_data)

spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, "spec")=
  .. cols(
  ..   Sepal.Length = col_double(),
  ..   Sepal.Width = col_double(),
  ..   Petal.Length = col_double(),
  ..   Petal.Width = col_double(),
  ..   Species = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

1.6 Look at the contents of an object

iris_data

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

1.6.1 Summary of the data set

summary(iris_data)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
   Species         
 Length:150        
 Class :character  
 Mode  :character

1.7 Pipes

The %>% pipe operator, commonly called the “pipe,” is one of the most important tools in the tidyverse. It is used to pass the result of one function into the next function, making your code cleaner and easier to read by chaining operations together.

1.7.1 How the Pipe Works:

The pipe takes the output of the expression on its left and passes it as the first argument to the function on its right.

\[ \texttt{result} \leftarrow \texttt{data} \\ \hspace{2cm} \% \! > \! \% \, \texttt{operation}_1 \\ \hspace{2cm} \% \! > \! \% \, \texttt{operation}_2 \\ \hspace{2cm} \vdots \\ \hspace{2cm} \% \! > \! \% \, \texttt{operation}_n\]

1.7.2 Simple Explanation:

Without the pipe, you would need to nest functions, which can make the code harder to read:

summarise(group_by(filter(df, Species == "setosa"), Species), mean_length = mean(Sepal.Length))

With the %>% pipe, you can write it more readable by breaking each step down:

iris_data %>%
  filter(Sepal.Length >5.8) %>%
  group_by(Species) %>%
  summarise(mean_length = mean(Sepal.Length))

# A tibble: 2 × 2
  Species    mean_length
  <chr>            <dbl>
1 versicolor        6.34
2 virginica         6.72

1.7.3 Benefits of Using the Pipe:

Improves readability: It reads like a logical sequence of steps.
Reduces the need for intermediate variables: You don’t need to create multiple intermediate objects.
Simplifies function chaining: Functions are applied one after the other, making it clear what happens at each step.

1.8 Single Table Verbs (basic)

All the main actions in tidyverse take a tibble (the new name for a dataframe), do something with it and then return another tibble. These are the ‘single table verbs’.

These are the main functions you’ll need to learn.
All of them accept lists, where you separate items with a comma.

1.8.1 Filter

filter():
- Filters rows based on specified conditions.
- Returns only the rows that meet the condition(s).

iris_data %>% 
  filter(
    Species == "setosa",
    Sepal.Length > 4.3)

# A tibble: 49 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 39 more rows

1.8.2 Logic

You can provide logical operators` to any verb to make complex queries
- == Equals
- != Not Equals
- > more than
- < less than
- >= more than or equal to
- <= less than or equal to
- | or
- & and
- ! not
- + add
- - subtract
- %% modulo

1.8.3 AND (&)

Filtering rows where Sepal.Length is greater than 5 AND Species is not setosa

iris_data %>% 
  filter(
    Sepal.Length > 5 & Species != "setosa"
    )

# A tibble: 96 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
 1          7           3.2          4.7         1.4 versicolor
 2          6.4         3.2          4.5         1.5 versicolor
 3          6.9         3.1          4.9         1.5 versicolor
 4          5.5         2.3          4           1.3 versicolor
 5          6.5         2.8          4.6         1.5 versicolor
 6          5.7         2.8          4.5         1.3 versicolor
 7          6.3         3.3          4.7         1.6 versicolor
 8          6.6         2.9          4.6         1.3 versicolor
 9          5.2         2.7          3.9         1.4 versicolor
10          5.9         3            4.2         1.5 versicolor
# ℹ 86 more rows

1.8.4 OR (|)

Filtering rows where Sepal.Length is less than 5 OR Species is setosa

iris_data %>% 
  filter(
    Sepal.Length < 5 | Species == "setosa"
    )

# A tibble: 52 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 42 more rows

1.8.5 NOT (!)

Filtering rows where Species is NOT “setosa” by negating the test with ! placed at the start. Compare to above where != was used for not-equal. Here it tests if the species equals setosa, then returns all rows where that is NOT true.

iris_data %>% 
  filter(
    !(Species == "setosa")
    )

# A tibble: 100 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
 1          7           3.2          4.7         1.4 versicolor
 2          6.4         3.2          4.5         1.5 versicolor
 3          6.9         3.1          4.9         1.5 versicolor
 4          5.5         2.3          4           1.3 versicolor
 5          6.5         2.8          4.6         1.5 versicolor
 6          5.7         2.8          4.5         1.3 versicolor
 7          6.3         3.3          4.7         1.6 versicolor
 8          4.9         2.4          3.3         1   versicolor
 9          6.6         2.9          4.6         1.3 versicolor
10          5.2         2.7          3.9         1.4 versicolor
# ℹ 90 more rows

1.8.6 1.8.7 Arrange

arrange():
- Orders rows of a tibble by one or more columns.
- Can sort in ascending or descending order.
- Using a list will sort by item 1, then item 2, then item 3.
- As with filter, this function can accept a list of actions that are carried out sequentially

iris_data %>% 
  arrange(
    Sepal.Length,
    Sepal.Width,
    Species
    )

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          4.3         3            1.1         0.1 setosa 
 2          4.4         2.9          1.4         0.2 setosa 
 3          4.4         3            1.3         0.2 setosa 
 4          4.4         3.2          1.3         0.2 setosa 
 5          4.5         2.3          1.3         0.3 setosa 
 6          4.6         3.1          1.5         0.2 setosa 
 7          4.6         3.2          1.4         0.2 setosa 
 8          4.6         3.4          1.4         0.3 setosa 
 9          4.6         3.6          1           0.2 setosa 
10          4.7         3.2          1.3         0.2 setosa 
# ℹ 140 more rows

iris_data %>% 
  arrange(
    !Sepal.Length,
    Sepal.Width,
    Species
    )

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
 1          5           2            3.5         1   versicolor
 2          6           2.2          4           1   versicolor
 3          6.2         2.2          4.5         1.5 versicolor
 4          6           2.2          5           1.5 virginica 
 5          4.5         2.3          1.3         0.3 setosa    
 6          5.5         2.3          4           1.3 versicolor
 7          6.3         2.3          4.4         1.3 versicolor
 8          5           2.3          3.3         1   versicolor
 9          4.9         2.4          3.3         1   versicolor
10          5.5         2.4          3.8         1.1 versicolor
# ℹ 140 more rows

1.8.8 Select

select():
- Selects specific columns from a data frame or tibble.
- Useful for reducing data to only the columns of interest.
- As with other verbs, a sequential set of actions is possible

iris_data %>% 
  select(
    Sepal.Length, 
    Species
    )

# A tibble: 150 × 2
   Sepal.Length Species
          <dbl> <chr>  
 1          5.1 setosa 
 2          4.9 setosa 
 3          4.7 setosa 
 4          4.6 setosa 
 5          5   setosa 
 6          5.4 setosa 
 7          4.6 setosa 
 8          5   setosa 
 9          4.4 setosa 
10          4.9 setosa 
# ℹ 140 more rows

It’s also possible to negate an action,

iris_data %>% 
  select(
    !Species
    )

# A tibble: 150 × 4
   Sepal.Length Sepal.Width Petal.Length Petal.Width
          <dbl>       <dbl>        <dbl>       <dbl>
 1          5.1         3.5          1.4         0.2
 2          4.9         3            1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5           3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4
 7          4.6         3.4          1.4         0.3
 8          5           3.4          1.5         0.2
 9          4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
# ℹ 140 more rows

1.8.9 Mutate

mutate():
- Adds new columns or modifies existing columns in a tibble.
- Commonly used for creating calculated columns.

iris_data %>%
  mutate(
    Petal.Ratio = Petal.Length / Petal.Width,
    Petal.Area = Petal.Length * Petal.Width
         )

# A tibble: 150 × 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Ratio
          <dbl>       <dbl>        <dbl>       <dbl> <chr>         <dbl>
 1          5.1         3.5          1.4         0.2 setosa         7   
 2          4.9         3            1.4         0.2 setosa         7   
 3          4.7         3.2          1.3         0.2 setosa         6.5 
 4          4.6         3.1          1.5         0.2 setosa         7.5 
 5          5           3.6          1.4         0.2 setosa         7   
 6          5.4         3.9          1.7         0.4 setosa         4.25
 7          4.6         3.4          1.4         0.3 setosa         4.67
 8          5           3.4          1.5         0.2 setosa         7.5 
 9          4.4         2.9          1.4         0.2 setosa         7   
10          4.9         3.1          1.5         0.1 setosa        15   
# ℹ 140 more rows
# ℹ 1 more variable: Petal.Area <dbl>

1.8.10 Pivot_longer

pivot_longer():
- Converts wide data to long format (stacking columns into rows).
- Useful for transforming data when working with multiple measurement columns.
- We’ll use relig_income as an example data set. This is in WIDE format.

relig_income

# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

To pivot this data set, we provide
- cols = : The columns that will be used to pivot in to the new ‘values’ column. Here we want all columns from the dataset except the religion column which provides the labels, so simply exclude that one with !religion. You could also specify specific columns with a list cols = c("<$10k","$10-20k","$20-30k")

relig_income_long <- relig_income %>%
    pivot_longer(
      cols = !religion, 
      names_to = "income", 
      values_to = "n")

relig_income_long

# A tibble: 180 × 3
   religion income                 n
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows

1.8.11 Pivot_wider

This is the exact opposite of pivot_longer. You’ll be taking the values of a column (here income) that you want to pivot to be the new column names in the wide format tibble, then distributing the values of another column (here count) to the appropriate columns and rows (here religion).

relig_income_long %>%
  pivot_wider(
    id_cols = religion,
    names_from = income, 
    values_from = n
    )

# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

1.8.12 Using pipes and verbs together

To do a series of things to a tibble, you simply pipe the verbs

relig_income %>%
    pivot_longer(  
      cols = !religion, 
      names_to = "income", 
      values_to = "count") %>% 

    filter(
      income != "Don't know/refused"
          ) %>% 

    arrange(
      income,
      religion
          )

# A tibble: 162 × 3
   religion                income  count
   <chr>                   <chr>   <dbl>
 1 Agnostic                $10-20k    34
 2 Atheist                 $10-20k    27
 3 Buddhist                $10-20k    21
 4 Catholic                $10-20k   617
 5 Don’t know/refused      $10-20k    14
 6 Evangelical Prot        $10-20k   869
 7 Hindu                   $10-20k     9
 8 Historically Black Prot $10-20k   244
 9 Jehovah's Witness       $10-20k    27
10 Jewish                  $10-20k    19
# ℹ 152 more rows

You can include notes if it helps you.
Using indentation also really helps

relig_income %>%
  
    # Pivot the table to be long format
    pivot_longer(  
      cols = !religion, 
      names_to = "income", 
      values_to = "count") %>% 
  
    # Remove lines where no income data was provided
    filter(
      income != "Don't know/refused"
          ) %>% 
  
    # Sort the data to sho 
    arrange(
      income,
      religion
          )

# A tibble: 162 × 3
   religion                income  count
   <chr>                   <chr>   <dbl>
 1 Agnostic                $10-20k    34
 2 Atheist                 $10-20k    27
 3 Buddhist                $10-20k    21
 4 Catholic                $10-20k   617
 5 Don’t know/refused      $10-20k    14
 6 Evangelical Prot        $10-20k   869
 7 Hindu                   $10-20k     9
 8 Historically Black Prot $10-20k   244
 9 Jehovah's Witness       $10-20k    27
10 Jewish                  $10-20k    19
# ℹ 152 more rows

1.8.13 Group_by

These verbs create a data frame with one row per group, where the variables are a summary of values. group_by groups data by one or more columns,

group_by allows you to create groupwise calculations like group means

This approach does not collapse the data, so if you wanted three rows, with a single mean value for each one, you need to do something else (see below).

Always ungroup() if you plan to do further calculations on individual rows.

iris %>% 
  group_by(Species) %>% 
  mutate(
    Sepal.length.mean = mean(Sepal.Length),
    weight = Sepal.Length - Sepal.length.mean
  )%>% 
  ungroup()

# A tibble: 150 × 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.length.mean
          <dbl>       <dbl>        <dbl>       <dbl> <fct>               <dbl>
 1          5.1         3.5          1.4         0.2 setosa               5.01
 2          4.9         3            1.4         0.2 setosa               5.01
 3          4.7         3.2          1.3         0.2 setosa               5.01
 4          4.6         3.1          1.5         0.2 setosa               5.01
 5          5           3.6          1.4         0.2 setosa               5.01
 6          5.4         3.9          1.7         0.4 setosa               5.01
 7          4.6         3.4          1.4         0.3 setosa               5.01
 8          5           3.4          1.5         0.2 setosa               5.01
 9          4.4         2.9          1.4         0.2 setosa               5.01
10          4.9         3.1          1.5         0.1 setosa               5.01
# ℹ 140 more rows
# ℹ 1 more variable: weight <dbl>

Summarise() and reframe() are almost always used in combination with group_by().
As with all verbs, you can provide lists to group_by(), reframe() and summarise().

1.8.14 Reframe

Purpose: A more flexible way to return multiple results for each group without reducing it to one row per group.
Typical Use: Used when you want to keep multiple rows per group but still perform summary operations. i.e. when creating denominators etc.
Behavior: Allows for returning multiple rows or multiple values for each group, so it doesn’t necessarily collapse the data.
If you choose to include any of the original variables in your reframed tibble, the resulting tibble will have the same dimensions as your original. Here the groupwise counts are added to count and the groupwise mean sepal lengths are added to mean.sepal.length.

iris_data %>%
  group_by(Species) %>%
  reframe(
    Sepal.Length,
    Sepal.Width,
    count = n(),
    sepal.length.mean = mean(Sepal.Length),
    weight = Sepal.Length-sepal.length.mean
  )

# A tibble: 150 × 6
   Species Sepal.Length Sepal.Width count sepal.length.mean   weight
   <chr>          <dbl>       <dbl> <int>             <dbl>    <dbl>
 1 setosa           5.1         3.5    50              5.01  0.0940 
 2 setosa           4.9         3      50              5.01 -0.106  
 3 setosa           4.7         3.2    50              5.01 -0.306  
 4 setosa           4.6         3.1    50              5.01 -0.406  
 5 setosa           5           3.6    50              5.01 -0.00600
 6 setosa           5.4         3.9    50              5.01  0.394  
 7 setosa           4.6         3.4    50              5.01 -0.406  
 8 setosa           5           3.4    50              5.01 -0.00600
 9 setosa           4.4         2.9    50              5.01 -0.606  
10 setosa           4.9         3.1    50              5.01 -0.106  
# ℹ 140 more rows

If you choose not to include your original variables, reframe() will present only the new variables.

iris %>%
  group_by(Species) %>%
  reframe(
    count = n(),
    mean.sepal.length = mean(Sepal.Length),
    median.petal.length = median(Petal.Length),
    sd.petal.length = sd(Petal.Length)
  )

# A tibble: 3 × 5
  Species    count mean.sepal.length median.petal.length sd.petal.length
  <fct>      <int>             <dbl>               <dbl>           <dbl>
1 setosa        50              5.01                1.5            0.174
2 versicolor    50              5.94                4.35           0.470
3 virginica     50              6.59                5.55           0.552

summarise():
- Creates summary statistics for each group, such as mean, median, or sum.
- In this case summarise() creates the same table as the previous example of reframe
- This shows how you can always use reframe so generally forget about using summarise()

iris_data %>%
  group_by(Species) %>%
  summarise(
    count = n(),
    mean.sepal.length = mean(Sepal.Length),
    median.petal.length = median(Petal.Length)
            )

# A tibble: 3 × 4
  Species    count mean.sepal.length median.petal.length
  <chr>      <int>             <dbl>               <dbl>
1 setosa        50              5.01                1.5 
2 versicolor    50              5.94                4.35
3 virginica     50              6.59                5.55

1.8.15 Rename

rename():
- Renames columns in a tibble.
- Helpful for cleaning up column names for clarity.

iris_data %>% 
  rename(
    var_a = Petal.Length,
    var_b = Petal.Width
    )

# A tibble: 150 × 5
   Sepal.Length Sepal.Width var_a var_b Species
          <dbl>       <dbl> <dbl> <dbl> <chr>  
 1          5.1         3.5   1.4   0.2 setosa 
 2          4.9         3     1.4   0.2 setosa 
 3          4.7         3.2   1.3   0.2 setosa 
 4          4.6         3.1   1.5   0.2 setosa 
 5          5           3.6   1.4   0.2 setosa 
 6          5.4         3.9   1.7   0.4 setosa 
 7          4.6         3.4   1.4   0.3 setosa 
 8          5           3.4   1.5   0.2 setosa 
 9          4.4         2.9   1.4   0.2 setosa 
10          4.9         3.1   1.5   0.1 setosa 
# ℹ 140 more rows

1.9 Joining

left_join() / right_join() / inner_join() / full_join():\

Joins add columns from one tibble to another, matching the observations using key variables.

There are three types of join

A left_join() keeps all observations in x.
A right_join() keeps all observations in y.
A full_join() keeps all observations in x and y.

We’ll use the band_members and band_instruments data frames for this

band_members

# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

You can see that both tibbles contain two variables, of which one is called name. This will be the key variable that is used for joining. R will automatically look for matching variables, and will merge the data semi-automatically. It even works if there’s more than one key variable.

1.9.1 Left join

band_members %>% 
left_join(
  band_instruments
)

Joining with `by = join_by(name)`

# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass

You can see that the new left_joined tibble now contains three variables.

All three band members who were in the band_members tibble are still represented here, but Keith is not included in this tibble because left_join adds new columns to observations that already exist in band_members

1.9.2 Right Join

The right_join works in exactly the opposite way. Here the right_join adds new columns to the observations of the right hand tibble (i.e. to band_instruments).

right_join(
  band_members,
  band_instruments
)

Joining with `by = join_by(name)`

# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

1.9.3 Full Join

The full_join keeps all the observations and all the columns of both data sets.

full_join(
  band_members,
  band_instruments
)

Joining with `by = join_by(name)`

# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

1.10 ggplot

Now we’ve covered the basics of managing and manipulating tibbles, let’s look at the basics of drawing charts in ggplot.

To understand the syntax of ggplot, you have to understand how charts created with this system are built in layers. Just like how we pipe data through %>% when handling tibbles, we add new layers to ggplot charts using +

ggplot accepts piped data as an input. The initial ggplot is a blank chart with no axes. Let’s pipe the iris tibble in to a ggplot.

iris %>% 
  ggplot()

Next we want to describe the ‘aesthetics’ of the plot. This is how we define the variables that will contribute to the axes, groups, points, shapes, fills, areas and so on.

Let’s provide ggplot with some aesthetics in the form of an x (Sepal.Length) and y (Sepal.Width)

This should add the axes which will be appropriately scaled according to the limits of the two variables.

iris %>% 
  ggplot(aes(x = Sepal.Length,y = Sepal.Width))

To add some of the data points to the chart, we need to add a new layer. The type of chart is defined by which “geom" layer you choose to add next.

Let’s start simple and draw some points with geom_point(). Remember to add the new layer by putting a + at the end of the last line

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width)
          ) +
      geom_point()

This is useful, but doesn’t tell us anything about the points. Colours, grouping, fills etc are defined in the aesthethics, so let’s add some colours to the points according to which species they represent.

You can encode colours with color= or colour=

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species
        )
          ) +
      geom_point()

If you provide a continuous variable to color you’ll get a nice result too.

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Petal.Length
        )
          ) +
      geom_point()

You can also use the shape= aesthetic to add different shapes. Here we can now see information on sepal length (x), sepal width (y), species (shape) and petal length (color).

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Petal.Length,
        shape=Species
        )
          ) +
      geom_point()

You can also use the size= aesthetic to get a very different plot.

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        size=Petal.Length
        )
          ) +
      geom_point()

1.10.1 Line charts

geom_line() draws lines between points

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        )
          ) +
      geom_line()

In this case, that makes little sense, because there’s three species. Adding the colour aesthethic groups the data and the lines will be drawn by group

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_line()

You can combine more than one geom by adding extra layers

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_line()+
      geom_point()

and there’s variations like geom_smooth() which makes a nicer line

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_smooth()+
      geom_point()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1.10.2 Facets

Facets allow you to draw multiple panels. facet_grid is a nice way to do this.

We’ll facet the line chart by Species. The facet is an arrangement of panels in columns and rows.

to arrange your facets in rows you provide facet_grid(.~Species)

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_smooth()+
      geom_point()+
  facet_grid(.~Species)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

To arrange your facets in columns you provide `facet_grid(Species~.)

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_smooth()+
      geom_point()+
  facet_grid(Species~.)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

To arrange your facets around species in columns and on petal size in rows you could do

iris %>% 
  mutate(
    petal_bigger_than_average = Petal.Length >= mean(Petal.Length) 
  ) %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_smooth()+
      geom_point()+
  facet_grid(
    petal_bigger_than_average ~  Species
    )

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

A problem with facets can be to do with axes being locked to the minimum and maximum values of the variable. You may wish to ‘free’ the axes using scales="free", scales="free_x" or scales="free_y"

iris %>% 
    ggplot(
      aes(
        Sepal.Length,
        Sepal.Width,
        color=Species,
        )
          ) +
      geom_smooth()+
      geom_point()+
  facet_grid(
            Petal.Length >= mean(Petal.Length) ~ Species, 
            scales = "free"
            )

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Finally we can see a linear relationship in some of the data. Setosa for instance has a strong correlation between sepal length and sepal width, which does not appear to be true of the other species, regardless of the petal length division.

1.10.3 Bar chart

geom_bar() – Creates bar plots, either stacked or grouped.

-   Example: `geom_bar(stat = "identity")` for bar heights based on a variable.

To plot a count of occurrances in the data use stat="count"

The default is a stacked bar chart. For this you only need to provide x.

iris_data %>% 
    ggplot(
      aes(
        Petal.Length > 4,
        fill = Species
        )
          )+
  geom_bar(stat="count")

You can change this to a side-by-side chart using `position=“dodge”

iris %>% 
    ggplot(
      aes(
        Petal.Length > 4,
        fill = Species
        )
          )+
  geom_bar(stat="count",
           position = "dodge"
           )

iris %>% 
    ggplot(
      aes(
        Petal.Length > 4,
        fill = Species
        )
          )+
  geom_bar(stat="count"           )+
  facet_grid(.~Species)

In lots of cases you’ll have precomputed some summaries and will want to print the exact identity values. Let’s reframe the iris data as a set of averages (see the reframe section) and then pipe the result in to a ggplot using stat="identity" Unlike with the stat="count" default, you need to provide both x (grouping) and x (value) data to stat = "identity.

iris_data %>%
  group_by(Species) %>%
  reframe(
    count = n(),
    mean.sepal.length = mean(Sepal.Length),
    median.petal.length = median(Petal.Length)
  ) %>% 
  
  ggplot(
    aes(
      Species,
      mean.sepal.length,
      fill=Species  
      )
        )+
  geom_bar(stat="identity")

1.10.4 Columnar charts

geom_col() – Similar to geom_bar(), but used when heights are defined by variables instead of counts.

-   Example: `geom_col()`

This is essentially identical to using stat="identity" with a geom_bar()

We can also use geom_errorbar() to add confidence intervals.

Note that geom_errorbar has its own set of aesthethics, to cover the upper (ymax) and lower (ymin) limits.

Let’s start by making some statistics

iris %>%
  group_by(Species) %>%
  reframe(
    count = n(),
    mean.sepal.length = mean(Sepal.Length),
    sd.sepal.length = sd(Sepal.Length),
    lower = mean.sepal.length - (1.96*sd.sepal.length),
    upper = mean.sepal.length + (1.96*sd.sepal.length)
  )

# A tibble: 3 × 6
  Species    count mean.sepal.length sd.sepal.length lower upper
  <fct>      <int>             <dbl>           <dbl> <dbl> <dbl>
1 setosa        50              5.01           0.352  4.32  5.70
2 versicolor    50              5.94           0.516  4.92  6.95
3 virginica     50              6.59           0.636  5.34  7.83

Then pipe this in to ggplot

iris %>%
  group_by(Species) %>%
  reframe(
    count = n(),
    mean.sepal.length = mean(Sepal.Length),
    sd.sepal.length = sd(Sepal.Length),
    lower = mean.sepal.length - (1.96*sd.sepal.length),
    upper = mean.sepal.length + (1.96*sd.sepal.length)
  ) %>% 
  
  ggplot(
    aes(
      Species,
      mean.sepal.length,
      fill=Species  
      )
        )+
  geom_col()+
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2)

1.10.5 Histograms

geom_histogram() – Plots the frequency distribution of continuous data by creating bins.

-   Example: `geom_histogram()`

Histograms require only values of x

iris %>%
  ggplot(
    aes(
      Petal.Length
        )
        )+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can control the number of bins

iris %>%
  ggplot(
    aes(
      Petal.Length
        )
        )+
  geom_histogram(bins = 50)

You can also add grouping as before

iris %>%
  ggplot(
    aes(
      Petal.Length,
      fill = Species
        )
        )+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.10.6 Density plots

geom_density() – Shows the distribution of a continuous variable with a smooth density curve.

-   Example: `geom_density()`

Density plots are very useful when you want to look at distributions of data in different classes. They are similar in many respects to histograms.

iris %>%
  ggplot(
    aes(
      Petal.Length,
      fill = Species
        )
        )+
  geom_density()

You’ll want to see what’s going on in the overlapping regions, so you can add transparency with alpha=. Transparency can be used in any ggplot.

iris %>%
  ggplot(
    aes(
      Petal.Length,
      fill = Species
        )
        )+
  geom_density(alpha=0.4)

1.10.7 Boxplots

geom_boxplot() – Visualizes the distribution of a variable through quartiles and potential outliers.

-   Example: `geom_boxplot()`

These are a mainstay of epidemiology.

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length,
      fill = Species
        )
        )+
  geom_boxplot()

1.10.8 Violin plots

geom_violin() – A hybrid of boxplot and density plot, showing distribution shape along with quartiles.

-   Example: `geom_violin()`

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length,
      fill = Species
        )
        )+
  geom_violin()

1.10.9 Violin & Box together

Adding a geom_box() layer to a violin plot can be useful.

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length,
      fill = Species
        )
        )+
  geom_boxplot()+
  geom_violin()

But this is ugly. and the boxplots are obscured by the violins. The order of the layers in a ggplot matters

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length,
      fill = Species
        )
        )+
  geom_violin()+
  geom_boxplot()

Changing the order of the layers improves things, but we can control each layer individually by changing it’s mappings. This is the reason why the geoms all have brackets!

Let’s fill the violin plots, adding some transparency. We’ll also make the boxes on the boxplots a bit narrower so that we can see all the violin data, and let’s remove the outlier points.

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length
        )
        )+
  geom_violin(mapping = aes(fill=Species))+
    geom_boxplot(width=0.1,outliers = F)

Finally, let’s add the points, jittering them so that they don’t all line up along the midlines with geom_jitter.

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length
        )
        )+
  geom_violin(mapping = aes(fill=Species))+
    geom_boxplot(width=0.1,outliers = F)+
  geom_jitter(size=0.3,width = 0.2)

1.10.10 Tile plots

geom_tile() – Creates heatmap-like visuals by filling rectangular areas based on values.

-   Example: `geom_tile()`

This kind of heatmap is great for showing the value of a variable (defined with fill) in a grid representing several classes on x and y axis.

iris %>%
  ggplot(
    aes(
      Species,
      Petal.Length>3,
      fill = Petal.Width
        )
        )+
geom_tile()

geom_jitter() – Adds small random noise to points, useful for avoiding overplotting.

-   Example: `geom_jitter()`

geom_ribbon() – Fills the area between two y-values (usually a line and its confidence interval).

-   Example: `geom_ribbon()`

geom_text() – Adds text annotations to points in the plot.

-   Example: `geom_text(aes(label = ...))`

geom_errorbar() – Adds error bars to plots (e.g., for displaying variability or uncertainty).

-   Example: `geom_errorbar()`

1.10.11 Separate wider

df <- tibble(id = 1:3, patient_id = c("m-123", "f-455", "f-123"))
df

# A tibble: 3 × 2
     id patient_id
  <int> <chr>     
1     1 m-123     
2     2 f-455     
3     3 f-123

There are three basic ways to split up a string into pieces

1.10.11.1 With a delimiter

df %>% 
  separate_wider_delim (
    patient_id, 
    delim = "-", 
    names = c("gender", "unit")
    )

# A tibble: 3 × 3
     id gender unit 
  <int> <chr>  <chr>
1     1 m      123  
2     2 f      455  
3     3 f      123

1.10.11.2 By string length

Here you provide a set of widths to map new columns to various characters.

The example data are in the form m-123 where the m represents gender, the - is just a delimiter and the 123 is the participant ID.

we can assign characters with var = n where n is the width of the string in characters.

widths = c(gender = 1)

will assign the first character in the strong to a new variable gender.

widths = c(gender = 1, 1)

Assigns the first character in the strong to a new variable gender.
The next character will be dropped

widths = c(gender = 1, 1, unit=3)

Assigns the first character in the strong to a new variable gender.
The next character will be dropped
Finally assigns the last 3 characters to a new variable unit

1.10.11.3 Or by REGEX

Regular expressions are a poweful language for string matching.

df %>% 
  separate_wider_regex(
    patient_id, c(gender = ".", ".", unit = "\\d+"))

# A tibble: 3 × 3
     id gender unit 
  <int> <chr>  <chr>
1     1 m      123  
2     2 f      455  
3     3 f      123

A full example is like this

df %>% 
  separate_wider_position (
    cols = patient_id,
    widths = c(gender = 1, 1, unit=3))

# A tibble: 3 × 3
     id gender unit 
  <int> <chr>  <chr>
1     1 m      123  
2     2 f      455  
3     3 f      123

1.10.12 Unite

Unite joins columns, or merges them.

df <- expand_grid(x = c("a", NA), y = c("b", NA))
df

# A tibble: 4 × 2
  x     y    
  <chr> <chr>
1 a     b    
2 a     <NA> 
3 <NA>  b    
4 <NA>  <NA>

1.10.12.1 Unite, dropping NAs

df %>% 
  unite(
    "z",
    x:y,
    na.rm = FALSE,
    remove = FALSE)

# A tibble: 4 × 3
  z     x     y    
  <chr> <chr> <chr>
1 a_b   a     b    
2 a_NA  a     <NA> 
3 NA_b  <NA>  b    
4 NA_NA <NA>  <NA>

1.10.12.2 Unite, removing originals and shirt

df %>% 
  unite(
    "z",
    x:y, 
    na.rm = TRUE,
    remove = FALSE
    )

# A tibble: 4 × 3
  z     x     y    
  <chr> <chr> <chr>
1 "a_b" a     b    
2 "a"   a     <NA> 
3 "b"   <NA>  b    
4 ""    <NA>  <NA>

1.10.13 Summary of Exclusive Helper Functions:

Conditional Operations: case_when(), if_else()
Range Check: between()
Missing Value Handling: coalesce(), is.na()
Cumulative Functions: cumsum(), cummean(), cumall(), cumany()
Row-based Operations: lag(), lead(), nth(), row_number()
Summarizing or Counting: n(), pmin(), pmax(), any(), all()

These helper functions are used specifically within verbs like mutate(), filter(), summarise(), arrange(), and others to perform specialized operations inside the context of a single table.