7  Introduction to Tidyverse

7.1 What is the tidyverse?

  • Collection of essential R packages for data science.

  • All packages share a common design philosophy, grammar, and data structures.

7.2 Setup

install.packages("tidyverse") # install tidyverse packages
library(tidyverse) # load tidyverse packages
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

7.3 Tibble

  • Tibble is a modern version of dataframes.

  • A modern re-imagining of data frames.

7.3.1 Create a tibble

library(tidyverse) # library(tibble)
first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl
# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51
class(first.tbl)
[1] "tbl_df"     "tbl"        "data.frame"

7.3.2 Convert an existing dataframe to a tibble

as_tibble(iris)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

7.3.3 Convert a tibble to a dataframe

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
class(first.tbl)
[1] "tbl_df"     "tbl"        "data.frame"
first.tbl.df <- as.data.frame(first.tbl)
class(first.tbl.df)
[1] "data.frame"

7.3.4 tibble vs data.frame

  1. The way they print output

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl
# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51

data.frame

dataframe <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51))
dataframe
  height weight
1    150     45
2    200     60
3    160     51
  1. With tibble you can create new variables that are functions of existing variables.

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2)
first.tbl
# A tibble: 3 × 3
  height weight     bmi
   <dbl>  <dbl>   <dbl>
1    150     45 0.002  
2    200     60 0.0015 
3    160     51 0.00199

data.frame

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2) # Not working

You will get an error message

Error in data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), : object 'height' not found.

With data.frame this is how we should create a new variable from the existing columns.

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51)) 
df$bmi <- (df$weight)/(df$height^2)
df
  height weight         bmi
1    150     45 0.002000000
2    200     60 0.001500000
3    160     51 0.001992188
  1. In contrast to data frames, the variable names in tibbles can contain spaces.

Example 1

tbl <- tibble(`patient id` = c(1, 2, 3))
tbl
# A tibble: 3 × 1
  `patient id`
         <dbl>
1            1
2            2
3            3
df <- data.frame(`patient id` = c(1, 2, 3))
df
  patient.id
1          1
2          2
3          3
  1. In contrast to data frames, the variable names in tibbles can start with a number.
tbl <- tibble(`1var` = c(1, 2, 3))
tbl
# A tibble: 3 × 1
  `1var`
   <dbl>
1      1
2      2
3      3
df <- data.frame(`1var` = c(1, 2, 3))
df
  X1var
1     1
2     2
3     3

In general, tibbles do not change the names of input variables and do not use row names.

  1. A tibble can have columns that are lists.

tibble

tbl <- tibble (x = 1:3, y = list(1:3, 1:4, 1:10))
tbl
# A tibble: 3 × 2
      x y         
  <int> <list>    
1     1 <int [3]> 
2     2 <int [4]> 
3     3 <int [10]>

data.frame

This feature is not available in data.frame.

If we try to do this with a traditional data frame we get an error.

df <- data.frame(x = 1:3, y = list(1:3, 1:4, 1:10)) ## Not working, error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 4, 10

7.3.5 Subsetting: tibble vs data.frame

7.3.5.1 Subsetting single columns

data.frame

df <- data.frame(x = 1:3, 
                 yz = c(10, 20, 30)); df
  x yz
1 1 10
2 2 20
3 3 30
df[, "x"]
[1] 1 2 3
df[, "x", drop=FALSE]
  x
1 1
2 2
3 3

tibble

tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30)); tbl
# A tibble: 3 × 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30
tbl[, "x"]
# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3
tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30))
tbl
# A tibble: 3 × 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30
tbl[, "x"]
# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3
# Method 1
tbl[, "x", drop = TRUE]
[1] 1 2 3
# Method 2
as.data.frame(tbl)[, "x"]
[1] 1 2 3

7.3.5.2 Subsetting single rows with the drop argument

data.frame

df[1, , drop = TRUE]
$x
[1] 1

$yz
[1] 10

tibble

tbl[1, , drop = TRUE]
# A tibble: 1 × 2
      x    yz
  <int> <dbl>
1     1    10
as.list(tbl[1, ])
$x
[1] 1

$yz
[1] 10

7.3.5.3 Accessing non-existent columns

data.frame

df$y
[1] 10 20 30
df[["y", exact = FALSE]]
[1] 10 20 30
df[["y", exact = TRUE]]
NULL

tibble

tbl$y
Warning: Unknown or uninitialised column: `y`.
NULL
tbl[["y", exact = FALSE]]
Warning: `exact` ignored.
NULL
tbl[["y", exact = TRUE]]
NULL

7.3.6 Some functions that work with both tibbles and dataframes

names(), colnames(), rownames(), ncol(), nrow(), length() # length of the underlying list

tibble

tb <- tibble(a = 1:3)
names(tb)
[1] "a"
colnames(tb)
[1] "a"
rownames(tb)
[1] "1" "2" "3"
nrow(tb); ncol(tb); length(tb)
[1] 3
[1] 1
[1] 1

data.frame

df <- data.frame(a = 1:3)
names(df)
[1] "a"
colnames(df)
[1] "a"
rownames(df)
[1] "1" "2" "3"
nrow(df); ncol(df); length(df)
[1] 3
[1] 1
[1] 1

However, when using tibble, we can use some additional commands

is_tibble(tb) 
[1] TRUE
glimpse(tb)
Rows: 3
Columns: 1
$ a <int> 1, 2, 3

7.4 Factors

factor

  • A vector that is used to store categorical variables.

  • It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.

Creating a factor vector

grades <- factor(c("A", "A", "A", "C", "B"))
grades
[1] A A A C B
Levels: A B C

Now let’s check the class type

class(grades) # It's a factor
[1] "factor"

To obtain all levels

levels(grades)
[1] "A" "B" "C"

7.4.1 Creating a factor vector

  • With factors all possible values of the variables can be defined under levels.
grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr
[1] A D A C B
Levels: A B C D E
levels(grade_factor_vctr)
[1] "A" "B" "C" "D" "E"
class(levels(grade_factor_vctr))
[1] "character"

7.4.2 Character vector vs Factor

  • Observe the differences in outputs. Factor prints all possible levels of the variable.

Character vector

grade_character_vctr <- c("A", "D", "A", "C", "B")
grade_character_vctr
[1] "A" "D" "A" "C" "B"

Factor vector

grade_factor_vctr <- factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr
[1] A D A C B
Levels: A B C D E
  • Factors behave like character vectors but they are actually integers.

Character vector

typeof(grade_character_vctr)
[1] "character"

Factor vector

typeof(grade_factor_vctr)
[1] "integer"
  • Let’s create a contingency table with table function.

Character vector output with table function

grade_character_vctr <- c("A", "D", "A", "C", "B")
table(grade_character_vctr)
grade_character_vctr
A B C D 
2 1 1 1 

Factor vector (with levels) output with table function

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
table(grade_factor_vctr)
grade_factor_vctr
A B C D E 
2 1 1 1 0 
  • Output corresponds to factor prints counts for all possible levels of the variable. Hence, with factors it is obvious when some levels contain no observations.

  • With factors you can’t use values that are not listed in the levels, but with character vectors there is no such restrictions.

Character vector

grade_character_vctr[2] <- "A+"
grade_character_vctr
[1] "A"  "A+" "A"  "C"  "B" 

Factor vector

grade_factor_vctr[2] <- "A+"
Warning in `[<-.factor`(`*tmp*`, 2, value = "A+"): invalid factor level, NA
generated
grade_factor_vctr
[1] A    <NA> A    C    B   
Levels: A B C D E

7.4.3 Modify factor levels

This our factor

grade_factor_vctr
[1] A    <NA> A    C    B   
Levels: A B C D E

7.4.4 Change labels

levels(grade_factor_vctr) <- 
  c("Excellent", "Good", "Average", "Poor", "Fail")
grade_factor_vctr
[1] Excellent <NA>      Excellent Average   Good     
Levels: Excellent Good Average Poor Fail

7.4.5 Reverse the level arrangement

levels(grade_factor_vctr) <- rev(levels(grade_factor_vctr))
grade_factor_vctr
[1] Fail    <NA>    Fail    Average Poor   
Levels: Fail Poor Average Good Excellent

7.4.6 Order of factor levels

Default order of levels

fv1 <- factor(c("D","E","E","A", "B", "C"))
fv1
[1] D E E A B C
Levels: A B C D E
fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"))
fv2
[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 1T 2T 3A 4A 5A 6B
df <- data.frame(fv2=fv2)
library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()

You can change the order of levels

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"), 
              levels = c("3A", "4A", "5A", "6B", "1T", "2T"))
fv2
[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 3A 4A 5A 6B 1T 2T
df <- data.frame(fv2=fv2)
library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()

Note that tibbles do not change the types of input variables (e.g., strings are not converted to factors by default).

tbl <- tibble(x1 = c("setosa", "versicolor", "virginica", "setosa"))
tbl
# A tibble: 4 × 1
  x1        
  <chr>     
1 setosa    
2 versicolor
3 virginica 
4 setosa    
df <- data.frame(x1 = c("setosa", "versicolor", "virginica", "setosa"))
df
          x1
1     setosa
2 versicolor
3  virginica
4     setosa
class(df$x1)
[1] "character"

7.5 Pipe operator: %>% or |>

7.5.1 Required package: magrittr (for %>%)

install.packages("magrittr")
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:purrr':

    set_names
The following object is masked from 'package:tidyr':

    extract

7.5.2 What does it do?

It takes whatever is on the left-hand-side of the pipe and makes it the first argument of whatever function is on the right-hand-side of the pipe.

For instance,

mean(1:10)
[1] 5.5

can be written as

1:10 %>% mean()
[1] 5.5

7.5.3 Illustrations

  1. x %>% f(y) turns into f(x, y)

  2. x %>% f(y) %>% g(z) turns into g(f(x, y), z)

7.5.4 Why %>%

  • This helps to make your code more readable.

Method 1: Without using pipe (hard to read)

colSums(matrix(c(1, 2, 3, 4, 8, 9, 10, 12), nrow=2))
[1]  3  7 17 22

Method 2: Using pipe (easy to read)

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix( , nrow = 2) %>%
  colSums()
[1]  3  7 17 22

or

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix(nrow = 2) %>% # remove comma
  colSums()
[1]  3  7 17 22

7.5.5 Rules

library(tidyverse) # to use as_tibble
library(magrittr) # to use %>%
df <- data.frame(x1 = 1:3, x2 = 4:6)
df
  x1 x2
1  1  4
2  2  5
3  3  6

Rule 1

head(df) 
df %>% head()
  x1 x2
1  1  4
2  2  5
3  3  6

Rule 2

head(df, n = 2)  
df %>% head(n = 2)
  x1 x2
1  1  4
2  2  5

Rule 3

head(df, n = 2)
2 %>% head(df, n = .)
  x1 x2
1  1  4
2  2  5

Rule 4

head(as_tibble(df), n = 2)
df %>% as_tibble() %>%
head(n = 2)
# A tibble: 2 × 2
     x1    x2
  <int> <int>
1     1     4
2     2     5

Rule 5: subsetting

df$x1
df %>% .$x1
[1] 1 2 3

or

df[["x1"]]
df %>% .[["x1"]]
[1] 1 2 3

or

df[[1]]
df %>% .[[1]]
[1] 1 2 3

7.5.6 Offline reading materials

Type the following codes to see more examples:

vignette("magrittr")
vignette("tibble")