7 Introduction to Tidyverse

7.1 What is the tidyverse?

Collection of essential R packages for data science.
All packages share a common design philosophy, grammar, and data structures.

7.2 Setup

install.packages("tidyverse") # install tidyverse packages
library(tidyverse) # load tidyverse packages

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

7.3 The Tidyverse data analysis workflow

7.4 Tibble

Tibble is a modern version of dataframes.
A modern re-imagining of data frames.

7.4.1 Create a tibble

library(tidyverse) # library(tibble)
first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl

# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51

class(first.tbl)

[1] "tbl_df"     "tbl"        "data.frame"

7.4.2 Convert an existing dataframe to a tibble

as_tibble(iris)

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

7.4.3 Convert a tibble to a dataframe

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
class(first.tbl)

[1] "tbl_df"     "tbl"        "data.frame"

first.tbl.df <- as.data.frame(first.tbl)
class(first.tbl.df)

[1] "data.frame"

7.4.4 tibble vs data.frame

The way they print output

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl

# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1    150     45
2    200     60
3    160     51

data.frame

dataframe <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51))
dataframe

  height weight
1    150     45
2    200     60
3    160     51

With tibble you can create new variables that are functions of existing variables.

tibble

first.tbl <- tibble(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2)
first.tbl

# A tibble: 3 × 3
  height weight     bmi
   <dbl>  <dbl>   <dbl>
1    150     45 0.002  
2    200     60 0.0015 
3    160     51 0.00199

data.frame

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), 
                    bmi = (weight)/height^2) # Not working

You will get an error message

Error in data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), : object 'height' not found.

With data.frame this is how we should create a new variable from the existing columns.

df <- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51)) 
df$bmi <- (df$weight)/(df$height^2)
df

  height weight         bmi
1    150     45 0.002000000
2    200     60 0.001500000
3    160     51 0.001992188

In contrast to data frames, the variable names in tibbles can contain spaces.

Example 1

tbl <- tibble(`patient id` = c(1, 2, 3))
tbl

# A tibble: 3 × 1
  `patient id`
         <dbl>
1            1
2            2
3            3

df <- data.frame(`patient id` = c(1, 2, 3))
df

  patient.id
1          1
2          2
3          3

In contrast to data frames, the variable names in tibbles can start with a number.

tbl <- tibble(`1var` = c(1, 2, 3))
tbl

# A tibble: 3 × 1
  `1var`
   <dbl>
1      1
2      2
3      3

df <- data.frame(`1var` = c(1, 2, 3))
df

In general, tibbles do not change the names of input variables and do not use row names.

A tibble can have columns that are lists.

tibble

tbl <- tibble (x = 1:3, y = list(1:3, 1:4, 1:10))
tbl

# A tibble: 3 × 2
      x y         
  <int> <list>    
1     1 <int [3]> 
2     2 <int [4]> 
3     3 <int [10]>

data.frame

This feature is not available in data.frame.

If we try to do this with a traditional data frame we get an error.

df <- data.frame(x = 1:3, y = list(1:3, 1:4, 1:10)) ## Not working, error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 4, 10

7.4.5 Subsetting: tibble vs data.frame

7.4.5.1 Subsetting single columns

data.frame

df <- data.frame(x = 1:3, 
                 yz = c(10, 20, 30)); df

df[, "x"]

[1] 1 2 3

df[, "x", drop=FALSE]

tibble

tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30)); tbl

# A tibble: 3 × 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30

tbl[, "x"]

# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3

tbl <- tibble(x = 1:3, 
              yz = c(10, 20, 30))
tbl

# A tibble: 3 × 2
      x    yz
  <int> <dbl>
1     1    10
2     2    20
3     3    30

tbl[, "x"]

# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3

# Method 1
tbl[, "x", drop = TRUE]

[1] 1 2 3

# Method 2
as.data.frame(tbl)[, "x"]

[1] 1 2 3

7.4.5.2 Subsetting single rows with the drop argument

data.frame

df[1, , drop = TRUE]

$x
[1] 1

$yz
[1] 10

tibble

tbl[1, , drop = TRUE]

# A tibble: 1 × 2
      x    yz
  <int> <dbl>
1     1    10

as.list(tbl[1, ])

$x
[1] 1

$yz
[1] 10

7.4.5.3 Accessing non-existent columns

data.frame

df$y

[1] 10 20 30

df[["y", exact = FALSE]]

[1] 10 20 30

df[["y", exact = TRUE]]

NULL

tibble

tbl$y

Warning: Unknown or uninitialised column: `y`.

NULL

tbl[["y", exact = FALSE]]

Warning: `exact` ignored.

NULL

tbl[["y", exact = TRUE]]

NULL

7.4.6 Some functions that work with both tibbles and dataframes

names(), colnames(), rownames(), ncol(), nrow(), length() # length of the underlying list

tibble

tb <- tibble(a = 1:3)
names(tb)

[1] "a"

colnames(tb)

[1] "a"

rownames(tb)

[1] "1" "2" "3"

nrow(tb); ncol(tb); length(tb)

[1] 3

[1] 1

[1] 1

data.frame

df <- data.frame(a = 1:3)
names(df)

[1] "a"

colnames(df)

[1] "a"

rownames(df)

[1] "1" "2" "3"

nrow(df); ncol(df); length(df)

[1] 3

[1] 1

[1] 1

However, when using tibble, we can use some additional commands

is_tibble(tb)

[1] TRUE

glimpse(tb)

Rows: 3
Columns: 1
$ a <int> 1, 2, 3

7.5 Factors

factor

A vector that is used to store categorical variables.
It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.

Creating a factor vector

grades <- factor(c("A", "A", "A", "C", "B"))
grades

[1] A A A C B
Levels: A B C

Now let’s check the class type

class(grades) # It's a factor

[1] "factor"

To obtain all levels

levels(grades)

[1] "A" "B" "C"

7.5.1 Creating a factor vector

With factors all possible values of the variables can be defined under levels.

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr

[1] A D A C B
Levels: A B C D E

levels(grade_factor_vctr)

[1] "A" "B" "C" "D" "E"

class(levels(grade_factor_vctr))

[1] "character"

7.5.2 Character vector vs Factor

Observe the differences in outputs. Factor prints all possible levels of the variable.

Character vector

grade_character_vctr <- c("A", "D", "A", "C", "B")
grade_character_vctr

[1] "A" "D" "A" "C" "B"

Factor vector

grade_factor_vctr <- factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr

[1] A D A C B
Levels: A B C D E

Factors behave like character vectors but they are actually integers.

Character vector

typeof(grade_character_vctr)

[1] "character"

Factor vector

typeof(grade_factor_vctr)

[1] "integer"

Let’s create a contingency table with table function.

Character vector output with table function

grade_character_vctr <- c("A", "D", "A", "C", "B")
table(grade_character_vctr)

grade_character_vctr
A B C D 
2 1 1 1

Factor vector (with levels) output with table function

grade_factor_vctr <- 
  factor(c("A", "D", "A", "C", "B"), 
         levels = c("A", "B", "C", "D", "E"))
table(grade_factor_vctr)

grade_factor_vctr
A B C D E 
2 1 1 1 0

Output corresponds to factor prints counts for all possible levels of the variable. Hence, with factors it is obvious when some levels contain no observations.
With factors you can’t use values that are not listed in the levels, but with character vectors there is no such restrictions.

Character vector

grade_character_vctr[2] <- "A+"
grade_character_vctr

[1] "A"  "A+" "A"  "C"  "B"

Factor vector

grade_factor_vctr[2] <- "A+"

Warning in `[<-.factor`(`*tmp*`, 2, value = "A+"): invalid factor level, NA
generated

grade_factor_vctr

[1] A    <NA> A    C    B   
Levels: A B C D E

7.5.3 Modify factor levels

This our factor

grade_factor_vctr

[1] A    <NA> A    C    B   
Levels: A B C D E

7.5.4 Change labels

levels(grade_factor_vctr) <- 
  c("Excellent", "Good", "Average", "Poor", "Fail")
grade_factor_vctr

[1] Excellent <NA>      Excellent Average   Good     
Levels: Excellent Good Average Poor Fail

7.5.5 Reverse the level arrangement

levels(grade_factor_vctr) <- rev(levels(grade_factor_vctr))
grade_factor_vctr

[1] Fail    <NA>    Fail    Average Poor   
Levels: Fail Poor Average Good Excellent

7.5.6 Order of factor levels

Default order of levels

fv1 <- factor(c("D","E","E","A", "B", "C"))
fv1

[1] D E E A B C
Levels: A B C D E

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"))
fv2

[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 1T 2T 3A 4A 5A 6B

df <- data.frame(fv2=fv2)
library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()

You can change the order of levels

fv2 <- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"), 
              levels = c("3A", "4A", "5A", "6B", "1T", "2T"))
fv2

[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 3A 4A 5A 6B 1T 2T

df <- data.frame(fv2=fv2)
library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()

Note that tibbles do not change the types of input variables (e.g., strings are not converted to factors by default).

tbl <- tibble(x1 = c("setosa", "versicolor", "virginica", "setosa"))
tbl

# A tibble: 4 × 1
  x1        
  <chr>     
1 setosa    
2 versicolor
3 virginica 
4 setosa

df <- data.frame(x1 = c("setosa", "versicolor", "virginica", "setosa"))
df

          x1
1     setosa
2 versicolor
3  virginica
4     setosa

class(df$x1)

[1] "character"

7.6 Pipe operator: %>% or |>

7.6.1 Required package: `magrittr` (for %>%)

install.packages("magrittr")
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

7.6.2 What does it do?

It takes whatever is on the left-hand-side of the pipe and makes it the first argument of whatever function is on the right-hand-side of the pipe.

For instance,

mean(1:10)

[1] 5.5

can be written as

1:10 %>% mean()

[1] 5.5

7.6.3 Illustrations

x %>% f(y) turns into f(x, y)
x %>% f(y) %>% g(z) turns into g(f(x, y), z)

7.6.4 Why %>%

This helps to make your code more readable.

Method 1: Without using pipe (hard to read)

colSums(matrix(c(1, 2, 3, 4, 8, 9, 10, 12), nrow=2))

[1]  3  7 17 22

Method 2: Using pipe (easy to read)

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix( , nrow = 2) %>%
  colSums()

[1]  3  7 17 22

c(1, 2, 3, 4, 8, 9, 10, 12) %>%
  matrix(nrow = 2) %>% # remove comma
  colSums()

[1]  3  7 17 22

7.6.5 Rules

library(tidyverse) # to use as_tibble
library(magrittr) # to use %>%
df <- data.frame(x1 = 1:3, x2 = 4:6)
df

Rule 1

head(df) 
df %>% head()

Rule 2

head(df, n = 2)  
df %>% head(n = 2)

  x1 x2
1  1  4
2  2  5

Rule 3

head(df, n = 2)
2 %>% head(df, n = .)

  x1 x2
1  1  4
2  2  5

Rule 4

head(as_tibble(df), n = 2)
df %>% as_tibble() %>%
head(n = 2)

# A tibble: 2 × 2
     x1    x2
  <int> <int>
1     1     4
2     2     5

Rule 5: subsetting

df$x1
df %>% .$x1

[1] 1 2 3

df[["x1"]]
df %>% .[["x1"]]

[1] 1 2 3

df[[1]]
df %>% .[[1]]

[1] 1 2 3

7.6.6 Offline reading materials

Type the following codes to see more examples:

vignette("magrittr")
vignette("tibble")

7.1 What is the tidyverse?

7.2 Setup

7.3 The Tidyverse data analysis workflow

7.4 Tibble

7.4.1 Create a tibble

7.4.2 Convert an existing dataframe to a tibble

7.4.3 Convert a tibble to a dataframe

7.4.4 tibble vs data.frame

7.4.5 Subsetting: tibble vs data.frame

7.4.5.1 Subsetting single columns

7.4.5.2 Subsetting single rows with the drop argument

7.4.5.3 Accessing non-existent columns

7.4.6 Some functions that work with both tibbles and dataframes

7.5 Factors

7.5.1 Creating a factor vector

7.5.2 Character vector vs Factor

7.5.3 Modify factor levels

7.5.4 Change labels

7.5.5 Reverse the level arrangement

7.5.6 Order of factor levels

7.6 Pipe operator: %>% or |>

7.6.1 Required package: magrittr (for %>%)

7.6.2 What does it do?

7.6.3 Illustrations

7.6.4 Why %>%

7.6.5 Rules

7.6.6 Offline reading materials

7.6.1 Required package: `magrittr` (for %>%)