── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
7 Introduction to Tidyverse
7.1 What is the tidyverse?
Collection of essential R packages for data science.
All packages share a common design philosophy, grammar, and data structures.
7.2 Setup
install.packages("tidyverse") # install tidyverse packages
library(tidyverse) # load tidyverse packages
7.3 Tibble
Tibble is a modern version of dataframes.
A modern re-imagining of data frames.
7.3.1 Create a tibble
library(tidyverse) # library(tibble)
<- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl first.tbl
# A tibble: 3 × 2
height weight
<dbl> <dbl>
1 150 45
2 200 60
3 160 51
class(first.tbl)
[1] "tbl_df" "tbl" "data.frame"
7.3.2 Convert an existing dataframe to a tibble
as_tibble(iris)
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
7.3.3 Convert a tibble to a dataframe
<- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl class(first.tbl)
[1] "tbl_df" "tbl" "data.frame"
<- as.data.frame(first.tbl)
first.tbl.df class(first.tbl.df)
[1] "data.frame"
7.3.4 tibble vs data.frame
- The way they print output
tibble
<- tibble(height = c(150, 200, 160), weight = c(45, 60, 51))
first.tbl first.tbl
# A tibble: 3 × 2
height weight
<dbl> <dbl>
1 150 45
2 200 60
3 160 51
data.frame
<- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51))
dataframe dataframe
height weight
1 150 45
2 200 60
3 160 51
- With tibble you can create new variables that are functions of existing variables.
tibble
<- tibble(height = c(150, 200, 160), weight = c(45, 60, 51),
first.tbl bmi = (weight)/height^2)
first.tbl
# A tibble: 3 × 3
height weight bmi
<dbl> <dbl> <dbl>
1 150 45 0.002
2 200 60 0.0015
3 160 51 0.00199
data.frame
<- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51),
df bmi = (weight)/height^2) # Not working
You will get an error message
Error in data.frame(height = c(150, 200, 160), weight = c(45, 60, 51), : object 'height' not found.
With data.frame
this is how we should create a new variable from the existing columns.
<- data.frame(height = c(150, 200, 160), weight = c(45, 60, 51))
df $bmi <- (df$weight)/(df$height^2)
df df
height weight bmi
1 150 45 0.002000000
2 200 60 0.001500000
3 160 51 0.001992188
- In contrast to data frames, the variable names in tibbles can contain spaces.
Example 1
<- tibble(`patient id` = c(1, 2, 3))
tbl tbl
# A tibble: 3 × 1
`patient id`
<dbl>
1 1
2 2
3 3
<- data.frame(`patient id` = c(1, 2, 3))
df df
patient.id
1 1
2 2
3 3
- In contrast to data frames, the variable names in tibbles can start with a number.
<- tibble(`1var` = c(1, 2, 3))
tbl tbl
# A tibble: 3 × 1
`1var`
<dbl>
1 1
2 2
3 3
<- data.frame(`1var` = c(1, 2, 3))
df df
X1var
1 1
2 2
3 3
In general, tibbles do not change the names of input variables and do not use row names.
- A tibble can have columns that are lists.
tibble
<- tibble (x = 1:3, y = list(1:3, 1:4, 1:10))
tbl tbl
# A tibble: 3 × 2
x y
<int> <list>
1 1 <int [3]>
2 2 <int [4]>
3 3 <int [10]>
data.frame
This feature is not available in data.frame
.
If we try to do this with a traditional data frame we get an error.
<- data.frame(x = 1:3, y = list(1:3, 1:4, 1:10)) ## Not working, error df
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3, 4, 10
7.3.5 Subsetting: tibble vs data.frame
7.3.5.1 Subsetting single columns
data.frame
<- data.frame(x = 1:3,
df yz = c(10, 20, 30)); df
x yz
1 1 10
2 2 20
3 3 30
"x"] df[,
[1] 1 2 3
"x", drop=FALSE] df[,
x
1 1
2 2
3 3
tibble
<- tibble(x = 1:3,
tbl yz = c(10, 20, 30)); tbl
# A tibble: 3 × 2
x yz
<int> <dbl>
1 1 10
2 2 20
3 3 30
"x"] tbl[,
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
<- tibble(x = 1:3,
tbl yz = c(10, 20, 30))
tbl
# A tibble: 3 × 2
x yz
<int> <dbl>
1 1 10
2 2 20
3 3 30
"x"] tbl[,
# A tibble: 3 × 1
x
<int>
1 1
2 2
3 3
# Method 1
"x", drop = TRUE] tbl[,
[1] 1 2 3
# Method 2
as.data.frame(tbl)[, "x"]
[1] 1 2 3
7.3.5.2 Subsetting single rows with the drop argument
data.frame
1, , drop = TRUE] df[
$x
[1] 1
$yz
[1] 10
tibble
1, , drop = TRUE] tbl[
# A tibble: 1 × 2
x yz
<int> <dbl>
1 1 10
as.list(tbl[1, ])
$x
[1] 1
$yz
[1] 10
7.3.5.3 Accessing non-existent columns
data.frame
$y df
[1] 10 20 30
"y", exact = FALSE]] df[[
[1] 10 20 30
"y", exact = TRUE]] df[[
NULL
tibble
$y tbl
Warning: Unknown or uninitialised column: `y`.
NULL
"y", exact = FALSE]] tbl[[
Warning: `exact` ignored.
NULL
"y", exact = TRUE]] tbl[[
NULL
7.3.6 Some functions that work with both tibbles and dataframes
names(), colnames(), rownames(), ncol(), nrow(), length() # length of the underlying list
tibble
<- tibble(a = 1:3)
tb names(tb)
[1] "a"
colnames(tb)
[1] "a"
rownames(tb)
[1] "1" "2" "3"
nrow(tb); ncol(tb); length(tb)
[1] 3
[1] 1
[1] 1
data.frame
<- data.frame(a = 1:3)
df names(df)
[1] "a"
colnames(df)
[1] "a"
rownames(df)
[1] "1" "2" "3"
nrow(df); ncol(df); length(df)
[1] 3
[1] 1
[1] 1
However, when using tibble, we can use some additional commands
is_tibble(tb)
[1] TRUE
glimpse(tb)
Rows: 3
Columns: 1
$ a <int> 1, 2, 3
7.4 Factors
factor
A vector that is used to store categorical variables.
It can only contain predefined values. Hence, factors are useful when you know the possible values a variable may take.
Creating a factor vector
<- factor(c("A", "A", "A", "C", "B"))
grades grades
[1] A A A C B
Levels: A B C
Now let’s check the class type
class(grades) # It's a factor
[1] "factor"
To obtain all levels
levels(grades)
[1] "A" "B" "C"
7.4.1 Creating a factor vector
- With factors all possible values of the variables can be defined under levels.
<-
grade_factor_vctr factor(c("A", "D", "A", "C", "B"),
levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr
[1] A D A C B
Levels: A B C D E
levels(grade_factor_vctr)
[1] "A" "B" "C" "D" "E"
class(levels(grade_factor_vctr))
[1] "character"
7.4.2 Character vector vs Factor
- Observe the differences in outputs. Factor prints all possible levels of the variable.
Character vector
<- c("A", "D", "A", "C", "B")
grade_character_vctr grade_character_vctr
[1] "A" "D" "A" "C" "B"
Factor vector
<- factor(c("A", "D", "A", "C", "B"),
grade_factor_vctr levels = c("A", "B", "C", "D", "E"))
grade_factor_vctr
[1] A D A C B
Levels: A B C D E
- Factors behave like character vectors but they are actually integers.
Character vector
typeof(grade_character_vctr)
[1] "character"
Factor vector
typeof(grade_factor_vctr)
[1] "integer"
- Let’s create a contingency table with
table
function.
Character vector output with table function
<- c("A", "D", "A", "C", "B")
grade_character_vctr table(grade_character_vctr)
grade_character_vctr
A B C D
2 1 1 1
Factor vector (with levels) output with table function
<-
grade_factor_vctr factor(c("A", "D", "A", "C", "B"),
levels = c("A", "B", "C", "D", "E"))
table(grade_factor_vctr)
grade_factor_vctr
A B C D E
2 1 1 1 0
Output corresponds to factor prints counts for all possible levels of the variable. Hence, with factors it is obvious when some levels contain no observations.
With factors you can’t use values that are not listed in the levels, but with character vectors there is no such restrictions.
Character vector
2] <- "A+"
grade_character_vctr[ grade_character_vctr
[1] "A" "A+" "A" "C" "B"
Factor vector
2] <- "A+" grade_factor_vctr[
Warning in `[<-.factor`(`*tmp*`, 2, value = "A+"): invalid factor level, NA
generated
grade_factor_vctr
[1] A <NA> A C B
Levels: A B C D E
7.4.3 Modify factor levels
This our factor
grade_factor_vctr
[1] A <NA> A C B
Levels: A B C D E
7.4.4 Change labels
levels(grade_factor_vctr) <-
c("Excellent", "Good", "Average", "Poor", "Fail")
grade_factor_vctr
[1] Excellent <NA> Excellent Average Good
Levels: Excellent Good Average Poor Fail
7.4.5 Reverse the level arrangement
levels(grade_factor_vctr) <- rev(levels(grade_factor_vctr))
grade_factor_vctr
[1] Fail <NA> Fail Average Poor
Levels: Fail Poor Average Good Excellent
7.4.6 Order of factor levels
Default order of levels
<- factor(c("D","E","E","A", "B", "C"))
fv1 fv1
[1] D E E A B C
Levels: A B C D E
<- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"))
fv2 fv2
[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 1T 2T 3A 4A 5A 6B
<- data.frame(fv2=fv2)
df library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()
You can change the order of levels
<- factor(c("1T","2T","3A","4A", "5A", "6B", "3A"),
fv2 levels = c("3A", "4A", "5A", "6B", "1T", "2T"))
fv2
[1] 1T 2T 3A 4A 5A 6B 3A
Levels: 3A 4A 5A 6B 1T 2T
<- data.frame(fv2=fv2)
df library(ggplot2)
ggplot(df, aes(x=fv2)) + geom_bar()
Note that tibbles do not change the types of input variables (e.g., strings are not converted to factors by default).
<- tibble(x1 = c("setosa", "versicolor", "virginica", "setosa"))
tbl tbl
# A tibble: 4 × 1
x1
<chr>
1 setosa
2 versicolor
3 virginica
4 setosa
<- data.frame(x1 = c("setosa", "versicolor", "virginica", "setosa"))
df df
x1
1 setosa
2 versicolor
3 virginica
4 setosa
class(df$x1)
[1] "character"
7.5 Pipe operator: %>% or |>
7.5.1 Required package: magrittr
(for %>%)
install.packages("magrittr")
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
7.5.2 What does it do?
It takes whatever is on the left-hand-side of the pipe and makes it the first argument of whatever function is on the right-hand-side of the pipe.
For instance,
mean(1:10)
[1] 5.5
can be written as
1:10 %>% mean()
[1] 5.5
7.5.3 Illustrations
x %>% f(y)
turns intof(x, y)
x %>% f(y) %>% g(z)
turns intog(f(x, y), z)
7.5.4 Why %>%
- This helps to make your code more readable.
Method 1: Without using pipe (hard to read)
colSums(matrix(c(1, 2, 3, 4, 8, 9, 10, 12), nrow=2))
[1] 3 7 17 22
Method 2: Using pipe (easy to read)
c(1, 2, 3, 4, 8, 9, 10, 12) %>%
matrix( , nrow = 2) %>%
colSums()
[1] 3 7 17 22
or
c(1, 2, 3, 4, 8, 9, 10, 12) %>%
matrix(nrow = 2) %>% # remove comma
colSums()
[1] 3 7 17 22
7.5.5 Rules
library(tidyverse) # to use as_tibble
library(magrittr) # to use %>%
<- data.frame(x1 = 1:3, x2 = 4:6)
df df
x1 x2
1 1 4
2 2 5
3 3 6
Rule 1
head(df)
%>% head() df
x1 x2
1 1 4
2 2 5
3 3 6
Rule 2
head(df, n = 2)
%>% head(n = 2) df
x1 x2
1 1 4
2 2 5
Rule 3
head(df, n = 2)
2 %>% head(df, n = .)
x1 x2
1 1 4
2 2 5
Rule 4
head(as_tibble(df), n = 2)
%>% as_tibble() %>%
df head(n = 2)
# A tibble: 2 × 2
x1 x2
<int> <int>
1 1 4
2 2 5
Rule 5: subsetting
$x1
df%>% .$x1 df
[1] 1 2 3
or
"x1"]]
df[[%>% .[["x1"]] df
[1] 1 2 3
or
1]]
df[[%>% .[[1]] df
[1] 1 2 3
7.5.6 Offline reading materials
Type the following codes to see more examples:
vignette("magrittr")
vignette("tibble")