Visualising Qualitative Data

Packages

library(tidyverse)
library(skimr)
library(viridis)

Data

data(diamonds)
skimr::skim(diamonds)

Data summary
Name	diamonds
Number of rows	53940
Number of columns	10
_______________________
Column type frequency:
factor	3
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cut	1	TRUE	5	Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color	1	TRUE	7	G: 11292, E: 9797, F: 9542, H: 8304
clarity	1	TRUE	8	SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carat	1	0.80	0.47	0.2	0.40	0.70	1.04	5.01	▇▂▁▁▁
depth	1	61.75	1.43	43.0	61.00	61.80	62.50	79.00	▁▁▇▁▁
table	1	57.46	2.23	43.0	56.00	57.00	59.00	95.00	▁▇▁▁▁
price	1	3932.80	3989.44	326.0	950.00	2401.00	5324.25	18823.00	▇▂▁▁▁
x	1	5.73	1.12	0.0	4.71	5.70	6.54	10.74	▁▁▇▃▁
y	1	5.73	1.14	0.0	4.72	5.71	6.54	58.90	▇▁▁▁▁
z	1	3.54	0.71	0.0	2.91	3.53	4.04	31.80	▇▁▁▁▁

Univariate Visualisations

Common Code

p1 <- ggplot(data=diamonds, aes(x=cut))

Bar chart: Counts

p1 + geom_bar()

Add color

p1 + geom_bar(fill="forestgreen")

Add title

Add sequential colour theme

p1 + geom_bar(aes(fill=cut))+
  scale_fill_viridis_d() + 
  labs(title="Composition of diamonds by cuts")

Change color pallet

p1 + geom_bar(aes(fill=cut))+
  scale_fill_viridis_d(option = "magma") + 
  labs(title="Composition of diamonds by cuts")

Manually fill colors

p1 + geom_bar(aes(fill=cut))+
  scale_fill_manual(<>) + 
  labs(title="Composition of diamonds by cuts")

Bar charts: percentages

Method 1: geom_col

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut)

# A tibble: 5 × 2
  cut         prop
  <ord>      <dbl>
1 Ideal     0.400 
2 Premium   0.256 
3 Good      0.0910
4 Very Good 0.224 
5 Fair      0.0298

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut) |> 
  mutate(cut = forcats::fct_reorder(cut, prop))

# A tibble: 5 × 2
  cut         prop
  <ord>      <dbl>
1 Ideal     0.400 
2 Premium   0.256 
3 Good      0.0910
4 Very Good 0.224 
5 Fair      0.0298

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut) |> 
  mutate(cut = forcats::fct_reorder(cut, prop)) |> 
  ggplot(aes(y=prop, x=cut)) +
  geom_col()

Method 2: geom_bar and after_stat

ggplot(diamonds, aes(x = cut, y = after_stat(count / sum(count)))) +
  geom_bar()

Flip coords

ggplot(diamonds, aes(x = cut, y = after_stat(count / sum(count)))) +
  geom_bar() + 
  coord_flip()

Obtain percentage

ggplot(diamonds, aes(x = cut, y = after_stat(count / sum(count)*100))) +
  geom_bar() + 
  coord_flip()

Level-up-your plots

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut) |> 
  mutate(cut = forcats::fct_reorder(cut, prop)) |> 
  ggplot(aes(prop, cut)) +
  geom_col() +
  scale_x_continuous(
    expand = c(0, 0), limits = c(0, .50),
    labels = scales::label_percent(),
    name = "Percentage"
  )

More work on the plot

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut) |> 
  mutate(cut = forcats::fct_reorder(cut, prop)) |> 
  ggplot(aes(prop, cut)) +
  geom_col() +
  scale_x_continuous(
    expand = c(0, 0), limits = c(0, .5),
    labels = scales::label_percent(),
    name = "Percentage"
  ) + theme(axis.title.y = element_blank())

diamonds |> 
  summarize(prop = n() / nrow(diamonds), .by = cut) |> 
  mutate(cut = forcats::fct_reorder(cut, prop)) |> 
  ggplot(aes(prop, cut)) +
  geom_col() +
  geom_text(
    aes(label = paste0("  ", sprintf("%2.1f", prop * 100), "%  ")),
    position = position_dodge(width = .9),    # move to center of bars
              hjust = -0.1,    # nudge above top of bar
              size = 3)+
  scale_x_continuous(
    expand = c(0, 0), limits = c(0, .5),
    labels = scales::label_percent(),
    name = "Percentage"
  ) + theme(axis.title.y = element_blank())

Bi-variate

Stacked bar chart

Encoding by colour

Position: stack

b1 <- ggplot(data=diamonds, aes(x=cut, fill=color))

R code:___________

Grouped bar chart/ Cluster bar chart

This chart displays bars for multiple categories grouped together side by side.

Encoding by colour

Position: dodge

R code:___________

Small Multiples or Trellis Chart

This chart displays multiple small bar charts, each representing a different subset of the data.

Encoding by position

ggplot(data=diamonds, aes(x=color))+geom_bar()+facet_wrap(~cut)

Your turn: What is the best chart: grouped bar chart, stack bar chart or faceting?

Percentage stacked bar chart

Categorical vs Quantitative

Cleveland dot chart

This is useful when you have large number of categories.

Representation using bar chart

Question: What is the best representation? Dot chart or Bar chart?

Heat map

Geospatial visualisation: Good to identify “hot spots”.

ggplot(gapminderAsia, aes(x=year, fill=lifeExp, y=country))+
  geom_raster()+
  scale_fill_viridis_c()

ggplot(gapminderAsia, aes(x=year, fill=lifeExp, y=reorder(country, lifeExp)))+
  geom_raster()+
  scale_fill_viridis_c()

Plotting summary statistics

Plotting Summary statistics: Method 1

Calculate summary statistics before plotting.

R code:___________________

# A tibble: 5 × 2
  cut       mean_carat
  <ord>          <dbl>
1 Fair           1.05 
2 Good           0.849
3 Very Good      0.806
4 Premium        0.892
5 Ideal          0.703

R Code:________________

Plotting summary statistics: Method 2 - with `stat_summary`

Common code

g1 <- ggplot(diamonds, aes(x = cut, y = carat))

Plot mean values.

g1+
  stat_summary(fun.y = "mean", geom="point", color="red")

Your turn: Plot mean and median.

R code:___________________

R code:___________________

mean_se: mean and standard error

R code:___________________

mean_cl_normal: 95 per cent confidence interval assuming normality. (Use library(Hmisc))

library(Hmisc)
g1+stat_summary(fun.data = "mean_cl_normal")

R code:___________________

mean_cl_boot: Bootstrap confidence interval (95%)

Confidence limits provide us a better idea than standard error limits of whether two means would be deemed statistically different.

Design of Experiments

Description

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Data summary
Name	ToothGrowth
Number of rows	60
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
supp	0	1	FALSE	2	OJ: 30, VC: 30

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
len	0	1	18.81	7.65	4.2	13.07	19.25	25.27	33.9	▅▃▅▇▂
dose	0	1	1.17	0.63	0.5	0.50	1.00	2.00	2.0	▇▇▁▁▇

   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5

R code:___________________

R code:___________________

R code:____________________

Avoid overlapping in the last category position_dodge(0.1)

R code: ___________

Not suitable for this example: Why?

Categorical with two Quantitative variables

R code: ___________

R code: ___________

R code: ___________

R code: ___________

R code: ___________

R code: ___________

Your turn: What is the best chart to visualize the relationship between the price, carat, and color of diamonds?

Packages

Data

Univariate Visualisations

Common Code

Bar chart: Counts

Bar charts: percentages

Method 1: geom_col

Method 2: geom_bar and after_stat

Level-up-your plots

Bi-variate

Stacked bar chart

Grouped bar chart/ Cluster bar chart

Small Multiples or Trellis Chart

Percentage stacked bar chart

Categorical vs Quantitative

Cleveland dot chart

Heat map

Plotting summary statistics

Plotting Summary statistics: Method 1

Plotting summary statistics: Method 2 - with stat_summary

Design of Experiments

Categorical with two Quantitative variables

Plotting summary statistics: Method 2 - with `stat_summary`