Data and setting up your workflow

The goal of this chapter is to provide readers with an overview of the datasets used in the book’s examples. Having an initial understanding of the data helps readers easily navigate between the examples.

Installation of associated packages

To run the examples in the book you need to install the following packages. In addition, to this package list, the associated package corresponds to the geom should be installed. The drone package provides the datasets associated with this geom encyclopedia.

#install.packages(drone)
#install.packages("devtools")
devtools::install_github("thiyangt/drone")
install.packages(tidyverse)
install.packages(patchwork)

Data set use in the geom Encyclopedia

The datasets used in the book coming from country-wise statistics obtained from public reliable websites. Only two datasets are used to explain all geoms so that readers can build intuition by seeing the same data represented in multiple ways. The reasons for using these datasets are:

  • The dataset context is familiar and easily understood by individuals from diverse backgrounds.

  • Ability to create cross-sectional, time-series, spatial, and spatio-temporal visualizations.

  • Ability to visualize relationships between qualitative–qualitative, qualitative–quantitative, and quantitative–quantitative variables.

  • Experience addressing common data challenges such as missing values, overplotting, and large-scale datasets.

Dataset 1: WorldHappinessScore

The dataset is obtained from World Population Review report (World Population Review (n.d.)). The World Happiness Score is reported on a 0–10 scale, with higher values indicating greater happiness. This dataset containes the world happiness score for 145 countries. The variable description is given in Table 1.

Table 1: Variable Description for WorldHappinessScore dataset
Variable Description
flagCode Country flag code
country Country name
WorldHappinessScore_2024 World happiness score for 2024
WorldHappinessScore_2023 World happiness score for 2023
WorldHappinessScore_2022 World happiness score for 2022

The first few rows of the dataset is shown below

library(drone)
head(WorldHappinessScore)
  flagCode     country WorldHappinessScore_2024 WorldHappinessScore_2023
1       FI     Finland                     7.74                    7.804
2       DK     Denmark                     7.58                    7.586
3       IS     Iceland                     7.53                    7.530
4       SE      Sweden                     7.34                    7.395
5       IL      Israel                     7.34                    7.473
6       NL Netherlands                     7.32                    7.403
  WorldHappinessScore_2022
1                    7.821
2                    7.636
3                    7.557
4                    7.384
5                    7.364
6                    7.415

WorldHappinessScore: Data Profiling

Table 2 provides a compact summary of the dataset.There are 5 variables. Among the 2 are character variables and 3 are numeric variables. Summary of both character variables and numeric variables are given in the output. For character variables,

n_missing tells how many values are missing for each variable.

complete_rate is the proportion of non-missing values.

n_unique is the number of unique values in the variable.

For numeric variables Table 2 shows mean, sd (standard deviation), p0 (min), percentiles: p25, p50 (median), p75, p100 (max), and a small histogram for each numeric variable.

Table 2: Summary description of the WorldHappinessScore dataset
Data summary
Name WorldHappinessScore
Number of rows 145
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
flagCode 1 0.99 2 2 0 144 0
country 0 1.00 4 22 0 145 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
WorldHappinessScore_2024 4 0.97 5.52 1.18 1.72 4.66 5.79 6.41 7.74 ▁▂▆▇▅
WorldHappinessScore_2023 9 0.94 5.53 1.14 1.86 4.70 5.67 6.33 7.80 ▁▂▆▇▃
WorldHappinessScore_2022 1 0.99 5.55 1.09 2.40 4.88 5.57 6.30 7.82 ▁▃▇▇▃

WorldHappinessScore: Data Quality Analysis

Figure 1 shows the types of variables we have in the dataset and missing value distribution.

library(tidyverse)
library(visdat)
vis_dat(WorldHappinessScore) + 
  scale_fill_brewer(palette = "Dark2")
Figure 1: Summary description of the WorldHappinessScore dataset

Figure 2 shows combination-wise missing values in the dataset WorldHappinessScore. There are 6 observations where only WorldHapinessScore_2023 missing and there are 3 observations where both WorldHapinessScore_2023 and WorldHapinessScore_2024 missing.

library(naniar)
gg_miss_upset(WorldHappinessScore) 
Figure 2: Combinations of missingness across cases: WorldHappinessScore dataset

Dataset 2: worldbankdata

This dataset provides country-level development indicators compiled from the World Bank Data Catalogue. It includes information on access to clean cooking fuels, access to electricity, income group classification, and regional grouping across multiple years.

The variable description is given in Table 3.

Table 3: Variable Description for worldbankdata dataset
Variable Description
Country Country name
Code ISO country code
Region World Bank regional classification
Year Year of observation
Cooking Access to clean fuels and technologies for cooking (percentage of population).
Electricity Access to electricity (percentage of population).
Income Income group classification: L = Low income, LM = Lower middle income, UM = Upper middle income, HI = High income.

To view the dataset use the following code.

library(drone)
library(tibble)
data(worldbankdata)
worldbankdata
# A tibble: 7,937 × 7
   Country Code  Region                     Year Cooking Electricity Income
   <fct>   <fct> <fct>                     <dbl>   <dbl>       <dbl> <fct> 
 1 Aruba   ABW   Latin America & Caribbean  1990      NA       100   H     
 2 Aruba   ABW   Latin America & Caribbean  2000      NA        91.7 H     
 3 Aruba   ABW   Latin America & Caribbean  2013      NA       100   H     
 4 Aruba   ABW   Latin America & Caribbean  2014      NA       100   H     
 5 Aruba   ABW   Latin America & Caribbean  2015      NA       100   H     
 6 Aruba   ABW   Latin America & Caribbean  2016      NA       100   H     
 7 Aruba   ABW   Latin America & Caribbean  2017      NA       100   H     
 8 Aruba   ABW   Latin America & Caribbean  2018      NA       100   H     
 9 Aruba   ABW   Latin America & Caribbean  2019      NA       100   H     
10 Aruba   ABW   Latin America & Caribbean  2020      NA       100   H     
# ℹ 7,927 more rows

worldbankdata: Data Profiling

Table 4 provides a compact summary of the dataset. The dataset contains 7 variables. Among them, 4 variables are factor variables and 3 variables are numeric variables. In addtion, Table 4 shows a comprehensive overview of your dataset, showing both the distribution of categorical variables and the summary statistics for numeric variables, including missing values and frequency counts.

Table 4: Summary description of the workldbank dataset
Data summary
Name worldbankdata
Number of rows 7937
Number of columns 7
_______________________
Column type frequency:
factor 4
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Country 0 1.00 FALSE 227 Afg: 36, Alb: 36, Alg: 36, Ame: 36
Code 0 1.00 FALSE 218 CIV: 48, CUW: 47, CZE: 47, FRO: 47
Region 199 0.97 FALSE 7 Eur: 2038, Sub: 1693, Lat: 1512, Eas: 1343
Income 559 0.93 FALSE 4 H: 2177, LM: 1978, L: 1704, UM: 1519

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 1 1.00 2004.59 10.41 1987.00 1996.00 2005.00 2014 2022 ▇▇▇▇▇
Cooking 6047 0.24 65.48 38.46 0.00 27.30 84.75 100 100 ▃▁▁▁▇
Electricity 5693 0.28 84.40 26.45 2.11 79.86 99.80 100 100 ▁▁▁▁▇

worldbankdata: Data Quality Analysis

worldbankdata dataset contains some missing values. Figure 3 shows the distribution of types of variables and missing value distribution.

library(tidyverse)
library(visdat)
vis_dat(worldbankdata) + 
  scale_fill_brewer(palette = "Dark2")
Figure 3: Data types and missing values distribution

Figure 4 shows combinations of missingness across cases. This is especially useful for understanding patterns of missingness across variables. For example, there are 5122 rows where both Electricity and Cooking variable missing.

library(naniar)
gg_miss_upset(worldbankdata) 
Figure 4: Combinations of missingness across cases: worldbankdata dataset