| Variable | Description |
|---|---|
| flagCode | Country flag code |
| country | Country name |
| WorldHappinessScore_2024 | World happiness score for 2024 |
| WorldHappinessScore_2023 | World happiness score for 2023 |
| WorldHappinessScore_2022 | World happiness score for 2022 |
Data and setting up your workflow
The goal of this chapter is to provide readers with an overview of the datasets used in the book’s examples. Having an initial understanding of the data helps readers easily navigate between the examples.
Installation of associated packages
To run the examples in the book you need to install the following packages. In addition, to this package list, the associated package corresponds to the geom should be installed. The drone package provides the datasets associated with this geom encyclopedia.
#install.packages(drone)
#install.packages("devtools")
devtools::install_github("thiyangt/drone")
install.packages(tidyverse)
install.packages(patchwork)Data set use in the geom Encyclopedia
The datasets used in the book coming from country-wise statistics obtained from public reliable websites. Only two datasets are used to explain all geoms so that readers can build intuition by seeing the same data represented in multiple ways. The reasons for using these datasets are:
The dataset context is familiar and easily understood by individuals from diverse backgrounds.
Ability to create cross-sectional, time-series, spatial, and spatio-temporal visualizations.
Ability to visualize relationships between qualitative–qualitative, qualitative–quantitative, and quantitative–quantitative variables.
Experience addressing common data challenges such as missing values, overplotting, and large-scale datasets.
Dataset 1: WorldHappinessScore
The dataset is obtained from World Population Review report (World Population Review (n.d.)). The World Happiness Score is reported on a 0–10 scale, with higher values indicating greater happiness. This dataset containes the world happiness score for 145 countries. The variable description is given in Table 1.
The first few rows of the dataset is shown below
library(drone)
head(WorldHappinessScore) flagCode country WorldHappinessScore_2024 WorldHappinessScore_2023
1 FI Finland 7.74 7.804
2 DK Denmark 7.58 7.586
3 IS Iceland 7.53 7.530
4 SE Sweden 7.34 7.395
5 IL Israel 7.34 7.473
6 NL Netherlands 7.32 7.403
WorldHappinessScore_2022
1 7.821
2 7.636
3 7.557
4 7.384
5 7.364
6 7.415
WorldHappinessScore: Data Profiling
Table 2 provides a compact summary of the dataset.There are 5 variables. Among the 2 are character variables and 3 are numeric variables. Summary of both character variables and numeric variables are given in the output. For character variables,
n_missing tells how many values are missing for each variable.
complete_rate is the proportion of non-missing values.
n_unique is the number of unique values in the variable.
For numeric variables Table 2 shows mean, sd (standard deviation), p0 (min), percentiles: p25, p50 (median), p75, p100 (max), and a small histogram for each numeric variable.
| Name | WorldHappinessScore |
| Number of rows | 145 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| flagCode | 1 | 0.99 | 2 | 2 | 0 | 144 | 0 |
| country | 0 | 1.00 | 4 | 22 | 0 | 145 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| WorldHappinessScore_2024 | 4 | 0.97 | 5.52 | 1.18 | 1.72 | 4.66 | 5.79 | 6.41 | 7.74 | ▁▂▆▇▅ |
| WorldHappinessScore_2023 | 9 | 0.94 | 5.53 | 1.14 | 1.86 | 4.70 | 5.67 | 6.33 | 7.80 | ▁▂▆▇▃ |
| WorldHappinessScore_2022 | 1 | 0.99 | 5.55 | 1.09 | 2.40 | 4.88 | 5.57 | 6.30 | 7.82 | ▁▃▇▇▃ |
WorldHappinessScore: Data Quality Analysis
Figure 1 shows the types of variables we have in the dataset and missing value distribution.
library(tidyverse)
library(visdat)
vis_dat(WorldHappinessScore) +
scale_fill_brewer(palette = "Dark2")
Figure 2 shows combination-wise missing values in the dataset WorldHappinessScore. There are 6 observations where only WorldHapinessScore_2023 missing and there are 3 observations where both WorldHapinessScore_2023 and WorldHapinessScore_2024 missing.
library(naniar)
gg_miss_upset(WorldHappinessScore)
Dataset 2: worldbankdata
This dataset provides country-level development indicators compiled from the World Bank Data Catalogue. It includes information on access to clean cooking fuels, access to electricity, income group classification, and regional grouping across multiple years.
The variable description is given in Table 3.
| Variable | Description |
|---|---|
| Country | Country name |
| Code | ISO country code |
| Region | World Bank regional classification |
| Year | Year of observation |
| Cooking | Access to clean fuels and technologies for cooking (percentage of population). |
| Electricity | Access to electricity (percentage of population). |
| Income | Income group classification: L = Low income, LM = Lower middle income, UM = Upper middle income, HI = High income. |
To view the dataset use the following code.
library(drone)
library(tibble)
data(worldbankdata)
worldbankdata# A tibble: 7,937 × 7
Country Code Region Year Cooking Electricity Income
<fct> <fct> <fct> <dbl> <dbl> <dbl> <fct>
1 Aruba ABW Latin America & Caribbean 1990 NA 100 H
2 Aruba ABW Latin America & Caribbean 2000 NA 91.7 H
3 Aruba ABW Latin America & Caribbean 2013 NA 100 H
4 Aruba ABW Latin America & Caribbean 2014 NA 100 H
5 Aruba ABW Latin America & Caribbean 2015 NA 100 H
6 Aruba ABW Latin America & Caribbean 2016 NA 100 H
7 Aruba ABW Latin America & Caribbean 2017 NA 100 H
8 Aruba ABW Latin America & Caribbean 2018 NA 100 H
9 Aruba ABW Latin America & Caribbean 2019 NA 100 H
10 Aruba ABW Latin America & Caribbean 2020 NA 100 H
# ℹ 7,927 more rows
worldbankdata: Data Profiling
Table 4 provides a compact summary of the dataset. The dataset contains 7 variables. Among them, 4 variables are factor variables and 3 variables are numeric variables. In addtion, Table 4 shows a comprehensive overview of your dataset, showing both the distribution of categorical variables and the summary statistics for numeric variables, including missing values and frequency counts.
| Name | worldbankdata |
| Number of rows | 7937 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Country | 0 | 1.00 | FALSE | 227 | Afg: 36, Alb: 36, Alg: 36, Ame: 36 |
| Code | 0 | 1.00 | FALSE | 218 | CIV: 48, CUW: 47, CZE: 47, FRO: 47 |
| Region | 199 | 0.97 | FALSE | 7 | Eur: 2038, Sub: 1693, Lat: 1512, Eas: 1343 |
| Income | 559 | 0.93 | FALSE | 4 | H: 2177, LM: 1978, L: 1704, UM: 1519 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 1 | 1.00 | 2004.59 | 10.41 | 1987.00 | 1996.00 | 2005.00 | 2014 | 2022 | ▇▇▇▇▇ |
| Cooking | 6047 | 0.24 | 65.48 | 38.46 | 0.00 | 27.30 | 84.75 | 100 | 100 | ▃▁▁▁▇ |
| Electricity | 5693 | 0.28 | 84.40 | 26.45 | 2.11 | 79.86 | 99.80 | 100 | 100 | ▁▁▁▁▇ |
worldbankdata: Data Quality Analysis
worldbankdata dataset contains some missing values. Figure 3 shows the distribution of types of variables and missing value distribution.
library(tidyverse)
library(visdat)
vis_dat(worldbankdata) +
scale_fill_brewer(palette = "Dark2")
Figure 4 shows combinations of missingness across cases. This is especially useful for understanding patterns of missingness across variables. For example, there are 5122 rows where both Electricity and Cooking variable missing.
library(naniar)
gg_miss_upset(worldbankdata)