[1] 437
7 Data Wrangling
7.1 Exercise
This exercise is based on the DStidy
dataset from the DSjobtracker
package in R. This dataset was compiled from job advertisements for roles like statisticians and data scientists. It includes information on job roles, required qualifications, technical and soft skills, tools, platforms, and other relevant criteria frequently mentioned in the job advertisements. Write R code to answer each question. Use the provided output to verify the correctness of your code.
- How many job advertisements mention both
R
andPython
as required skills?
- Which job categories require R but not Python?
Help: distinct
function
# A tibble: 20 × 1
Job_Category
<chr>
1 Unimportant
2 Data Science
3 Data Science and Data Engineering
4 Data Analyst
5 <NA>
6 AI
7 Business/Systems Analysts
8 Computer/Information Technology
9 Science & Technology
10 Sciences
11 Statistics
12 Data science
13 Information Services
14 Analyst, Engineering and Information Technology
15 Research, Analyst, and Information Technology
16 Analyst
17 Administrative
18 Machine Learning
19 Data Analytics
20 Actuarial Science
- What is the most common educational qualification required (BSc_needed, MSc_needed, PhD_needed) across all job categories?
# A tibble: 1 × 3
BSc MSc PhD
<int> <int> <int>
1 293 152 71
- What is the average salary for job roles that require SQL, grouped by Job_Category?
# A tibble: 26 × 2
Job_Category avg_salary
<chr> <dbl>
1 Unimportant 11108498.
2 <NA> 4689542.
3 Machine Learning 108000
4 Data Analyst 107890
5 Data Science 102805.
6 AI 95000
7 Data Science and Data Engineering 78500
8 Data science 75125.
9 Research, Analyst, and Information Technology 70970.
10 Internet Publishing 70000
# ℹ 16 more rows
- Calculate the proportion of job advertisements that require English by City.
# A tibble: 299 × 2
City Prop_English
<chr> <dbl>
1 Anaheim 1
2 Arges 1
3 Athens 1
4 Attiki 1
5 Berlin 1
6 Bugis 1
7 Calgary 1
8 Central Denmark 1
9 Chengdu 1
10 Darmstadt 1
# ℹ 289 more rows
- Select only the columns: ID, Job_title, Company, and City.
# A tibble: 1,172 × 4
ID Job_title Company City
<dbl> <chr> <chr> <chr>
1 1 <NA> <NA> <NA>
2 2 Junior Data Scientist Dialog Axiata PLC Colo…
3 3 Engineer, Analytics & Data Science London Stock Exchang… Colo…
4 4 CI-Statistical Analyst/Business Analyst-CMB E.D. Bullard Company Colo…
5 5 DA-Data Analyst-CMB E.D. Bullard Company Colo…
6 6 Data Scientist Emirates Center for … Kual…
7 7 Principal HR Data Scientist BHP Group Limited Kual…
8 8 DV-Data Visualization & QA Specialist-CMB E.D. Bullard Company Colo…
9 9 Product Manager, Product Data Science Numerator Otta…
10 10 Associate Director, Data Science Phreesia Inc. Otta…
# ℹ 1,162 more rows
- Filter the dataset to include only jobs that mention Python as a required skill.
# A tibble: 774 × 109
ID Consultant DateRetrieved DatePublished Job_title Company
<dbl> <chr> <dttm> <dttm> <chr> <chr>
1 1 Thiyanga 2020-08-05 00:00:00 NA <NA> <NA>
2 2 Jayani 2020-08-07 00:00:00 2020-07-31 00:00:00 Junior Data… Dialog…
3 3 Jayani 2020-08-07 00:00:00 2020-08-06 00:00:00 Engineer, A… London…
4 6 Jayani 2020-08-07 00:00:00 2020-08-13 00:00:00 Data Scient… Emirat…
5 7 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Principal H… BHP Gr…
6 10 Jayani 2020-08-12 00:00:00 2020-08-11 00:00:00 Associate D… Phrees…
7 13 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Data Scient… Visier…
8 17 Jayani 2020-08-13 00:00:00 2020-08-11 00:00:00 Data Scienc… Clearb…
9 18 Jayani 2020-08-13 00:00:00 2020-08-12 00:00:00 Marketing i… Yellow…
10 20 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Data Analyst Ernst …
# ℹ 764 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
# MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
# `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
# Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
# Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
# cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
- Select all columns related to Microsoft Office tools.
# A tibble: 1,172 × 4
`MS Word` `Ms Excel` `Ms Access` `Ms PowerPoint`
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 1 1 1 1
6 0 0 0 0
7 0 0 0 0
8 1 1 0 1
9 0 0 0 0
10 0 0 0 0
# ℹ 1,162 more rows
- Filter jobs that were published in or after the year 2023.
Help: use year
function in the lubridate package
# A tibble: 251 × 109
ID Consultant DateRetrieved DatePublished Job_title Company
<dbl> <chr> <dttm> <dttm> <chr> <chr>
1 1 Kavya 2023-01-05 18:30:00 2023-01-04 18:30:00 Senior Engi… DFCC B…
2 2 Kavya 2023-01-05 18:30:00 2023-01-04 18:30:00 Database En… <NA>
3 3 Kavya 2023-01-05 18:30:00 2023-01-03 18:30:00 Database Ad… Abans
4 4 Kavya 2023-01-05 18:30:00 2023-01-02 18:30:00 Consulting … George…
5 6 Kavya 2023-01-05 18:30:00 2023-01-05 18:30:00 Data Scient… LSEG
6 7 Kavya 2023-01-05 18:30:00 2023-01-05 18:30:00 D & A Azure… Ernst …
7 8 Kavya 2023-01-05 18:30:00 2023-01-04 18:30:00 Data Analyst Novels…
8 9 Kavya 2023-01-05 18:30:00 2023-01-04 18:30:00 Data Analyst ADA
9 10 Kavya 2023-01-05 18:30:00 2023-01-04 18:30:00 Machine Lea… Bestka…
10 12 Kavya 2023-01-05 18:30:00 2023-01-05 18:30:00 Data Scient… JKH
# ℹ 241 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
# MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
# `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
# Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
# Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
# cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
- Filter the jobs where both PhD_needed and MSc_needed are 0.
# A tibble: 758 × 109
ID Consultant DateRetrieved DatePublished Job_title Company
<dbl> <chr> <dttm> <dttm> <chr> <chr>
1 1 Thiyanga 2020-08-05 00:00:00 NA <NA> <NA>
2 2 Jayani 2020-08-07 00:00:00 2020-07-31 00:00:00 Junior Data… Dialog…
3 3 Jayani 2020-08-07 00:00:00 2020-08-06 00:00:00 Engineer, A… London…
4 4 Jayani 2020-08-07 00:00:00 2020-07-24 00:00:00 CI-Statisti… E.D. B…
5 5 Jayani 2020-08-07 00:00:00 2020-07-24 00:00:00 DA-Data Ana… E.D. B…
6 7 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Principal H… BHP Gr…
7 8 Jayani 2020-08-07 00:00:00 2020-07-24 00:00:00 DV-Data Vis… E.D. B…
8 9 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Product Man… Numera…
9 10 Jayani 2020-08-12 00:00:00 2020-08-11 00:00:00 Associate D… Phrees…
10 11 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Data Engine… SDK
# ℹ 748 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
# MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
# `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
# Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
# Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
# cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
- Filter out all rows where Salary is missing (NA).
# A tibble: 239 × 109
ID Consultant DateRetrieved DatePublished Job_title Company
<dbl> <chr> <dttm> <dttm> <chr> <chr>
1 12 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Data Analyst EPCOR …
2 24 Jayani 2020-08-13 00:00:00 2020-08-13 00:00:00 Data Scient… Elabra…
3 45 Jayani 2020-08-13 00:00:00 2020-08-11 00:00:00 Associate P… SnapHu…
4 52 Thimani 2020-08-07 00:00:00 2020-07-24 00:00:00 Data Scienc… Summit…
5 54 Thimani 2020-08-07 00:00:00 2020-08-01 00:00:00 Data Scient… PA con…
6 56 Thimani 2020-08-07 00:00:00 2020-07-31 00:00:00 Data Scient… Brilli…
7 57 Thimani 2020-08-07 00:00:00 2020-07-14 00:00:00 Data Scient… G-Rese…
8 58 Thimani 2020-08-07 00:00:00 2020-07-19 00:00:00 Manager-Dat… Biz2Cr…
9 62 Thimani 2020-08-09 00:00:00 2020-07-19 00:00:00 Senior Mana… Britis…
10 63 Thimani 2020-08-09 00:00:00 2020-07-23 00:00:00 VP, Data Sc… 7Park …
# ℹ 229 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
# MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
# `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
# Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
# Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
# cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
- For each Experience_Category, compute the percentage of jobs that require Presentation_Skills.
# A tibble: 5 × 2
Experience_Category prop_presentation
<chr> <dbl>
1 More than 10 years 0.25
2 More than 2 and less than 5 years 0.306
3 More than 5 and less than 10 years 0.318
4 Two or less years 0.275
5 Unknown or Not needed 0.360