7  Data Wrangling

7.1 Exercise

This exercise is based on the DStidy dataset from the DSjobtracker package in R. This dataset was compiled from job advertisements for roles like statisticians and data scientists. It includes information on job roles, required qualifications, technical and soft skills, tools, platforms, and other relevant criteria frequently mentioned in the job advertisements. Write R code to answer each question. Use the provided output to verify the correctness of your code.

  1. How many job advertisements mention both R and Python as required skills?
[1] 437
  1. Which job categories require R but not Python?

Help: distinct function

# A tibble: 20 × 1
   Job_Category                                   
   <chr>                                          
 1 Unimportant                                    
 2 Data Science                                   
 3 Data Science and Data Engineering              
 4 Data Analyst                                   
 5 <NA>                                           
 6 AI                                             
 7 Business/Systems Analysts                      
 8 Computer/Information Technology                
 9 Science & Technology                           
10 Sciences                                       
11 Statistics                                     
12 Data science                                   
13 Information Services                           
14 Analyst, Engineering and Information Technology
15 Research, Analyst, and Information Technology  
16 Analyst                                        
17 Administrative                                 
18 Machine Learning                               
19 Data Analytics                                 
20 Actuarial Science                              
  1. What is the most common educational qualification required (BSc_needed, MSc_needed, PhD_needed) across all job categories?
# A tibble: 1 × 3
    BSc   MSc   PhD
  <int> <int> <int>
1   293   152    71
  1. What is the average salary for job roles that require SQL, grouped by Job_Category?
# A tibble: 26 × 2
   Job_Category                                  avg_salary
   <chr>                                              <dbl>
 1 Unimportant                                    11108498.
 2 <NA>                                            4689542.
 3 Machine Learning                                 108000 
 4 Data Analyst                                     107890 
 5 Data Science                                     102805.
 6 AI                                                95000 
 7 Data Science and Data Engineering                 78500 
 8 Data science                                      75125.
 9 Research, Analyst, and Information Technology     70970.
10 Internet Publishing                               70000 
# ℹ 16 more rows
  1. Calculate the proportion of job advertisements that require English by City.
# A tibble: 299 × 2
   City            Prop_English
   <chr>                  <dbl>
 1 Anaheim                    1
 2 Arges                      1
 3 Athens                     1
 4 Attiki                     1
 5 Berlin                     1
 6 Bugis                      1
 7 Calgary                    1
 8 Central Denmark            1
 9 Chengdu                    1
10 Darmstadt                  1
# ℹ 289 more rows
  1. Select only the columns: ID, Job_title, Company, and City.
# A tibble: 1,172 × 4
      ID Job_title                                   Company               City 
   <dbl> <chr>                                       <chr>                 <chr>
 1     1 <NA>                                        <NA>                  <NA> 
 2     2 Junior Data Scientist                       Dialog Axiata PLC     Colo…
 3     3 Engineer, Analytics & Data Science          London Stock Exchang… Colo…
 4     4 CI-Statistical Analyst/Business Analyst-CMB E.D. Bullard Company  Colo…
 5     5 DA-Data Analyst-CMB                         E.D. Bullard Company  Colo…
 6     6 Data Scientist                              Emirates Center for … Kual…
 7     7 Principal HR Data Scientist                 BHP Group Limited     Kual…
 8     8 DV-Data Visualization & QA Specialist-CMB   E.D. Bullard Company  Colo…
 9     9 Product Manager, Product Data Science       Numerator             Otta…
10    10 Associate Director, Data Science            Phreesia Inc.         Otta…
# ℹ 1,162 more rows
  1. Filter the dataset to include only jobs that mention Python as a required skill.
# A tibble: 774 × 109
      ID Consultant DateRetrieved       DatePublished       Job_title    Company
   <dbl> <chr>      <dttm>              <dttm>              <chr>        <chr>  
 1     1 Thiyanga   2020-08-05 00:00:00 NA                  <NA>         <NA>   
 2     2 Jayani     2020-08-07 00:00:00 2020-07-31 00:00:00 Junior Data… Dialog…
 3     3 Jayani     2020-08-07 00:00:00 2020-08-06 00:00:00 Engineer, A… London…
 4     6 Jayani     2020-08-07 00:00:00 2020-08-13 00:00:00 Data Scient… Emirat…
 5     7 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Principal H… BHP Gr…
 6    10 Jayani     2020-08-12 00:00:00 2020-08-11 00:00:00 Associate D… Phrees…
 7    13 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Data Scient… Visier…
 8    17 Jayani     2020-08-13 00:00:00 2020-08-11 00:00:00 Data Scienc… Clearb…
 9    18 Jayani     2020-08-13 00:00:00 2020-08-12 00:00:00 Marketing i… Yellow…
10    20 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Data Analyst Ernst …
# ℹ 764 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
#   MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
#   `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
#   Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
#   Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
#   cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
  1. Select all columns related to Microsoft Office tools.
# A tibble: 1,172 × 4
   `MS Word` `Ms Excel` `Ms Access` `Ms PowerPoint`
       <dbl>      <dbl>       <dbl>           <dbl>
 1         0          0           0               0
 2         0          0           0               0
 3         0          0           0               0
 4         0          0           0               0
 5         1          1           1               1
 6         0          0           0               0
 7         0          0           0               0
 8         1          1           0               1
 9         0          0           0               0
10         0          0           0               0
# ℹ 1,162 more rows
  1. Filter jobs that were published in or after the year 2023.

Help: use year function in the lubridate package

# A tibble: 251 × 109
      ID Consultant DateRetrieved       DatePublished       Job_title    Company
   <dbl> <chr>      <dttm>              <dttm>              <chr>        <chr>  
 1     1 Kavya      2023-01-05 18:30:00 2023-01-04 18:30:00 Senior Engi… DFCC B…
 2     2 Kavya      2023-01-05 18:30:00 2023-01-04 18:30:00 Database En… <NA>   
 3     3 Kavya      2023-01-05 18:30:00 2023-01-03 18:30:00 Database Ad… Abans  
 4     4 Kavya      2023-01-05 18:30:00 2023-01-02 18:30:00 Consulting … George…
 5     6 Kavya      2023-01-05 18:30:00 2023-01-05 18:30:00 Data Scient… LSEG   
 6     7 Kavya      2023-01-05 18:30:00 2023-01-05 18:30:00 D & A Azure… Ernst …
 7     8 Kavya      2023-01-05 18:30:00 2023-01-04 18:30:00 Data Analyst Novels…
 8     9 Kavya      2023-01-05 18:30:00 2023-01-04 18:30:00 Data Analyst ADA    
 9    10 Kavya      2023-01-05 18:30:00 2023-01-04 18:30:00 Machine Lea… Bestka…
10    12 Kavya      2023-01-05 18:30:00 2023-01-05 18:30:00 Data Scient… JKH    
# ℹ 241 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
#   MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
#   `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
#   Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
#   Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
#   cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
  1. Filter the jobs where both PhD_needed and MSc_needed are 0.
# A tibble: 758 × 109
      ID Consultant DateRetrieved       DatePublished       Job_title    Company
   <dbl> <chr>      <dttm>              <dttm>              <chr>        <chr>  
 1     1 Thiyanga   2020-08-05 00:00:00 NA                  <NA>         <NA>   
 2     2 Jayani     2020-08-07 00:00:00 2020-07-31 00:00:00 Junior Data… Dialog…
 3     3 Jayani     2020-08-07 00:00:00 2020-08-06 00:00:00 Engineer, A… London…
 4     4 Jayani     2020-08-07 00:00:00 2020-07-24 00:00:00 CI-Statisti… E.D. B…
 5     5 Jayani     2020-08-07 00:00:00 2020-07-24 00:00:00 DA-Data Ana… E.D. B…
 6     7 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Principal H… BHP Gr…
 7     8 Jayani     2020-08-07 00:00:00 2020-07-24 00:00:00 DV-Data Vis… E.D. B…
 8     9 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Product Man… Numera…
 9    10 Jayani     2020-08-12 00:00:00 2020-08-11 00:00:00 Associate D… Phrees…
10    11 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Data Engine… SDK    
# ℹ 748 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
#   MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
#   `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
#   Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
#   Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
#   cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
  1. Filter out all rows where Salary is missing (NA).
# A tibble: 239 × 109
      ID Consultant DateRetrieved       DatePublished       Job_title    Company
   <dbl> <chr>      <dttm>              <dttm>              <chr>        <chr>  
 1    12 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Data Analyst EPCOR …
 2    24 Jayani     2020-08-13 00:00:00 2020-08-13 00:00:00 Data Scient… Elabra…
 3    45 Jayani     2020-08-13 00:00:00 2020-08-11 00:00:00 Associate P… SnapHu…
 4    52 Thimani    2020-08-07 00:00:00 2020-07-24 00:00:00 Data Scienc… Summit…
 5    54 Thimani    2020-08-07 00:00:00 2020-08-01 00:00:00 Data Scient… PA con…
 6    56 Thimani    2020-08-07 00:00:00 2020-07-31 00:00:00 Data Scient… Brilli…
 7    57 Thimani    2020-08-07 00:00:00 2020-07-14 00:00:00 Data Scient… G-Rese…
 8    58 Thimani    2020-08-07 00:00:00 2020-07-19 00:00:00 Manager-Dat… Biz2Cr…
 9    62 Thimani    2020-08-09 00:00:00 2020-07-19 00:00:00 Senior Mana… Britis…
10    63 Thimani    2020-08-09 00:00:00 2020-07-23 00:00:00 VP, Data Sc… 7Park …
# ℹ 229 more rows
# ℹ 103 more variables: R <dbl>, SAS <dbl>, SPSS <dbl>, Python <dbl>,
#   MAtlab <dbl>, Scala <dbl>, `C#` <dbl>, `MS Word` <dbl>, `Ms Excel` <dbl>,
#   `OLE/DB` <dbl>, `Ms Access` <dbl>, `Ms PowerPoint` <dbl>,
#   Spreadsheets <dbl>, Data_visualization <dbl>, Presentation_Skills <dbl>,
#   Communication <dbl>, BigData <dbl>, Data_warehouse <dbl>,
#   cloud_storage <dbl>, Google_Cloud <dbl>, AWS <dbl>, …
  1. For each Experience_Category, compute the percentage of jobs that require Presentation_Skills.
# A tibble: 5 × 2
  Experience_Category                prop_presentation
  <chr>                                          <dbl>
1 More than 10 years                             0.25 
2 More than 2 and less than 5 years              0.306
3 More than 5 and less than 10 years             0.318
4 Two or less years                              0.275
5 Unknown or Not needed                          0.360