STA 529 2.0 Data Mining

Week 1: June 29, 2024

Introduction to data mining

Classification and Regression Trees

Week 2: July 7, 2024

Measuring Performance

Data

library(mlbench)
library(BostonHousing)

Practical 1: Working with missing data

Lab 1

Practical 2: Classification

Lab 2

Week 3: July 14, 2024

Random Forests

Preactical 3: Introduction to tidymodels

Lab 3

Extra reading: Click here

Week 4: August 11, 2024

AdaBoost and Gradient Boosting

Week 5: August 18, 2024

Pattern Mining: Market Basket Analysis

Lab 4

Week 6: August 25, 2024

Association rule:

Example 1: Lab 5.1

Example 2: Lab 5.2

Example 3:

library(arules)
data(Groceries)

Example 3: Lab 5.3

Lab 6:

This question is based on “house.datamining2023.csv”. The dataset is saved on your local machine.This dataset contains information related to housing characteristics in various geographical locations.

Data

The variable description is as follows:

    longitude - The geographical coordinate specifying the east-west position of a location. 
    
    latitude - The geographical coordinate specifying the north-south position of a location. -
   
    housing_median_age - The median age of houses in a specific area. 
    
    total_rooms - The total number of rooms in all housing units in a specific area.
    
    total_bedrooms - The total number of bedrooms in all households in a specific area.
    
    population - The total population of a specific area. 
    
    households - The total number of households in a specific area. 
    
    median_income - The median income of households in a specific area. 
    
    median_house_value - The median value of houses in a specific area. 
    
    ocean_proximity - The proximity of the housing unit to the ocean, categorized into different classes. 

Week 7: September 1, 2024

Collaborative filtering

Lab 7

Week 8: September 15, 2024

Resampling

Support Vector Machine

Week 9: October 6, 2024

Mid semester project presentation and discussing errors

Week 10: October 13, 2024

Cluster Analysis

Dataset: https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-02-11#data-dictionary

Develop a model to predict which actual hotel stays included children and/or babies

library(readr)
hotels <- 
  read_csv("https://tidymodels.org/start/case-study/hotels.csv")