ITCS 3162 Project 3

Introduction

I completed this project for a class, using a dataset from a Kaggle competition. The goal of the Kaggle competition is to find the sale price for a house based on 79 features of the house, and the goal of the project is to compare 3 different methods of data cleaning, preprocessing, modeling, etc to find the one with the best accuracy. I will be creating 3 different regression models to ask the following questions.

  • What features most contribute to the sale price of a house?
  • What approach to data pre-processing minimizes RSME of my regression model?
  • Which regression model has the best accuracy?
  1. Introduction
    1. The Dataset
    2. What is Regression?
  2. Experiment 1
    1. Data Understanding
    2. Data Pre-processing
    3. Modeling & Evaluation
  3. Experiment 2
    1. Data Understanding
    2. Data Pre-processing
    3. Modeling & Evaluation
  4. Experiment 3
    1. Data Understanding
    2. Data Pre-processing
    3. Modeling & Evaluation
  5. Conclusion
    1. Impact / Conclusion
    2. References

The Dataset

This dataset came from Kaggle as part of a Kaggle Competition. There are 79 variables, so I’ve summarized them into broad categories to make understanding them a bit easier.

Feature CategoryExplanationVariables in this Category
House typeWhat the house is and what its zoning isMSSubClass, MSZoning, BldgType, HouseStyle
House locationFeatures of the area near to the house, like neighborhood, road type, nearby landmarksStreet, Alley, Neighborhood, Condition1, Condition2
Lot featuresFeatures related to the land the house stands onLotFrontage, LotArea, LotShape, LandContour, LotConfig, LandSlope
House features (categorical)Features of the house, such as utility type or exterior materials.Utilities, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, Foundation, Heating, Electrical, GarageType, GarageFinish, MiscFeature
House features (numerical)Features of the house that are (or can be converted to) a range of numerical valuesHeatingQC, CentralAir, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, FireplaceQu, PavedDrive, PoolArea, PoolQC, Fence
Quality RatingsRatings of different parts of the house with no “N/A” rating possibleOverallQual, OverallCond, ExterQual, ExterCond, KitchenQual, Functional
Basement featuresFeatures specifically related to the basement. This is a unique category because not all houses have basements.BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF
Garage featuresFeatures specifically related to the garage. Like with basements, not all houses have garagesGarageCars, GarageArea, GarageQual, GarageCond,
Room CountCount of different types of room/feature.BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, Bedroom, Kitchen, TotRmsAbvGrd, Fireplaces
Porch AreaSquare footage of different types of porch.WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch
Construction YearsDates related to construction of the house.YearBuilt, YearRemodAdd, GarageYrBlt
Sale InformationInformation related to the sale or price of the houseMiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice(training only)

What is Regression?

Regression is a form of machine learning that predicts a target numerical value based on one or more independent variables by creating a mathematical equation that uses the independent variables to calculate the dependent variable. Regression typically requires numerical variables. Under linear regression, you’re creating a straight line that predicts the output.


Experiment 1

For this experiment, I decided to go with a very simplistic approach to pre-processing, where I simply remove values that are not immediately compatible with linear regression. This will be a baseline that I can then compare to my other two experiments.

Libraries
library(dplyr)
library(tidyr)
library(Metrics)

Data Understanding

Because of my simplistic approach to pre-processing for this experiment, all I did was look at each variable’s type. ID and MSSubClass are stored as integers, but they aren’t really numerical data, so I’ll cut them out in addition to the non-numeric columns.

Code
sapply(train, typeof)
Output
           Id    MSSubClass      MSZoning   LotFrontage       LotArea        Street         Alley 
    "integer"     "integer"   "character"     "integer"     "integer"   "character"   "character" 
     LotShape   LandContour     Utilities     LotConfig     LandSlope  Neighborhood    Condition1 
  "character"   "character"   "character"   "character"   "character"   "character"   "character" 
   Condition2      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
  "character"   "character"   "character"     "integer"     "integer"     "integer"     "integer" 
    RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType    MasVnrArea     ExterQual 
  "character"   "character"   "character"   "character"   "character"     "integer"   "character" 
    ExterCond    Foundation      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
  "character"   "character"   "character"   "character"   "character"   "character"     "integer" 
 BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC    CentralAir 
  "character"     "integer"     "integer"     "integer"   "character"   "character"   "character" 
   Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath 
  "character"     "integer"     "integer"     "integer"     "integer"     "integer"     "integer" 
     FullBath      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
    "integer"     "integer"     "integer"     "integer"   "character"     "integer"   "character" 
   Fireplaces   FireplaceQu    GarageType   GarageYrBlt  GarageFinish    GarageCars    GarageArea 
    "integer"   "character"   "character"     "integer"   "character"     "integer"     "integer" 
   GarageQual    GarageCond    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
  "character"   "character"   "character"     "integer"     "integer"     "integer"     "integer" 
  ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature       MiscVal        MoSold 
    "integer"     "integer"   "character"   "character"   "character"     "integer"     "integer" 
       YrSold      SaleType SaleCondition     SalePrice 
    "integer"   "character"   "character"     "integer" 

Data Pre-processing

This code drops the ID and MSSubClass columns, then all non-numeric columns, before dropping all rows with any NA values. This reduces the training data from 1460 row 80 cols to 1121 rows 36 cols. To prevent data loss, I put this pruned down dataset into its own dataset rather than overwriting the original.

Code
ex1_train <- train[,-1:-2] %>% select(where(is.numeric)) %>% drop_na()

Modeling & Evaluation

The Kaggle competition excludes the sale prices of the testing dataset, so I’ll be training and evaluating using the training dataset instead. I’ll be evaluating all my data using RMSE, and for this very basic approach to pre-processing, I got an RMSE of 36767.15

Code
#Linear Regression Model
ex1_lm <- lm(SalePrice ~ ., ex1_train)

#Evaluation using root mean squared error (RMSE)
ex1_pred <- predict.lm(ex1_lm, ex1_train)
rmse(ex1_train$SalePrice, ex1_pred)
Output
> rmse(ex1_train$SalePrice, ex1_pred)
[1] 36767.15

Experiment 2

For this experiment, I decided to use just the rating-based factors, where the quality of different parts of the house are given a score. I’ll need to do some conversion from character to numerical, in some cases.

Data Understanding

For this experiment, I looked at the data_description.txt file that came with the data so I know what columns need to be dropped, converted, or left as-in.

VariableConversion Needed?Has NAs?
OverallQualNoNo
OverallCondNoNo
ExterQualYesNo
ExterCondYesNo
BsmtQualYesYes
BsmtExposureYesYes
HeatingQCYesYes
CentralAirYesNo
KitchenQualYesNo
FunctionalYesNo
FireplaceQuYesYes
GarageQualYesYes
GarageCondYesYes
PoolQCYesYes
FenceYesYes

More rows have an NA in PoolQC and Fence than not (1453 and 1179 respectively), so I’ll drop those two columns in addition to those not listed.


Data Pre-processing

Although all this data are grading scales of various parts of the house, they don’t all come in numeric format. Because of that, I needed to process them by converting the given grade to a number. Most follow an Ex/Gd/TA/Fa/Po format, while BsmtExposure, CentralAir, and Functional all have unique grading scales. For all these columns, an NA means the feature is not present, so I decided to replace all NAs with a score of 0. Using BsmtQual as an example, that means there’s 6 possible levels:

  • Ex (Excellent): 5
  • Gd (Good): 4
  • TA (Typical/Average): 3
  • Fa (Fair): 2
  • Po (Poor): 1
  • NA (No Basement): 0

Other variables have similar grading scales, though not all have an NA and the 3 unique cases I mentioned have a fewer or greater number of levels than 6.

Code
#Selecting only our desired columns
ex2_train <- train %>% select(OverallQual, OverallCond, ExterQual, ExterCond, BsmtQual, BsmtExposure, HeatingQC, CentralAir, KitchenQual, Functional, FireplaceQu, GarageQual, GarageCond, SalePrice)

#Fixing columns with Ex, Gd, TA, Fa, Po levels
cols_to_fix <- c('ExterQual', 'ExterCond', 'BsmtQual', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond')
ex2_train <- ex2_train %>% 
  mutate(across(cols_to_fix, str_replace, 'Ex', "5")) %>% 
  mutate(across(cols_to_fix, str_replace, 'Gd', "4")) %>% 
  mutate(across(cols_to_fix, str_replace, 'TA', "3")) %>% 
  mutate(across(cols_to_fix, str_replace, 'Fa', "2")) %>% 
  mutate(across(cols_to_fix, str_replace, 'Po', "1")) %>%
  mutate(across(cols_to_fix, replace_na, "0"))

#Fixing BsmtExposure
ex2_train <- ex2_train %>% 
  mutate(across("BsmtExposure", str_replace, 'Gd', "4")) %>% 
  mutate(across("BsmtExposure", str_replace, 'Av', "3")) %>% 
  mutate(across("BsmtExposure", str_replace, 'Mn', "2")) %>% 
  mutate(across("BsmtExposure", str_replace, 'No', "1")) %>% 
  mutate(across("BsmtExposure", replace_na, "0"))

#Fixing CentralAir
ex2_train <- ex2_train %>% 
  mutate(across("CentralAir", str_replace, 'Y', "1")) %>% 
  mutate(across("CentralAir", str_replace, 'N', "0"))

#Functional
ex2_train <- ex2_train %>% 
  mutate(across("Functional", str_replace, 'Typ', "7")) %>%
  mutate(across("Functional", str_replace, 'Min1', "6")) %>% 
  mutate(across("Functional", str_replace, 'Min2', "5")) %>% 
  mutate(across("Functional", str_replace, 'Mod', "4")) %>% 
  mutate(across("Functional", str_replace, 'Maj1', "3")) %>% 
  mutate(across("Functional", str_replace, 'Maj2', "2")) %>% 
  mutate(across("Functional", str_replace, 'Sev', "1")) %>% 
  mutate(across("Functional", str_replace, 'Sal', "0"))
  
#Convert all to numeric
ex2_train <- ex2_train %>% mutate_if(is.character,as.numeric)

Modeling & Evaluation

Modeling this experiment went the same as with the first experiment. The resulting RMSE was 43202.16, which is higher than with experiment 1. This means the model was less accurate. However, it’s worth noting that experiment 2 uses 13 factors, while experiment 1 uses 35 for predicting the final value. So, I feel like it’s worth keeping all the factors from experiment 2 into experiment 3.

Code
#Linear Regression Model
ex2_lm <- lm(SalePrice ~ ., ex2_train)

#Evaluation using root mean squared error (RMSE)
ex2_pred <- predict.lm(ex2_lm, ex2_train)
rmse(ex2_train$SalePrice, ex2_pred)
Output
> rmse(ex2_train$SalePrice, ex2_pred)
[1] 43202.16

Experiment 3

For this experiment, I decided to use experiment 2’s database as a base, and then add on any any of the numeric columns with no NA values.

Data Understanding

I checked the numeric variables to see which had NA values in them, so that I can then exclude them from the list of variables I’ll be adding back to the data. LotFrontage, GarageYrBuilt, and MasVnrArea were the only ones with NAs, so they’ll be the only one being excluded for that reason. As mentioned earlier, Id and MSSubClass are also going to be excluded as they are not truly numeric.

Code
#Data understanding: Check the numeric columns for if they have NAs
colSums(is.na(select(train, where(is.numeric))))
Output
           Id    MSSubClass   LotFrontage       LotArea   OverallQual   OverallCond     YearBuilt 
            0             0           259             0             0             0             0 
 YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF     X1stFlrSF 
            0             8             0             0             0             0             0 
    X2ndFlrSF  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
            0             0             0             0             0             0             0 
 BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars    GarageArea 
            0             0             0             0            81             0             0 
   WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea       MiscVal 
            0             0             0             0             0             0             0 
       MoSold        YrSold     SalePrice 
            0             0             0 

Data Pre-processing

In this step, take the dataframe I created for experiment 2 and add back the relevant columns from the base dataset. This gives me 44 variables to use to predict the sale price, which is more than Experiment 1 or Experiment 2

Code
ex3_train <- train[,-1:-2] %>% select(where(is.numeric))
ex3_train <- ex3_train[,-1] #Drops lot frontage
ex3_train <- ex3_train[,-23] #Drops GarageYrBlt
ex3_train <- ex3_train[,-34] #Drops SalePrice (is duplicate)
ex3_train <- ex3_train[,-2:-3] #Drops OverallQual and OverallCond (duplicates)
ex3_train <- ex3_train[,-4] #Drops MasVnrArea
ex3_train <- bind_cols(ex3_train, ex2_train)

colSums(is.na(ex3_train))

Modeling & Evaluation

TotalBsmtSF and GrLivArea had coefficients of NA in this model, making the model unusable. To fix this, I simply dropped those two columns and remade the linear regression model. My final RMSE was 33399.57, which is better than what I got for experiment 1 or experiment 2.

Code
#Linear Regression Model
ex3_lm <- lm(SalePrice ~ ., ex3_train)
ex3_lm

#TotalBsmtSF and GrLivArea give NA, so I'll drop them and go again
ex3_train <- ex3_train[,-11] #Drops GrLivArea
ex3_train <- ex3_train[,-7] #Drops TotalBsmtSF
ex3_lm <- lm(SalePrice ~ ., ex3_train)
ex3_lm

#Evaluation using root mean squared error (RMSE)
ex3_pred <- predict.lm(ex3_lm, ex3_train)
rmse(train$SalePrice, ex3_pred)
Output
> rmse(train$SalePrice, ex3_pred)
[1] 33399.57

Conclusion

Impact / Conclusion

In this project my main comparison ended up being based on number of variables, since I kept to somewhat more limited forms of data pre-processing. However, the rating-based model (ex2) had a sharper decease in the number of variables used compared to its loss of accuracy against the stats-based model (ex1). This makes me think of previous experiences I’ve had, where selectively chosen variables can lead to a model that is more accurate than one that uses more data overall. In this case the larger model was more accurate, but it’s possible that if I went though all the variables, subjectively chose the ones I think are most likely to be helpful, and then only used them, I might be able to create a model that’s more accurate than any of the three I used here.

References

These are the functions I had to look up documentation for when working on this project.