Introduction
I completed this project for a class, using a dataset from a Kaggle competition. The goal of the Kaggle competition is to find the sale price for a house based on 79 features of the house, and the goal of the project is to compare 3 different methods of data cleaning, preprocessing, modeling, etc to find the one with the best accuracy. I will be creating 3 different regression models to ask the following questions.
- What features most contribute to the sale price of a house?
- What approach to data pre-processing minimizes RSME of my regression model?
- Which regression model has the best accuracy?
The Dataset
This dataset came from Kaggle as part of a Kaggle Competition. There are 79 variables, so I’ve summarized them into broad categories to make understanding them a bit easier.
| Feature Category | Explanation | Variables in this Category |
| House type | What the house is and what its zoning is | MSSubClass, MSZoning, BldgType, HouseStyle |
| House location | Features of the area near to the house, like neighborhood, road type, nearby landmarks | Street, Alley, Neighborhood, Condition1, Condition2 |
| Lot features | Features related to the land the house stands on | LotFrontage, LotArea, LotShape, LandContour, LotConfig, LandSlope |
| House features (categorical) | Features of the house, such as utility type or exterior materials. | Utilities, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, Foundation, Heating, Electrical, GarageType, GarageFinish, MiscFeature |
| House features (numerical) | Features of the house that are (or can be converted to) a range of numerical values | HeatingQC, CentralAir, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, FireplaceQu, PavedDrive, PoolArea, PoolQC, Fence |
| Quality Ratings | Ratings of different parts of the house with no “N/A” rating possible | OverallQual, OverallCond, ExterQual, ExterCond, KitchenQual, Functional |
| Basement features | Features specifically related to the basement. This is a unique category because not all houses have basements. | BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF |
| Garage features | Features specifically related to the garage. Like with basements, not all houses have garages | GarageCars, GarageArea, GarageQual, GarageCond, |
| Room Count | Count of different types of room/feature. | BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, Bedroom, Kitchen, TotRmsAbvGrd, Fireplaces |
| Porch Area | Square footage of different types of porch. | WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch |
| Construction Years | Dates related to construction of the house. | YearBuilt, YearRemodAdd, GarageYrBlt |
| Sale Information | Information related to the sale or price of the house | MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice(training only) |
What is Regression?
Regression is a form of machine learning that predicts a target numerical value based on one or more independent variables by creating a mathematical equation that uses the independent variables to calculate the dependent variable. Regression typically requires numerical variables. Under linear regression, you’re creating a straight line that predicts the output.
Experiment 1
For this experiment, I decided to go with a very simplistic approach to pre-processing, where I simply remove values that are not immediately compatible with linear regression. This will be a baseline that I can then compare to my other two experiments.
Libraries
library(dplyr)
library(tidyr)
library(Metrics)
Data Understanding
Because of my simplistic approach to pre-processing for this experiment, all I did was look at each variable’s type. ID and MSSubClass are stored as integers, but they aren’t really numerical data, so I’ll cut them out in addition to the non-numeric columns.
Code
sapply(train, typeof)
Output
Id MSSubClass MSZoning LotFrontage LotArea Street Alley
"integer" "integer" "character" "integer" "integer" "character" "character"
LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1
"character" "character" "character" "character" "character" "character" "character"
Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
"character" "character" "character" "integer" "integer" "integer" "integer"
RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual
"character" "character" "character" "character" "character" "integer" "character"
ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
"character" "character" "character" "character" "character" "character" "integer"
BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
"character" "integer" "integer" "integer" "character" "character" "character"
Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
"character" "integer" "integer" "integer" "integer" "integer" "integer"
FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
"integer" "integer" "integer" "integer" "character" "integer" "character"
Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea
"integer" "character" "character" "integer" "character" "integer" "integer"
GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
"character" "character" "character" "integer" "integer" "integer" "integer"
ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold
"integer" "integer" "character" "character" "character" "integer" "integer"
YrSold SaleType SaleCondition SalePrice
"integer" "character" "character" "integer"
Data Pre-processing
This code drops the ID and MSSubClass columns, then all non-numeric columns, before dropping all rows with any NA values. This reduces the training data from 1460 row 80 cols to 1121 rows 36 cols. To prevent data loss, I put this pruned down dataset into its own dataset rather than overwriting the original.
Code
ex1_train <- train[,-1:-2] %>% select(where(is.numeric)) %>% drop_na()
Modeling & Evaluation
The Kaggle competition excludes the sale prices of the testing dataset, so I’ll be training and evaluating using the training dataset instead. I’ll be evaluating all my data using RMSE, and for this very basic approach to pre-processing, I got an RMSE of 36767.15
Code
#Linear Regression Model
ex1_lm <- lm(SalePrice ~ ., ex1_train)
#Evaluation using root mean squared error (RMSE)
ex1_pred <- predict.lm(ex1_lm, ex1_train)
rmse(ex1_train$SalePrice, ex1_pred)
Output
> rmse(ex1_train$SalePrice, ex1_pred)
[1] 36767.15
Experiment 2
For this experiment, I decided to use just the rating-based factors, where the quality of different parts of the house are given a score. I’ll need to do some conversion from character to numerical, in some cases.
Data Understanding
For this experiment, I looked at the data_description.txt file that came with the data so I know what columns need to be dropped, converted, or left as-in.
| Variable | Conversion Needed? | Has NAs? |
| OverallQual | No | No |
| OverallCond | No | No |
| ExterQual | Yes | No |
| ExterCond | Yes | No |
| BsmtQual | Yes | Yes |
| BsmtExposure | Yes | Yes |
| HeatingQC | Yes | Yes |
| CentralAir | Yes | No |
| KitchenQual | Yes | No |
| Functional | Yes | No |
| FireplaceQu | Yes | Yes |
| GarageQual | Yes | Yes |
| GarageCond | Yes | Yes |
| PoolQC | Yes | Yes |
| Fence | Yes | Yes |
More rows have an NA in PoolQC and Fence than not (1453 and 1179 respectively), so I’ll drop those two columns in addition to those not listed.
Data Pre-processing
Although all this data are grading scales of various parts of the house, they don’t all come in numeric format. Because of that, I needed to process them by converting the given grade to a number. Most follow an Ex/Gd/TA/Fa/Po format, while BsmtExposure, CentralAir, and Functional all have unique grading scales. For all these columns, an NA means the feature is not present, so I decided to replace all NAs with a score of 0. Using BsmtQual as an example, that means there’s 6 possible levels:
- Ex (Excellent): 5
- Gd (Good): 4
- TA (Typical/Average): 3
- Fa (Fair): 2
- Po (Poor): 1
- NA (No Basement): 0
Other variables have similar grading scales, though not all have an NA and the 3 unique cases I mentioned have a fewer or greater number of levels than 6.
Code
#Selecting only our desired columns
ex2_train <- train %>% select(OverallQual, OverallCond, ExterQual, ExterCond, BsmtQual, BsmtExposure, HeatingQC, CentralAir, KitchenQual, Functional, FireplaceQu, GarageQual, GarageCond, SalePrice)
#Fixing columns with Ex, Gd, TA, Fa, Po levels
cols_to_fix <- c('ExterQual', 'ExterCond', 'BsmtQual', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond')
ex2_train <- ex2_train %>%
mutate(across(cols_to_fix, str_replace, 'Ex', "5")) %>%
mutate(across(cols_to_fix, str_replace, 'Gd', "4")) %>%
mutate(across(cols_to_fix, str_replace, 'TA', "3")) %>%
mutate(across(cols_to_fix, str_replace, 'Fa', "2")) %>%
mutate(across(cols_to_fix, str_replace, 'Po', "1")) %>%
mutate(across(cols_to_fix, replace_na, "0"))
#Fixing BsmtExposure
ex2_train <- ex2_train %>%
mutate(across("BsmtExposure", str_replace, 'Gd', "4")) %>%
mutate(across("BsmtExposure", str_replace, 'Av', "3")) %>%
mutate(across("BsmtExposure", str_replace, 'Mn', "2")) %>%
mutate(across("BsmtExposure", str_replace, 'No', "1")) %>%
mutate(across("BsmtExposure", replace_na, "0"))
#Fixing CentralAir
ex2_train <- ex2_train %>%
mutate(across("CentralAir", str_replace, 'Y', "1")) %>%
mutate(across("CentralAir", str_replace, 'N', "0"))
#Functional
ex2_train <- ex2_train %>%
mutate(across("Functional", str_replace, 'Typ', "7")) %>%
mutate(across("Functional", str_replace, 'Min1', "6")) %>%
mutate(across("Functional", str_replace, 'Min2', "5")) %>%
mutate(across("Functional", str_replace, 'Mod', "4")) %>%
mutate(across("Functional", str_replace, 'Maj1', "3")) %>%
mutate(across("Functional", str_replace, 'Maj2', "2")) %>%
mutate(across("Functional", str_replace, 'Sev', "1")) %>%
mutate(across("Functional", str_replace, 'Sal', "0"))
#Convert all to numeric
ex2_train <- ex2_train %>% mutate_if(is.character,as.numeric)
Modeling & Evaluation
Modeling this experiment went the same as with the first experiment. The resulting RMSE was 43202.16, which is higher than with experiment 1. This means the model was less accurate. However, it’s worth noting that experiment 2 uses 13 factors, while experiment 1 uses 35 for predicting the final value. So, I feel like it’s worth keeping all the factors from experiment 2 into experiment 3.
Code
#Linear Regression Model
ex2_lm <- lm(SalePrice ~ ., ex2_train)
#Evaluation using root mean squared error (RMSE)
ex2_pred <- predict.lm(ex2_lm, ex2_train)
rmse(ex2_train$SalePrice, ex2_pred)
Output
> rmse(ex2_train$SalePrice, ex2_pred)
[1] 43202.16
Experiment 3
For this experiment, I decided to use experiment 2’s database as a base, and then add on any any of the numeric columns with no NA values.
Data Understanding
I checked the numeric variables to see which had NA values in them, so that I can then exclude them from the list of variables I’ll be adding back to the data. LotFrontage, GarageYrBuilt, and MasVnrArea were the only ones with NAs, so they’ll be the only one being excluded for that reason. As mentioned earlier, Id and MSSubClass are also going to be excluded as they are not truly numeric.
Code
#Data understanding: Check the numeric columns for if they have NAs
colSums(is.na(select(train, where(is.numeric))))
Output
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
0 0 259 0 0 0 0
YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
0 8 0 0 0 0 0
X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
0 0 0 0 0 0 0
BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea
0 0 0 0 81 0 0
WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
0 0 0 0 0 0 0
MoSold YrSold SalePrice
0 0 0
Data Pre-processing
In this step, take the dataframe I created for experiment 2 and add back the relevant columns from the base dataset. This gives me 44 variables to use to predict the sale price, which is more than Experiment 1 or Experiment 2
Code
ex3_train <- train[,-1:-2] %>% select(where(is.numeric))
ex3_train <- ex3_train[,-1] #Drops lot frontage
ex3_train <- ex3_train[,-23] #Drops GarageYrBlt
ex3_train <- ex3_train[,-34] #Drops SalePrice (is duplicate)
ex3_train <- ex3_train[,-2:-3] #Drops OverallQual and OverallCond (duplicates)
ex3_train <- ex3_train[,-4] #Drops MasVnrArea
ex3_train <- bind_cols(ex3_train, ex2_train)
colSums(is.na(ex3_train))
Modeling & Evaluation
TotalBsmtSF and GrLivArea had coefficients of NA in this model, making the model unusable. To fix this, I simply dropped those two columns and remade the linear regression model. My final RMSE was 33399.57, which is better than what I got for experiment 1 or experiment 2.
Code
#Linear Regression Model
ex3_lm <- lm(SalePrice ~ ., ex3_train)
ex3_lm
#TotalBsmtSF and GrLivArea give NA, so I'll drop them and go again
ex3_train <- ex3_train[,-11] #Drops GrLivArea
ex3_train <- ex3_train[,-7] #Drops TotalBsmtSF
ex3_lm <- lm(SalePrice ~ ., ex3_train)
ex3_lm
#Evaluation using root mean squared error (RMSE)
ex3_pred <- predict.lm(ex3_lm, ex3_train)
rmse(train$SalePrice, ex3_pred)
Output
> rmse(train$SalePrice, ex3_pred)
[1] 33399.57
Conclusion
Impact / Conclusion
In this project my main comparison ended up being based on number of variables, since I kept to somewhat more limited forms of data pre-processing. However, the rating-based model (ex2) had a sharper decease in the number of variables used compared to its loss of accuracy against the stats-based model (ex1). This makes me think of previous experiences I’ve had, where selectively chosen variables can lead to a model that is more accurate than one that uses more data overall. In this case the larger model was more accurate, but it’s possible that if I went though all the variables, subjectively chose the ones I think are most likely to be helpful, and then only used them, I might be able to create a model that’s more accurate than any of the three I used here.
References
These are the functions I had to look up documentation for when working on this project.