Regression in Julia with the California Housing Dataset
This project is based on Season 3, Episode 1 of the Kaggle Playground Series. The title of this episode is: “Tabular Regression with the California Housing Dataset”. Our task is to predict the median housing value of a block group of housing.
In this project, my goal is to use the Julia programming language for the first time in a machine learning application. I will be using the Julia package Pluto as my notebook environment for the project. I like to think of Pluto notebooks as upgrades of Jupyter notebooks since they have many additional features that Jupyter lacks (such as reactivity).
The target variable, median housing value, is a continuous variable and so regression techniques are needed. To keep things simple as I learn to use Julia, I will be applying linear regression only.
Let’s get started.
California Housing Data Description
The dataset I will be using here was synthetically generated from the original California Housing dataset. This allows us to use real-world data while not compromising privacy in any way.
As previously mentioned, the target variable is
MedHousVal – median housing value. The features of the dataset are:
MedInc– median income in block
HouseAge– median house age in block
AveRooms– average number of rooms per house
AveBedrms– average number of bedrooms per house
Population– population of block
AveOccup– average number of household members per house
Lattitude– latitude of block
Longitude– longitude of block
Setting up a Julia Environment
Pluto notebooks also serve as their own environments. Each Pluto notebook automatically stores information on the packages used, their dependencies, and their corresponding versions. However, I wanted to store this information about my environment separately so that I can share my code with anyone that uses Julia, without them needing to specifically use Pluto too.
Julia has built-in package and environment management through a package called
Pkg. Once you start a Julia session in your terminal you simply type the right square bracket
] and you will enter ‘package mode’ where you can activate an environment and add, remove, or update packages.
If you navigated to the directory of your project in your terminal, you can simply type
activate . and the current directory becomes the environment.
To switch off Pluto’s automatic package management, we must activate the project environment from inside Pluto using this code:
begin import Pkg # Activate the environment at the current working directory Pkg.activate(Base.current_project()) # Instantiate to prevent missing dependencies Pkg.instantiate # Specify the packages we will be using in this project using StatsKit, MLJ, Gadfly, Random, MLBase import Plots end
So, I will be using the following packages for this project:
- StatsKit – collection of essential stats packages
- MLJ – machine learning toolkit
- Gadfly – data visualization (similar to GGPlot2 on R)
- Random – random number generator for setting the seed
- MLBase – basic machine learning functionality
- Plots – Plots is imported only so that I can plot a correlation heat map
During the data exploration phase, I’m looking for any issues that may need to be addressed (like missing values or extreme outliers) and visualizing the data.
A first inspection of the data doesn’t reveal any obvious issues, and I don’t notice anything special that warrants further investigation.
I decided to create an additional variable called ‘BlockDensity’. This uses the total population and average number of occupants per block to indicate how many houses are located in a block. The hypothesis was that dense blocks could either indicate smaller houses that cost less or it could indicate tall apartment buildings that cost more.
# Number of houses in a block (ie. block density) mod_df.BlockDensity = mod_df.Population ./ mod_df.AveOccup
The main visualization that I look at is the correlation heat map to get an idea of how correlated the target is with the features (and the features with each other).
begin cols = names(mod_df) corr_matrix = cor(Matrix(mod_df)) # PLOT (n,m) = size(corr_matrix) Plots.heatmap(corr_matrix, xticks=(1:m,cols), xrot=90, yticks=(1:m,cols), yflip=true, c = :bluesreds) Plots.annotate!([(j, i, (round(corr_matrix[i,j],digits=1), 8,"Computer Modern",:white)) for i in 1:n for j in 1:m]) end
This quickly reveals a strong positive correlation between the target (median house value) and median income. This makes sense since people with higher incomes can generally afford more expensive houses.
Since I’ll will be using linear regression for this project, I need to deal with outliers as they can skew the results. For this reason, I wrote a function that finds outliers based on a method by Tukey called the outer fences.
These formulas allow us to identify data points that are far out from the first and third quartiles.
# Finding outliers past the outer fences # (Q1 - (3*IQR); Q3 + (3*IQR)) function outer_fences(df, X) print("\n $(X)") # Quantiles quantiles = quantile!(df[!, X], [0.25, 0.75]) q1 = quantiles q3 = quantiles # Interquartile range IQR = q3-q1 # Fences fence_1 = q1 - (3*IQR) fence_2 = q3 + (3*IQR) print("\n Fence cutoffs: ($(round(fence_1, digits=2)), $(round(fence_2,digits=2)))") # Count the number of samples that are outside the fences outside_samples = df[!, X][(df[!, X] .< fence_1) .| (df[!, X] .> fence_2), :] count_outside_samples = size(outside_samples) print("\n Number of outside samples: $(count_outside_samples)") return X, fence_1, fence_2 end
I probably spent far too much time on this problem of outliers than I should have, but I was able to identify and remove around 1625 data points from the training data.
Fitting the Model
To fit a linear regression model I used the
lm function from the GLM package. The only downside to this function is that you must specify every feature in the formula for the regression equation, which was a bit of a pain.
linearRegressor = lm(@formula(MedHouseVal ~ MedInc + HouseAge + AveRooms + AveBedrms + Population + AveOccup + Latitude + Longitude + BlockDensity), train)
The R2 value for this model looked promising but the RMSE was a little too high. Printing the R2 value was straightforward:
# R Square value of the model (0.9194) r2(linearRegressor)
However, getting the RMSE took a bit more effort. First, I define the RMSE function definition:
# RMSE function definition function rmse(performance_df) rmse = sqrt(mean(performance_df.error.*performance_df.error)) return rmse end
Then a dataframe is created with the predicted and error values for the training and test sets which can then be entered into the RMSE formula. The RMSE for the test set is 0.3147 and is 0.3111 for the train set.
begin # Prediction ypredicted_test = GLM.predict(linearRegressor, test) ypredicted_train = GLM.predict(linearRegressor, train) # Test Performance DataFrame performance_testdf = DataFrame(y_actual = test[!,:MedHouseVal], y_predicted = ypredicted_test) performance_testdf.error = performance_testdf[!,:y_actual] - performance_testdf[!,:y_predicted] performance_testdf.error_sq = performance_testdf.error.*performance_testdf.error # Train Performance DataFrame performance_traindf = DataFrame(y_actual = train[!,:MedHouseVal], y_predicted = ypredicted_train) performance_traindf.error = performance_traindf[!,:y_actual] - performance_traindf[!,:y_predicted] performance_traindf.error_sq = performance_traindf.error.*performance_traindf.error end
Overall, this was a good learning experience using Julia for the first time. Some aspects of the Julia language seemed unnecessarily complicated and required additional function definitions or steps. However, on the other hand, there were many other aspects of Julia that made it a pleasure to work with (such as the extremely simple environment management).
Check out the associated Github repo for the full code.