Regression in Julia with the California Housing Dataset

This project is based on Season 3, Episode 1 of the Kaggle Playground Series. The title of this episode is: “Tabular Regression with the California Housing Dataset”. Our task is to predict the median housing value of a block group of housing.

In this project, my goal is to use the Julia programming language for the first time in a machine learning application. I will be using the Julia package Pluto as my notebook environment for the project. I like to think of Pluto notebooks as upgrades of Jupyter notebooks since they have many additional features that Jupyter lacks (such as reactivity).

The target variable, median housing value, is a continuous variable and so regression techniques are needed. To keep things simple as I learn to use Julia, I will be applying linear regression only.

Let’s get started.

California Housing Data Description

The dataset I will be using here was synthetically generated from the original California Housing dataset. This allows us to use real-world data while not compromising privacy in any way.

As previously mentioned, the target variable is MedHousVal – median housing value. The features of the dataset are:

  • MedInc – median income in block
  • HouseAge – median house age in block
  • AveRooms – average number of rooms per house
  • AveBedrms – average number of bedrooms per house
  • Population – population of block
  • AveOccup – average number of household members per house
  • Lattitude – latitude of block
  • Longitude – longitude of block

Setting up a Julia Environment

Pluto notebooks also serve as their own environments. Each Pluto notebook automatically stores information on the packages used, their dependencies, and their corresponding versions. However, I wanted to store this information about my environment separately so that I can share my code with anyone that uses Julia, without them needing to specifically use Pluto too.

Julia has built-in package and environment management through a package called Pkg. Once you start a Julia session in your terminal you simply type the right square bracket ] and you will enter ‘package mode’ where you can activate an environment and add, remove, or update packages.

If you navigated to the directory of your project in your terminal, you can simply type activate . and the current directory becomes the environment.

To switch off Pluto’s automatic package management, we must activate the project environment from inside Pluto using this code:

begin
	import Pkg

	# Activate the environment at the current working directory
	Pkg.activate(Base.current_project())

	# Instantiate to prevent missing dependencies
	Pkg.instantiate

	# Specify the packages we will be using in this project
	using StatsKit, MLJ, Gadfly, Random, MLBase
	import Plots
end

So, I will be using the following packages for this project:

  • StatsKit – collection of essential stats packages
  • MLJ – machine learning toolkit
  • Gadfly – data visualization (similar to GGPlot2 on R)
  • Random – random number generator for setting the seed
  • MLBase – basic machine learning functionality
  • Plots – Plots is imported only so that I can plot a correlation heat map

Data Exploration

During the data exploration phase, I’m looking for any issues that may need to be addressed (like missing values or extreme outliers) and visualizing the data.

A first inspection of the data doesn’t reveal any obvious issues, and I don’t notice anything special that warrants further investigation.

I decided to create an additional variable called ‘BlockDensity’. This uses the total population and average number of occupants per block to indicate how many houses are located in a block. The hypothesis was that dense blocks could either indicate smaller houses that cost less or it could indicate tall apartment buildings that cost more.

# Number of houses in a block (ie. block density)
mod_df.BlockDensity = mod_df.Population ./ mod_df.AveOccup

The main visualization that I look at is the correlation heat map to get an idea of how correlated the target is with the features (and the features with each other).

begin
	cols = names(mod_df)
	corr_matrix = cor(Matrix(mod_df))
	
	# PLOT
	(n,m) = size(corr_matrix)
	Plots.heatmap(corr_matrix, xticks=(1:m,cols), xrot=90, yticks=(1:m,cols), yflip=true, c = :bluesreds)
	Plots.annotate!([(j, i, (round(corr_matrix[i,j],digits=1), 8,"Computer Modern",:white)) for i in 1:n for j in 1:m])
end

This quickly reveals a strong positive correlation between the target (median house value) and median income. This makes sense since people with higher incomes can generally afford more expensive houses.

Since I’ll will be using linear regression for this project, I need to deal with outliers as they can skew the results. For this reason, I wrote a function that finds outliers based on a method by Tukey called the outer fences.

These formulas allow us to identify data points that are far out from the first and third quartiles.

# Finding outliers past the outer fences
# (Q1 - (3*IQR); Q3 + (3*IQR))

function outer_fences(df, X) 
	print("\n $(X)")
	# Quantiles
	quantiles = quantile!(df[!, X], [0.25, 0.75])
	q1 = quantiles[1]
	q3 = quantiles[2]

	# Interquartile range
	IQR = q3-q1

	# Fences
	fence_1 = q1 - (3*IQR)
	fence_2 = q3 + (3*IQR)

	print("\n Fence cutoffs: ($(round(fence_1, digits=2)), $(round(fence_2,digits=2)))")

	# Count the number of samples that are outside the fences
	outside_samples = df[!, X][(df[!, X] .< fence_1) .| (df[!, X] .> fence_2), :]
	count_outside_samples = size(outside_samples)[1]
	print("\n Number of outside samples: $(count_outside_samples)")	

	return X, fence_1, fence_2
end

I probably spent far too much time on this problem of outliers than I should have, but I was able to identify and remove around 1625 data points from the training data.

Fitting the Model

To fit a linear regression model I used the lm function from the GLM package. The only downside to this function is that you must specify every feature in the formula for the regression equation, which was a bit of a pain.

linearRegressor = lm(@formula(MedHouseVal ~ MedInc + HouseAge + AveRooms + AveBedrms + Population + AveOccup + Latitude + Longitude + BlockDensity), train)

The R2 value for this model looked promising but the RMSE was a little too high. Printing the R2 value was straightforward:

# R Square value of the model (0.9194)
r2(linearRegressor)

However, getting the RMSE took a bit more effort. First, I define the RMSE function definition:

# RMSE function definition
function rmse(performance_df)
    rmse = sqrt(mean(performance_df.error.*performance_df.error))
    return rmse
end

Then a dataframe is created with the predicted and error values for the training and test sets which can then be entered into the RMSE formula. The RMSE for the test set is 0.3147 and is 0.3111 for the train set.

begin
	# Prediction
	ypredicted_test = GLM.predict(linearRegressor, test)
	ypredicted_train = GLM.predict(linearRegressor, train)
	
	# Test Performance DataFrame
	performance_testdf = DataFrame(y_actual = test[!,:MedHouseVal], y_predicted = ypredicted_test)
	performance_testdf.error = performance_testdf[!,:y_actual] - performance_testdf[!,:y_predicted]
	performance_testdf.error_sq = performance_testdf.error.*performance_testdf.error
	
	# Train Performance DataFrame
	performance_traindf = DataFrame(y_actual = train[!,:MedHouseVal], y_predicted = ypredicted_train)
	performance_traindf.error = performance_traindf[!,:y_actual] - performance_traindf[!,:y_predicted]
	performance_traindf.error_sq = performance_traindf.error.*performance_traindf.error
end

Overall, this was a good learning experience using Julia for the first time. Some aspects of the Julia language seemed unnecessarily complicated and required additional function definitions or steps. However, on the other hand, there were many other aspects of Julia that made it a pleasure to work with (such as the extremely simple environment management).

Check out the associated Github repo for the full code.