Classification Ensemble in Python with a Stroke Prediction Dataset
This project is based on Season 3, Episode 2 of the Kaggle Playground Series. The title of this episode is: “Tabular Classification with a Stroke Prediction Dataset”. Our task is to predict the probability that a patient will have a stroke.
The target, stroke, is a binary variable and so classification methods are needed to predict the probability of stroke. In this project, I will be training the data on an ensemble of machine learning models.
The models I use (and their associated Python libraries) are:
- Logistic regression (
from sklearn.linear_model import LogisticRegression
) - Decision trees (
from sklearn.tree import DecisionTreeClassifier
) - XGBoost (
import xgboost as xgb
) - LightGBM (
import lightgbm as lgb
) - Ada boost (
from sklearn.ensemble import AdaBoostClassifier
) - K-Nearest neighbors (
from sklearn.neighbors import KNeighborsClassifier
) - Random forest (
from sklearn.ensemble import RandomForestClassifier
) - Support vector machines (
from sklearn.svm import SVC
)
However, instead of building the ensemble by just averaging the results of the predictions from these models, I will instead use a method called Super Learner Ensembling. This method uses cross-validation and combines the results through a final meta learner. In this project, I use logistic regression as the meta-learner and the package ML-Ensemble (which simplifies the process considerably).
Check out a previous blog post to find out more about the different ensemble methods you can use.
Stroke Data Description
The dataset I will use in this project was synthetically generated from the original stroke dataset. This allows us to use real-world data while not compromising privacy in any way. However, we will combine some parts of the original stroke dataset to enrich our analysis.
This stroke dataset contains the following features:
- id: unique identifier
- gender: “male”, “female”, or “other” (categorical)
- age: age (in years)
- hypertension: whether or not the patient has hypertension (binary)
- heart_disease: whether or not the patient has heart disease (binary)
- ever_married: whether or not the patient was ever married (binary)
- work_type: “children”, “govt_jov”, “never_worked”, “private”, “self-employed” (categorical)
- residence_type: “rural”, “urban” (binary)
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: “formerly smoked”, “never smoked”, “smokes”, “unknown” (categorical)
Data Exploration
In this phase of the project, I want to identify any potential ‘funnies’ in the data that may require cleaning and create any visualizations that may be useful for the model-building stage.
Using df.describe()
prints out some quick summary statistics of the dataset. However, by using the exclude
parameter, you can ignore columns of a certain type (don’t know how I only just discovered this).
# Quick stats: Continuous variables # Exclude object dtypes: gender, ever_married, work_type, residence_type, and smoking_status round(train_cp.describe(exclude = "object"), 2)
This gives the following table output:
A first look at this table shows two obvious problems:
- Minimum age is 0.08 – unless the study included newborns, this is probably dirty data.
- Minimum BMI is 10.30 – this is an unrealistic value for a person so this is probably also dirty data.
Dirty data like this can be removed or imputed as if they were nulls. In this project, they didn’t make much of a difference to the modeling so I left those data points in the data.
Next, I want to have a look at the distribution of the target variable, stroke. Quickly printing the counts reveals a very unbalanced dataset with only 632 cases out of over 15k data points!
A post on the discussion forum for this dataset suggests adding valid cases from the original dataset to this dataset for training. There are 249 cases in that dataset, of which 40 BMI values are null. Taking complete cases leaves 209 additional cases that can be included in our model here.
Visualizing the target variable (after adding the additional cases):
Next, I print out a few quick counts of the categorical variables in the dataset.
Gender | Count |
---|---|
Male | 5857 |
Female | 9446 |
Other | 1 |
Residence Type | Count |
---|---|
Rural | 7763 |
Urban | 7749 |
Ever Married | Count |
---|---|
No | 4941 |
Yes | 10571 |
Work Type | Count |
---|---|
Govt job | 1561 |
Never worked | 42 |
Private | 9879 |
Self-employed | 1992 |
Children | 2038 |
These counts reveal two issues:
- Gender has an ‘other’ category. Since there is only 1 data point with this category, I will simplify things by replacing it with ‘Female’ – the dominant gender category. Thereafter, the gender variable can be converted to a 0/1 binary variable.
- Residence type and ever married should be converted to 0/1 binary variables. It is possible to do one hot encoding for these variables but it would unnecessarily increase the dimensionality of the data and can hurt the performance of the model.
Fitting the Model(s)
To predict whether or not a patient will have a stroke, I fit an ensemble of 8 classification models using the ‘Super Learner’ ensemble method.
To prepare the data for the model, the continuous variables are standardized and the categorical variables are one hot encoded. I also drop the ID column since it doesn’t help the model in any way.
To fit the ensemble, we must first define the list of models that will be used in the ensemble.
# Get a list of models def get_models(): models = list() models.append(LogisticRegression()) models.append(DecisionTreeClassifier()) models.append(xgb.XGBClassifier()) models.append(lgb.LGBMClassifier()) models.append(AdaBoostClassifier()) models.append(KNeighborsClassifier()) models.append(RandomForestClassifier()) models.append(SVC(probability=True)) return models
Then, using ML-Ensemble, we must add one layer for these models and then a second meta-layer for the predictions.
def build_ensemble(): ensemble = SuperLearner(random_state=seed, folds = 25, shuffle = True, verbose = 1) # Add models models = get_models() ensemble.add(models, proba = True) # Attach the final meta estimator ensemble.add(LogisticRegression(), meta = True, proba = True) return ensemble
An important point to note is that this Kaggle competition required the predicted probabilities and not the predicted classes. Since ML-Ensemble prints the classes by default, we need to specify proba = True
in BOTH layers (this was not documented very well so I spent way too much time figuring out how to get predicted probabilities here!).
Using this function, we can build the super learner:
# Build the super learner ensemble = build_ensemble()
And finally, the ensemble is fitted to the training data (here, we must use df.values
because the ML-Ensemble package does not accept a dataframe):
# Fit the super learner ensemble ensemble.fit(train_features_scaled.values, train_target.values)
The final score I got for this submission was 0.89285. I could have improved this with some fine-tuning, but I stopped here for this project.
I hope you found this helpful! You can find the full code on Github.
If you have any questions or comments, drop them below! I’d love to hear from you!