Predicting Titanic Survivors with a Random Forest Classifier Model
Manuel Sainz de la Pena
For my second blog post I will be diving into one of my favorite machine learning models: Random Forest. Before I go any further I want to give credit to General Assembly and their lesson on this topic, written by Matt Brems, Riley Dallas, and Patrick Wales-Dinan , for providing much of the content which I will be discussing.
Random Forest models are an attempt to remedy one of the main problems with bagged decision tree models, namely their tendency to be overfit and have high variance. This is due to the fact that the bagged trees are all strongly correlated with each other. High correlation between trees results in a model that can be as overfit as a single decision tree. How do Random Forest models tackle this problem? They attempt to “de-correlate” the the individual trees in the Random Forest. This is accomplished by utilizing a random subset of features in each individual tree of the model.
For the purposes of this blog I will be utilizing the Kaggle Titanic Training Dataset (https://www.kaggle.com/c/titanic/data). More information about this data (including a data dictionary) can be found on the website linked. I won’t show my data cleaning and exploration steps, but will briefly summarize how I feature engineered my model.
- The “Embarked” column was dummified.
- “FamilyCount” column was created to sum the values of siblings, spouses, parents, and children for each passenger.
- “IsReverend” column was created to return 1 if the passenger was a Reverend, and 0 if not.
- “IsMale” column was created to return 1 if the passenger was a male, and 0 if not.
My final model feature list is shown below:
Before creating our model it is important to identify the baseline accuracy score that we intend to beat. Baseline accuracy represents the percentage of data found in the majority, for the classification we are trying to predict with our model. For us, this ended up being the percentage of passengers that did not survive the Titanic’s voyage. Below is how to find this score:
After noting baseline accuracy, we can split our data into training and testing sets using Scikit-learn’s “train_test_split” function shown below:
Next we need to instantiate our Random Forest Classifier model:
After this step, it is useful to set up a GridSearch scaffold to run through a dictionary of hyperparameters and identify our strongest performing model.
The code above gives us a data frame as an output which updates itself each time we re-run our GridSearch (the bottom cell) with the hyperparameters that gave us our best performing Random Forest model. See below for our best performing model:
Next, we can create a new instance of Random Forest Classifier with the optimal hyperparameters we identified via GridSearch. After fitting the model to our training data, we scored it with our testing data. This model ended up having a testing accuracy score of 83.86% (shown below).
We have now successfully created a Random Forest Classification Model! This model will correctly predict whether or not a passenger survived the Titanic’s ill fated maiden voyage approximately 84% of the time. We achieved a ~22% improvement over baseline without spending a lot of time on advanced data imputation strategies or feature engineering. Importantly, our Random Forest model avoided being significantly overfit.
Strategies to improve this model further would include:
- More advanced data imputation strategies for null values
- Engineer additional features
- Remove poorly performing features
- Add additional hyperparameters to our GridSearch parameter dictionary
Thanks for reading, and happy modeling!