Predicting Titanic Survivors with a Random Forest Classifier Model

Manuel Sainz de la Pena

For my second blog post I will be diving into one of my favorite machine learning models: Random Forest. Before I go any further I want to give credit to General Assembly and their lesson on this topic, written by Matt Brems, Riley Dallas, and Patrick Wales-Dinan , for providing much of the content which I will be discussing.

Random Forest models are an attempt to remedy one of the main problems with bagged decision tree models, namely their tendency to be overfit and have high variance. This is due to the fact that the bagged trees are all strongly correlated with each other. High correlation between trees results in a model that can be as overfit as a single decision tree. How do Random Forest models tackle this problem? They attempt to “de-correlate” the the individual trees in the Random Forest. This is accomplished by utilizing a random subset of features in each individual tree of the model.

https://www.thrillist.com/travel/nation/the-most-beautiful-forests-in-the-world

For the purposes of this blog I will be utilizing the Kaggle Titanic Training Dataset (https://www.kaggle.com/c/titanic/data). More information about this data (including a data dictionary) can be found on the website linked. I won’t show my data cleaning and exploration steps, but will briefly summarize how I feature engineered my model.

  1. The “Embarked” column was dummified.
  2. “FamilyCount” column was created to sum the values of siblings, spouses, parents, and children for each passenger.
  3. “IsReverend” column was created to return 1 if the passenger was a Reverend, and 0 if not.
  4. “IsMale” column was created to return 1 if the passenger was a male, and 0 if not.

My final model feature list is shown below:

Features to be included in our model. Independent variables and target variable defined.

Before creating our model it is important to identify the baseline accuracy score that we intend to beat. Baseline accuracy represents the percentage of data found in the majority, for the classification we are trying to predict with our model. For us, this ended up being the percentage of passengers that did not survive the Titanic’s voyage. Below is how to find this score:

Our baseline accuracy is 61.75%. If we predict that all of our passengers will die (majority class) we will be correct 61.75% of the time. Our goal is to beat this percentage with our model.

After noting baseline accuracy, we can split our data into training and testing sets using Scikit-learn’s “train_test_split” function shown below:

Remember to stratify by “y” to ensure that our training and testing classes are balanced with respect to our target variable!

Next we need to instantiate our Random Forest Classifier model:

Creates an “instance” of the model

After this step, it is useful to set up a GridSearch scaffold to run through a dictionary of hyperparameters and identify our strongest performing model.

Code Credit: General Assembly Lesson 6.03

The code above gives us a data frame as an output which updates itself each time we re-run our GridSearch (the bottom cell) with the hyperparameters that gave us our best performing Random Forest model. See below for our best performing model:

My second model obtained the highest accuracy score on training data.

Next, we can create a new instance of Random Forest Classifier with the optimal hyperparameters we identified via GridSearch. After fitting the model to our training data, we scored it with our testing data. This model ended up having a testing accuracy score of 83.86% (shown below).

We have now successfully created a Random Forest Classification Model! This model will correctly predict whether or not a passenger survived the Titanic’s ill fated maiden voyage approximately 84% of the time. We achieved a ~22% improvement over baseline without spending a lot of time on advanced data imputation strategies or feature engineering. Importantly, our Random Forest model avoided being significantly overfit.

Strategies to improve this model further would include:

  • More advanced data imputation strategies for null values
  • Engineer additional features
  • Remove poorly performing features
  • Add additional hyperparameters to our GridSearch parameter dictionary

Thanks for reading, and happy modeling!

I am a Data Scientist at General Assembly. I hope to help others entering this field by sharing the wisdom, tips, and best practices I learned along the way.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store