by Manuel Sainz de la Pena
This is the first question I receive when I tell a family member or friend that I am currently studying to become a data scientist.
The truth is that I didn’t really know the answer to that question when I began my Data Science boot camp with General Assembly. Sure, I had a rough idea. I was aware of the prevalence of data in all of our lives. And I knew that companies would certainly pay for someone to analyze that data and provide actionable insights.
However, I definitely did not have a complete understanding of what exactly a data scientist does, and what are the skills a data scientist should possess. In an effort to answer these questions, I will share what I’ve learned so far about the typical workflow of a data scientist.
I believe that the first pre-requisite for any good data scientist is to have an open, curious mind. Question asking is the first step of the data science workflow and it is imperative that we identify a precise “problem statement” that we can specifically address in order to add value to an organization. An example of a problem statement would be something to the effect of: “I want to predict the sale price of a home in Ames, Iowa based off its zip code, square footage, and overall quality.” This is an iterative process which usually requires sharpening and narrowing down of the problem statement once we obtain our data and understand its limitations.
Obtain Data and Clean it
The next step in the job of a data scientist is to obtain and clean our data. This can be accomplished in a variety ways. Some websites allow us to scrape data directly from them. We might be provided with an existing data set, or we might have to go out and collect the data we are looking for ourselves! Cleaning the data is where data scientists often spend a big chunk of their time. The following are common issues that a data scientist might have to deal with at this stage:
- missing/null values
- duplicate entries
- incorrectly formatted data
There is no one size fits all solution for these types of problems. However a data scientist must be able to rationalize and defend the choices they make in the data cleaning stage.
Exploratory Data Analysis
Visualizations are a data scientists’ best friend when it comes to Exploratory Data Analysis (EDA). The goal of EDA is to better understand the relationships between variables in our data. We can create graphs to visualize these relationships. For example, making bar graphs showing the average sale price for houses in Ames, Iowa based on the number of bedrooms and bathrooms would most likely provide insight. My personal favorite type of visualization is the Seaborn Correlation Heatmap (example shown below).
Heatmaps allow us to visualize the correlation strength between our target variable (sale price in this example) and independent variables (Overall Quality, Total Square Feet, etc.). They are also extremely useful when it comes to identifying which independent variables might suffer from multicollinearity. This information can help us to remove redundant features which will negatively impact our models.
Feature engineering is the process of deciding which features we ultimately choose to include in our models. There are two steps usually involved.
- Feature Selection: This step typically involves removing features from our model that add more noise than predictive power. Removing the redundant features we identified during EDA is usually a good starting point. If our model has too many features it can also skew the bias-variance tradeoff too heavily towards variance, and thus result in an overfit model which will not perform well on unseen data.
- Feature Construction: This step refers to the creation of new features from existing ones. For example, in our Ames, Iowa housing sale price predictor, a new feature could be created that sums the total number of bedrooms and bathrooms into an entirely new feature. Or perhaps multiplying the square footage by the “house quality” integer value would give us a numeric feature which ends up highly correlated to sale price. Feature construction is an iterative process that can be aided by outside research and one’s own intuition.
At last, we’ve gotten to the machine learning stage in the data science process. First, we need to select the type of model(s) we would like to train. Below is a cheat sheet from Microsoft Azure showing a useful flow chart for how we might pick the best model for our specific project.
Below are the basic steps we need to follow to create a model:
- Split data into a training and testing set
- Instantiate whichever model we will be using
- Fit the model with training data set
- Obtain predictions for the testing set from the trained model
- Use performance metrics to evaluate the performance of the model
The simple steps above can be made infinitely more complex, but serve as a basic guide for what occurs when we conduct supervised machine learning.
Communication of Results
We know have our model, its predictions, and have evaluated the strength of those predictions. Perhaps the most important part of our job as data scientists is to effectively communicate our results. Data scientists must be able to tell a compelling story with their data and usually this involves utilizing clear, compelling visualizations. Failure to communicate results effectively can result in a whole lot of work going down the drain, if we cannot convince our audience to act on our findings. Thus, as data scientists it is critical that we practice and refine our presentation skills to maximize our impact.
Having gone through the basic data science workflow, we should now have a better idea of what exactly a data scientist does and what are the skills that the profession requires. I look forward to continuing to share my experiences in the field and invite you to follow along.