Data Science Mock Interview - Top Questions and Answers : View Full Interview Transcript Below
Welcome to the AceAI data science mock interview. Here at AceAI, we develop career-prep resources and tools to help you land your dream data science job.
In this interview, we will be asking a potential data science candidate key interview questions that have been developed by leading data science hiring managers. The topics covered in this video include
Live Coding Exercises
Exploratory Data Analysis
Please let us know if you have any questions and feel free to contact us or schedule a one on one professional coaching consultation.
PART 1: Python Imports
Thank you for joining us for the technical interview. For this interview, we will present you with some data and code, and we will ask you questions about the analysis and machine learning approaches you would use. We will be working together off of this Jupiter notebook.
So we’ve imported some python packages and modules you might find beneficial. Also, feel free to import any additional libraries you will need as we continue the interview.
Are you familiar with these libraries, and if so, could you give us a brief overview of what they do?
Yes, I’ve worked with all of these before. Pandas is a great tool for data analysis and manipulation, and seaboard and matplotlib are plotting libraries.Numpy is a mathematics tool that performs functions on matrices and multidimensional arrays.
PART 2: EDA
We will be working with this dataset, which is already imported. This is a dataset for credit risk management, namely to predict whether a customer will default on their loan. I am also going to create a train and a test split from the data.
So could you show us, when you are presented with a new dataset, what are some of the first things you would do to start exploring it? Also, as you are looking at the data, please feel free to ask us any questions you have about it.
Ok, so one of the first things I would is determine the shape of the dataset, to see how many rows and features the training and the test dataset have. I would also look at the data types in the training set. These are all encoded as integer and float values. I would take a look at some of the first few rows in the training dataset to get a better idea of whats in the data, there are some categorical feature such as equation and marital status of the customers, and it looks like there are also numerical feature such as the age of the customers, and then if I scroll to the last feature of the dataset, this is the default payment, so this would be the response variables, whether the customer had defaulted or not. And one last thing I would do initially is see how many missing values are in the data, and marriage has a few missing values.
PART 3: CATEGORICAL FEATURES
So knowing now a little more about this dataset, there are two types of variables, categorical and numerical. Let’s focus on a categorical feature, say MARRIAGE. Could you explore this variable a bit more, and walk me through your thoughts?
The first thing I would do is look at the unique values in this feature, and it looks like there are some missing values here, which we saw. And I would also do a value counts to see how many unique values there are in this particular feature. And in order to get it ready for machine learning I would just as a first step delete the rows with the missing values, because there aren’t that many, there are only 34 missing values out of a few thousand. Later on I could impute the missing values with the mode of this feature. And then because marriage is a categorical, not a numerical feature, that’s not ideal, there really isn’t a difference in marriage types 1, 2, 3, so I would transform them into dummy or indicator values using this following step. I’m going to create final training set by concatenating these values with the original training set, and then I’m going to delete the original feature for marriage. Let me just make sure all of that worked. Theres no longer a marital feature here and then here are the indicator values for the marital status, so that has worked well.
PART 4 : Class Imbalance
So let’s do one more step with the exploratory analysis. Let’s look at the response variable, default_payment. What should we take into consideration before building a machine learning model to predict on this variable ?
I’m going to extract that feature, so it looks like there are many more customers to did not default than who did default, let me just plot that so it’s a bit more clear. Because the response variable is imbalanced and there are many more customers who have not defaulted than those who have, I would have to keep this in mind.
PART 5: MODEL BUILDING
I’m going to import a few more tools from scikit-learn to let us start building a machine learning model. I will import a Random Forest Classifier, and stratified k-fold and grid search cross validation functionalities. I’m also going to define tuning parameters for the random forest, where n_estimators is the number of trees in the classifier, and then I’m going to use 5-fold cross validation.
Ok, so now I am going to fit the function, and let’s wait a while as it runs, and you can see with this asterisks that the model is still training.
And here are the results. What do you think of them?
These results are too good to be true, the model isn’t making any errors on the out of sample data and the standard deviation is zero.
Yes, exactly. What do you think is going on?
Ok, so let me go do that now, and I will then set off model training again. When the model has finished building, please tell me what you think of the new results for the model performance.
It looks more reasonable, the model is’n’t perfect, and the accuracy is improving with more trees. You could try adding more trees to the ensemble, as it looks like the model performance is still improving with additional decision trees added to it.
There is however another error. Remember that the response variable was unbalanced. Can you see how we could improve the model fitting to account for that fact?
Yes, you could change it so that the model scoring uses the f1 score or the area under the receiver operating curve, these two measures account for class imbalance quite well.
So what is the f1 score?
The f1 score is a weighted average of both the precision and recall of the model.
What is the worst and the best f1 score you can have?
So 0 is the worst score and 1 is the best F1 score.
Part 6: Model Refitting
So I’m going to train the model again, using f1 as the scoring metric. Let’s wait for it to build.
What do you think of the model’s performance now?
The F1 is quite low… if doesn’t seem at first glance that the model is doing well.
What would you suggest to improve the model?
Well, you could try tuning more of the parameters, such as adding more trees, or optimising how many features every decision tree in the random forest samples from, or the minimum number of samples per leaf in every decision tree. You should also do some more preprocesing steps, such as dealing appropriately withe other categorical variables such as education, and having better encoding for missing values. You could also do more feature engineering and thus increase the predictive capability of the model, using some of the ideas from before, such as client’s other loans, their professions, pooling from more historical data. You should also investigate the data points where the model is misclassifying, to see if there’s some kind of pattern with the misclassifications. And you could speak with subject matter experts and really learn about the data and go from there.
Part 7 : Test Set
Finally, we are going to fit the model to the test set and see what happens. So I’m going to perform the same data preproccessing steps we had for the training set, we are going to drop the missing values, encode the marriage feature appropriately, and use the trained model to predict on the test set.
This is the F1 score on the testing set. What are your thoughts about this result?
The F1 score in the testing set is very close to the cross validation results, so that is good, it means our cross validation can give us confidence about what our test set error will be, but the F1 score continues to be low in both cases, so much more work would need to be done to improve the model performance.
That’s it for our Data Science Mock Interview. Thanks for watching! Please consider commenting and subscribing to our channel, and make sure to visit us at AceAINow.com to access FREE data science interview resources and top interview tips.