Tutorial-Top Data Science Tech Questions

Top Question 1: Describe how your favorite algorithm works?

Make sure to be prepared with at least one algorithm you can describe in great detail. Be familiar with how the algorithm works, with what use cases it's best for, with potential caveats the algorithm has, and with what data preprocessing steps are necessary for the algorithm to perform well.

One of my favourite algorithms is the Random Forest algorithm. It can be used for both classification and regression tasks, and requires minimal data preprocessing, as the random forest can handle missing values and categorical features well, and doesn't require data normalization or standardization. The random forest algorithm works by using decision trees as base learners. It uses many decorrelated trees to make predictions, ultimately helping to reduce the high variance problem single decision trees suffer from. The key to this decorrelation is that each time a split in an individual tree is calculated, the algorithm doesn't use all the predictors, but only a sample of the predictors as the potential features to split from. One potential disadvantage of the random forest over decision trees is that it's harder for stakeholders and managers to visualise and interpret the results.

Top Question 2: Which of the following is not true about bias and variance?
  1. bias is a measure of a model's error
  2. variance is a measure of how an estimate changes given different training data
  3. there is no trade-off between bias and variance
  4. an ideal model has high bias and low variance

The correct answer is C. There is always a trade-off between bias and variance.

Top Question 3: Which of the following is not true about overfitting?
  1. an overfit model captures the noise in the training data
  2. overfit models do not generalize well to new unseen data
  3. overfit models are less likely to occur in models with greater flexibility
  4. overfitting may be assessed with cross validation techniques

The correct answer is C. Models that have high flexibility can suffer from overfitting. Techniques such as regularization, feature selection, feature extraction and cross-validation can help to prevent overfitting.

Top Question 4: Which of the following are supervised learning approaches?
  1. Random Forest
  2. Decision Trees
  3. Naive Bayes
  4. K-Means

The correct answer is A,B and C. K-Means is the only unsupervised learning approach in the choices.

Top Question 5: Describe an unsupervised algorithm you used recently?

Common unsupervised algorithms used in the commercial setting are k-Means for clustering, Apriori algorithm for association rule learning, and Principal Component Analysis for dimensionality reduction

Top Question 6: What are some common data preprocessing steps you need to take before using machine learning algorithms?

Many algorithms require the data to be standardized or normalized. Further, many algorithms cannot handle missing values, so missing-data imputation is necessary. In addition, often categorical values have to be encoded to numerical values using techniques such as one-hot encoding and label encoding.

Top Question 7: Name a few approaches you can use to impute missing values?

For a categorical feature, you could impute a missing value with the mode of the feature, whereas for numerical features, you could impute the missing values with the mean or the median of the feature. You could also use the k-Nearest Neighbors algorithm to find a data point's k closest neighbors and impute the value based on the values in the point's neighborhood.

Top Question 8: Name a few of the libraries you use for data preprocessing?

Python: pandas for data manipulation, numpy for arrays and numerical operations, scikit-learn for encoders, standardization and impututation, and NLTK for text mining
R: plyr and dplyr for data manipulation, reshape for transposing and melting data, lubridate to process date and time, and tm for text mining

Top Question 9: Which of the following is not true about regularization?
  1. lasso regularization may improve model interpretability
  2. smaller shrinkage tuning parameter values increase the impact of regularization
  3. ridge regularization shrinks all coefficient values
  4. regularization may help prevent overfitting

The correct answer is B. Smaller shrinkage tuning parameter values decrease the impact of regularization

Top Question 10: Which of these approaches should not be used to address multicollinearity?
  1. adding correlated features
  2. regularization
  3. measuring correlation between predictors
  4. measuring multicollinearity with the variance inflation factor

The correct answer is A. Adding correlated features to the model will only worsen the multicollinearity problem.

Top Question 11: How should you correctly split your data to evaluate your model performance?

You should divide your data into train/validation/test splits. You use the training set to learn the model, the validation set to tune the model, and the test set to determine the out-of-sample performance of the model. A good starting point is to use 60% of the data for training, 20% for validation, and 20% for testing. You can also use cross-validation to build multiple train/validation splits.

Top Question 12: Which of these is not a reason for using cross validation?
  1. estimate test error in the training data
  2. detect model overfitting
  3. utilization of more data for training while maintaining enough data for validation
  4. computational efficiency of using more folds

The correct answer is D. Increasing the number of cross validation folds increases the computational cost of performing validation.

Top Question 13: What approaches should you not consider when dealing with unbalanced data for a classification task?
  1. Using accuracy as the performance metric
  2. Up sampling the minority class
  3. Down sampling the majority class
  4. Using metrics such as area under the receiver operating characteristic curve

The correct answer is A. You should not use accuracy as a metric when your data is unbalanced.

Top Question 14: Which of these are true about the sensitivity measure?
  1. sensitivity measures the proportion of positive cases that were correctly identified
  2. sensitivity is the same measure as recall
  3. sensitivity is the true positive rate
  4. sensitivity is the same measure as precision

The correct answer is D.

Top Question 15: What does the F1 score measure?
  1. area under the receiver operating characteristic curve
  2. accuracy
  3. specificity
  4. harmonic mean of recall and precision

The correct answer is D.

Top Question 16: Name a few of the libraries you use for building models?

Python: scikit-learn , TensorFlow, Keras, Pytorch, Spark
R: caret, e1071, randomForest, nnet, mboost, gbm, SparkR

Top Question 17: Which of the following statements are not true regarding the bag of words approach?
  1. word order and grammar is always accounted for
  2. text is represented by frequency of each word
  3. bag of words matrices are often sparse
  4. TF-IDF is an approach to improve the BOW technique

The correct answer is A.

Top Question 18: Which of these statements are not true about random forests?
  1. Random forests use bootstrapping approaches
  2. Random forests employ bagging approaches
  3. Error estimates may be made with the out of bag error method
  4. Random forests are built out sequentially

The correct answer is D.

Top Question 19: What are assumptions of linear regression?
  1. linearity
  2. homoscedasticity
  3. independence of observations
  4. normality

All of the above are assumptions of linear regression

Top Question 20: What are your favourite data visualization libraries?

Python: Matplotlib, Seaborn, Plotly, Bokeh, geoplotlib
R: ggplot2, Shiny, Plotly, Leaflet