Continue browsing or dismiss this message to accept cookies. Learn more
Tutorial-Top Data Science Tech Questions
Top Question 1: Describe how your favorite algorithm works?
Make sure to be prepared with at least one algorithm you can describe in great detail. Be familiar with how the algorithm works,
with what use cases it's best for, with potential caveats the algorithm
has, and with what data preprocessing steps are necessary for the algorithm to perform well.
One of my favourite algorithms is the Random Forest algorithm. It can be used for both classification and regression tasks, and requires minimal data preprocessing, as the random forest
can handle missing values and categorical features well, and doesn't require data normalization or standardization.
The random forest algorithm works by using decision trees as base learners. It uses many decorrelated trees to make predictions, ultimately helping to reduce the high variance problem
single decision trees suffer from. The key to this decorrelation is that each time a split in an individual tree is calculated, the algorithm doesn't use all the predictors, but only
a sample of the predictors as the potential features to split from.
One potential disadvantage of the random forest over decision trees is that it's harder for stakeholders and managers to visualise and interpret the results.
Top Question 2: Which of the following is not true about bias and variance?
bias is a measure of a model's error
variance is a measure of how an estimate changes given different training data
there is no trade-off between bias and variance
an ideal model has high bias and low variance
The correct answer is C. There is always a trade-off between bias and variance.
Top Question 3: Which of the following is not true about overfitting?
an overfit model captures the noise in the training data
overfit models do not generalize well to new unseen data
overfit models are less likely to occur in models with greater flexibility
overfitting may be assessed with cross validation techniques
The correct answer is C. Models that have high flexibility can suffer from overfitting. Techniques such as regularization, feature selection, feature extraction and cross-validation can help
to prevent overfitting.
Top Question 4: Which of the following are supervised learning approaches?
The correct answer is A,B and C. K-Means is the only unsupervised learning approach in the choices.
Top Question 5: Describe an unsupervised algorithm you used recently?
Common unsupervised algorithms used in the commercial setting are k-Means for clustering,
Apriori algorithm for association rule learning,
and Principal Component Analysis for dimensionality reduction
Top Question 6: What are some common data preprocessing steps you need to take before using machine learning algorithms?
Many algorithms require the data to be standardized or normalized. Further, many algorithms cannot handle missing values, so missing-data imputation is necessary.
In addition, often categorical values have to be encoded to numerical values using techniques such as one-hot encoding and label encoding.
Top Question 7: Name a few approaches you can use to impute missing values?
For a categorical feature, you could impute a missing value with the mode
of the feature, whereas for numerical features, you could impute the missing values with the mean or the median of the feature.
You could also use the k-Nearest Neighbors algorithm to
find a data point's k closest neighbors and impute the value based on the values in the point's neighborhood.
Top Question 8: Name a few of the libraries you use for data preprocessing?
Python: pandas for data manipulation, numpy for arrays and numerical operations, scikit-learn for encoders, standardization and impututation, and NLTK for text mining
R: plyr and dplyr for data manipulation, reshape for transposing and melting data, lubridate to process date and time, and tm for text mining
Top Question 9: Which of the following is not true about regularization?
lasso regularization may improve model interpretability
smaller shrinkage tuning parameter values increase the impact of regularization
ridge regularization shrinks all coefficient values
regularization may help prevent overfitting
The correct answer is B. Smaller shrinkage tuning parameter values decrease the impact of regularization
Top Question 10: Which of these approaches should not be used to address multicollinearity?
adding correlated features
measuring correlation between predictors
measuring multicollinearity with the variance inflation factor
The correct answer is A. Adding correlated features to the model will only worsen the multicollinearity problem.
Top Question 11: How should you correctly split your data to evaluate your model performance?
You should divide your data into train/validation/test splits. You use the training set to learn the model, the validation set to tune the model, and the test set
to determine the out-of-sample performance of the model. A good starting point is to use 60% of the data for training,
20% for validation, and 20% for testing. You can also use cross-validation to build multiple train/validation splits.
Top Question 12: Which of these is not a reason for using cross validation?
estimate test error in the training data
detect model overfitting
utilization of more data for training while maintaining enough data for validation
computational efficiency of using more folds
The correct answer is D. Increasing the number of cross validation folds increases the computational cost of performing validation.
Top Question 13: What approaches should you not consider when dealing with unbalanced data for a classification task?
Using accuracy as the performance metric
Up sampling the minority class
Down sampling the majority class
Using metrics such as area under the receiver operating characteristic curve
The correct answer is A. You should not use accuracy as a metric when your data is unbalanced.
Top Question 14: Which of these are true about the sensitivity measure?
sensitivity measures the proportion of positive cases that were correctly identified
sensitivity is the same measure as recall
sensitivity is the true positive rate
sensitivity is the same measure as precision
The correct answer is D.
Top Question 15: What does the F1 score measure?
area under the receiver operating characteristic curve
harmonic mean of recall and precision
The correct answer is D.
Top Question 16: Name a few of the libraries you use for building models?