# Tutorial-Top Data Science Tech Questions # Top Question 1: Describe how your favorite algorithm works?

Make sure to be prepared with at least one algorithm you can describe in great detail. Be familiar with how the algorithm works, with what use cases it's best for, with potential caveats the algorithm has, and with what data preprocessing steps are necessary for the algorithm to perform well.

One of my favourite algorithms is the Random Forest algorithm. It can be used for both classification and regression tasks, and requires minimal data preprocessing, as the random forest can handle missing values and categorical features well, and doesn't require data normalization or standardization. The random forest algorithm works by using decision trees as base learners. It uses many decorrelated trees to make predictions, ultimately helping to reduce the high variance problem single decision trees suffer from. The key to this decorrelation is that each time a split in an individual tree is calculated, the algorithm doesn't use all the predictors, but only a sample of the predictors as the potential features to split from. One potential disadvantage of the random forest over decision trees is that it's harder for stakeholders and managers to visualise and interpret the results.

# Top Question 2: Which of the following is not true about bias and variance? bias is a measure of a model's error variance is a measure of how an estimate changes given different training data there is no trade-off between bias and variance an ideal model has high bias and low variance

The correct answer is C. There is always a trade-off between bias and variance.

# Top Question 3: Which of the following is not true about overfitting? an overfit model captures the noise in the training data overfit models do not generalize well to new unseen data overfit models are less likely to occur in models with greater flexibility overfitting may be assessed with cross validation techniques

The correct answer is C. Models that have high flexibility can suffer from overfitting. Techniques such as regularization, feature selection, feature extraction and cross-validation can help to prevent overfitting.

# Top Question 4: Which of the following are supervised learning approaches? Random Forest Decision Trees Naive Bayes K-Means

The correct answer is A,B and C. K-Means is the only unsupervised learning approach in the choices.

# Top Question 5: Describe an unsupervised algorithm you used recently?

Common unsupervised algorithms used in the commercial setting are k-Means for clustering, Apriori algorithm for association rule learning, and Principal Component Analysis for dimensionality reduction

# Top Question 6: What are some common data preprocessing steps you need to take before using machine learning algorithms?

Many algorithms require the data to be standardized or normalized. Further, many algorithms cannot handle missing values, so missing-data imputation is necessary. In addition, often categorical values have to be encoded to numerical values using techniques such as one-hot encoding and label encoding.

# Top Question 7: Name a few approaches you can use to impute missing values?

For a categorical feature, you could impute a missing value with the mode of the feature, whereas for numerical features, you could impute the missing values with the mean or the median of the feature. You could also use the k-Nearest Neighbors algorithm to find a data point's k closest neighbors and impute the value based on the values in the point's neighborhood.

# Top Question 8: Name a few of the libraries you use for data preprocessing?

Python: pandas for data manipulation, numpy for arrays and numerical operations, scikit-learn for encoders, standardization and impututation, and NLTK for text mining
R: plyr and dplyr for data manipulation, reshape for transposing and melting data, lubridate to process date and time, and tm for text mining

# Top Question 9: Which of the following is not true about regularization? lasso regularization may improve model interpretability smaller shrinkage tuning parameter values increase the impact of regularization ridge regularization shrinks all coefficient values regularization may help prevent overfitting

The correct answer is B. Smaller shrinkage tuning parameter values decrease the impact of regularization

# Top Question 10: Which of these approaches should not be used to address multicollinearity? adding correlated features regularization measuring correlation between predictors measuring multicollinearity with the variance inflation factor

The correct answer is A. Adding correlated features to the model will only worsen the multicollinearity problem.

# Top Question 11: How should you correctly split your data to evaluate your model performance?

You should divide your data into train/validation/test splits. You use the training set to learn the model, the validation set to tune the model, and the test set to determine the out-of-sample performance of the model. A good starting point is to use 60% of the data for training, 20% for validation, and 20% for testing. You can also use cross-validation to build multiple train/validation splits.

# Top Question 12: Which of these is not a reason for using cross validation? estimate test error in the training data detect model overfitting utilization of more data for training while maintaining enough data for validation computational efficiency of using more folds

The correct answer is D. Increasing the number of cross validation folds increases the computational cost of performing validation.

# Top Question 13: What approaches should you not consider when dealing with unbalanced data for a classification task? Using accuracy as the performance metric Up sampling the minority class Down sampling the majority class Using metrics such as area under the receiver operating characteristic curve

The correct answer is A. You should not use accuracy as a metric when your data is unbalanced.

# Top Question 16: Name a few of the libraries you use for building models?

Python: scikit-learn , TensorFlow, Keras, Pytorch, Spark
R: caret, e1071, randomForest, nnet, mboost, gbm, SparkR