Scenario practice questions

A data scientist is training a binary classification model to predict customer churn. The dataset has 10,000 records with 9,500 non-churners and 500 churners. After training a logistic regression model, the model achieves 95% accuracy on the test set. However, the business team reports that the model is not useful because it predicts almost all customers as non-churners. Which metric should the data scientist use to evaluate the model's performance in this scenario?

Trap 1: Accuracy

Accuracy is not suitable for imbalanced datasets as it can be high even if the model predicts the majority class only.

Trap 2: R-squared

R-squared is a metric for regression models, not classification.

Trap 3: Precision

Precision measures how many of the predicted churners are actual churners, but it does not reflect how many actual churners were missed.

A
Accuracy
Why wrong: Accuracy is not suitable for imbalanced datasets as it can be high even if the model predicts the majority class only.
B
R-squared
Why wrong: R-squared is a metric for regression models, not classification.
C
Precision
Why wrong: Precision measures how many of the predicted churners are actual churners, but it does not reflect how many actual churners were missed.
D
Recall
Recall measures the proportion of actual churners correctly identified, which is the key metric for this imbalanced problem.

Question 2mediummultiple choice

A data scientist is training a model using Amazon SageMaker and notices the training loss is decreasing but validation loss starts increasing after a few epochs. Which technique should they apply to address this?

Trap 1: Increase batch size

Increasing batch size can speed up training but does not directly address overfitting.

Trap 2: Increase the learning rate

Increasing learning rate might worsen the problem and cause divergence.

Trap 3: Add more training data

More data can help generalize but is not the direct solution for overfitting; regularization is more effective.

A
Increase batch size
Why wrong: Increasing batch size can speed up training but does not directly address overfitting.
B
Increase the learning rate
Why wrong: Increasing learning rate might worsen the problem and cause divergence.
C
Add more training data
Why wrong: More data can help generalize but is not the direct solution for overfitting; regularization is more effective.
D
Add regularization (e.g., L1 or L2)
Regularization penalizes large weights and reduces overfitting, which is indicated by increasing validation loss.

Question 3hardmultiple choice

An e-commerce company stores user interaction logs in Amazon S3. They want to use machine learning to segment users based on purchasing behavior. Which unsupervised learning algorithm is most appropriate?

Trap 1: Linear regression

Supervised learning algorithm for predicting continuous values.

Trap 2: Random forest

Supervised ensemble learning method.

Trap 3: Neural network

Can be used but typically supervised; overkill for simple segmentation.

A
Linear regression
Why wrong: Supervised learning algorithm for predicting continuous values.
B
Random forest
Why wrong: Supervised ensemble learning method.
C
K-means clustering
Unsupervised algorithm that groups data into clusters based on similarity.
D
Neural network
Why wrong: Can be used but typically supervised; overkill for simple segmentation.

Question 4hardmultiple choice

An organization uses Amazon Bedrock to generate content. They have implemented guardrails to block toxic content. However, some users are able to bypass the guardrails by encoding their prompts. What step should be taken to improve security?

Trap 1: Encode the prompts before sending to the model.

Encoding prompts does not prevent bypass; the model would decode and potentially still produce toxic content.

Trap 2: Use a different foundation model that is less susceptible.

Model susceptibility is not the root cause; guardrails can be bypassed regardless.

Trap 3: Restrict access to the model using IAM policies.

Access control does not prevent authorized users from submitting encoded prompts.

A
Encode the prompts before sending to the model.
Why wrong: Encoding prompts does not prevent bypass; the model would decode and potentially still produce toxic content.
B
Enable prompt injection detection in the guardrail configuration.
Prompt injection detection can identify and block encoded or malicious prompts.
C
Use a different foundation model that is less susceptible.
Why wrong: Model susceptibility is not the root cause; guardrails can be bypassed regardless.
D
Restrict access to the model using IAM policies.
Why wrong: Access control does not prevent authorized users from submitting encoded prompts.

Question 5easymultiple choice

A data scientist at a retail company is tasked with building a model to predict customer churn. The dataset contains 100,000 records with features such as age, purchase history, customer support interactions, and a binary label indicating whether the customer churned in the past. The team needs a model that can be deployed for real-time inference with low latency. They have limited time and want to use a built-in algorithm from Amazon SageMaker that is optimized for classification tasks. Which approach should they take?

Trap 1: Use Amazon SageMaker PCA algorithm

PCA is for dimensionality reduction, not for building a classification model.

Trap 2: Use Amazon SageMaker K-Means algorithm

K-Means is an unsupervised clustering algorithm, not for supervised classification.

Trap 3: Use Amazon SageMaker BlazingText algorithm

BlazingText is designed for text data, not tabular customer churn data.

A
Use Amazon SageMaker PCA algorithm
Why wrong: PCA is for dimensionality reduction, not for building a classification model.
B
Use Amazon SageMaker XGBoost algorithm
XGBoost is a built-in algorithm for classification and works well with tabular data.
C
Use Amazon SageMaker K-Means algorithm
Why wrong: K-Means is an unsupervised clustering algorithm, not for supervised classification.
D
Use Amazon SageMaker BlazingText algorithm
Why wrong: BlazingText is designed for text data, not tabular customer churn data.