CompTIA AI+ AI0-001 (AI0-001) — Questions 151225

500 questions total · 7pages · All types, answers revealed

Page 2

Page 3 of 7

Page 4
151
Multi-Selectmedium

Which THREE are common activation functions used in neural networks? (Choose three.)

Select 3 answers
A.Sigmoid
B.K-means
C.Tanh
D.ReLU
E.Softmax
AnswersA, C, D

Correct: Sigmoid is a classic activation function.

Why this answer

Options A, B, and C are correct because Sigmoid, ReLU, and Tanh are widely used activation functions. Options D and E are incorrect: Softmax is used for output layer in multi-class classification, but it is not typically considered a 'common' activation function in hidden layers, and K-means is a clustering algorithm.

152
MCQeasy

A data scientist is preparing a dataset for a classification task. The dataset contains 10,000 rows and 50 features, but many features have missing values. Which approach should the scientist take first to address the missing data?

A.Use a deep learning model to predict missing values without preprocessing.
B.Analyze the pattern and proportion of missing values to choose an appropriate imputation strategy.
C.Remove all rows with any missing values to ensure a clean dataset.
D.Replace missing values with the mean of each feature immediately.
AnswerB

Understanding missingness pattern is crucial before deciding on imputation or deletion.

Why this answer

Option B is correct because the first step in handling missing data is to understand the pattern and proportion of missingness (e.g., MCAR, MAR, MNAR) to select an appropriate imputation method. Blindly applying imputation or deletion without analysis can introduce bias or reduce model performance. This diagnostic step ensures the chosen strategy aligns with the data's underlying structure and the classification task's requirements.

Exam trap

CompTIA often tests the misconception that immediate imputation (e.g., mean/median) or row deletion is the safest first step, when in reality, a diagnostic analysis of missingness patterns is required before any data modification.

How to eliminate wrong answers

Option A is wrong because deep learning models typically require complete data or sophisticated handling of missingness; using them to predict missing values without preprocessing ignores the need to first understand the missing data mechanism and can lead to overfitting or biased predictions. Option C is wrong because removing all rows with any missing values can discard a significant portion of the dataset (up to 50 features with missingness), potentially losing valuable information and reducing statistical power, especially when missingness is not completely random. Option D is wrong because immediately replacing missing values with the mean of each feature assumes the data is missing completely at random (MCAR) and can distort feature distributions, reduce variance, and introduce bias if the missingness is related to the feature values themselves.

153
MCQhard

A team is training a deep neural network on a large image dataset. They observe that the training loss decreases smoothly but validation loss oscillates. Which regularization technique should be applied?

A.Data augmentation
B.L1 regularization
C.Dropout
D.Batch normalization
AnswerC

Dropout reduces overfitting by randomly dropping units during training, forcing the network to learn robust features.

Why this answer

Option B is correct because dropout randomly deactivates neurons, preventing co-adaptation and reducing overfitting. Option A (L1) sparsifies weights but is less common for image DNNs. Option C (batch norm) accelerates training but may not directly fix overfitting.

Option D (data augmentation) increase data diversity but is applied before training.

154
MCQeasy

A financial institution is implementing an AI-based fraud detection system. The compliance officer is concerned about potential bias in the model that could lead to unfair treatment of certain customer groups. Which governance practice should be prioritized to address this concern?

A.Increase the diversity of the training data by collecting more samples from underrepresented groups.
B.Schedule regular bias audits using fairness metrics.
C.Retrain the model every month with the latest transaction data.
D.Use SHAP values to provide explanations for each prediction.
AnswerB

Bias audits with metrics like demographic parity can detect unfair treatment and guide mitigation.

Why this answer

Regular bias audits using fairness metrics (Option B) are the correct governance practice because they provide a systematic, quantitative method to detect and measure disparate impact across protected groups. Unlike simply collecting more data, audits directly evaluate model outputs for statistical parity, equal opportunity, or other fairness definitions, enabling the institution to identify and remediate bias proactively. This aligns with regulatory expectations for ongoing monitoring and accountability in AI governance.

Exam trap

CompTIA often tests the distinction between interpretability (explaining a single prediction) and fairness (systematic bias across groups), leading candidates to mistakenly choose SHAP values (Option D) as a bias mitigation technique when it is only an explanation tool.

How to eliminate wrong answers

Option A is wrong because merely increasing training data diversity does not guarantee fairness; the model can still learn biased correlations from the data or amplify existing societal biases, and without fairness metrics, there is no way to measure whether the outcome is equitable. Option C is wrong because retraining monthly with the latest transaction data addresses model drift and concept drift, not bias; bias can persist or even worsen with new data if the underlying data generation process remains biased. Option D is wrong because SHAP values provide local interpretability for individual predictions but do not measure or mitigate systemic bias across groups; they explain why a specific decision was made, not whether the model treats groups fairly overall.

155
MCQeasy

A data science team deployed a model for real-time predictions. After two weeks, the model's accuracy dropped from 92% to 80%. The monitoring system shows no data drift in features, but the target variable distribution has shifted. Which approach should the team use to detect this issue?

A.Schedule manual weekly reviews of model predictions
B.Monitor the distribution of the predicted target variable over time
C.Retrain the model immediately with new data
D.Monitor input feature distributions using a KS test
AnswerB

This detects target drift, which indicates concept drift.

Why this answer

Option B is correct because monitoring the distribution of the predicted target variable directly detects concept drift, which occurs when the relationship between features and the target changes. Since the monitoring system shows no data drift in features, the accuracy drop is likely due to a shift in the target variable's distribution, and tracking predictions over time reveals this shift. This approach aligns with MLOps best practices for detecting concept drift without requiring immediate retraining.

Exam trap

CompTIA often tests the distinction between data drift and concept drift, trapping candidates who assume that monitoring input features (Option D) is sufficient to detect all performance degradation.

How to eliminate wrong answers

Option A is wrong because manual weekly reviews are reactive, not proactive, and cannot provide real-time detection of distribution shifts; they also introduce latency and human error. Option C is wrong because retraining the model immediately without diagnosing the root cause may waste resources and could reinforce biased patterns if the drift is temporary or due to a data quality issue. Option D is wrong because monitoring input feature distributions using a KS test detects data drift, but the problem states there is no data drift in features, so this approach would not identify the target variable shift causing the accuracy drop.

156
MCQeasy

A data scientist is preparing a dataset for training a classification model. The dataset has a column with missing values in 5% of rows. Which action should the data engineer take to minimize bias?

A.Impute missing values with the median of the column
B.Remove all rows with missing values
C.Replace missing values with a constant such as 999
D.Use a model that can handle missing values natively
AnswerA

Median imputation preserves the central tendency without being affected by outliers, suitable for low missing rate.

Why this answer

Imputing with the median preserves the distribution without significantly reducing sample size, minimizing bias. Removing rows reduces sample size, constant 999 introduces artificial outlier, and native handling may not be available.

157
MCQeasy

A data scientist needs to predict whether a customer will churn based on historical data containing features like account age, monthly charges, and support tickets. The target variable is binary (churn or not). Which type of machine learning algorithm should be used?

A.Linear regression
B.Logistic regression
C.K-means clustering
D.Principal component analysis
AnswerB

Logistic regression outputs probabilities for binary classification.

Why this answer

Logistic regression is a classification algorithm well-suited for binary outcomes. Linear regression is for continuous outputs, K-means is unsupervised, and PCA is dimensionality reduction.

158
MCQmedium

After deploying a model for fraud detection, the data scientist observes a steady decline in precision over two months. Which issue is most likely occurring?

A.Data drift
B.Concept drift
C.Model overfitting
D.Adversarial attack
AnswerB

Precision decline indicates that the model's decision boundary is no longer optimal, a sign of concept drift.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, reducing prediction accuracy.

159
MCQhard

A security researcher demonstrates that by adding small perturbations to an image of a stop sign, an autonomous vehicle's AI misclassifies it as a speed limit sign. This is an example of which type of attack?

A.Data poisoning attack
B.Model extraction attack
C.Adversarial example attack
D.Membership inference attack
AnswerC

Adversarial examples are crafted inputs with perturbations that fool the model.

Why this answer

Adding small perturbations to input to cause misclassification is a classic adversarial example attack, which falls under evasion attacks (adversarial machine learning). Poisoning alters training data, extraction steals model parameters, and inference determines membership.

160
MCQhard

A streaming data pipeline ingests sensor data from IoT devices. The data arrives at irregular intervals and contains occasional spikes. Which data transformation is most appropriate for preparing this data for a time-series model?

A.Downsampling to a fixed frequency using mean aggregation
B.Removing all rows with values outside 3 standard deviations
C.Using a sliding window to compute moving averages
D.Padding missing timestamps with zeros
AnswerA

Mean aggregation over fixed intervals handles irregular timing and reduces noise.

Why this answer

Downsampling to a fixed frequency using mean aggregation handles irregular intervals and smooths spikes. Removing spikes may lose valid anomalies, padding with zeros introduces bias, and moving averages are a smoothing technique but not resampling.

161
MCQeasy

During feature engineering, a data scientist creates a new feature that is a linear combination of two existing features. What risk does this pose to the model?

A.Multicollinearity
B.Data leakage
C.Overfitting
D.Underfitting
AnswerA

Multicollinearity occurs when features are highly correlated, causing unstable estimates and inflated variances.

Why this answer

Creating a new feature as a linear combination of two existing features introduces perfect multicollinearity, where the new feature is an exact linear function of the original ones. This violates the assumption of no perfect multicollinearity in linear models, causing the design matrix to become singular and making coefficient estimates unstable or impossible to compute. Even in non-linear models, high multicollinearity can inflate variance and reduce interpretability.

Exam trap

CompTIA often tests the distinction between multicollinearity and overfitting, trapping candidates who confuse feature redundancy with model complexity.

How to eliminate wrong answers

Option B is wrong because data leakage refers to using information from outside the training set (e.g., future data or target leakage), not to relationships among features within the training data. Option C is wrong because overfitting is caused by a model learning noise or overly complex patterns, not by linear dependencies between features; multicollinearity primarily affects coefficient stability, not generalization error directly. Option D is wrong because underfitting occurs when a model is too simple to capture underlying patterns, whereas multicollinearity is a data structure issue that can actually increase model complexity without improving fit.

162
MCQmedium

A financial institution uses a machine learning model to approve personal loans. The model was trained on historical data that includes applicant age, income, credit score, and loan amount. Compliance officers have received customer complaints suggesting the model may be discriminating against applicants over 60 years old. Initial analysis shows that the approval rate for applicants over 60 is 20 percentage points lower than for younger applicants with similar credit profiles. The data science team has been asked to investigate and remediate any bias. They have access to the training data, model coefficients, and can retrain or modify the model. What is the FIRST step the team should take?

A.Replace the model with a third-party vendor model that claims to be bias-free.
B.Re-sample the training data to have equal numbers of applicants over and under 60.
C.Conduct a fairness audit using appropriate metrics such as disparate impact ratio on the current model.
D.Remove the age feature from the training data and retrain the model.
AnswerC

An audit quantifies bias and provides a baseline to measure remediation effectiveness.

Why this answer

Option C is correct because the first step in addressing potential bias is to conduct a fairness audit using established metrics like the disparate impact ratio (e.g., the 80% rule from the US Equal Employment Opportunity Commission). This quantifies whether the model's approval rate for applicants over 60 is less than 80% of the rate for the younger group, providing a legally and technically sound baseline before any remediation. Without this measurement, any subsequent changes (like resampling or removing features) could be misguided or ineffective.

Exam trap

CompTIA often tests the misconception that removing a protected attribute (like age) is sufficient to eliminate bias, when in fact proxy features can perpetuate discrimination, making a fairness audit the mandatory first step.

How to eliminate wrong answers

Option A is wrong because replacing the model with a third-party vendor model that claims to be bias-free does not address the specific bias found in the current system, and it bypasses the necessary diagnostic step of understanding the root cause; vendor claims are not a substitute for empirical validation. Option B is wrong because resampling the training data to have equal numbers of applicants over and under 60 does not guarantee fairness—it can introduce sampling bias, distort the real-world distribution, and may not correct the underlying model behavior that causes disparate impact. Option D is wrong because simply removing the age feature from the training data and retraining the model is a naive approach; age may be correlated with other features (e.g., income, credit score), so the model could still indirectly discriminate through proxy variables, a phenomenon known as 'bias amplification' or 'redundant encoding'.

163
MCQmedium

An operations team sees the log entries above for a production ML model. What is the MOST likely root cause of the latency spike?

A.A scheduled training job consuming GPU resources on the same node.
B.A memory leak in the model serving container causing gradual slowdown.
C.A network outage between the model server and the client.
D.A bug in the model's preprocessing code causing incorrect predictions.
AnswerB

Memory leak can cause garbage collection overhead and increased latency.

Why this answer

The log entries show a gradual increase in latency over time, which is characteristic of a memory leak in the model serving container. As memory consumption grows, garbage collection pauses become more frequent and longer, eventually causing request processing to slow down. This pattern is distinct from a sudden spike caused by resource contention or network issues.

Exam trap

CompTIA often tests the distinction between gradual vs. sudden performance degradation patterns, where candidates mistakenly attribute a gradual latency increase to a transient resource contention event like a training job or network issue.

How to eliminate wrong answers

Option A is wrong because a scheduled training job consuming GPU resources would cause a sudden, sharp latency spike at the start of training, not a gradual increase over time. Option C is wrong because a network outage would result in complete request failures or timeouts, not a progressive latency degradation. Option D is wrong because a bug in preprocessing code causing incorrect predictions would affect prediction accuracy, not the latency of the serving endpoint.

164
MCQmedium

A data scientist fine-tunes a large language model for a legal document summarization task. After fine-tuning, the model performs well on test data but produces summaries that include hallucinated legal clauses. Which mitigation strategy is most effective?

A.Use a different tokenizer during fine-tuning.
B.Decrease the temperature parameter to 0.1 during inference.
C.Implement retrieval-augmented generation (RAG) to provide factual context.
D.Set a maximum token limit of 50 for each summary.
AnswerC

RAG fetches relevant documents to condition the generation, reducing reliance on parametric memory.

Why this answer

Option A is correct because RAG grounds the model in retrieved documents, reducing hallucinations. Option B is wrong as temperature affects creativity but does not eliminate hallucinations. Option C is wrong because it reduces output length but may not prevent false content.

Option D is wrong because it does not address the model's tendency to fabricate.

165
MCQhard

A deep learning model for natural language processing uses a recurrent neural network (RNN) to process long sequences. The gradients vanish after many time steps. Which architectural change is most effective to mitigate this problem?

A.Add dropout regularization
B.Use a larger learning rate
C.Replace the RNN cells with Long Short-Term Memory (LSTM) units
D.Increase the number of hidden layers
AnswerC

LSTM's gating structure preserves gradients over long sequences.

Why this answer

Option C is correct because LSTMs have gating mechanisms that allow gradients to flow longer, mitigating vanishing gradients. Option A is incorrect because more layers can exacerbate vanishing. Option B is incorrect because a larger learning rate may cause instability.

Option D is incorrect because dropout addresses overfitting, not vanishing gradients.

166
MCQmedium

An AI system is being designed to automatically detect fraudulent transactions in real-time. The system must have low latency and high precision to minimize false alarms. Which algorithm is most appropriate?

A.Logistic regression
B.Convolutional neural network
C.Deep reinforcement learning
D.Random forest
AnswerD

Random forest provides high accuracy and precision with low inference latency, making it ideal for real-time fraud detection.

Why this answer

Random forest is the most appropriate algorithm because it handles high-dimensional transaction data, provides feature importance for interpretability, and achieves high precision with low latency through ensemble decision trees. Its parallelizable structure allows real-time scoring, and it naturally balances precision and recall without the computational overhead of deep learning.

Exam trap

CompTIA often tests the misconception that deep learning (CNNs or reinforcement learning) is always superior for complex tasks, but here the key constraints are low latency and high precision on tabular data, where ensemble methods like random forest outperform deep models.

How to eliminate wrong answers

Option A is wrong because logistic regression assumes linear decision boundaries and cannot capture complex non-linear patterns in transaction data, leading to lower precision. Option B is wrong because convolutional neural networks are designed for spatial data like images, not tabular transaction features, and introduce unnecessary latency and computational cost for real-time fraud detection. Option C is wrong because deep reinforcement learning is used for sequential decision-making in dynamic environments (e.g., game playing, robotics), not for static classification tasks like fraud detection, and its training instability and high latency make it unsuitable for real-time scoring.

167
MCQhard

A financial institution is building a fraud detection system using a supervised learning model. The dataset is highly imbalanced with 99.9% legitimate transactions and 0.1% fraudulent ones. Which approach would be MOST effective to train the model to detect fraud?

A.Train the model using accuracy as the performance metric
B.Undersample the legitimate transactions to match the number of fraudulent ones
C.Use SMOTE to generate synthetic fraudulent transactions
D.Increase the regularization strength in the model
AnswerC

SMOTE creates synthetic samples of the minority class, effectively balancing the dataset without losing data.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most effective approach because it generates synthetic fraudulent transactions by interpolating between existing minority class samples, thereby balancing the dataset without losing information. This allows the model to learn decision boundaries for fraud detection more effectively than simple undersampling or metric adjustments, especially given the extreme 99.9% vs 0.1% imbalance.

Exam trap

CompTIA often tests the misconception that simply changing the performance metric (like using F1-score or precision-recall) alone is sufficient to handle imbalance, but the trap here is that without addressing the data distribution itself, the model still lacks sufficient fraudulent examples to learn meaningful patterns.

How to eliminate wrong answers

Option A is wrong because accuracy is a misleading metric for highly imbalanced datasets; a model that predicts all transactions as legitimate would achieve 99.9% accuracy but detect zero fraud. Option B is wrong because undersampling the majority class to match the 0.1% fraud rate would discard 99.8% of legitimate transactions, causing severe information loss and poor generalization to real-world data. Option D is wrong because increasing regularization strength reduces model complexity to prevent overfitting, but it does not address the class imbalance; the model would still be biased toward the majority class and fail to learn fraud patterns.

168
MCQeasy

Which principle ensures that AI decisions can be traced back and understood by humans?

A.Transparency
B.Privacy
C.Robustness
D.Accountability
AnswerA

Transparency ensures that AI processes are open and understandable.

Why this answer

Transparency refers to the ability to explain and understand how an AI system reached a decision.

169
Multi-Selecthard

A company is deploying a machine learning model that predicts customer churn. The model currently has high variance. Which THREE actions should the data scientist take to reduce variance? (Select THREE.)

Select 3 answers
A.Reduce model complexity (e.g., fewer features, simpler model).
B.Use regularization.
C.Add more training data.
D.Remove outliers from the training data.
E.Increase model complexity.
AnswersA, B, C

Simpler models have lower variance.

Why this answer

Options B, C, and D are correct. Reducing model complexity (B) (e.g., fewer features, simpler model) directly limits variance. Adding more training data (C) helps the model learn a more general pattern, reducing variance.

Regularization (D) penalizes large weights, controlling model complexity. Option A (increasing complexity) would increase variance. Option E (removing outliers) can sometimes reduce variance but is not a standard or primary technique; it may also reduce bias but is less reliable.

170
MCQmedium

A hospital deploys an AI system to detect pneumonia from chest X-rays. The model achieves 95% accuracy on the test set but later is found to be less accurate for patients under 18. The development team suspects bias. Which step should be taken first to investigate?

A.Automatically retrain the model with a balanced dataset including more pediatric cases.
B.Expand the test set with more pediatric X-rays and re-evaluate overall accuracy.
C.Compute and compare performance metrics for different age subgroups in the test set.
D.Add more features to the model to capture age-related anatomical differences.
AnswerC

Subgroup analysis is the standard first step in fairness auditing.

Why this answer

Option C is correct because the first step in investigating suspected model bias is to perform a disaggregated analysis of performance metrics across relevant subgroups, such as age brackets. This directly identifies whether the model's accuracy, precision, recall, or other metrics differ significantly for pediatric patients versus adults, confirming the presence and nature of the bias before any remediation is attempted.

Exam trap

CompTIA often tests the principle that aggregate metrics like overall accuracy can be misleading, and the trap here is that candidates jump to a solution (retraining or adding features) before performing the necessary diagnostic step of subgroup performance analysis.

How to eliminate wrong answers

Option A is wrong because automatically retraining the model with a balanced dataset without first understanding the root cause of the bias could introduce new biases or fail to address the specific issue, and it skips the critical diagnostic step of measuring subgroup performance. Option B is wrong because expanding the test set with more pediatric X-rays and re-evaluating overall accuracy would dilute the subgroup signal into a single aggregate metric, masking the disparity rather than revealing it. Option D is wrong because adding more features to the model without first analyzing the existing bias is a premature intervention; it assumes the bias stems from missing features rather than from imbalanced training data or model behavior, and it could increase complexity without solving the underlying problem.

171
MCQmedium

A healthcare organization deploys an AI system to analyze medical images and detect anomalies. During a routine audit, the security team discovers that the AI model occasionally returns results that include data from patients who have opted out of data sharing. Which security control should be implemented to prevent this violation?

A.Apply data anonymization techniques to the training dataset.
B.Implement role-based access control (RBAC) on the AI model's inference API.
C.Use differential privacy during model training.
D.Encrypt the training data at rest and in transit.
AnswerA

Anonymization removes personally identifiable information, ensuring that the model cannot output data linked to specific patients.

Why this answer

Option B is correct because data anonymization ensures that patient identities are removed from training data, preventing re-identification of opt-out patients. Option A is incorrect because access control does not address data already in the model. Option C is incorrect because encryption protects data in transit/rest but does not prevent data leakage from model outputs.

Option D is incorrect because differential privacy adds noise to queries but does not directly remove specific patient data from model results.

172
MCQmedium

A data scientist is training a deep neural network for sentiment analysis. The training loss decreases steadily but the validation loss starts to increase after 10 epochs. What is the most likely cause and best corrective action?

A.Underfitting; increase model complexity
B.Vanishing gradients; use ReLU activation
C.Data leakage; shuffle data before splitting
D.Overfitting; apply dropout and early stopping
AnswerD

Validation loss increasing while training loss decreases is classic overfitting; dropout regularizes and early stopping halts training.

Why this answer

Option A (Underfitting) would show high training loss, not decreasing training loss. Option C (Vanishing gradients) would cause training loss to plateau slowly. Option D (Data leakage) often shows suspiciously high performance.

Option B correctly identifies overfitting and suggests dropout and early stopping.

173
MCQeasy

During an AI model deployment, the operations team notices that inference requests are taking longer than expected. Which component is most likely causing the bottleneck?

A.Input data preprocessing pipeline
B.API gateway rate limiting
C.Database connection pool size
D.The machine learning model's size and architecture
AnswerD

Larger models take longer to compute predictions.

Why this answer

The machine learning model's size and architecture directly determine the computational complexity of inference. Larger models with more parameters or deeper architectures require more matrix multiplications and memory bandwidth, which increases latency per request. This is the most common bottleneck in AI deployment because the model itself is the core computation unit, and its inference time scales with its complexity.

Exam trap

CompTIA often tests the misconception that operational components like API gateways or databases are the primary cause of slow inference, when in fact the model's computational demand is the root cause, especially in scenarios where preprocessing and postprocessing are negligible.

How to eliminate wrong answers

Option A is wrong because input data preprocessing typically involves lightweight operations like normalization or tokenization, which are orders of magnitude faster than model inference and rarely the primary bottleneck unless the pipeline is poorly optimized. Option B is wrong because API gateway rate limiting controls the number of requests per second, not the latency of individual inference requests; it would cause throttling errors, not slow responses. Option C is wrong because database connection pool size affects the ability to fetch or store data concurrently, but inference latency is dominated by model computation, not database lookups, unless the model relies on external data retrieval per request.

174
MCQeasy

An organization is deploying an AI model on edge devices with limited computational resources. Which model optimization technique is most appropriate?

A.Perform additional feature engineering
B.Apply model quantization
C.Use an ensemble of models
D.Increase the training dataset size
AnswerB

Quantization reduces precision, making models smaller and faster.

Why this answer

Model quantization reduces the precision of the model's weights and activations (e.g., from 32-bit floating point to 8-bit integer), which significantly decreases memory footprint and computational requirements. This makes it ideal for deployment on edge devices with limited resources, as it enables faster inference with minimal accuracy loss.

Exam trap

CompTIA often tests the misconception that improving model performance (e.g., via feature engineering or more data) is equivalent to optimizing for deployment constraints, when in fact techniques like quantization directly address resource limitations.

How to eliminate wrong answers

Option A is wrong because feature engineering improves model input quality but does not reduce the computational load or model size required for inference on edge devices. Option C is wrong because using an ensemble of models increases the total number of parameters and inference time, which is counterproductive for resource-constrained edge devices. Option D is wrong because increasing the training dataset size improves model generalization but does not reduce the model's computational requirements during inference; it may even increase training time and model complexity.

175
MCQeasy

During model training, the data science team discovers that many input features contain missing values. Which step should be taken to improve data quality?

A.Implement data validation checks to handle missing data appropriately (e.g., imputation).
B.Increase the model complexity to handle missing data.
C.Ignore missing values and train the model.
D.Remove all records with missing values.
AnswerA

This ensures data quality without losing valuable information.

Why this answer

Option A is correct because data validation checks, such as imputation (e.g., mean, median, or KNN imputation), directly address missing values by estimating plausible replacements based on the available data. This improves data quality and prevents bias or loss of information that could degrade model performance. In the context of AI implementation, handling missing data is a fundamental data preprocessing step to ensure robust model training.

Exam trap

CompTIA often tests the misconception that 'ignoring missing data' or 'removing rows' is acceptable, when in fact proper data validation and imputation are required to maintain data integrity and model validity.

How to eliminate wrong answers

Option B is wrong because increasing model complexity (e.g., adding more layers or parameters) does not inherently handle missing data; it may overfit to noise or propagate errors from incomplete features. Option C is wrong because ignoring missing values can cause algorithms (e.g., linear regression, SVM) to fail during training or produce biased coefficients, as many implementations do not natively support NaN inputs. Option D is wrong because removing all records with missing values can lead to significant data loss, reduce sample size, and introduce selection bias, especially when missingness is not completely at random (MCAR).

176
Multi-Selectmedium

Which TWO of the following are effective techniques for detecting bias in an AI model?

Select 2 answers
A.Fairness metrics such as equal opportunity difference
B.Feature importance scores
C.Confusion matrix on the entire dataset
D.Cross-validation accuracy
E.Disparate impact analysis
AnswersA, E

Quantifies specific fairness criteria.

Why this answer

Options A and B are correct because disparate impact analysis measures outcome differences across groups, and fairness metrics quantify bias. Option C is wrong because cross-validation assesses generalization, not fairness. Option D is wrong because confusion matrices for all groups can reveal bias but are less direct than metrics designed for fairness.

Option E is wrong because feature importance may not directly reveal bias.

177
MCQeasy

A bank deploys an AI system to approve loan applications. During testing, the model denies a disproportionate number of applicants from a particular demographic group, even after controlling for credit history. Which ethical principle is being violated?

A.Transparency
B.Privacy
C.Accountability
D.Fairness
AnswerD

Fairness requires equal treatment across demographic groups; the observed disparity indicates bias.

Why this answer

Option B is correct because fairness requires that AI systems do not discriminate against protected groups. Option A is wrong because transparency is about explainability, not bias. Option C is wrong because accountability refers to responsibility for outcomes, not the specific bias issue.

Option D is wrong because privacy concerns data protection, not discrimination.

178
MCQmedium

A credit union uses an AI model to approve personal loans. The model was trained on historical data from the past five years. A recent internal review shows that the model approves loans predominantly for white applicants compared to other ethnicities, even when income and credit scores are similar. The credit union wants to comply with fair lending laws without significantly reducing overall approval rates. The data science team has access to the training data. What is the most appropriate remediation step?

A.Apply a fairness constraint that penalizes the model for disparate impact
B.Discontinue the AI model and use manual approval for all loans
C.Resample the training data to ensure balanced representation of ethnicities
D.Adjust the approval threshold so that approval rates are equal across ethnic groups
AnswerC

Resampling addresses the root cause by balancing training data.

Why this answer

Option B is correct because resampling (oversampling minority groups or undersampling majority) can balance the representation and reduce bias. Option A is wrong because equalizing rates without addressing data bias may not be sustainable. Option C is wrong because skipping the model is not practical.

Option D is wrong because simple reweighting may not correct complex biased patterns.

179
Multi-Selectmedium

Which TWO techniques are commonly used to prevent overfitting in deep neural networks?

Select 2 answers
A.Using a larger learning rate
B.Dropout
C.L1 regularization
D.Early stopping
E.Increasing the number of layers
AnswersB, D

Dropout randomly drops neurons during training, reducing overfitting.

Why this answer

Dropout is a regularization technique that randomly drops a fraction of neurons during training, which prevents the network from relying too heavily on any single neuron and forces it to learn more robust features. This reduces overfitting by introducing noise and effectively training an ensemble of sub-networks.

Exam trap

CompTIA often tests the distinction between regularization techniques that reduce overfitting (like dropout and early stopping) versus hyperparameters or architectural changes that increase model capacity (like larger learning rates or more layers), which candidates mistakenly think help with overfitting.

180
Multi-Selectmedium

Which TWO statements correctly describe the difference between supervised and unsupervised learning?

Select 2 answers
A.Supervised learning is only used for classification
B.Unsupervised learning always requires a target variable
C.Supervised learning requires labeled data
D.Supervised learning is a subset of reinforcement learning
E.Unsupervised learning discovers hidden patterns
AnswersC, E

Labels are required for supervised tasks.

Why this answer

Option C is correct because supervised learning relies on labeled datasets where each training example is paired with an output label, enabling the model to learn a mapping from inputs to outputs. This is a fundamental distinction from unsupervised learning, which works with unlabeled data to find inherent structures or patterns.

Exam trap

CompTIA often tests the misconception that supervised learning is synonymous with classification, ignoring regression, or that unsupervised learning requires a target variable, which is a direct contradiction of its definition.

181
MCQhard

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days. The data engineering team has built a pipeline that collects data from multiple sources, including electronic health records (EHR), lab results, and wearable device data. During initial testing, the model's performance is poor, with high false positives. Upon investigation, the team discovers that the data contains significant temporal misalignment: lab results are timestamped when ordered, not when collected; wearable data is aggregated hourly; and EHR data has inconsistent update frequencies. The data pipeline currently joins all features on the patient ID without aligning timestamps. The data volume is large, and processing time is a concern. Which action should the data engineering team take to most effectively address the issue and improve model performance?

A.Discard all records where timestamps do not match exactly across sources, and only use records with perfect alignment.
B.Implement a window-based feature aggregation (e.g., 6-hour windows) and align all features to the same time windows before joining.
C.Leave the pipeline unchanged and instead adjust the model's classification threshold to reduce false positives.
D.Use a data imputation algorithm to fill in missing timestamps and then join on the nearest timestamp.
AnswerB

This creates consistent timestamps and reduces noise through aggregation, effectively addressing misalignment.

Why this answer

Implementing a window-based feature aggregation with consistent time windows (e.g., 6-hour or 12-hour) and aligning all data to those windows before joining ensures temporal consistency and reduces noise. This approach addresses the root cause of misalignment while managing data volume through aggregation. Simply discarding data or padding with zeros loses valuable information.

Using an interpolation algorithm may introduce unrealistic values for irregularly sampled data. Leaving the pipeline as-is and tuning the model does not fix the data quality issue.

182
MCQhard

Refer to the exhibit. A security engineer is reviewing an AI access control policy. Which of the following is the most significant security weakness in this policy?

A.The policy allows access from a wide private IP range
B.The policy does not require multi-factor authentication
C.The policy grants 'audit_log' access to data scientists
D.The policy does not limit the number of inference requests
AnswerD

Unlimited inference invites model extraction or denial-of-wallet attacks.

Why this answer

Option D is correct because no restrictions are placed on the number of inference requests, allowing potential model extraction attacks. Option A is wrong because auditors need audit_log access for compliance. Option B is wrong because the policy correctly restricts to corporate IP ranges.

Option C is wrong because MFA is required, which is a strong control.

183
MCQeasy

Refer to the exhibit. The training log shows losses and accuracies over 5 epochs. What is the most likely problem?

A.Data leakage
B.Overfitting
C.Underfitting
D.Vanishing gradient
AnswerB

Overfitting is indicated by decreasing training loss and increasing validation loss.

Why this answer

Option B is correct because training loss decreases while validation loss increases, a classic sign of overfitting. Option A is incorrect because underfitting would show high losses on both sets. Option C is incorrect because vanishing gradient affects training loss progression, not divergence.

Option D is incorrect because data leakage typically causes both sets to perform well.

184
MCQmedium

Refer to the exhibit. An auditor reports that the model's fairness check was bypassed in a recent deployment. Based on the policy, what is the most likely cause?

A.The auditor role lacks 'evaluate' permission
B.Data scientist role has deploy permission, allowing deployment without fairness validation
C.Fairness check threshold is set to 0.8, which is too low
D.External_user role can perform inference, which triggers unfair predictions
AnswerB

The deploy permission may bypass the fairness check if not enforced.

Why this answer

Option B (Data scientist role has deploy permission, allowing deployment without fairness validation) is correct. The policy shows fairness_check required, but if the deployment process does not enforce it, the data scientist could bypass it. Option A (Auditor lacks evaluate) is unrelated to deployment.

Option C (Fairness threshold low) does not cause bypass. Option D (External user inference) is unrelated.

185
MCQhard

An organization uses a batch prediction pipeline that processes daily customer data to generate marketing recommendations. One month after deployment, the model's performance degrades significantly. The data pipeline logs show that the input data schema has changed — a new categorical feature 'customer_segment' has been added, and the existing feature 'age_group' is now missing. Which step should the operations team take first?

A.Retrain the model using the new schema and redeploy
B.Update the data preprocessing pipeline to handle missing features and add the new feature
C.Revert to the previous week's model version that was performing well
D.Contact the data engineering team to revert the schema change
AnswerB

This adapts the pipeline to the new schema, enabling proper feeding to the model.

Why this answer

Option B is correct because the immediate priority is to ensure the data preprocessing pipeline can handle the schema change without breaking. The pipeline must gracefully handle the missing 'age_group' feature (e.g., by imputing or dropping it) and incorporate the new 'customer_segment' feature before any model retraining or rollback. This prevents data drift from causing inference errors and maintains pipeline stability.

Exam trap

CompTIA often tests the misconception that retraining the model (Option A) is the first step to fix performance degradation, but the trap here is that the root cause is a schema mismatch in the preprocessing layer, not the model weights.

How to eliminate wrong answers

Option A is wrong because retraining the model without first fixing the preprocessing pipeline would still fail due to missing or misaligned features, and it assumes the new schema is already compatible. Option C is wrong because reverting to a previous model version does not address the root cause — the input data schema has changed, so the old model would still receive malformed data and produce incorrect predictions. Option D is wrong because contacting the data engineering team to revert the schema change is a reactive, non-technical workaround that ignores the need for the operations team to adapt the pipeline to handle schema evolution autonomously.

186
MCQmedium

An AI model for detecting fraudulent transactions has high precision but low recall. Which business impact is most likely?

A.The model has no impact on fraud detection
B.The model detects all fraudulent transactions
C.Many fraudulent transactions go undetected
D.Many legitimate transactions are flagged as fraud
AnswerC

Low recall indicates a high number of false negatives.

Why this answer

High precision means that when the model flags a transaction as fraudulent, it is very likely correct. However, low recall indicates that the model misses a significant proportion of actual fraudulent transactions. Therefore, the most likely business impact is that many fraudulent transactions go undetected, leading to financial losses.

Exam trap

CompTIA often tests the distinction between precision and recall by presenting a scenario where candidates confuse high precision with high recall, leading them to incorrectly select option D (many legitimate transactions flagged) instead of recognizing that low recall causes undetected fraud.

How to eliminate wrong answers

Option A is wrong because a model with high precision and low recall does have a significant impact—it fails to catch many fraud cases, which directly affects business outcomes. Option B is wrong because low recall means the model does not detect all fraudulent transactions; it misses many, contradicting the claim of detecting all fraud. Option D is wrong because high precision implies few false positives, so legitimate transactions are rarely flagged as fraud; that scenario would correspond to low precision, not high precision.

187
MCQeasy

A data engineer needs to combine two datasets, each with unique customer_id, to include all records from both datasets. Which join type should be used?

A.FULL OUTER JOIN
B.RIGHT JOIN
C.LEFT JOIN
D.INNER JOIN
AnswerA

FULL OUTER JOIN includes all records from both tables, matching where possible and filling nulls elsewhere.

Why this answer

A FULL OUTER JOIN returns all records from both datasets, matching rows where the customer_id is present in both and filling in NULLs for missing matches. This is the only join type that guarantees every unique customer_id from either dataset appears in the result, which is exactly what the requirement specifies.

Exam trap

CompTIA often tests the misconception that LEFT JOIN or RIGHT JOIN can include all records from both datasets, but candidates forget that these asymmetric joins exclude non-matching rows from the opposite side.

How to eliminate wrong answers

Option B (RIGHT JOIN) is wrong because it returns only all rows from the right dataset and matching rows from the left, omitting any customer_id that exists only in the left dataset. Option C (LEFT JOIN) is wrong because it returns only all rows from the left dataset and matching rows from the right, omitting any customer_id that exists only in the right dataset. Option D (INNER JOIN) is wrong because it returns only rows where customer_id exists in both datasets, discarding all non-matching records from either side.

188
MCQeasy

A company deploys a deep learning model for real-time image classification. After deployment, they notice high inference latency exceeding the 100ms SLA. Which action would most likely reduce latency without significantly impacting accuracy?

A.Add more training data to improve model robustness
B.Replace the model with a simpler logistic regression model
C.Increase batch size for inference
D.Apply model quantization
AnswerD

Quantization reduces model size and inference time with minor accuracy impact.

Why this answer

Model quantization reduces the precision of the model's weights and activations (e.g., from 32-bit floating point to 8-bit integer), which significantly decreases memory bandwidth and computational requirements during inference. This directly lowers latency without fundamentally altering the model's learned representations, so accuracy degradation is typically minimal (often <1-2%).

Exam trap

CompTIA often tests the misconception that increasing batch size always improves latency, when in fact it increases per-request latency in real-time systems, and that simpler models are always better for latency, ignoring the critical accuracy requirement.

How to eliminate wrong answers

Option A is wrong because adding more training data improves model robustness and generalization but does not reduce inference latency; it may even increase training time and model complexity. Option B is wrong because replacing a deep learning model with a logistic regression model would drastically reduce accuracy for complex image classification tasks, failing the 'without significantly impacting accuracy' constraint. Option C is wrong because increasing batch size for inference increases the number of images processed per batch, which can improve throughput but actually increases per-request latency (time to first prediction) and may exceed the 100ms SLA for real-time applications.

189
Multi-Selecteasy

Which TWO data preprocessing techniques reduce the dimensionality of a dataset?

Select 2 answers
A.One-hot encoding
B.Imputation
C.Feature scaling
D.Principal Component Analysis (PCA)
E.Feature selection
AnswersD, E

PCA reduces dimensionality by projecting data onto principal components.

Why this answer

Options A and D are correct. Principal Component Analysis (PCA) transforms data to a lower dimensional space, while feature selection picks a subset of original features. Option B (feature scaling) does not reduce dimensions.

Option C (one-hot encoding) actually increases dimensions. Option E (imputation) handles missing values but does not reduce dimensions.

190
MCQmedium

A company uses an AI model to predict equipment failures. The model outputs a probability of failure. To minimize false alarms, the operations team wants a high precision. Which deployment strategy should they implement?

A.Retrain the model on more recent data
B.Increase the decision threshold for positive classification
C.Decrease the decision threshold
D.Use an ensemble of models with voting
AnswerB

Higher threshold means fewer positive predictions, increasing precision.

Why this answer

To minimize false alarms and achieve high precision, the operations team should increase the decision threshold for positive classification. A higher threshold means the model only predicts a failure when it is very confident, reducing the number of false positives (false alarms) at the cost of potentially missing some true failures (lower recall). This directly controls the precision-recall trade-off without changing the underlying model.

Exam trap

CompTIA often tests the precision-recall trade-off by making candidates confuse increasing the threshold (which improves precision) with decreasing it (which improves recall), or by suggesting retraining or ensemble methods as direct solutions for precision tuning.

How to eliminate wrong answers

Option A is wrong because retraining on more recent data improves model accuracy and relevance but does not directly control the precision-recall trade-off; it may not reduce false alarms if the model's calibration remains unchanged. Option C is wrong because decreasing the decision threshold would make the model more sensitive, increasing the number of positive predictions and thus increasing false alarms (lower precision), which is the opposite of the goal. Option D is wrong because using an ensemble of models with voting can improve overall accuracy and robustness, but it does not specifically target precision; the voting mechanism may still produce many false positives unless the threshold is also adjusted.

191
Multi-Selectmedium

Which THREE are common data preprocessing steps in a machine learning pipeline? (Choose 3)

Select 3 answers
A.Hyperparameter tuning
B.Encoding categorical variables
C.Model evaluation
D.Scaling numeric features
E.Handling missing values
AnswersB, D, E

Categorical data must be converted to numeric.

Why this answer

Encoding categorical variables is a common data preprocessing step because machine learning algorithms require numerical input. Techniques like one-hot encoding or label encoding convert categorical data (e.g., colors, countries) into numeric format, enabling the model to process them correctly. Without this step, the model would misinterpret categorical labels as ordinal or meaningless numeric values.

Exam trap

CompTIA often tests the distinction between preprocessing steps (data cleaning, transformation) and later pipeline stages (model tuning, evaluation), so candidates mistakenly select hyperparameter tuning or model evaluation as preprocessing steps.

192
MCQhard

A large financial services company deploys multiple AI models on a shared Kubernetes cluster with GPU nodes. The models serve real-time fraud detection and credit scoring. Recently, the operations team observed frequent out-of-memory (OOM) errors during peak hours, causing inference failures. The monitoring dashboards show GPU memory utilization averaging 90% during peak times, and pods are being evicted. The team has allocated 8GB per pod and the total cluster GPU memory is 32GB. The models require at least 4GB each, but the fraud detection model occasionally spikes to 7GB. Which course of action best resolves the OOM errors while maintaining high availability?

A.Reduce the batch size and model complexity for all models to lower memory footprint
B.Set resource limits and requests per model based on observed usage, and implement pod priority classes
C.Provision larger GPU nodes with 48GB memory each
D.Increase the memory request for all pods to 8GB to ensure they have enough
AnswerB

Limits prevent OOM, priority ensures critical models get resources.

Why this answer

Option B is correct because it uses Kubernetes resource management features—setting precise resource requests and limits based on observed GPU memory usage—combined with pod priority classes to ensure critical fraud detection pods are scheduled and retained during contention. This prevents OOM errors by capping memory per pod while allowing the spike-prone fraud model to be prioritized over less critical workloads, maintaining high availability without overprovisioning.

Exam trap

CompTIA often tests the misconception that simply increasing resource requests or node size solves OOM errors, when the real solution involves proper resource limits and scheduling policies to handle variable workloads and maintain availability.

How to eliminate wrong answers

Option A is wrong because reducing batch size and model complexity may degrade inference accuracy or latency, and it does not address the root cause of memory spikes for the fraud detection model; it is a workaround that sacrifices performance. Option C is wrong because provisioning larger GPU nodes (48GB) is a costly overprovisioning approach that does not solve the scheduling or priority issue—it only shifts the bottleneck and may still allow a single pod to consume excessive memory and cause OOM on the larger node. Option D is wrong because increasing the memory request for all pods to 8GB does not prevent the fraud detection model from spiking to 7GB (which is under 8GB) and ignores the need for limits and priority; it may also lead to resource waste and does not address eviction during peak contention.

193
MCQhard

An ML engineering team has a retraining pipeline that triggers automatically when model accuracy drops below a threshold. Recently, the model's accuracy has been fluctuating, causing frequent retraining and high compute costs. The team suspects the data distribution is changing slowly. Which approach should the team implement to reduce unnecessary retraining while maintaining model performance?

A.Use a simpler model to reduce variability
B.Implement a statistical drift detection method on input features
C.Increase the frequency of model retraining
D.Reduce the batch size for inference
AnswerB

Drift detection ensures retraining only when meaningful change occurs.

Why this answer

Option B is correct because implementing a statistical drift detection method (e.g., using KL divergence, PSI, or ADWIN) on input features allows the team to identify when the data distribution has genuinely changed, rather than reacting to random accuracy fluctuations. This reduces unnecessary retraining by triggering the pipeline only when statistically significant drift is detected, maintaining model performance without the high compute costs of frequent retraining.

Exam trap

CompTIA often tests the misconception that increasing retraining frequency or simplifying the model can solve drift-related issues, but the correct approach is to detect drift statistically before deciding to retrain.

How to eliminate wrong answers

Option A is wrong because using a simpler model may reduce variability but does not address the root cause of distribution drift; it could also degrade performance by underfitting the true underlying patterns. Option C is wrong because increasing retraining frequency would exacerbate the compute cost problem and may overfit to transient fluctuations, not solve the issue of unnecessary retraining. Option D is wrong because reducing the batch size for inference affects throughput and latency, not the detection of data distribution changes or the decision to retrain.

194
MCQhard

An e-commerce company deploys a deep learning model for product recommendation. After a new data pipeline is implemented, the model's online performance degrades: recall drops by 20% and the click-through rate decreases. The data scientists suspect data drift. They compare the distribution of the input features between the training data and recent production data. The Kolmogorov-Smirnov test shows significant differences for two numerical features (price and rating). The team also notices that the frequency of categorical feature 'category' has changed. Which of the following is the MOST appropriate first step? A. Immediately retrain the model on all available data including new production data. B. Roll back to the previous data pipeline and investigate the root cause of drift. C. Use feature selection to remove the drifting features and retrain. D. Implement a monitoring dashboard to track drift over time and set up alerts.

A.Implement a monitoring dashboard to track drift over time and set up alerts.
B.Roll back to the previous data pipeline and investigate the root cause of drift.
C.Use feature selection to remove the drifting features and retrain.
D.Immediately retrain the model on all available data including new production data.
AnswerB

Rolling back restores the previous stable distribution; investigating the root cause prevents recurrence.

Why this answer

Option B is correct. Since the drift occurred after a pipeline change, rolling back and investigating the root cause is the most prudent first step before making model changes. Retraining on drifted data (A) might incorporate a faulty distribution.

Removing drifting features (C) could lose important information and may not fully address the issue. Implementing monitoring (D) is useful for long-term but does not address the immediate degradation.

195
MCQeasy

A developer sees the above error during inference on a deployed image classification model. What is the most likely cause?

A.The model version is incompatible with the serving framework
B.The input images are not being resized to the required dimensions
C.The inference server does not support batch processing
D.The model is overfitting to a specific image size
AnswerB

Model expects 299x299 but receives 224x224, so preprocessing is missing resizing.

Why this answer

The error during inference typically indicates a mismatch between the input tensor shape expected by the model and the shape of the provided image. Most image classification models are trained on fixed-size inputs (e.g., 224x224 for ResNet), and failing to resize the input images to those required dimensions causes a shape mismatch error in the serving framework (e.g., TensorFlow Serving or TorchServe). Option B correctly identifies this as the most likely cause because the error message often references tensor shape incompatibility.

Exam trap

CompTIA often tests the misconception that inference errors are caused by model versioning or server configuration, when the actual issue is a simple preprocessing step like image resizing that candidates overlook.

How to eliminate wrong answers

Option A is wrong because model version incompatibility with the serving framework usually manifests as a loading or serialization error (e.g., 'Unsupported op set' or 'Model not found'), not a runtime shape mismatch during inference. Option C is wrong because the inference server's batch processing capability is unrelated to the error; even if batch processing is unsupported, the server would still process single images, and the error would not be about input dimensions. Option D is wrong because overfitting to a specific image size is a training-phase issue that would affect model accuracy, not cause a runtime shape mismatch error during inference.

196
MCQhard

A company operating in the EU must comply with GDPR. An AI model processes personal data for customer segmentation. Which of the following ensures compliance?

A.Obtain explicit consent once and use data indefinitely.
B.Store personal data permanently for model improvement.
C.Use only aggregated data without any individual records.
D.Implement data anonymization and allow users to request deletion.
AnswerD

Anonymization reduces privacy risk, and deletion capability ensures compliance with GDPR rights.

Why this answer

Option D is correct because GDPR mandates that personal data must be processed lawfully, with data minimization and the right to erasure. Implementing data anonymization removes personally identifiable information (PII) so the data is no longer considered personal data under GDPR, and allowing users to request deletion directly satisfies the 'right to be forgotten' (Article 17). This approach ensures compliance by both protecting individual privacy and providing a mechanism for data subjects to exercise their legal rights.

Exam trap

CompTIA often tests the misconception that pseudonymization or simple aggregation is sufficient for GDPR compliance, when in fact only irreversible anonymization (where no individual can be re-identified) removes data from GDPR scope, and the right to deletion must still be explicitly supported for any remaining personal data.

How to eliminate wrong answers

Option A is wrong because GDPR requires that consent be specific, informed, and revocable; obtaining consent once does not permit indefinite use, and data must be retained only as long as necessary for the stated purpose. Option B is wrong because storing personal data permanently violates the data minimization and storage limitation principles (Article 5(1)(c) and (e)), and model improvement is not a valid basis for indefinite retention without explicit, ongoing consent. Option C is wrong because while aggregated data reduces risk, it does not automatically ensure compliance if the aggregation method is reversible or if the data can be re-identified; true anonymization must be irreversible and meet the GDPR's standard of 'anonymous information' (Recital 26).

197
MCQmedium

Refer to the exhibit. What is the most likely issue and what action should be taken?

A.Learning rate is too low; increase it
B.Underfitting; increase model complexity
C.Overfitting; apply early stopping around epoch 15
D.Data imbalance; use class weights
AnswerC

Validation loss starts rising after epoch 15; early stopping halts training at that point.

Why this answer

The training loss continues to decrease while validation loss increases after epoch 20, indicating overfitting. Early stopping around epoch 15 would prevent this.

198
MCQeasy

Based on the exhibit, what issue should the team address?

A.Model accuracy below threshold
B.Potential fairness bias across groups
C.High latency
D.Low throughput
AnswerB

The disparity in accuracy between Group B (0.83) and other groups (0.97, 0.96) indicates a fairness issue that needs to be addressed.

Why this answer

Option B is correct because the exhibit likely shows a confusion matrix or performance metrics broken down by demographic groups (e.g., race, gender), revealing that the model's false positive or false negative rates differ significantly across groups. This disparity indicates a potential fairness bias, which must be addressed to ensure equitable outcomes, especially in high-stakes AI applications like hiring or lending.

Exam trap

CompTIA often tests the misconception that high overall accuracy or low latency/throughput issues are the primary concerns, when the real problem is hidden bias revealed only by disaggregated performance metrics across subgroups.

How to eliminate wrong answers

Option A is wrong because the exhibit does not show an overall accuracy metric below a threshold; instead, it highlights group-wise performance differences, not a global accuracy issue. Option C is wrong because latency refers to inference time per request, which is not indicated by group-wise performance metrics or confusion matrices. Option D is wrong because throughput measures the number of predictions per second, which is unrelated to the group-level bias patterns shown in the exhibit.

199
Multi-Selectmedium

A machine learning engineer is deploying a model to production. Which TWO practices are essential for ensuring reproducibility of model predictions?

Select 2 answers
A.Increase the number of training epochs to ensure convergence.
B.Use the same GPU hardware for both training and inference.
C.Use parallel data loading to speed up inference.
D.Version-control the model artifact (e.g., using MLflow or DVC).
E.Fix random seeds for all libraries (e.g., NumPy, TensorFlow).
AnswersD, E

Versioning ensures the exact model is used for inference.

Why this answer

Version-controlling the model artifact (D) is essential because it allows you to reproduce the exact model binary that generated a prediction, ensuring that any changes to the model code, hyperparameters, or training data do not silently alter outputs. Tools like MLflow or DVC store the model along with its metadata, enabling rollback and auditability in production.

Exam trap

CompTIA often tests the misconception that hardware consistency (e.g., same GPU) is required for reproducibility, when in fact deterministic software practices (version control and seed fixing) are the critical factors.

200
MCQmedium

A security analyst notices that an AI model used for facial recognition is returning unusually high confidence scores for certain individuals while consistently misidentifying others. Which type of attack is most likely occurring?

A.Data poisoning
B.Evasion attack
C.Model inversion attack
D.Model extraction attack
AnswerC

Inversion exploits confidence scores to infer private training data, often showing high confidence on seen data.

Why this answer

Option C is correct because a model inversion attack aims to reconstruct training data by exploiting confidence scores, leading to overconfidence on familiar data. Option A is wrong because poisoning corrupts training data, not inference behavior. Option B is wrong because evasion attacks craft adversarial inputs to cause misclassification, not systematic overconfidence.

Option D is wrong because extraction attacks steal model parameters through queries, not cause confidence anomalies.

201
Multi-Selecthard

Which TWO are key differences between Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)?

Select 2 answers
A.CNNs are designed for sequential data; RNNs for spatial data
B.RNNs have internal memory; CNNs do not
C.CNNs can handle variable-length inputs; RNNs require fixed-size inputs
D.CNNs use backpropagation; RNNs do not
E.CNNs use weight sharing across spatial dimensions; RNNs share weights across time steps
AnswersB, E

RNNs maintain a hidden state for temporal memory; CNNs are feedforward.

Why this answer

CNNs share weights across spatial dimensions via convolution filters, while RNNs share weights across time steps. RNNs have internal memory (hidden state) that captures temporal dependencies; CNNs lack inherent memory.

202
MCQmedium

A manufacturing company uses a computer vision AI to inspect products on an assembly line for defects. The AI model was trained on images from a single camera angle under bright, uniform lighting. Recently, the company moved the inspection station to a different part of the factory where lighting is dimmer and varies due to nearby windows. The model now misclassifies many non-defective products as defective, causing false alarms and production delays. The team has limited labeled data from the new environment. Which action should the team take to restore inspection accuracy while minimizing downtime?

A.Apply domain adaptation techniques using a small set of labeled images from the new environment
B.Increase the defect classification threshold to reduce false positives
C.Revert to the previous lighting setup by reinstalling bright, uniform lights
D.Retrain the model from scratch using a large dataset of images from the new environment
AnswerA

Domain adaptation adjusts the model to new conditions with minimal data.

Why this answer

Domain adaptation techniques allow a model trained on a source domain (bright, uniform lighting) to generalize to a target domain (dim, variable lighting) using only a small set of labeled images from the new environment. This approach minimizes downtime because it avoids the need for large-scale data collection or retraining from scratch, and it directly addresses the distribution shift that causes false positives.

Exam trap

CompTIA often tests the misconception that simply adjusting a threshold or reverting to old conditions is a valid fix, when the correct approach is to adapt the model to the new data distribution using domain adaptation.

How to eliminate wrong answers

Option B is wrong because increasing the classification threshold reduces false positives at the cost of increasing false negatives, which would allow defective products to pass inspection — a critical safety and quality risk. Option C is wrong because reverting to the previous lighting setup is a workaround that does not solve the underlying domain shift problem and may be impractical or costly if the new location is fixed. Option D is wrong because retraining from scratch requires a large labeled dataset from the new environment, which the team does not have, and would cause significant downtime for data collection and training.

203
MCQeasy

A data scientist is training a neural network to classify images of handwritten digits. The model achieves 99% accuracy on training data but only 85% on validation data. Which technique should the scientist apply first to address this issue?

A.Remove one or more hidden layers from the network
B.Increase the number of training epochs
C.Apply L2 regularization to the network weights
D.Add more features to the input data
AnswerC

L2 regularization penalizes large weights and reduces overfitting.

Why this answer

The model shows high training accuracy (99%) but lower validation accuracy (85%), which is a classic sign of overfitting. L2 regularization (option C) adds a penalty term to the loss function proportional to the squared magnitude of the weights, discouraging the network from learning overly complex patterns that do not generalize. This directly addresses overfitting without reducing the model's capacity too aggressively.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting, and the trap here is that candidates may confuse increasing epochs (option B) as a solution to low validation accuracy, when in fact it exacerbates overfitting in this scenario.

How to eliminate wrong answers

Option A is wrong because removing hidden layers reduces the model's capacity, which may underfit and does not specifically target the overfitting problem; the network already has sufficient capacity to memorize the training data. Option B is wrong because increasing the number of training epochs would likely worsen overfitting by allowing the model to further memorize noise in the training data, not improve validation performance. Option D is wrong because adding more features to the input data (e.g., additional pixel-level transformations) would increase the dimensionality and risk of overfitting, not reduce it, and is not a standard technique for addressing overfitting in neural networks.

204
MCQhard

An AI model achieves high accuracy on training data but performs poorly on new test data. The data scientist suspects the model has memorized noise. Which technique directly adds a penalty term to the loss function to address this?

A.Batch normalization
B.Data augmentation
C.Dropout
D.L2 regularization
AnswerD

Correct; L2 adds a penalty term proportional to squared weights.

Why this answer

L2 regularization (also known as weight decay) directly adds a penalty term proportional to the squared magnitude of the model's weights to the loss function. This discourages the model from fitting the noise in the training data by keeping weights small, thereby reducing overfitting and improving generalization to new test data.

Exam trap

CompTIA often tests the distinction between regularization techniques that modify the loss function (L2) versus those that modify the network architecture or data (dropout, batch normalization, data augmentation), so candidates mistakenly choose dropout because it is a well-known regularization method, even though it does not add a penalty term to the loss function.

How to eliminate wrong answers

Option A is wrong because batch normalization normalizes the inputs of each layer to stabilize and accelerate training, but it does not add a penalty term to the loss function; it addresses internal covariate shift, not overfitting from memorized noise. Option B is wrong because data augmentation artificially expands the training dataset by applying transformations (e.g., rotations, flips) to reduce overfitting, but it does not modify the loss function with a penalty term. Option C is wrong because dropout randomly drops neurons during training to prevent co-adaptation, which is a regularization technique but it does not add a penalty term to the loss function; it works by altering the network architecture during training.

205
MCQhard

A company deploys a deep learning model for real-time object detection in autonomous vehicles. The model was trained on high-end GPUs but needs to run on edge devices with limited computational resources. Which technique is most effective for reducing model size and inference latency while maintaining acceptable accuracy?

A.Hyperparameter tuning
B.Batch normalization
C.Dropout
D.Quantization
AnswerD

Quantization reduces numerical precision, shrinking model size and improving inference speed.

Why this answer

Quantization reduces the precision of model weights (e.g., from 32-bit to 8-bit), significantly decreasing model size and speeding up inference with minimal accuracy loss.

206
MCQmedium

A financial institution uses a regression model to predict credit risk. The model has a high R-squared on training data but low R-squared on test data. Which of the following is the most likely cause?

A.The features were not standardized before training.
B.The model is overfitting the training data.
C.The model is underfitting the training data.
D.There is multicollinearity among the input features.
AnswerB

Overfitting explains high training and low test performance.

Why this answer

A high R-squared on training data combined with a low R-squared on test data is the classic symptom of overfitting. The model has memorized noise and specific patterns in the training set rather than learning generalizable relationships, causing poor performance on unseen data.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting by presenting a high training metric with a low test metric, tempting candidates to think the model is 'too good' or that data preprocessing (like standardization) is the fix.

How to eliminate wrong answers

Option A is wrong because feature standardization (scaling) affects convergence speed for some algorithms but does not inherently cause overfitting or the described train-test R-squared gap. Option C is wrong because underfitting would produce low R-squared on both training and test data, not high on training and low on test. Option D is wrong because multicollinearity inflates coefficient variances and can reduce interpretability, but it does not typically cause a large discrepancy between training and test R-squared; it affects both sets similarly.

207
MCQmedium

Refer to the exhibit. A data scientist is training a neural network and observes the training log above. What is the most likely cause?

A.The model is overfitting
B.The model is underfitting
C.The batch size is too large
D.The learning rate is too high
AnswerD

High learning rate causes the optimizer to overshoot minima, leading to divergence.

Why this answer

The loss is increasing and accuracy decreasing, indicating divergence, which is typically caused by a learning rate that is too high.

208
MCQmedium

Refer to the exhibit. A data scientist defines a model configuration in JSON. Which component is missing from the configuration for a complete machine learning pipeline?

A.Training hyperparameters
B.Data preprocessing steps
C.Model type
D.Evaluation metrics
AnswerB

Preprocessing (scaling, encoding) is missing.

Why this answer

A complete machine learning pipeline must include data preprocessing steps to transform raw data into a format suitable for model training. The JSON configuration defines the model type, evaluation metrics, and training hyperparameters, but omits any specification for data cleaning, normalization, feature encoding, or splitting, which are essential for reproducibility and model performance.

Exam trap

CompTIA often tests the misconception that a model configuration is complete if it includes the model type, hyperparameters, and evaluation metrics, but candidates overlook that data preprocessing is a mandatory pipeline stage for transforming raw data before training.

How to eliminate wrong answers

Option A is wrong because training hyperparameters (e.g., learning rate, batch size) are present in the configuration as part of the model training specification, so they are not missing. Option C is wrong because the model type (e.g., 'neural_network', 'random_forest') is explicitly defined in the JSON under the 'model' key, so it is not missing. Option D is wrong because evaluation metrics (e.g., 'accuracy', 'f1_score') are listed in the configuration under the 'evaluation' section, so they are not missing.

209
Multi-Selecteasy

A data scientist is training a supervised learning model for customer churn prediction. Which TWO types of bias are most likely to affect the model's fairness and accuracy if not addressed?

Select 2 answers
A.Algorithmic bias
B.Selection bias
C.Measurement bias
D.Sampling bias
E.Confirmation bias
AnswersB, C

Selection bias arises when the sample is not representative of the population, leading to skewed predictions.

Why this answer

Selection bias (B) occurs when the training data does not represent the true customer population, e.g., using only data from a specific time period or region, leading to a model that fails to generalize. Measurement bias (C) arises from systematic errors in how features are recorded, such as inconsistent data collection methods across customer segments, which can skew predictions and harm fairness.

Exam trap

CompTIA often tests the distinction between data-level biases (selection, measurement) and human cognitive biases (confirmation bias), so candidates mistakenly pick confirmation bias because it sounds plausible in a data science context.

210
Multi-Selectmedium

Which THREE are common activation functions used in neural networks? (Choose THREE.)

Select 3 answers
A.ReLU
B.Softmax
C.Sigmoid
D.Linear
E.Tanh
AnswersA, C, E

Rectified Linear Unit is widely used in hidden layers.

Why this answer

ReLU (Rectified Linear Unit) is a common activation function in neural networks because it introduces non-linearity while being computationally efficient. It outputs the input directly if positive, otherwise zero, which helps mitigate the vanishing gradient problem compared to sigmoid or tanh. This makes it a default choice for hidden layers in many deep learning architectures.

Exam trap

CompTIA often tests the distinction between activation functions used in hidden layers versus output layers, so candidates mistakenly select Softmax as a general activation function when it is only appropriate for the final layer in classification tasks.

211
MCQhard

Based on the exhibit, which action is permitted by this policy?

A.Deploy a new model to an endpoint.
B.Update an existing endpoint.
C.Delete an endpoint.
D.Invoke an endpoint for inference.
AnswerA

The allowed actions are necessary and sufficient to deploy a new model.

Why this answer

The exhibit shows an IAM policy that grants the `sagemaker:CreateModel` and `sagemaker:CreateEndpointConfig` actions, but the key action is `sagemaker:CreateEndpoint`. Deploying a new model to an endpoint requires creating a new endpoint, which is explicitly allowed by this policy. The policy does not include `sagemaker:UpdateEndpoint`, `sagemaker:DeleteEndpoint`, or `sagemaker:InvokeEndpoint`, so only creating a new endpoint is permitted.

Exam trap

CompTIA often tests the distinction between creating a new resource versus modifying or deleting an existing one, leading candidates to assume that broad permissions like `CreateEndpoint` also cover updates or invocations, which is incorrect in IAM policy evaluation.

How to eliminate wrong answers

Option B is wrong because updating an existing endpoint requires the `sagemaker:UpdateEndpoint` action, which is not listed in the policy. Option C is wrong because deleting an endpoint requires the `sagemaker:DeleteEndpoint` action, which is not granted. Option D is wrong because invoking an endpoint for inference requires the `sagemaker:InvokeEndpoint` action, which is absent from the policy.

212
MCQmedium

A company is deploying a fraud detection model that must return predictions within 100ms to avoid transaction delays. The team is deciding between batch and real-time inference. Which factor most strongly supports a real-time inference architecture?

A.The model requires large amounts of historical data for each prediction
B.The application requires immediate feedback for each transaction
C.The infrastructure budget is limited and must be optimized
D.The model can be retrained weekly using gathered data
AnswerB

Real-time inference delivers low-latency predictions for each request.

Why this answer

Real-time inference is required when the application must return predictions within strict latency bounds (e.g., 100ms) to avoid transaction delays. The need for immediate feedback per transaction directly aligns with a real-time architecture, where each request is processed individually as it arrives, rather than waiting for a batch window. Batch inference would introduce unacceptable latency because it processes groups of records on a schedule, not on-demand.

Exam trap

CompTIA often tests the misconception that batch inference is always cheaper or more efficient, but the trap here is that latency requirements (under 100ms) force a real-time architecture regardless of cost or data volume.

How to eliminate wrong answers

Option A is wrong because requiring large amounts of historical data for each prediction does not dictate real-time vs. batch; it affects feature engineering and storage, not inference latency. Option C is wrong because limited infrastructure budget typically favors batch inference, which can use cheaper, less scalable resources and process data in bulk, not real-time. Option D is wrong because weekly retraining is a model update frequency concern, unrelated to the inference serving architecture; both batch and real-time systems can support periodic retraining.

213
MCQmedium

An organization implements the above access control policy for its AI model registry. During an audit, the auditor discovers that a data scientist deployed a model to production without authorization. Which of the following is the most likely cause?

A.The ML engineer role has approval required, but the manager approved the deployment
B.The time window condition blocked the deployment, but it was overridden by an administrator
C.The policy allows data scientists to deploy to staging, but they exploited a gap in enforcement to promote to production without the required approval
D.The auditor lacks 'deploy_to_production' permission, so they missed the deployment
AnswerC

The policy lacks technical enforcement of the role-based separation.

Why this answer

Option B is correct because the policy gives data scientists the 'deploy_to_staging' permission but not 'deploy_to_production', and the conditions include MFA but no separation of duties. However, the policy does not prevent a data scientist from manually copying the model to production if there is no technical control. The most likely cause is that the policy is not enforced by a technical mechanism, allowing the data scientist to bypass the intended restrictions.

214
Multi-Selectmedium

Which THREE are common causes of data leakage in machine learning pipelines?

Select 3 answers
A.Using time-based splitting for sequential data
B.Using future information to predict the present
C.Using cross-validation on the entire dataset
D.Applying normalization before splitting data into train and test sets
E.Including features that are directly derived from the target variable
AnswersB, D, E

Using data that would not be available at prediction time is a direct form of leakage.

Why this answer

Option B is correct because using future information to predict the present is a classic form of data leakage. In time series or sequential data, if a model is trained on features that include values from a later time point, it gains access to information that would not be available at prediction time, leading to overly optimistic performance metrics and poor generalization.

Exam trap

CompTIA often tests the distinction between valid data splitting practices and actual leakage causes, so candidates may incorrectly select time-based splitting (Option A) as a leakage cause when it is actually a proper technique for sequential data.

215
MCQeasy

Refer to the exhibit. The monitoring dashboard for a deployed churn prediction model shows a drift detected flag. However, the error rate and latency are within acceptable ranges. What is the most appropriate immediate action?

A.Trigger automatic retraining using the latest data
B.Roll back to the previous model version immediately
C.Ignore the drift since performance metrics are stable
D.Investigate the type and severity of drift before deciding
AnswerD

Understanding drift (covariate vs concept) informs next steps.

Why this answer

Option B is correct because drift detection warrants investigation before any automated action; retraining or rollback might be premature without understanding the drift type. Option A is wrong because auto-retraining could be risky if drift is benign. Option C is wrong because ignoring drift may lead to future degradation.

Option D is wrong because rollback discards potential improvements.

216
Multi-Selecteasy

Which TWO are evaluation metrics for classification problems? (Choose two.)

Select 2 answers
A.Precision
B.Mean Absolute Error
C.R-squared
D.Mean Squared Error
E.Recall
AnswersA, E

Correct: Precision is a classification metric.

Why this answer

Options B and D are correct because Precision and Recall are classification metrics. Options A, C, and E are incorrect: Mean Squared Error and Mean Absolute Error are regression metrics, and R-squared is also for regression.

217
MCQhard

A company serves a large language model (LLM) on a Kubernetes cluster. The inference latency is acceptable but the cost is high due to GPU usage. The model is 7 billion parameters and requires 16GB GPU memory. The team wants to reduce cost without increasing latency. Which strategy should they implement?

A.Increase the batch size for inference
B.Add more GPU nodes to distribute the load
C.Switch to CPU-based inference
D.Use model quantization to reduce precision
AnswerD

Quantization reduces model size and memory, enabling more efficient GPU usage.

Why this answer

Model quantization reduces the precision of the model's weights (e.g., from FP32 to INT8), which decreases the GPU memory footprint from 16GB to approximately 4GB for a 7B parameter model. This directly lowers GPU cost per inference while maintaining acceptable latency, as the model can run on fewer or cheaper GPUs without increasing inference time.

Exam trap

CompTIA often tests the misconception that adding more hardware (Option B) or increasing batch size (Option A) always reduces cost, when in fact they increase resource usage and cost; the trap is that candidates overlook memory optimization techniques like quantization as a direct cost-reduction strategy.

How to eliminate wrong answers

Option A is wrong because increasing batch size for inference would increase GPU memory usage and could increase latency due to larger memory transfers, not reduce cost without affecting latency. Option B is wrong because adding more GPU nodes would increase cost, not reduce it, and does not address the high GPU memory usage per inference. Option C is wrong because switching to CPU-based inference would drastically increase latency (often 10-100x slower) due to the lack of parallel processing for large matrix operations, violating the requirement to not increase latency.

218
MCQeasy

A machine learning engineer needs to choose an algorithm for grouping customers into segments based on purchasing behavior without any labels. Which algorithm should the engineer use?

A.K-means clustering
B.Random forest classifier
C.Linear regression
D.Support vector machine
AnswerA

K-means is unsupervised and groups data based on feature similarity.

Why this answer

K-means clustering is an unsupervised algorithm that partitions data into K clusters based on similarity.

219
MCQmedium

A company uses a pre-trained language model for a legal document classification task. They have limited labeled data (500 documents). Which strategy is MOST effective for adapting the model to this domain?

A.Use a rule-based keyword matching system instead.
B.Train a new model from scratch on the 500 documents.
C.Apply extensive data augmentation to increase dataset size.
D.Fine-tune the pre-trained model on the 500 labeled documents.
AnswerD

Correct; transfer learning works well with small labeled datasets.

Why this answer

Fine-tuning a pre-trained language model on 500 labeled legal documents is the most effective strategy because it leverages the model's existing knowledge of language structure and general semantics, requiring only a small amount of domain-specific data to adapt to the legal classification task. This approach avoids the high data requirements of training from scratch and outperforms rule-based or augmentation-only methods by directly optimizing the model's weights for the target domain.

Exam trap

CompTIA often tests the misconception that more data is always better (trap of Option C) or that starting from scratch is safer (trap of Option B), when in fact transfer learning via fine-tuning is the standard approach for low-resource NLP tasks.

How to eliminate wrong answers

Option A is wrong because rule-based keyword matching lacks the semantic understanding needed for legal document classification, where context and nuance are critical, and it cannot generalize beyond predefined patterns. Option B is wrong because training a new model from scratch on only 500 documents is insufficient for deep learning models, leading to severe overfitting and poor generalization due to the lack of pre-trained linguistic knowledge. Option C is wrong because extensive data augmentation on only 500 documents may introduce noise and unrealistic variations, and it does not provide the same benefit as leveraging a pre-trained model's learned representations, which already capture rich language patterns.

220
MCQhard

A financial institution is developing a fraud detection model using historical transaction data. The dataset contains over 10 million records, but only 0.01% of transactions are fraudulent. The current model uses a neural network trained with standard cross-entropy loss, and the team applies random undersampling of the majority class to create a balanced training set. However, the model still produces a high number of false positives (legitimate transactions flagged as fraud) and misses approximately 30% of actual fraud cases. The business requires that at least 95% of frauds be caught, and the false positive rate must be below 1% to avoid overwhelming fraud analysts. The team has limited resources to collect additional data and cannot change the model architecture significantly. Which approach should the team take to best meet the business requirements?

A.Use cost-sensitive learning by assigning a higher misclassification cost to the fraud class.
B.Apply feature selection to remove noisy predictors and then retrain the current model.
C.Switch to an anomaly detection algorithm such as Isolation Forest or One-Class SVM.
D.Collect more transaction data, especially fraudulent examples, to naturally balance the classes.
AnswerA

This directly penalizes false negatives more, encouraging the model to catch more frauds while maintaining a low false positive rate through tuning.

Why this answer

Cost-sensitive learning adjusts the loss function to penalize false negatives more heavily, directly addressing the need to catch more frauds while controlling false positives. Collecting more data is impractical and may not resolve the imbalance. Anomaly detection models treat fraud as outliers but often have high false positive rates in this context.

Feature selection does not inherently solve the imbalance or performance metric trade-off.

221
MCQmedium

A data scientist trains a linear regression model to predict house prices. The model has high bias and low variance. Which action would most likely reduce bias?

A.Apply L2 regularization
B.Increase the training dataset size
C.Add polynomial features
D.Remove irrelevant features
AnswerC

Adding complexity reduces bias but may increase variance.

Why this answer

High bias indicates the model is underfitting the data, meaning it is too simple to capture the underlying patterns. Adding polynomial features increases model complexity by introducing non-linear terms, which allows the linear regression model to better fit the training data and thus reduce bias.

Exam trap

CompTIA often tests the bias-variance tradeoff by making candidates confuse regularization (which reduces variance) with methods that reduce bias, or by implying that more data always fixes underfitting.

How to eliminate wrong answers

Option A is wrong because L2 regularization (Ridge regression) reduces overfitting by penalizing large coefficients, which increases bias to lower variance, making bias worse. Option B is wrong because increasing the training dataset size typically reduces variance (helps with overfitting) but does not address underfitting (high bias) — it may even make bias more apparent. Option D is wrong because removing irrelevant features simplifies the model further, which increases bias and is counterproductive when the goal is to reduce bias.

222
MCQeasy

A team is deploying a deep learning model for real-time image classification on edge devices with limited computational resources. Which technique would best help reduce model size and inference time without significant accuracy loss?

A.Data augmentation
B.Model pruning and quantization
C.Transfer learning
D.Ensemble learning
AnswerB

Pruning removes redundant weights and quantization reduces precision, decreasing model size and speeding up inference.

Why this answer

Option A (Data augmentation) improves generalization but does not reduce model size. Option B (Transfer learning) can reduce training time but not necessarily inference time or model size. Option D (Ensemble learning) increases both size and inference time.

Option C (Model pruning and quantization) directly reduces model size and speeds up inference.

223
MCQmedium

A team is training a language model using a large text corpus. They want to ensure the model does not learn biased associations between gender and professions. Which data engineering technique should they apply?

A.Remove all gender-related words from the text
B.Use a pre-trained model that is already debiased
C.Apply adversarial debiasing during training
D.Balance the representation of professions across genders
AnswerD

Balancing ensures the model sees equal examples of each gender across professions, reducing biased correlations.

Why this answer

Balancing the representation of professions across genders in the training data reduces the chance the model learns spurious correlations. Removing gender words is too aggressive, pre-trained models may still be biased, and adversarial debiasing is a model training technique, not data engineering.

224
MCQeasy

A data analyst is cleaning a dataset and finds that 20% of the values for the 'age' column are missing. Which imputation method is most robust if the data is not normally distributed?

A.Mean imputation
B.Median imputation
C.Mode imputation
D.Remove rows with missing values
AnswerB

Median is robust to non-normal distributions.

Why this answer

Median imputation is the most robust method for handling missing values in the 'age' column when the data is not normally distributed because the median is unaffected by outliers or skewness. Unlike the mean, which is sensitive to extreme values, the median provides a central tendency measure that better represents the typical value in non-normal distributions, preserving the dataset's integrity for downstream modeling.

Exam trap

CompTIA often tests the misconception that mean imputation is always the default or best choice for numerical data, but the trap here is that candidates overlook the importance of distribution shape and outlier sensitivity, leading them to select mean imputation despite the data not being normally distributed.

How to eliminate wrong answers

Option A is wrong because mean imputation assumes a normal distribution and is highly sensitive to outliers, which can introduce bias and distort the dataset's variance when the data is skewed. Option C is wrong because mode imputation is typically used for categorical data, not continuous variables like age, and it can lead to loss of granularity and inaccurate representation of the distribution. Option D is wrong because removing rows with missing values reduces sample size and can introduce selection bias, especially if the missingness is not completely at random, which is inefficient and may degrade model performance.

225
MCQeasy

A team deploys a machine learning model as a REST API. They want to monitor model drift. Which metric is MOST appropriate for detecting drift in the input data distribution?

A.Model accuracy on a recent holdout set.
B.Population stability index (PSI) comparing training and recent data.
C.F1 score on the training data.
D.Root mean squared error (RMSE) on test data.
AnswerB

PSI directly quantifies distribution shift.

Why this answer

Population stability index (PSI) is the most appropriate metric for detecting drift in input data distribution because it directly measures the shift between the training data distribution and the recent production data distribution. PSI is calculated by binning both distributions and computing the sum of (proportion in bin of recent data minus proportion in bin of training data) times the natural log of their ratio, making it sensitive to changes in feature distributions without requiring ground truth labels.

Exam trap

The trap here is that candidates often confuse performance metrics (accuracy, F1, RMSE) with distribution drift detection, not realizing that PSI specifically quantifies covariate shift without needing ground truth labels.

How to eliminate wrong answers

Option A is wrong because model accuracy on a recent holdout set measures performance degradation, not input data distribution drift; accuracy can drop due to concept drift or other factors, and it requires labeled data which may not be available in production. Option C is wrong because F1 score on the training data is a measure of model fit on historical data, not a metric for detecting changes in the input distribution of new data. Option D is wrong because root mean squared error (RMSE) on test data evaluates prediction error on a static test set, not the distributional shift between training and current production inputs.

Page 2

Page 3 of 7

Page 4

All pages