Sample questions
CompTIA AI+ AI0-001 practice questions
A machine learning engineer is building a spam filter. The dataset contains 10,000 emails, of which 1,000 are spam. The engineer decides to use a Random Forest classifier. Which preprocessing step is most critical to ensure the model generalizes well to new, unseen emails?
Trap 1: Apply Principal Component Analysis (PCA) to reduce dimensionality
PCA may discard useful information and is not necessary for Random Forest.
Trap 2: Normalize the numerical features to have zero mean and unit variance
Random Forest is not sensitive to feature scaling.
Trap 3: Encode all features using one-hot encoding
One-hot encoding is not universally needed and may increase dimensionality unnecessarily.
- A
Apply Principal Component Analysis (PCA) to reduce dimensionality
Why wrong: PCA may discard useful information and is not necessary for Random Forest.
- B
Normalize the numerical features to have zero mean and unit variance
Why wrong: Random Forest is not sensitive to feature scaling.
- C
Split the data into training and testing sets before any other preprocessing
Splitting first prevents data leakage and ensures realistic evaluation.
- D
Encode all features using one-hot encoding
Why wrong: One-hot encoding is not universally needed and may increase dimensionality unnecessarily.
Which THREE are common data preprocessing steps in a machine learning pipeline? (Choose 3)
Trap 1: Hyperparameter tuning
Hyperparameter tuning is part of model optimization.
Trap 2: Model evaluation
Model evaluation is after training.
- A
Hyperparameter tuning
Why wrong: Hyperparameter tuning is part of model optimization.
- B
Encoding categorical variables
Categorical data must be converted to numeric.
- C
Model evaluation
Why wrong: Model evaluation is after training.
- D
Scaling numeric features
Scaling prevents features with larger ranges from dominating.
- E
Handling missing values
Missing data must be addressed before training.
An e-commerce company uses an AI system to set dynamic prices for products. A customer complains that the price they see is higher than the price shown to a friend for the same product at the same time. The company wants to ensure pricing fairness. Which ethical principle should guide the redesign of the pricing algorithm?
Trap 1: Privacy by design
Privacy by design focuses on data protection, not on ensuring fair pricing outcomes.
Trap 2: Accountability
Accountability assigns responsibility but does not directly address the fairness of the pricing algorithm.
Trap 3: Beneficence
Beneficence focuses on promoting good, but does not specifically prevent unfair pricing practices.
- A
Transparency and explainability
Transparency requires the company to disclose how prices are determined, helping to ensure fairness and build trust.
- B
Privacy by design
Why wrong: Privacy by design focuses on data protection, not on ensuring fair pricing outcomes.
- C
Accountability
Why wrong: Accountability assigns responsibility but does not directly address the fairness of the pricing algorithm.
- D
Beneficence
Why wrong: Beneficence focuses on promoting good, but does not specifically prevent unfair pricing practices.
An AI system used for autonomous driving is found to have a lower accuracy in detecting pedestrians with darker skin tones. The development team wants to address this ethical issue. Which action is most effective?
Trap 1: Conduct additional testing to measure the disparity
Testing identifies the issue but does not fix it.
Trap 2: Replace the object detection algorithm with a different one
Algorithm change may not address data imbalance.
Trap 3: Adjust the model's decision threshold for pedestrian detection
Threshold adjustment changes sensitivity but does not improve feature learning.
- A
Conduct additional testing to measure the disparity
Why wrong: Testing identifies the issue but does not fix it.
- B
Augment the training dataset with more images of pedestrians with darker skin
Diverse data helps the model learn robust features for all skin tones.
- C
Replace the object detection algorithm with a different one
Why wrong: Algorithm change may not address data imbalance.
- D
Adjust the model's decision threshold for pedestrian detection
Why wrong: Threshold adjustment changes sensitivity but does not improve feature learning.
In the AI lifecycle, which phase involves splitting data into training, validation, and test sets?
Trap 1: Model training
Incorrect; training uses the already-split training data.
Trap 2: Data collection
Incorrect; data collection acquires raw data, not splitting.
Trap 3: Model evaluation
Incorrect; evaluation uses test data, but splitting happens earlier.
- A
Model training
Why wrong: Incorrect; training uses the already-split training data.
- B
Data preprocessing
Correct; preprocessing includes cleaning, transforming, and splitting data.
- C
Data collection
Why wrong: Incorrect; data collection acquires raw data, not splitting.
- D
Model evaluation
Why wrong: Incorrect; evaluation uses test data, but splitting happens earlier.
A startup is building a chatbot for customer service. They have 500 recorded conversations and want to use a pre-trained language model to generate responses. However, they have limited computational resources and need the chatbot to respond in real-time. They are considering fine-tuning a large model like GPT-3 or using a smaller model like DistilBERT. The conversation data contains industry-specific jargon. Which approach should they take?
Trap 1: Use GPT-3 via API without fine-tuning
GPT-3 is large, may not understand industry jargon without fine-tuning, and API costs can be high for real-time.
Trap 2: Train a custom RNN from scratch on the conversations
Training from scratch requires large datasets and significant compute; 500 conversations are insufficient.
Trap 3: Implement a rule-based system with keywords
Rule-based systems cannot handle the variability of natural language and would likely fail.
- A
Use GPT-3 via API without fine-tuning
Why wrong: GPT-3 is large, may not understand industry jargon without fine-tuning, and API costs can be high for real-time.
- B
Fine-tune DistilBERT on the conversation data
DistilBERT is smaller, faster, and fine-tuning on domain-specific data will adapt it to jargon while meeting real-time requirements.
- C
Train a custom RNN from scratch on the conversations
Why wrong: Training from scratch requires large datasets and significant compute; 500 conversations are insufficient.
- D
Implement a rule-based system with keywords
Why wrong: Rule-based systems cannot handle the variability of natural language and would likely fail.
A data scientist is preparing a dataset for supervised learning. Which TWO steps are essential?
Trap 1: One-hot encoding all features
Incorrect; one-hot encoding is only for categorical features, and not all features require it.
Trap 2: Normalizing features
Incorrect; normalization is beneficial but not essential for all algorithms.
Trap 3: Removing outliers
Incorrect; outlier removal is optional and depends on the problem.
- A
One-hot encoding all features
Why wrong: Incorrect; one-hot encoding is only for categorical features, and not all features require it.
- B
Normalizing features
Why wrong: Incorrect; normalization is beneficial but not essential for all algorithms.
- C
Labeling the data
Correct; supervised learning requires labeled examples.
- D
Removing outliers
Why wrong: Incorrect; outlier removal is optional and depends on the problem.
- E
Splitting into training and test sets
Correct; splitting is essential to avoid data leakage and evaluate generalization.
A company wants to create an AI system that can identify objects in images. They have a large dataset of labeled images. Which type of neural network architecture is most suitable?
Trap 1: Transformer
Incorrect; transformers are effective for NLP and some vision tasks but CNNs are more standard and efficient for large image datasets.
Trap 2: Generative adversarial network (GAN)
Incorrect; GANs are for generating data, not classification.
Trap 3: Recurrent neural network (RNN)
Incorrect; RNNs are for sequences, not spatial data.
- A
Transformer
Why wrong: Incorrect; transformers are effective for NLP and some vision tasks but CNNs are more standard and efficient for large image datasets.
- B
Convolutional neural network (CNN)
Correct; CNNs excel at image recognition due to convolutional layers.
- C
Generative adversarial network (GAN)
Why wrong: Incorrect; GANs are for generating data, not classification.
- D
Recurrent neural network (RNN)
Why wrong: Incorrect; RNNs are for sequences, not spatial data.
A financial services company is developing an AI model to detect fraudulent transactions. The dataset contains 99.9% legitimate transactions and 0.1% fraudulent ones. Which technique should the data scientist use to address the class imbalance problem?
Trap 1: Use a bagging ensemble method
Bagging can improve stability but does not directly solve class imbalance without additional techniques like SMOTE.
Trap 2: Undersample the legitimate transactions
Undersampling the majority class may lose valuable information and lead to underfitting.
Trap 3: Use cost-sensitive learning with higher weight on fraudulent class
Cost-sensitive learning modifies the algorithm's penalty, but it does not address data imbalance directly; SMOTE is preferred for preprocessing.
- A
Apply Synthetic Minority Oversampling Technique (SMOTE)
SMOTE creates synthetic examples of the minority class, balancing the dataset without losing information.
- B
Use a bagging ensemble method
Why wrong: Bagging can improve stability but does not directly solve class imbalance without additional techniques like SMOTE.
- C
Undersample the legitimate transactions
Why wrong: Undersampling the majority class may lose valuable information and lead to underfitting.
- D
Use cost-sensitive learning with higher weight on fraudulent class
Why wrong: Cost-sensitive learning modifies the algorithm's penalty, but it does not address data imbalance directly; SMOTE is preferred for preprocessing.
Based on the exhibit, which action is most likely to resolve the memory issue?
Exhibit
Refer to the exhibit. Error: RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 8.00 GiB total capacity; 6.50 GiB already allocated; 1.50 GiB free; 0 bytes cached) at /workspace/training.py:345
Trap 1: Add more training data.
Adding data increases memory requirements and would worsen the error.
Trap 2: Increase the learning rate.
Learning rate does not affect memory usage.
Trap 3: Switch to a CPU.
CPU may still run out of memory if the problem is not addressed, and training would be slower.
- A
Add more training data.
Why wrong: Adding data increases memory requirements and would worsen the error.
- B
Increase the learning rate.
Why wrong: Learning rate does not affect memory usage.
- C
Switch to a CPU.
Why wrong: CPU may still run out of memory if the problem is not addressed, and training would be slower.
- D
Reduce the batch size.
Smaller batches reduce the memory allocated for intermediate tensors.
A company deploys an AI model via a REST API that handles sensitive customer data. To secure the endpoint, the security team requires that only authenticated and authorized applications can invoke the API. Which mechanism should be implemented?
Trap 1: TLS encryption for the connection
TLS encrypts data, but does not authenticate the client.
Trap 2: Input sanitization to prevent injection
Input sanitization prevents code injection, not authentication.
Trap 3: IP whitelisting
IP whitelisting allows any request from a trusted IP, but does not authenticate individual applications.
- A
API key or bearer token in the HTTP header
API keys/tokens authenticate the caller and are standard for API security.
- B
TLS encryption for the connection
Why wrong: TLS encrypts data, but does not authenticate the client.
- C
Input sanitization to prevent injection
Why wrong: Input sanitization prevents code injection, not authentication.
- D
IP whitelisting
Why wrong: IP whitelisting allows any request from a trusted IP, but does not authenticate individual applications.
During an AI model deployment, the operations team notices that inference requests are taking longer than expected. Which component is most likely causing the bottleneck?
Trap 1: Input data preprocessing pipeline
Preprocessing is typically lightweight.
Trap 2: API gateway rate limiting
Rate limiting affects throughput, not per-request latency.
Trap 3: Database connection pool size
Databases are used for storage, not inference.
- A
Input data preprocessing pipeline
Why wrong: Preprocessing is typically lightweight.
- B
API gateway rate limiting
Why wrong: Rate limiting affects throughput, not per-request latency.
- C
Database connection pool size
Why wrong: Databases are used for storage, not inference.
- D
The machine learning model's size and architecture
Larger models take longer to compute predictions.
During model monitoring, a loan approval model shows disparate impact against a protected group. The model's overall accuracy is high, but the false positive rate for the protected group is 0.12 compared to 0.02 for other groups. Which action should the operations team take first?
Trap 1: Document the disparity and proceed with deployment because accuracy…
Ignoring disparate impact is not acceptable ethical practice.
Trap 2: Replace the model with a simpler model that is less discriminatory
Simpler models may not capture complexity and could still be biased.
Trap 3: Adjust the decision threshold for the protected group to equalize…
This is a temporary fix and may not address underlying bias.
- A
Document the disparity and proceed with deployment because accuracy is high
Why wrong: Ignoring disparate impact is not acceptable ethical practice.
- B
Replace the model with a simpler model that is less discriminatory
Why wrong: Simpler models may not capture complexity and could still be biased.
- C
Retrain the model with reweighted training data to minimize disparity
Retraining with fairness constraints directly mitigates bias in the model.
- D
Adjust the decision threshold for the protected group to equalize false positive rates
Why wrong: This is a temporary fix and may not address underlying bias.
A healthcare company must deploy a diagnostic AI model that uses protected health information (PHI). To comply with HIPAA, the operations team needs to ensure data privacy during model inference. Which practice should be implemented?
Trap 1: Run the model on-premises to avoid cloud data transmission
On-premises doesn't guarantee encryption or compliance automatically.
Trap 2: Mask sensitive fields in the input data before inference
Masking may alter the input and affect model output.
Trap 3: Apply differential privacy during model training only
Differential privacy protects training data, not inference data.
- A
Run the model on-premises to avoid cloud data transmission
Why wrong: On-premises doesn't guarantee encryption or compliance automatically.
- B
Encrypt all PHI at rest and in transit within the inference pipeline
Encryption ensures confidentiality of PHI.
- C
Mask sensitive fields in the input data before inference
Why wrong: Masking may alter the input and affect model output.
- D
Apply differential privacy during model training only
Why wrong: Differential privacy protects training data, not inference data.
A model trained on a dataset with imbalanced classes achieves 98% accuracy but only 50% recall for the minority class. Which technique should be applied first to address the imbalance?
Trap 1: Reduce the majority class size
Undersampling may discard valuable data and reduce overall performance.
Trap 2: Use SMOTE to generate synthetic samples
SMOTE is a valid approach but may create noisy or unrealistic samples; cost-sensitive learning is a simpler first step.
Trap 3: Collect more data for the minority class
Although beneficial, collecting additional data is often not the most immediate or feasible option.
- A
Apply cost-sensitive learning
Cost-sensitive learning adjusts class weights in the loss function, directly tackling imbalance without data modification.
- B
Reduce the majority class size
Why wrong: Undersampling may discard valuable data and reduce overall performance.
- C
Use SMOTE to generate synthetic samples
Why wrong: SMOTE is a valid approach but may create noisy or unrealistic samples; cost-sensitive learning is a simpler first step.
- D
Collect more data for the minority class
Why wrong: Although beneficial, collecting additional data is often not the most immediate or feasible option.
An MLOps team automates model deployment with a CI/CD pipeline. A performance regression is detected after deploying a new model version. The team needs to automatically roll back to the previous version. Which approach best enables safe automated rollback?
Trap 1: Maintain a manual rollback script that the operations team can run
Manual processes are not automated and can be slow.
Trap 2: Deploy new models as canary releases and monitor for 24 hours
Canary is gradual but rollback may not be fully automated.
Trap 3: Automatically keep the previous model version in storage for later…
Storage alone doesn't handle traffic switching.
- A
Use a blue/green deployment with automated health checks and traffic switching
Blue/green allows instant rollback by redirecting traffic.
- B
Maintain a manual rollback script that the operations team can run
Why wrong: Manual processes are not automated and can be slow.
- C
Deploy new models as canary releases and monitor for 24 hours
Why wrong: Canary is gradual but rollback may not be fully automated.
- D
Automatically keep the previous model version in storage for later use
Why wrong: Storage alone doesn't handle traffic switching.
Refer to the exhibit. A team created an access policy for a fraud detection model endpoint. An intern reports being unable to access the model for testing. Reviewing the policy, what is the most likely cause?
Exhibit
Refer to the exhibit.
```json
{
"model_policy": {
"model": "fraud-detection-v3",
"allowed_roles": ["data_scientist", "ml_engineer"],
"denied_roles": ["intern"],
"endpoint": "/api/v1/predict"
}
}
```Trap 1: The intern's role is not included in the allowed roles
That alone wouldn't cause denial if not explicitly denied.
Trap 2: The policy JSON has a syntax error
The JSON is valid.
Trap 3: The endpoint path is incorrect
Exhibit shows correct path.
- A
The intern's role is not included in the allowed roles
Why wrong: That alone wouldn't cause denial if not explicitly denied.
- B
The policy JSON has a syntax error
Why wrong: The JSON is valid.
- C
The endpoint path is incorrect
Why wrong: Exhibit shows correct path.
- D
The intern's role is explicitly denied in the policy
Denied roles override any allowed list.
A dataset for a binary classification problem has 95% of samples in class "0" and 5% in class "1". The data scientist trains a logistic regression model and achieves 95% accuracy. Which metric should the scientist primarily use to evaluate model performance?
Trap 1: R-squared.
R-squared is for regression.
Trap 2: Accuracy.
Accuracy is high due to majority class, masking poor performance on minority class.
Trap 3: Mean squared error.
MSE is for regression, not classification.
- A
Precision, recall, and F1-score.
These metrics evaluate performance on the minority class, crucial for imbalanced data.
- B
R-squared.
Why wrong: R-squared is for regression.
- C
Accuracy.
Why wrong: Accuracy is high due to majority class, masking poor performance on minority class.
- D
Mean squared error.
Why wrong: MSE is for regression, not classification.
A data scientist is evaluating a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). Which TWO metrics are most appropriate for assessing model performance? (Choose two.)
Trap 1: F1 score
F1 is useful but the question asks for two; precision and recall are more direct.
Trap 2: Area under the ROC curve (AUC-ROC)
AUC-ROC is less informative for imbalanced datasets; precision-recall is preferred.
Trap 3: Accuracy
Accuracy is high even if the model predicts all non-fraud, so it's misleading.
- A
Precision
Precision measures the proportion of predicted fraud that is actually fraud, important to avoid false positives.
- B
Recall
Recall measures the proportion of actual fraud that is detected, critical for catching fraud.
- C
F1 score
Why wrong: F1 is useful but the question asks for two; precision and recall are more direct.
- D
Area under the ROC curve (AUC-ROC)
Why wrong: AUC-ROC is less informative for imbalanced datasets; precision-recall is preferred.
- E
Accuracy
Why wrong: Accuracy is high even if the model predicts all non-fraud, so it's misleading.
A data engineer is building a pipeline to ingest streaming data from IoT sensors. Which data storage solution is best suited for real-time analytics on timestamped sensor readings?
Trap 1: Data warehouse
Data warehouses are designed for batch analytical queries, not real-time streaming ingestion.
Trap 2: Relational database
Relational databases are optimized for transactional workloads and may not handle high-velocity time-series data efficiently.
Trap 3: Data lake
Data lakes are for raw storage and are not optimized for low-latency real-time queries.
- A
Data warehouse
Why wrong: Data warehouses are designed for batch analytical queries, not real-time streaming ingestion.
- B
Relational database
Why wrong: Relational databases are optimized for transactional workloads and may not handle high-velocity time-series data efficiently.
- C
Data lake
Why wrong: Data lakes are for raw storage and are not optimized for low-latency real-time queries.
- D
Time-series database
Time-series databases provide specialized indexing, compression, and query capabilities for timestamped data.
While training a deep neural network, the loss function fails to converge and oscillates wildly. Which adjustment is most likely to stabilize training?
Trap 1: Increase the number of hidden layers
More layers increase complexity and may worsen oscillations.
Trap 2: Decrease the batch size
Smaller batch sizes introduce more noise, which can increase oscillations.
Trap 3: Use a test set
Test set is for evaluation, not for training stability.
- A
Increase the number of hidden layers
Why wrong: More layers increase complexity and may worsen oscillations.
- B
Decrease the batch size
Why wrong: Smaller batch sizes introduce more noise, which can increase oscillations.
- C
Reduce the learning rate
Lower learning rate reduces step size, stabilizing training.
- D
Use a test set
Why wrong: Test set is for evaluation, not for training stability.
A data engineer needs to store training data in a format that supports columnar pruning during model training. Which storage format should they use?
Trap 1: XML
XML is row-oriented and heavily nested, inefficient for columnar operations.
Trap 2: JSON
JSON is row-oriented and verbose, not optimized for columnar access.
Trap 3: CSV
CSV is row-oriented; reading all columns is required even if only a few are needed.
- A
Parquet
Parquet is columnar, enabling compression and pruning, reducing I/O.
- B
XML
Why wrong: XML is row-oriented and heavily nested, inefficient for columnar operations.
- C
JSON
Why wrong: JSON is row-oriented and verbose, not optimized for columnar access.
- D
CSV
Why wrong: CSV is row-oriented; reading all columns is required even if only a few are needed.
Which TWO of the following are common methods for mitigating bias in AI models?
Trap 1: Using adversarial training
Adversarial training improves robustness to adversarial examples, not bias.
Trap 2: Applying L1 regularization
L1 regularization induces sparsity, not fairness.
Trap 3: Performing k-fold cross-validation
Cross-validation assesses performance, does not mitigate bias.
- A
Using adversarial training
Why wrong: Adversarial training improves robustness to adversarial examples, not bias.
- B
Reweighting training samples based on sensitive attributes
Reweighting can adjust for underrepresented groups to reduce bias.
- C
Applying L1 regularization
Why wrong: L1 regularization induces sparsity, not fairness.
- D
Adding fairness constraints during training
Fairness constraints directly enforce fairness during model training.
- E
Performing k-fold cross-validation
Why wrong: Cross-validation assesses performance, does not mitigate bias.
An AI system is being designed to automatically detect fraudulent transactions in real-time. The system must have low latency and high precision to minimize false alarms. Which algorithm is most appropriate?
Trap 1: Logistic regression
Logistic regression is linear and may not capture complex patterns in fraud detection, potentially leading to lower precision.
Trap 2: Convolutional neural network
CNNs are designed for image data and are not suitable for tabular fraud detection; they also have higher computational overhead.
Trap 3: Deep reinforcement learning
Deep reinforcement learning is complex and typically not used for static classification tasks; it requires extensive training and may have high latency.
- A
Logistic regression
Why wrong: Logistic regression is linear and may not capture complex patterns in fraud detection, potentially leading to lower precision.
- B
Convolutional neural network
Why wrong: CNNs are designed for image data and are not suitable for tabular fraud detection; they also have higher computational overhead.
- C
Deep reinforcement learning
Why wrong: Deep reinforcement learning is complex and typically not used for static classification tasks; it requires extensive training and may have high latency.
- D
Random forest
Random forest provides high accuracy and precision with low inference latency, making it ideal for real-time fraud detection.
Question Discussion
Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.
Sign in to join the discussion.