CCNA AI Models and Data Engineering Questions

75 of 95 questions · Page 1/2 · AI Models and Data Engineering · Answers revealed

1
MCQhard

A data scientist trains a deep learning model on a large dataset. The training loss decreases steadily but the validation loss starts increasing after 20 epochs. The scientist uses early stopping with patience=5. Which of the following is the MOST likely cause and best corrective action?

A.Model is overfitting; add dropout regularization.
B.Training data is not representative; collect more data.
C.Model is underfitting; increase model capacity.
D.Learning rate too high; reduce learning rate.
AnswerA

Diverging validation loss after training loss decrease is classic overfitting; dropout helps.

Why this answer

The training loss decreasing while validation loss increasing after 20 epochs is a classic sign of overfitting, where the model memorizes training data noise instead of generalizing. Early stopping with patience=5 would halt training after 5 epochs of no validation improvement, but the root cause is overfitting. Adding dropout regularization randomly drops neurons during training, forcing the network to learn more robust features and reducing overfitting.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting by showing a diverging validation loss curve, and the trap here is that candidates may confuse overfitting with a learning rate issue or data quality problem, leading them to choose 'reduce learning rate' or 'collect more data' instead of the correct regularization technique.

How to eliminate wrong answers

Option B is wrong because the validation loss increasing while training loss decreases indicates overfitting, not unrepresentative data; collecting more data might help but is not the most direct corrective action for overfitting. Option C is wrong because underfitting would show high training loss that does not decrease, not a decreasing training loss with increasing validation loss. Option D is wrong because a high learning rate would typically cause training loss to oscillate or diverge, not steadily decrease; reducing learning rate addresses convergence issues, not overfitting.

2
MCQeasy

A data engineer is splitting a dataset into training, validation, and test sets for a machine learning project. The dataset is large and representative of the population. Which split ratio is commonly recommended?

A.90% training, 5% validation, 5% test
B.70% training, 20% validation, 10% test
C.50% training, 25% validation, 25% test
D.80% training, 10% validation, 10% test
AnswerD

This is a standard split, providing ample training data and reliable validation and test sets.

Why this answer

Option B is correct because 80/10/10 is a typical split for large datasets, ensuring enough training data while having separate validation and test sets. Option A is wrong because 50/25/25 has too little training data. Option C is wrong because 90/5/5 gives too little validation data.

Option D is wrong because 70/20/10 is also reasonable but 80/10/10 is more common for large datasets; however, the question asks for 'commonly recommended' and among the options, 80/10/10 is standard.

3
Multi-Selecteasy

A data scientist is cleaning a dataset. Which TWO actions are appropriate for handling missing data?

Select 2 answers
A.Ignore missing values and train the model directly.
B.Use a predictive model to estimate missing values.
C.Impute missing values with the mean of the entire dataset.
D.Delete rows with missing values if the missing rate is low.
E.Replace missing values with the most frequent value always.
AnswersB, D

Predictive imputation uses relationships in data, a valid advanced method.

Why this answer

Option B is correct because using a predictive model to estimate missing values is a sophisticated imputation technique that leverages relationships between features to fill gaps, preserving data integrity and avoiding bias. This approach is particularly useful when data is not missing completely at random, as it can capture complex patterns that simpler methods miss.

Exam trap

CompTIA often tests the misconception that simple imputation methods like mean or mode are always safe, when in fact they can introduce bias and distort the dataset, making predictive imputation or deletion of rows with low missing rates more appropriate depending on the context.

4
MCQeasy

A logistics company uses a machine learning model to predict delivery times based on historical data. The model was performing well, but recently it started making inaccurate predictions, especially for routes that have experienced new traffic patterns and road closures. The data engineering team receives an alert that the model's accuracy has dropped by 15% over the last week. They suspect data drift. The team has access to the original training data and a continuous stream of new data. What is the most appropriate first step for the team to take?

A.Roll back the model to the previous stable version and schedule a full audit of the data pipeline.
B.Compare the distributions of key features between the training data and the recent data to quantify data drift.
C.Immediately retrain the model using the most recent data to adapt to the new patterns.
D.Add more features to the model to capture the new traffic patterns and road closures.
AnswerB

Identifying drift by comparing distributions is the standard first step to diagnose the problem before taking corrective action.

Why this answer

Option B is correct because the first step in diagnosing a suspected data drift is to statistically compare the distributions of key features between the training data and the recent streaming data. This quantifies whether the input data distribution has changed, which directly explains the accuracy drop. Without this analysis, any corrective action (like retraining or rollback) would be premature and could mask the root cause.

Exam trap

CompTIA often tests the misconception that the immediate response to a performance drop should be retraining or rollback, rather than first diagnosing the type of drift (data drift vs. concept drift) through distribution comparison.

How to eliminate wrong answers

Option A is wrong because rolling back the model without first confirming data drift wastes time and may not address the new traffic patterns; it assumes the previous model is still valid, which is false if drift is present. Option C is wrong because immediately retraining on recent data without verifying drift could introduce bias or overfit to transient noise, and it ignores the need to first understand what changed. Option D is wrong because adding features without first analyzing drift is a blind attempt that may not solve the distribution shift and could increase model complexity unnecessarily.

5
Multi-Selecteasy

A data scientist is evaluating a logistic regression model for binary classification on highly imbalanced data. Which TWO metrics are most appropriate to assess model performance? (Choose TWO.)

Select 2 answers
A.Accuracy
B.Recall
C.Precision
D.Mean squared error (MSE)
E.F1 score
AnswersB, C

Recall measures the proportion of actual positives correctly identified, critical for minority class performance.

Why this answer

Precision and Recall directly measure the model's ability to correctly identify positive (minority) instances and avoid false positives. Accuracy is misleading when classes are imbalanced. MSE is for regression.

F1 score combines precision and recall, but the question asks for two metrics, and precision and recall are fundamental.

6
Multi-Selectmedium

A natural language processing (NLP) team is building a sentiment analysis model. The raw text data contains punctuation, stop words, and URLs. Which TWO preprocessing steps are most appropriate to improve model performance? (Choose two.)

Select 2 answers
A.Remove all punctuation and URLs
B.Apply stemming to reduce words to root forms
C.Remove common stop words
D.Convert all text to lowercase
E.Tokenize the text into individual words
AnswersA, C

Punctuation and URLs are typically not useful for sentiment and add noise.

Why this answer

Options A and B are correct. Removing stop words reduces noise, and removing punctuation and URLs helps standardize text. Option C (stemming) is useful but not always necessary; Option D (tokenization) is fundamental but the question asks for specific preprocessing to improve performance; Option E (converting to lowercase) is standard but not among the best two for this scenario; however, converting to lowercase is a basic step.

Actually, typical preprocessing includes lowercasing, removing punctuation, removing stop words, and tokenization. The question asks for 'most appropriate'. Lowercasing is common but technically could be considered basic.

I'll choose A and B as the two best. But to be safe, let's adjust: Many might argue lowercasing is essential. I'll make correct options A and D? Let's think: The stem says 'improve model performance'.

Stop word removal and punctuation removal are often performed. Lowercasing is also standard. But I need exactly two.

I'll go with A (remove stop words) and B (remove punctuation and URLs) as they directly reduce noise. Lowercasing is also important but often done anyway. I'll keep A and B.

7
MCQmedium

A machine learning team is deploying a sentiment analysis model for customer reviews. The model was trained on reviews from an e-commerce site but will be used for a social media platform. The team observes a drop in accuracy. Which concept best explains this issue?

A.Data drift
B.Concept drift
C.Bias-variance tradeoff
D.Overfitting
AnswerA

The distribution of reviews differs between e-commerce and social media.

Why this answer

Data drift occurs when the statistical properties of the input data change between the training and production environments. Here, the model was trained on e-commerce reviews but is now processing social media posts, which have different vocabulary, tone, and structure, causing a mismatch in the input distribution and leading to accuracy degradation.

Exam trap

CompTIA often tests the distinction between data drift (input distribution change) and concept drift (relationship change), and candidates mistakenly choose concept drift when the scenario describes a change in the input data source rather than a change in the underlying mapping from inputs to outputs.

How to eliminate wrong answers

Option B is wrong because concept drift refers to a change in the underlying relationship between input features and the target variable over time, not a change in the input data distribution itself. Option C is wrong because bias-variance tradeoff is a model selection concept describing the balance between underfitting and overfitting, not an explanation for performance drop due to data distribution shift. Option D is wrong because overfitting occurs when a model learns training data too well, including noise, and fails to generalize to new data from the same distribution, not to a different distribution.

8
MCQhard

A healthcare startup is deploying a machine learning model to predict patient readmission within 30 days using electronic health records (EHR). The data pipeline uses Apache Spark for preprocessing and training on an Amazon EMR cluster. The training dataset is 50 GB and composed of structured numeric and categorical features, along with unstructured clinical notes. The data scientist observes that training takes over 12 hours and frequently fails due to out-of-memory (OOM) errors, especially when processing the clinical notes via TF-IDF vectorization. The cluster has 10 nodes with 64 GB RAM each. The data engineer has already tried increasing spark.sql.shuffle.partitions to 400 and using Kryo serialization, but OOM persists. Which action should the data engineer take next to resolve the OOM errors?

A.Broadcast the TF-IDF model to all executors to avoid shuffling
B.Repartition the clinical notes data into 2000 partitions before TF-IDF
C.Add 10 more nodes to the cluster to increase total memory
D.Use a single executor with 64 GB and increase driver memory to 128 GB
AnswerB

More partitions reduce the data per executor, mitigating OOM during vectorization.

Why this answer

Option B is correct because repartitioning the clinical notes data into 2000 partitions before TF-IDF vectorization increases parallelism and reduces the memory pressure per partition. The default partition count (often based on spark.default.parallelism) is too low for 50 GB of data, causing individual partitions to exceed executor memory limits. By increasing partitions, each executor processes smaller chunks, preventing OOM errors during the memory-intensive TF-IDF stage.

Exam trap

CompTIA often tests the misconception that increasing cluster resources (nodes or memory) alone solves OOM errors, when the real fix is to optimize data partitioning and parallelism within Spark's execution model.

How to eliminate wrong answers

Option A is wrong because broadcasting the TF-IDF model does not address the root cause of OOM; the model itself is typically small, but the issue is the large volume of raw text data being processed per partition, not the model size. Option C is wrong because adding more nodes increases total cluster memory but does not fix the per-partition memory imbalance; without repartitioning, the same skewed partitions will still cause OOM on individual executors. Option D is wrong because using a single executor with 64 GB and increasing driver memory to 128 GB ignores the distributed nature of Spark; it would force all processing into one executor, causing severe memory contention and likely worse OOM, while also losing parallelism.

9
Multi-Selectmedium

Which THREE practices are recommended for versioning machine learning models in a production environment?

Select 3 answers
A.Use a model registry like MLflow or DVC.
B.Store model metadata such as hyperparameters and training data hash.
C.Automate model deployment based on version tags.
D.Use Git to version model binaries.
E.Keep only the latest model to save storage.
AnswersA, B, C

Model registries provide centralized versioning and lifecycle management.

Why this answer

Option A is correct because a model registry like MLflow or DVC provides a centralized repository for tracking model versions, metadata, and lineage. This enables reproducibility, rollback, and auditability in production, which is essential for managing the lifecycle of machine learning models.

Exam trap

CompTIA often tests the misconception that Git is suitable for versioning all artifacts, including large binary model files, when in fact Git's architecture is optimized for text diffs and cannot efficiently manage model binaries in a production ML pipeline.

10
MCQhard

A team is training a deep learning model for image classification. The training loss decreases rapidly but validation loss starts increasing after a few epochs. Which regularization technique should be applied to mitigate this issue?

A.Data augmentation
B.L2 regularization
C.Early stopping
D.Dropout
AnswerC

Early stopping prevents overfitting by stopping training when validation loss starts to rise.

Why this answer

Option C is correct because early stopping halts training when validation loss increases, preventing overfitting. Option A is wrong because L2 regularization penalizes large weights but doesn't stop training. Option B is wrong because dropout randomly drops neurons during training, but early stopping directly addresses the symptom.

Option D is wrong because data augmentation increases data diversity, but the issue is overfitting due to training too long.

11
MCQeasy

A company wants to deploy an AI model for real-time inference on edge devices with limited computational resources. Which model architecture would be MOST suitable?

A.YOLOv4
B.MobileNet
C.ResNet-152
D.BERT
AnswerB

MobileNet uses depthwise separable convolutions to reduce computation, ideal for edge deployment.

Why this answer

MobileNet is specifically designed for mobile and edge devices using depthwise separable convolutions, which drastically reduce the number of parameters and computational cost while maintaining acceptable accuracy. This makes it the most suitable choice for real-time inference on resource-constrained edge hardware.

Exam trap

CompTIA often tests the misconception that any 'lightweight' or 'fast' model (like YOLOv4) is suitable for edge devices, ignoring the specific architectural optimizations (e.g., depthwise separable convolutions) that MobileNet uniquely provides for extreme resource constraints.

How to eliminate wrong answers

Option A is wrong because YOLOv4, while fast for object detection, is still a large convolutional network requiring significant GPU memory and compute, making it impractical for low-power edge devices. Option C is wrong because ResNet-152 is a very deep residual network with 152 layers, optimized for high accuracy on powerful hardware, not for limited-resource edge deployment. Option D is wrong because BERT is a transformer-based NLP model with hundreds of millions of parameters, requiring substantial memory and compute, and is not designed for real-time inference on edge devices.

12
MCQeasy

A data scientist notices that a binary classification model consistently predicts the majority class. Which data engineering technique should be applied?

A.Feature scaling
B.Dimensionality reduction
C.Polynomial features
D.Oversampling
AnswerD

Oversampling (e.g., SMOTE) creates synthetic samples of the minority class to balance the dataset.

Why this answer

Oversampling (Option D) is correct because the model's bias toward the majority class indicates a class imbalance problem. By synthetically increasing the number of minority class samples (e.g., using SMOTE or random oversampling), the training data becomes more balanced, allowing the classifier to learn decision boundaries that are not skewed toward the majority class.

Exam trap

CompTIA often tests the misconception that feature scaling or dimensionality reduction can fix class imbalance, when in reality these techniques address different issues like feature magnitude or curse of dimensionality, not skewed target distributions.

How to eliminate wrong answers

Option A is wrong because feature scaling normalizes the range of input features (e.g., via min-max scaling or standardization) but does not address class imbalance; it only prevents features with larger magnitudes from dominating gradient-based optimization. Option B is wrong because dimensionality reduction (e.g., PCA or t-SNE) reduces the number of features to combat overfitting or noise, but it does not alter the class distribution, so the majority class bias remains. Option C is wrong because polynomial features create interaction or higher-degree terms from existing features to capture non-linear relationships, but they do not change the ratio of majority to minority samples, leaving the imbalance untouched.

13
MCQmedium

A data scientist is building a regression model to predict house prices. The dataset contains features such as square footage, number of bedrooms, and year built. Initial model performance is poor, and the scientist suspects that feature engineering could help. Which approach is most likely to improve model accuracy?

A.Use only linear features because polynomial terms overfit
B.Remove all features except square footage to reduce noise
C.Create interaction terms such as bedrooms times square footage
D.Add random noise to the target variable to increase variance
AnswerC

Interaction terms capture combined effects of features, often improving regression models.

Why this answer

Creating interaction terms like bedrooms × square footage captures non-linear relationships and synergies between features that a linear model alone cannot represent. In real estate, the effect of square footage on price often depends on the number of bedrooms (e.g., a large house with few bedrooms may be less valuable), so interaction terms allow the model to learn these conditional patterns, directly improving predictive accuracy.

Exam trap

CompTIA often tests the misconception that adding more features always causes overfitting, when in fact carefully engineered interaction terms can reduce bias without excessive variance if regularized properly.

How to eliminate wrong answers

Option A is wrong because restricting to only linear features ignores potentially valuable non-linear patterns; polynomial terms can be regularized to avoid overfitting and are often necessary for complex relationships. Option B is wrong because removing all features except square footage discards important predictors like bedrooms and year built, which carry significant signal for house prices, thus increasing bias and reducing accuracy. Option D is wrong because adding random noise to the target variable artificially increases variance and corrupts the ground truth, making it harder for the model to learn the true underlying patterns and degrading performance.

14
Multi-Selecthard

Which TWO strategies are effective for handling missing values in a dataset when the missingness is not random (MNAR)?

Select 2 answers
A.Multiple imputation using chained equations
B.Treat missing as a separate category (e.g., for categorical features)
C.Listwise deletion
D.KNN imputation
E.Mean imputation
AnswersA, B

Multiple imputation can handle MNAR if the imputation model incorporates variables that predict missingness.

Why this answer

Multiple imputation using chained equations (MICE) is effective for MNAR because it models each variable with missing values as a function of other variables, iteratively generating plausible values that preserve the relationships and uncertainty in the data. This approach can account for the systematic pattern of missingness by incorporating auxiliary variables that are correlated with both the missing values and the missingness mechanism, making it robust even when missingness depends on unobserved data.

Exam trap

CompTIA often tests the misconception that mean imputation or KNN imputation are safe defaults for any missing data pattern, but the trap here is that MNAR requires methods that explicitly model the missingness mechanism, which simple imputation techniques fail to do.

15
MCQhard

Refer to the exhibit. A data engineer notices that the batch processing step is taking too long and causing delays. Which change would most likely reduce the latency?

A.Increase the parallelism of the Spark job
B.Move feature engineering to the stream processing step in Flink
C.Replace Apache Flink with Apache Storm for stream processing
D.Change the output format from Parquet to CSV
AnswerB

Performing feature engineering in stream reduces batch processing time and overall latency.

Why this answer

Moving feature engineering from the batch Spark job to the stream processing Flink job reduces the workload on the batch step, making it faster. Replacing Flink, increasing parallelism, or changing output format do not address the bottleneck as effectively.

16
MCQmedium

A data pipeline processes customer data from multiple sources. The data quality check reveals duplicate records. Which step should the pipeline include to handle this?

A.Data deduplication
B.Data encryption
C.Data transformation
D.Data validation
AnswerA

Deduplication specifically targets and eliminates duplicate records.

Why this answer

Deduplication is the process of identifying and removing duplicate records to ensure data quality. Data validation checks for schema or format errors. Data transformation changes data structure or values.

Data encryption ensures security but does not address duplicates.

17
Multi-Selecteasy

A data engineer is preparing a dataset for a binary classification model. The dataset has 10,000 samples with 100 features. To improve model performance and reduce training time, the engineer decides to perform feature selection. Which two techniques are appropriate for this task? (Select TWO).

Select 2 answers
A.Normalization
B.Recursive Feature Elimination (RFE)
C.L1 Regularization
D.One-Hot Encoding
E.Principal Component Analysis (PCA)
AnswersB, C

RFE selects features by removing the least important ones iteratively.

Why this answer

Recursive Feature Elimination (RFE) is an appropriate feature selection technique because it iteratively removes the least important features based on a model's feature importance scores or coefficients, directly reducing the feature count from 100 to a smaller subset. This improves model performance by eliminating irrelevant or redundant features and reduces training time by decreasing dimensionality.

Exam trap

CompTIA often tests the distinction between feature selection (keeping original features) and dimensionality reduction (creating new features), so candidates mistakenly select PCA thinking it selects features, when it actually transforms them into principal components.

18
MCQmedium

A data engineer needs to design a data pipeline for a real-time fraud detection system. The system requires low-latency processing of streaming transactions. Which architecture is most appropriate?

A.Stream processing with Apache Kafka and Flink
B.Data lake with Apache Spark
C.Batch processing with Apache Hadoop
D.Microservices architecture with REST APIs
AnswerA

Stream processing provides low-latency real-time analysis.

Why this answer

Apache Kafka provides a distributed, fault-tolerant event streaming platform that ingests high-throughput transaction data with low latency, while Apache Flink offers true stream processing with exactly-once semantics and sub-second event-time processing. Together, they enable real-time fraud detection by analyzing transactions as they arrive, without the delays inherent in batch or micro-batch approaches.

Exam trap

CompTIA often tests the distinction between true stream processing (e.g., Flink, Kafka Streams) and micro-batch or near-real-time processing (e.g., Spark Streaming), where candidates mistakenly assume that any 'streaming' API (like Spark Streaming) is equivalent to low-latency stream processing.

How to eliminate wrong answers

Option B is wrong because a data lake with Apache Spark typically relies on micro-batch processing (e.g., Spark Streaming with a minimum batch interval of ~100ms), which introduces higher latency than true stream processing and is unsuitable for sub-second fraud detection. Option C is wrong because batch processing with Apache Hadoop (e.g., MapReduce) is designed for high-throughput, high-latency processing of large static datasets, not for real-time streaming where transactions must be evaluated within milliseconds. Option D is wrong because microservices architecture with REST APIs is a design pattern for building distributed services, not a data pipeline technology; REST APIs introduce synchronous request-response overhead and cannot natively handle continuous, unbounded data streams with low-latency stateful processing.

19
MCQhard

A fraud detection model has high precision but low recall. The cost of false negatives is very high. Which threshold adjustment should be made?

A.Use class weights during training
B.Apply SMOTE to the training data
C.Decrease classification threshold
D.Increase classification threshold
AnswerC

Decreasing the threshold increases the number of positive predictions, raising recall and reducing false negatives.

Why this answer

Option B is correct because decreasing the classification threshold classifies more samples as positive, increasing recall (at the cost of some precision). Option A would increase precision but further reduce recall. Option C (class weights) is a training-time adjustment.

Option D (SMOTE) addresses imbalance but is applied to training data, not prediction threshold.

20
MCQhard

A machine learning team is developing a model to predict server failure from telemetry data. They use a deep neural network with 3 hidden layers. After training, the model achieves 99% accuracy on training data but only 85% on validation data. Which technique should the team apply to reduce the generalization error?

A.Increase the number of hidden layers
B.Apply L2 regularization
C.Increase the learning rate
D.Add more training data
AnswerB

Regularization adds a penalty on large weights, reducing overfitting and improving generalization.

Why this answer

The large gap between training and validation accuracy indicates overfitting. L2 regularization penalizes large weights and reduces overfitting. Increasing layers or learning rate would exacerbate overfitting, and adding data helps but may not be immediately available.

21
MCQhard

A credit risk model is being developed to predict loan defaults. The dataset has 95% non-default and 5% default instances. The data scientist trains a logistic regression model and obtains 95% accuracy, but the recall for defaults is only 10%. Which action is most appropriate to improve the model's ability to identify defaults?

A.Apply principal component analysis (PCA) to reduce dimensionality
B.Collect more data from loan applicants to increase dataset size
C.Undersample the non-default class to match the number of defaults
D.Use SMOTE to oversample the default class
AnswerD

SMOTE creates synthetic samples, balancing classes and improving recall.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most appropriate action because it generates synthetic samples for the minority class (defaults) rather than simply duplicating existing ones. This directly addresses the severe class imbalance (95% non-default vs. 5% default) that causes the logistic regression model to achieve high accuracy by predicting nearly all instances as non-default, while failing to identify actual defaults (recall of only 10%). By creating realistic synthetic default instances, SMOTE balances the training data and forces the model to learn decision boundaries that better capture the minority class.

Exam trap

CompTIA often tests the misconception that undersampling the majority class is always better than oversampling the minority class, but in this scenario, undersampling would discard valuable non-default patterns and reduce model robustness, whereas SMOTE generates new, realistic default samples without data loss.

How to eliminate wrong answers

Option A is wrong because PCA reduces dimensionality by projecting data onto principal components, which does not address class imbalance and can even discard variance that distinguishes defaults from non-defaults. Option B is wrong because simply collecting more data does not guarantee a better ratio of defaults; if the underlying population imbalance remains, the model will still be biased toward the majority class. Option C is wrong because undersampling the non-default class discards a large amount of potentially useful data, which can lead to loss of information and reduced model performance, especially when the majority class contains important patterns.

22
MCQeasy

A company streams sensor data from IoT devices. The data arrives as JSON messages at high velocity. Which data pipeline architecture is BEST suited to handle this streaming data for near-real-time analytics?

A.Batch processing using Hadoop MapReduce every 24 hours.
B.Batch processing using nightly ETL jobs.
C.Single-node database with periodic inserts.
D.Stream processing using Apache Kafka and Spark Streaming.
AnswerD

Kafka ingests streaming data, Spark Streaming processes it with low latency.

Why this answer

Apache Kafka acts as a distributed, fault-tolerant ingestion layer that can handle high-velocity JSON messages, while Spark Streaming processes the data in micro-batches for near-real-time analytics. This combination provides the low-latency, scalable pipeline required for streaming IoT sensor data, unlike batch or single-node approaches.

Exam trap

CompTIA often tests the distinction between batch and stream processing by presenting batch options that seem 'reliable' or 'traditional,' trapping candidates who overlook the explicit 'near-real-time' requirement in the question.

How to eliminate wrong answers

Option A is wrong because Hadoop MapReduce is designed for batch processing of large static datasets, not for continuous high-velocity streaming data, and a 24-hour cycle cannot meet near-real-time requirements. Option B is wrong because nightly ETL jobs introduce hours of latency, making them unsuitable for near-real-time analytics on streaming data. Option C is wrong because a single-node database with periodic inserts cannot scale to handle high-velocity IoT data streams and will become a bottleneck, failing to provide near-real-time processing.

23
MCQmedium

An organization uses a machine learning model to approve loans. The model shows higher false positive rates for a protected group. Which data engineering step should be taken to mitigate this?

A.Remove the protected attribute from training data
B.Use adversarial debiasing technique
C.Increase model complexity
D.Add synthetic data to balance groups
AnswerB

Adversarial debiasing forces the model to be invariant to protected attributes, reducing bias.

Why this answer

Option C is correct because adversarial debiasing explicitly reduces bias during training. Option A (removing the attribute) often fails due to correlated features. Option B (synthetic data) can help but may not be sufficient.

Option D increases complexity, potentially worsening bias.

24
MCQmedium

Refer to the exhibit. A stream processor ingests events. One event arrives with missing "user_id". What will happen?

A.The event will be stored in a dead-letter queue automatically.
B.The event will be accepted but user_id will be set to null.
C.The event will be accepted with a default user_id of 0.
D.The event will be rejected because user_id is required.
AnswerD

Validation against the required field causes rejection.

Why this answer

Option D is correct because in stream processing systems like Apache Kafka or AWS Kinesis, if a required field such as 'user_id' is missing and the schema (e.g., Avro, JSON Schema) defines it as required, the event will be rejected at ingestion. The stream processor typically validates the event against the schema; if validation fails, the event is not accepted into the stream, preventing downstream processing errors.

Exam trap

CompTIA often tests the misconception that stream processors automatically handle missing fields by setting defaults or using dead-letter queues, but the correct behavior is strict rejection when the field is required by the schema.

How to eliminate wrong answers

Option A is wrong because a dead-letter queue is not automatically used for missing required fields; it is typically configured for events that fail processing after ingestion, not for schema validation failures at ingestion time. Option B is wrong because setting user_id to null would violate the required constraint; stream processors do not automatically coerce missing required fields to null unless explicitly configured with a default value. Option C is wrong because assigning a default user_id of 0 is not standard behavior; default values must be explicitly defined in the schema, and without such definition, the event is rejected.

25
MCQmedium

A team is building a regression model to predict house prices. The dataset includes numerical features (square footage, number of bedrooms) and categorical features (neighborhood, roof type). The categorical features have high cardinality (neighborhood has 200+ unique values). Which encoding strategy should the team use to avoid overfitting and maintain model interpretability?

A.Target encoding with regularization.
B.Label encoding.
C.Binary encoding.
D.One-hot encoding with feature selection.
AnswerA

Target encoding condenses categories using target mean, and regularization prevents overfitting.

Why this answer

Target encoding with regularization is the best choice because it replaces each categorical value with the mean of the target variable for that category, which captures the relationship between the category and house prices. Regularization (e.g., adding a prior or using cross-validation) shrinks the encoded values toward the global mean, preventing overfitting on rare categories (e.g., neighborhoods with only a few houses). This maintains interpretability because each encoded value directly reflects the average price impact of that category, unlike black-box embeddings.

Exam trap

CompTIA often tests the misconception that one-hot encoding is always safe for categorical variables, but here the high cardinality (200+ neighborhoods) makes one-hot encoding impractical and prone to overfitting, leading candidates to overlook target encoding with regularization as the correct solution.

How to eliminate wrong answers

Option B (Label encoding) is wrong because it assigns arbitrary integer labels to categories (e.g., neighborhood 1, 2, 3), which implies an ordinal relationship that does not exist, misleading the regression model into treating categories as ordered numeric features. Option C (Binary encoding) is wrong because while it reduces dimensionality compared to one-hot, it still creates multiple binary columns per category, which can lead to overfitting with high-cardinality features and does not directly capture the target relationship, reducing interpretability. Option D (One-hot encoding with feature selection) is wrong because one-hot encoding with 200+ neighborhoods would create over 200 dummy variables, causing extreme sparsity and high risk of overfitting even after feature selection, and feature selection methods (e.g., Lasso) may discard important rare categories, losing signal.

26
MCQhard

An e-commerce company needs to update its recommendation model continuously as user preferences change. The model currently retrains from scratch every night, but the training time is too long. Which approach would reduce training time while keeping the model up-to-date?

A.Use dimensionality reduction on features.
B.Implement incremental learning using online gradient descent.
C.Switch to a simpler model.
D.Increase the batch size for retraining.
AnswerB

Online learning updates the model incrementally, avoiding full retrain.

Why this answer

Incremental learning using online gradient descent updates the model parameters with each new data point or mini-batch, avoiding the need to retrain from scratch. This approach significantly reduces training time while continuously adapting to changing user preferences, making it ideal for real-time recommendation systems.

Exam trap

CompTIA often tests the misconception that dimensionality reduction or simpler models are the primary solution for reducing training time, when in fact incremental learning directly addresses the need for continuous updates without full retraining.

How to eliminate wrong answers

Option A is wrong because dimensionality reduction reduces the number of features but does not eliminate the need to retrain the entire model from scratch each night; the training time savings are marginal and the core problem of full retraining remains. Option C is wrong because switching to a simpler model may reduce training time but typically sacrifices model accuracy and expressiveness, which is critical for capturing nuanced user preferences in recommendations. Option D is wrong because increasing the batch size for retraining can actually increase memory usage and may not reduce overall training time if the model still retrains from scratch nightly; it does not address the fundamental inefficiency of full retraining.

27
MCQhard

A medical imaging team is developing an AI model to detect tumors from CT scans. They have 10,000 labeled scans, but the labels were created by a semi-automated process with an estimated 20% error rate (mislabeled tumor vs. no tumor). The team trains a convolutional neural network (CNN) and achieves 90% accuracy on a held-out test set that was carefully validated by an expert radiologist. However, when deployed to a new hospital's patient population, the accuracy drops to 70%. The team suspects domain shift and label noise. Which strategy is most likely to improve model robustness for the new hospital?

A.Use active learning to select the most uncertain predictions from the new hospital's data, then have an expert radiologist correct those labels
B.Randomly select 1,000 scans from the new hospital and have them re-labeled by the radiologist
C.Collect 20,000 more scans with the same semi-automated labeling process
D.Reduce the CNN's number of layers and apply dropout to combat overfitting
AnswerA

Active learning targets the most informative samples, maximizing improvement per expert effort.

Why this answer

Option C is correct. Active learning helps select the most informative samples (e.g., uncertain predictions) for expert review, efficiently improving the model with limited expert effort. Option A is wrong because simply adding more noisy labels will amplify errors.

Option B is wrong because random sampling may not capture the most valuable corrections. Option D is wrong because reducing model complexity may underfit, and dropouts fine-tuning might not address label noise or domain shift.

28
MCQmedium

A data engineer is preprocessing text data for sentiment analysis. Which technique preserves word order while converting text to numeric features?

A.TF-IDF
B.N-grams with frequency counts
C.Word2Vec
D.Bag-of-words
AnswerB

N-grams represent contiguous sequences of n words, thus preserving local word order.

Why this answer

Option D is correct because n-grams capture sequences of words, preserving local order. Option A and B ignore order entirely. Option C (word2vec) produces embeddings that encode semantic similarity but not exact word order.

29
MCQhard

A team is training a deep neural network on a large image dataset. They observe that the training loss decreases smoothly but validation loss oscillates. Which regularization technique should be applied?

A.Data augmentation
B.L1 regularization
C.Dropout
D.Batch normalization
AnswerC

Dropout reduces overfitting by randomly dropping units during training, forcing the network to learn robust features.

Why this answer

Option B is correct because dropout randomly deactivates neurons, preventing co-adaptation and reducing overfitting. Option A (L1) sparsifies weights but is less common for image DNNs. Option C (batch norm) accelerates training but may not directly fix overfitting.

Option D (data augmentation) increase data diversity but is applied before training.

30
MCQeasy

A data scientist is preparing a dataset for training a classification model. The dataset has a column with missing values in 5% of rows. Which action should the data engineer take to minimize bias?

A.Impute missing values with the median of the column
B.Remove all rows with missing values
C.Replace missing values with a constant such as 999
D.Use a model that can handle missing values natively
AnswerA

Median imputation preserves the central tendency without being affected by outliers, suitable for low missing rate.

Why this answer

Imputing with the median preserves the distribution without significantly reducing sample size, minimizing bias. Removing rows reduces sample size, constant 999 introduces artificial outlier, and native handling may not be available.

31
MCQhard

A streaming data pipeline ingests sensor data from IoT devices. The data arrives at irregular intervals and contains occasional spikes. Which data transformation is most appropriate for preparing this data for a time-series model?

A.Downsampling to a fixed frequency using mean aggregation
B.Removing all rows with values outside 3 standard deviations
C.Using a sliding window to compute moving averages
D.Padding missing timestamps with zeros
AnswerA

Mean aggregation over fixed intervals handles irregular timing and reduces noise.

Why this answer

Downsampling to a fixed frequency using mean aggregation handles irregular intervals and smooths spikes. Removing spikes may lose valid anomalies, padding with zeros introduces bias, and moving averages are a smoothing technique but not resampling.

32
MCQeasy

During feature engineering, a data scientist creates a new feature that is a linear combination of two existing features. What risk does this pose to the model?

A.Multicollinearity
B.Data leakage
C.Overfitting
D.Underfitting
AnswerA

Multicollinearity occurs when features are highly correlated, causing unstable estimates and inflated variances.

Why this answer

Creating a new feature as a linear combination of two existing features introduces perfect multicollinearity, where the new feature is an exact linear function of the original ones. This violates the assumption of no perfect multicollinearity in linear models, causing the design matrix to become singular and making coefficient estimates unstable or impossible to compute. Even in non-linear models, high multicollinearity can inflate variance and reduce interpretability.

Exam trap

CompTIA often tests the distinction between multicollinearity and overfitting, trapping candidates who confuse feature redundancy with model complexity.

How to eliminate wrong answers

Option B is wrong because data leakage refers to using information from outside the training set (e.g., future data or target leakage), not to relationships among features within the training data. Option C is wrong because overfitting is caused by a model learning noise or overly complex patterns, not by linear dependencies between features; multicollinearity primarily affects coefficient stability, not generalization error directly. Option D is wrong because underfitting occurs when a model is too simple to capture underlying patterns, whereas multicollinearity is a data structure issue that can actually increase model complexity without improving fit.

33
MCQhard

A financial institution is building a fraud detection system using a supervised learning model. The dataset is highly imbalanced with 99.9% legitimate transactions and 0.1% fraudulent ones. Which approach would be MOST effective to train the model to detect fraud?

A.Train the model using accuracy as the performance metric
B.Undersample the legitimate transactions to match the number of fraudulent ones
C.Use SMOTE to generate synthetic fraudulent transactions
D.Increase the regularization strength in the model
AnswerC

SMOTE creates synthetic samples of the minority class, effectively balancing the dataset without losing data.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most effective approach because it generates synthetic fraudulent transactions by interpolating between existing minority class samples, thereby balancing the dataset without losing information. This allows the model to learn decision boundaries for fraud detection more effectively than simple undersampling or metric adjustments, especially given the extreme 99.9% vs 0.1% imbalance.

Exam trap

CompTIA often tests the misconception that simply changing the performance metric (like using F1-score or precision-recall) alone is sufficient to handle imbalance, but the trap here is that without addressing the data distribution itself, the model still lacks sufficient fraudulent examples to learn meaningful patterns.

How to eliminate wrong answers

Option A is wrong because accuracy is a misleading metric for highly imbalanced datasets; a model that predicts all transactions as legitimate would achieve 99.9% accuracy but detect zero fraud. Option B is wrong because undersampling the majority class to match the 0.1% fraud rate would discard 99.8% of legitimate transactions, causing severe information loss and poor generalization to real-world data. Option D is wrong because increasing regularization strength reduces model complexity to prevent overfitting, but it does not address the class imbalance; the model would still be biased toward the majority class and fail to learn fraud patterns.

34
MCQhard

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days. The data engineering team has built a pipeline that collects data from multiple sources, including electronic health records (EHR), lab results, and wearable device data. During initial testing, the model's performance is poor, with high false positives. Upon investigation, the team discovers that the data contains significant temporal misalignment: lab results are timestamped when ordered, not when collected; wearable data is aggregated hourly; and EHR data has inconsistent update frequencies. The data pipeline currently joins all features on the patient ID without aligning timestamps. The data volume is large, and processing time is a concern. Which action should the data engineering team take to most effectively address the issue and improve model performance?

A.Discard all records where timestamps do not match exactly across sources, and only use records with perfect alignment.
B.Implement a window-based feature aggregation (e.g., 6-hour windows) and align all features to the same time windows before joining.
C.Leave the pipeline unchanged and instead adjust the model's classification threshold to reduce false positives.
D.Use a data imputation algorithm to fill in missing timestamps and then join on the nearest timestamp.
AnswerB

This creates consistent timestamps and reduces noise through aggregation, effectively addressing misalignment.

Why this answer

Implementing a window-based feature aggregation with consistent time windows (e.g., 6-hour or 12-hour) and aligning all data to those windows before joining ensures temporal consistency and reduces noise. This approach addresses the root cause of misalignment while managing data volume through aggregation. Simply discarding data or padding with zeros loses valuable information.

Using an interpolation algorithm may introduce unrealistic values for irregularly sampled data. Leaving the pipeline as-is and tuning the model does not fix the data quality issue.

35
MCQeasy

A data engineer needs to combine two datasets, each with unique customer_id, to include all records from both datasets. Which join type should be used?

A.FULL OUTER JOIN
B.RIGHT JOIN
C.LEFT JOIN
D.INNER JOIN
AnswerA

FULL OUTER JOIN includes all records from both tables, matching where possible and filling nulls elsewhere.

Why this answer

A FULL OUTER JOIN returns all records from both datasets, matching rows where the customer_id is present in both and filling in NULLs for missing matches. This is the only join type that guarantees every unique customer_id from either dataset appears in the result, which is exactly what the requirement specifies.

Exam trap

CompTIA often tests the misconception that LEFT JOIN or RIGHT JOIN can include all records from both datasets, but candidates forget that these asymmetric joins exclude non-matching rows from the opposite side.

How to eliminate wrong answers

Option B (RIGHT JOIN) is wrong because it returns only all rows from the right dataset and matching rows from the left, omitting any customer_id that exists only in the left dataset. Option C (LEFT JOIN) is wrong because it returns only all rows from the left dataset and matching rows from the right, omitting any customer_id that exists only in the right dataset. Option D (INNER JOIN) is wrong because it returns only rows where customer_id exists in both datasets, discarding all non-matching records from either side.

36
Multi-Selecteasy

Which TWO data preprocessing techniques reduce the dimensionality of a dataset?

Select 2 answers
A.One-hot encoding
B.Imputation
C.Feature scaling
D.Principal Component Analysis (PCA)
E.Feature selection
AnswersD, E

PCA reduces dimensionality by projecting data onto principal components.

Why this answer

Options A and D are correct. Principal Component Analysis (PCA) transforms data to a lower dimensional space, while feature selection picks a subset of original features. Option B (feature scaling) does not reduce dimensions.

Option C (one-hot encoding) actually increases dimensions. Option E (imputation) handles missing values but does not reduce dimensions.

37
Multi-Selectmedium

Which THREE are common data preprocessing steps in a machine learning pipeline? (Choose 3)

Select 3 answers
A.Hyperparameter tuning
B.Encoding categorical variables
C.Model evaluation
D.Scaling numeric features
E.Handling missing values
AnswersB, D, E

Categorical data must be converted to numeric.

Why this answer

Encoding categorical variables is a common data preprocessing step because machine learning algorithms require numerical input. Techniques like one-hot encoding or label encoding convert categorical data (e.g., colors, countries) into numeric format, enabling the model to process them correctly. Without this step, the model would misinterpret categorical labels as ordinal or meaningless numeric values.

Exam trap

CompTIA often tests the distinction between preprocessing steps (data cleaning, transformation) and later pipeline stages (model tuning, evaluation), so candidates mistakenly select hyperparameter tuning or model evaluation as preprocessing steps.

38
Multi-Selectmedium

Which THREE are common causes of data leakage in machine learning pipelines?

Select 3 answers
A.Using time-based splitting for sequential data
B.Using future information to predict the present
C.Using cross-validation on the entire dataset
D.Applying normalization before splitting data into train and test sets
E.Including features that are directly derived from the target variable
AnswersB, D, E

Using data that would not be available at prediction time is a direct form of leakage.

Why this answer

Option B is correct because using future information to predict the present is a classic form of data leakage. In time series or sequential data, if a model is trained on features that include values from a later time point, it gains access to information that would not be available at prediction time, leading to overly optimistic performance metrics and poor generalization.

Exam trap

CompTIA often tests the distinction between valid data splitting practices and actual leakage causes, so candidates may incorrectly select time-based splitting (Option A) as a leakage cause when it is actually a proper technique for sequential data.

39
MCQmedium

A team is training a language model using a large text corpus. They want to ensure the model does not learn biased associations between gender and professions. Which data engineering technique should they apply?

A.Remove all gender-related words from the text
B.Use a pre-trained model that is already debiased
C.Apply adversarial debiasing during training
D.Balance the representation of professions across genders
AnswerD

Balancing ensures the model sees equal examples of each gender across professions, reducing biased correlations.

Why this answer

Balancing the representation of professions across genders in the training data reduces the chance the model learns spurious correlations. Removing gender words is too aggressive, pre-trained models may still be biased, and adversarial debiasing is a model training technique, not data engineering.

40
MCQeasy

A data analyst is cleaning a dataset and finds that 20% of the values for the 'age' column are missing. Which imputation method is most robust if the data is not normally distributed?

A.Mean imputation
B.Median imputation
C.Mode imputation
D.Remove rows with missing values
AnswerB

Median is robust to non-normal distributions.

Why this answer

Median imputation is the most robust method for handling missing values in the 'age' column when the data is not normally distributed because the median is unaffected by outliers or skewness. Unlike the mean, which is sensitive to extreme values, the median provides a central tendency measure that better represents the typical value in non-normal distributions, preserving the dataset's integrity for downstream modeling.

Exam trap

CompTIA often tests the misconception that mean imputation is always the default or best choice for numerical data, but the trap here is that candidates overlook the importance of distribution shape and outlier sensitivity, leading them to select mean imputation despite the data not being normally distributed.

How to eliminate wrong answers

Option A is wrong because mean imputation assumes a normal distribution and is highly sensitive to outliers, which can introduce bias and distort the dataset's variance when the data is skewed. Option C is wrong because mode imputation is typically used for categorical data, not continuous variables like age, and it can lead to loss of granularity and inaccurate representation of the distribution. Option D is wrong because removing rows with missing values reduces sample size and can introduce selection bias, especially if the missingness is not completely at random, which is inefficient and may degrade model performance.

41
MCQmedium

A financial institution is training a risk assessment model. The dataset includes customer credit scores, income, age, and past loan defaults. During feature engineering, a data engineer creates a new feature 'income_to_debt_ratio'. Which type of feature engineering technique is this?

A.Feature encoding
B.Feature scaling
C.Feature selection
D.Feature combination
AnswerD

Creating a ratio from two continuous variables is a combination technique to capture interaction.

Why this answer

Option D is correct because 'income_to_debt_ratio' is created by combining two existing features (income and debt) into a single derived feature. This is a classic example of feature combination (also known as feature crossing or feature construction), where arithmetic operations or logical rules are applied to existing variables to generate new predictive signals. The goal is to capture interactions or relationships that the original features alone may not express linearly.

Exam trap

CompTIA often tests the distinction between feature engineering techniques by presenting a derived feature and expecting candidates to recognize it as feature combination rather than confusing it with scaling or encoding.

How to eliminate wrong answers

Option A is wrong because feature encoding transforms categorical variables into numerical representations (e.g., one-hot encoding, label encoding), not create new numerical ratios from existing numerical features. Option B is wrong because feature scaling normalizes or standardizes the range of feature values (e.g., min-max scaling, z-score normalization) without generating new features. Option C is wrong because feature selection reduces the number of features by choosing a subset of the original ones (e.g., using correlation analysis or recursive feature elimination), not by engineering new derived attributes.

42
MCQmedium

A company is deploying an AI model to recommend products. The model's training data included historical purchases from the past two years, but the business environment has changed significantly due to a market shift. What is the most likely issue affecting model performance?

A.Concept drift
B.Overfitting
C.Underfitting
D.Data leakage
AnswerA

Concept drift is the change in the underlying relationship between features and target variable over time, making the model outdated.

Why this answer

Concept drift occurs when the statistical properties of the target variable change over time, which is common in dynamic business environments. Overfitting and underfitting relate to training dataset characteristics. Data leakage involves using information not available at prediction time.

43
MCQmedium

A machine learning engineer is training a Support Vector Machine (SVM) with an RBF kernel on a dataset with features on different scales (e.g., age 0-100, income 0-1,000,000). The model converges slowly and yields poor accuracy. What should the engineer do first?

A.Standardize the features to have zero mean and unit variance
B.Increase the regularization parameter C to penalize misclassifications more
C.Decrease the gamma parameter to reduce the influence of each data point
D.Switch to a linear kernel to avoid distance calculations
AnswerA

Standardization ensures all features contribute equally to the distance metric.

Why this answer

Option D is correct because feature scaling (normalization or standardization) is crucial for SVMs with RBF kernel, as the distance metric depends on feature scales. Option A is wrong because switching to linear kernel may not capture non-linearity. Option B is wrong because increasing C is regularization, not addressing scale.

Option C is wrong because reducing gamma may help but without scaling, distances are dominated by large-scale features.

44
Multi-Selectmedium

A data science team is building a model to predict customer churn. The dataset includes categorical variables like 'region' and 'subscription_type'. Which three preprocessing steps should be applied to these categorical features? (Select THREE).

Select 3 answers
A.Normalization
B.Label encoding
C.Standard scaling
D.Ordinal encoding
E.One-hot encoding
AnswersB, D, E

Label encoding assigns integers to each category, suitable for ordinal categories.

Why this answer

Label encoding (B) is correct because it converts each unique category in a categorical variable into a unique integer, which is a simple and memory-efficient way to prepare categorical data for machine learning models. Ordinal encoding (D) is correct for categorical variables with a natural order, such as 'subscription_type' if tiers exist (e.g., basic, premium, enterprise), preserving ordinal relationships. One-hot encoding (E) is correct for nominal categorical variables like 'region' where no order exists, creating binary columns for each category to avoid implying false ordinality.

Exam trap

CompTIA often tests the distinction between ordinal and nominal categorical variables, trapping candidates who apply label encoding to nominal data or one-hot encoding to ordinal data without considering the feature's inherent order.

45
MCQmedium

A dataset for a binary classification problem has 95% of samples in class "0" and 5% in class "1". The data scientist trains a logistic regression model and achieves 95% accuracy. Which metric should the scientist primarily use to evaluate model performance?

A.Precision, recall, and F1-score.
B.R-squared.
C.Accuracy.
D.Mean squared error.
AnswerA

These metrics evaluate performance on the minority class, crucial for imbalanced data.

Why this answer

In a highly imbalanced dataset (95% class 0, 5% class 1), accuracy is misleading because a model can achieve 95% accuracy by simply predicting the majority class for all samples. Precision, recall, and F1-score provide a more nuanced view of performance on the minority class, which is typically the class of interest in binary classification problems. The F1-score, in particular, balances precision and recall, making it the primary metric for evaluating model effectiveness on imbalanced data.

Exam trap

CompTIA often tests the concept that accuracy is a poor metric for imbalanced datasets, trapping candidates who assume high accuracy always indicates good model performance without considering class distribution.

How to eliminate wrong answers

Option B is wrong because R-squared is a metric for regression models, measuring the proportion of variance in the dependent variable explained by the independent variables, and is not applicable to classification tasks. Option C is wrong because accuracy is not a reliable metric for imbalanced datasets; a model that always predicts the majority class can achieve high accuracy without actually learning meaningful patterns, as seen with the 95% accuracy matching the class distribution. Option D is wrong because mean squared error (MSE) is a loss function for regression problems, used to quantify the average squared difference between predicted and actual continuous values, and is not appropriate for evaluating binary classification outputs.

46
MCQeasy

A team is building a regression model to predict house prices. Which data transformation is most appropriate if the target variable exhibits right skewness?

A.Principal component analysis (PCA)
B.Standardization (Z-score)
C.One-hot encoding
D.Log transformation
AnswerD

Log transformation reduces right skewness by compressing large values.

Why this answer

Log transformation is the most appropriate technique for right-skewed target variables because it compresses the long tail, making the distribution more symmetric and closer to Gaussian. This stabilizes variance and often improves the performance of regression models that assume normally distributed errors, such as linear regression.

Exam trap

CompTIA often tests the misconception that standardization can fix skewness, but candidates must remember that standardization only rescales the data, not reshape its distribution.

How to eliminate wrong answers

Option A is wrong because Principal Component Analysis (PCA) is a dimensionality reduction technique for features, not a transformation applied to the target variable; it does not address skewness in the target. Option B is wrong because Standardization (Z-score) centers and scales the data but does not change the shape of the distribution, so it cannot correct right skewness. Option C is wrong because One-hot encoding is used to convert categorical variables into numerical format, not to transform a continuous target variable.

47
Multi-Selectmedium

A team is developing a natural language processing model to classify customer feedback. The dataset contains text in multiple languages. Which THREE preprocessing steps are essential to ensure the model performs well across all languages?

Select 3 answers
A.One-hot encoding
B.Lowercasing
C.Tokenization
D.Stemming
E.Removing stop words
AnswersB, C, E

Lowercasing reduces vocabulary size and helps generalize across different cases.

Why this answer

Lowercasing is essential because it normalizes text across languages by converting all characters to the same case, reducing vocabulary size and ensuring that words like 'Good' and 'good' are treated identically. This prevents the model from learning separate representations for case variations, which is critical for multilingual datasets where case usage may differ (e.g., German capitalizes nouns). Without lowercasing, the model's performance degrades due to sparsity and increased feature space.

Exam trap

CompTIA often tests the distinction between preprocessing steps (like lowercasing, tokenization, stop word removal) and feature engineering techniques (like one-hot encoding), leading candidates to mistakenly include one-hot encoding as a preprocessing step when it is actually a vectorization method applied after preprocessing.

48
Multi-Selecthard

A team is using k-fold cross-validation to evaluate a model. They observe high variance in performance scores across folds. Which TWO actions are most likely to reduce this variance? (Choose TWO.)

Select 2 answers
A.Increase the number of folds
B.Use stratified cross-validation
C.Decrease the number of folds
D.Shuffle data before splitting
E.Use a more complex model
AnswersA, B

More folds mean each training set is larger and more similar to the full dataset, reducing variance.

Why this answer

Increasing the number of folds (e.g., from 5 to 10) uses more training data per fold, reducing variance. Using stratified cross-validation ensures that each fold has a representative class distribution, which stabilizes scores. Decreasing folds increases variance.

Shuffling is already a common practice. Using a more complex model typically increases variance.

49
MCQmedium

A financial services company has a real-time fraud detection system that uses Apache Kafka to stream transaction events, a TensorFlow Serving model for scoring, and a Redis cache for lookup of historical fraud patterns. The system processes 10,000 transactions per second with an SLA of 100ms latency per transaction. Recently, after a model update, the latency for some transactions spiked to over 500ms, causing timeouts. The model uses a deep neural network with 10 million parameters. The engineering team suspects the issue is due to increased model inference time. Which action should be taken to reduce latency without significant loss in accuracy?

A.Add more Redis nodes to the cache cluster
B.Increase the number of Kafka partitions and consumer threads
C.Decrease the inference batch size from 32 to 1
D.Quantize the model weights from FP32 to FP16
AnswerD

FP16 quantization reduces model size and speeds up inference, typically with minimal accuracy impact.

Why this answer

The latency spike is caused by increased model inference time after a model update. Quantizing model weights from FP32 to FP16 reduces memory bandwidth and computation requirements, directly speeding up inference on compatible hardware (e.g., GPUs with Tensor Cores) with minimal accuracy loss. This addresses the root cause—model inference latency—without changing the system architecture.

Exam trap

The trap here is that candidates confuse system-level scaling (adding cache nodes or Kafka partitions) with model-level optimization, failing to recognize that the latency spike originates from the model inference step itself.

How to eliminate wrong answers

Option A is wrong because adding Redis nodes improves cache lookup throughput, but the latency spike is due to model inference time, not cache performance. Option B is wrong because increasing Kafka partitions and consumer threads improves message ingestion parallelism, but does not reduce the per-transaction inference latency of the TensorFlow Serving model. Option C is wrong because decreasing the inference batch size from 32 to 1 reduces throughput and increases per-transaction overhead (e.g., kernel launch latency), which would worsen latency, not improve it.

50
MCQhard

An AI model is deployed to a mobile app with limited computational resources. The model is a deep neural network with high latency. Which technique is best to reduce inference time?

A.Increase batch size
B.Add more layers
C.Use a larger model
D.Quantization
AnswerD

Quantization reduces model size and speeds up inference by using lower-precision arithmetic.

Why this answer

Quantization reduces the precision of model weights (e.g., from float32 to int8), significantly speeding up inference and reducing memory footprint with minimal accuracy loss. Increasing batch size is for throughput, not single inference latency. Using a larger model or adding more layers would increase latency.

51
MCQmedium

A healthcare startup is developing a deep learning model to detect diabetic retinopathy from retinal images. The model is trained on a dataset of 10,000 labeled images. During initial testing, the model achieves 99% accuracy on the training set but only 85% on the test set. The startup wants to deploy the model in a clinical setting where false negatives (missing a disease) are critical. The team has access to additional unlabeled retinal images from multiple sources. Which strategy should the team use to improve the model's generalization and reduce false negatives?

A.Use semi-supervised learning with the unlabeled images to improve feature representations
B.Apply aggressive data augmentation to the training set
C.Increase the learning rate during training
D.Add more convolutional layers to the model
AnswerA

Semi-supervised learning utilizes unlabeled data to learn generalizable features, reducing overfitting and improving test performance.

Why this answer

Semi-supervised learning leverages the large pool of unlabeled retinal images to learn robust feature representations, which helps the model generalize better to unseen data. By reducing overfitting (the gap between 99% training and 85% test accuracy), this approach directly improves test-set performance. Additionally, semi-supervised methods can be tuned to emphasize recall, thereby reducing false negatives critical in clinical diabetic retinopathy screening.

Exam trap

CompTIA often tests the misconception that simply increasing data or model complexity (augmentation, layers) always improves generalization, when in fact semi-supervised learning is the targeted solution for leveraging unlabeled data to close the train-test accuracy gap and address class-specific metrics like false negatives.

How to eliminate wrong answers

Option B is wrong because aggressive data augmentation, while helpful for generalization, does not directly address the high false-negative rate; it may even distort critical pathological features if applied too aggressively. Option C is wrong because increasing the learning rate typically destabilizes training, leading to divergence or poor convergence, and does not reduce false negatives or improve generalization. Option D is wrong because adding more convolutional layers increases model capacity, which would likely worsen overfitting given the already large gap between training and test accuracy, and does not specifically target false negatives.

52
Multi-Selecteasy

A data scientist is preparing a dataset for a classification model. The dataset contains several categorical variables with high cardinality. Which TWO encoding methods are appropriate for converting these categorical variables into numerical features?

Select 2 answers
A.Min-max scaling
B.K-means clustering
C.One-hot encoding
D.Principal component analysis (PCA)
E.Label encoding
AnswersC, E

One-hot encoding converts each category into a binary vector, suitable for categorical variables.

Why this answer

One-hot encoding and label encoding are both appropriate techniques for encoding categorical variables. One-hot encoding creates binary columns for each category, while label encoding assigns a unique integer to each category. PCA is a dimensionality reduction technique, K-means is a clustering algorithm, and min-max scaling is a normalization method, none of which are encoding methods.

53
MCQeasy

Refer to the exhibit. What is the recall of the model?

A.0.72
B.0.8
C.0.7
D.0.73
AnswerB

TP=80, FN=20, recall=80/100=0.8

Why this answer

Recall is calculated as True Positives divided by (True Positives + False Negatives). From the confusion matrix, True Positives = 72 and False Negatives = 18, so recall = 72 / (72 + 18) = 72 / 90 = 0.8. This measures the model's ability to correctly identify all actual positive cases.

Exam trap

CompTIA often tests recall by providing a confusion matrix and expects candidates to correctly identify the denominator as TP+FN, not total samples, to avoid confusing recall with accuracy or precision.

How to eliminate wrong answers

Option A (0.72) is wrong because it incorrectly uses True Positives divided by total predictions (72/100 = 0.72), which is accuracy, not recall. Option C (0.7) is wrong because it likely results from misreading the matrix (e.g., using 72/102 or confusing with precision). Option D (0.73) is wrong because it may come from a miscalculation such as (72 + 1)/(72 + 18 + 10) = 73/100, which is not a standard metric.

54
Multi-Selecthard

A data scientist is evaluating a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). Which TWO metrics are most appropriate for assessing model performance? (Choose two.)

Select 2 answers
A.Precision
B.Recall
C.F1 score
D.Area under the ROC curve (AUC-ROC)
E.Accuracy
AnswersA, B

Precision measures the proportion of predicted fraud that is actually fraud, important to avoid false positives.

Why this answer

Precision is appropriate because it measures the proportion of predicted fraud cases that are actually fraudulent, which is critical when false positives (flagging legitimate transactions as fraud) are costly. In a highly imbalanced dataset like this (99% non-fraud), precision directly evaluates the model's ability to avoid overwhelming fraud analysts with false alarms.

Exam trap

CompTIA often tests the misconception that AUC-ROC is always the best metric for imbalanced datasets, but the trap here is that AUC-ROC can be misleadingly high even when the model performs poorly on the minority class, whereas precision and recall directly address the class imbalance.

55
MCQeasy

A dataset used for training a classification model contains 10% missing values in a feature that is known to be important. The data scientist decides to impute the missing values. Which imputation method is most robust if the data is not missing completely at random?

A.Delete all rows with missing values
B.Use multiple imputation to model missing values
C.Replace missing values with the mean of the feature
D.Fill missing values with 0
AnswerB

Multiple imputation provides unbiased estimates under missing at random assumption.

Why this answer

Option B is correct because multiple imputation accounts for uncertainty and is robust when data are missing at random. Option A is wrong because mean imputation distorts distributions and relationships. Option C is wrong because deleting rows with missing values reduces sample size and may bias results.

Option D is wrong because filling with zero is arbitrary and introduces bias.

56
MCQhard

A model trained on a dataset with imbalanced classes achieves 98% accuracy but only 50% recall for the minority class. Which technique should be applied first to address the imbalance?

A.Apply cost-sensitive learning
B.Reduce the majority class size
C.Use SMOTE to generate synthetic samples
D.Collect more data for the minority class
AnswerA

Cost-sensitive learning adjusts class weights in the loss function, directly tackling imbalance without data modification.

Why this answer

Cost-sensitive learning directly modifies the model's loss function to penalize misclassifications of the minority class more heavily than those of the majority class. This approach addresses the root cause of the imbalance—the model's bias toward the majority class—without altering the dataset distribution, making it the most immediate and effective first step.

Exam trap

CompTIA often tests the misconception that data-level techniques like SMOTE or undersampling should always be the first approach, when in fact cost-sensitive learning is a simpler, less invasive, and often more effective initial step that directly adjusts the model's learning objective.

How to eliminate wrong answers

Option B is wrong because reducing the majority class size (random undersampling) discards potentially valuable data, which can lead to loss of information and increased variance in the model, and it is not typically the first technique applied. Option C is wrong because SMOTE generates synthetic samples for the minority class, which can introduce noise and is a data-level augmentation technique that should be considered after cost-sensitive adjustments or as a complementary method, not as the first step. Option D is wrong because collecting more data for the minority class is often impractical, time-consuming, and may not be feasible in real-world scenarios; it is not a guaranteed or immediate solution to the imbalance.

57
MCQeasy

A model's training accuracy is 99% but validation accuracy drops to 60%. What is the most likely issue?

A.Data leakage
B.Overfitting
C.Multicollinearity
D.Underfitting
AnswerB

Overfitting leads to high training accuracy but low validation accuracy.

Why this answer

A training accuracy of 99% with a validation accuracy of only 60% is a classic symptom of overfitting. The model has memorized the training data, including noise and outliers, rather than learning generalizable patterns, causing it to perform poorly on unseen validation data.

Exam trap

CompTIA often tests the distinction between overfitting and data leakage by presenting a large accuracy gap, where candidates might mistakenly attribute the issue to data leakage instead of recognizing that leakage typically inflates both accuracies rather than creating a divergence.

How to eliminate wrong answers

Option A is wrong because data leakage typically causes both training and validation accuracy to be artificially high, not a large gap between them; it occurs when information from outside the training set inadvertently influences the model. Option C is wrong because multicollinearity refers to high correlation among input features in regression models, which affects coefficient stability and interpretability, not a drastic accuracy drop between training and validation sets. Option D is wrong because underfitting would result in low accuracy on both training and validation sets (e.g., both below 70%), not a high training accuracy with a low validation accuracy.

58
MCQmedium

A data engineer is reviewing an S3 bucket policy for a machine learning project. The policy is intended to allow access to training data only from the corporate network (10.0.0.0/16). However, users in the corporate network report access denied. Which issue is most likely causing the problem?

A.The policy is missing a Deny statement.
B.The resource ARN is incorrect.
C.The policy does not include s3:ListBucket action, which is needed to list objects.
D.The IP address condition is not matching because the corporate network uses a different CIDR.
AnswerC

Users need ListBucket to see objects before getting them.

Why this answer

Option C is correct. The condition key uses "aws:SourceIp" but the correct key for IP address condition is "aws:SourceIp". However, the exhibit shows "aws:SourceIp" which is correct? Wait, the exhibit shows "aws:SourceIp" — that is correct.

But the problem might be that the policy uses "GetObject" but the users might be trying to list objects (s3:ListBucket). Option C says the action does not include s3:ListBucket, which is often needed first. Option A is wrong because the resource is correct.

Option B is wrong because the policy allows access from 10.0.0.0/16. Option D is wrong because the effect is Allow.

59
MCQeasy

A data engineer needs to store training data in a format that supports columnar pruning during model training. Which storage format should they use?

A.Parquet
B.XML
C.JSON
D.CSV
AnswerA

Parquet is columnar, enabling compression and pruning, reducing I/O.

Why this answer

Parquet is the correct choice because it is a columnar storage format that enables column pruning, allowing the training process to read only the columns needed for model training rather than entire rows. This reduces I/O and speeds up data loading, which is critical for large-scale AI/ML workloads. Unlike row-oriented formats, Parquet stores data by columns, making it efficient for analytical queries and feature selection.

Exam trap

CompTIA often tests the misconception that JSON or CSV are acceptable for columnar pruning because they are common and human-readable, but the trap here is that only columnar formats like Parquet or ORC support efficient column-level access, while row-oriented formats require full record scans.

How to eliminate wrong answers

Option B (XML) is wrong because XML is a verbose, hierarchical text format that stores data row-wise and lacks columnar pruning capabilities, leading to high storage overhead and slow read performance for tabular data. Option C (JSON) is wrong because JSON is a row-oriented, self-describing format that requires parsing entire records even when only a subset of fields is needed, making it unsuitable for column pruning. Option D (CSV) is wrong because CSV is a flat, row-oriented text format that forces reading entire rows into memory, with no support for columnar storage or predicate pushdown, resulting in inefficient I/O for selective column access.

60
MCQhard

A retail company uses a machine learning model to predict daily sales. The model takes features like past sales, promotions, holidays, and weather data. Recently, the model's accuracy dropped significantly. The data engineer checks the data pipeline and finds that the weather data source changed from a free API to a new paid API that provides more detailed data. The new data includes additional attributes like humidity and wind speed, but the existing pipeline only ingests temperature and precipitation. Also, the time zone format changed from UTC to local time. The model was trained on the old format. Which action should the engineer take first to restore model performance?

A.Add a new step to merge old and new weather data before feeding to the model.
B.Transform the new data to match the old format (time zone and selected features) and retrain the model.
C.Revert to the old weather API.
D.Retrain the model with the new data including all new features.
AnswerB

This aligns the data with the training pipeline, resolving the immediate mismatch.

Why this answer

The immediate problem is the time zone change causing misalignment between training and inference data. Transforming the new data to match the old format ensures consistency. Retraining with all new features may introduce drift; reverting may not be possible; merging data without alignment causes inconsistency.

61
Multi-Selecthard

Which THREE data quality dimensions are critical for ensuring model reliability?

Select 3 answers
A.Timeliness.
B.Consistency.
C.Completeness.
D.Volume.
E.Accuracy.
AnswersA, B, E

Outdated data can cause predictions to be irrelevant.

Why this answer

Timeliness is critical because stale data can lead to incorrect model predictions, especially in dynamic environments like network traffic analysis or fraud detection. For AI models, data must reflect the current state of the system to ensure relevance and reliability. Without timely data, the model may act on outdated patterns, reducing its effectiveness.

Exam trap

CompTIA often tests the distinction between data quality dimensions and data characteristics, so the trap here is that candidates confuse completeness or volume with the three critical dimensions (timeliness, consistency, accuracy) that directly impact model reliability.

62
MCQeasy

A data scientist is preparing a dataset for a supervised learning model. The dataset contains missing values in 15% of the rows for a numeric feature. Which preprocessing technique should be applied to minimize bias?

A.Remove all rows with missing values.
B.Impute missing values with the mean of the feature.
C.Encode the missing values as a separate category.
D.Use a model that handles missing values natively.
AnswerB

Mean imputation preserves the dataset size and is appropriate for numeric features when missingness is random.

Why this answer

Imputing missing values with the mean preserves data size and is standard for numeric features when missingness is random. Removing rows can introduce bias if missingness is not random. Encoding as a separate category is for categorical features.

Using a model that handles missing values natively is not always available and may still require imputation.

63
MCQeasy

A data scientist is preparing a dataset for a classification model. The dataset contains a column "Age" with 10% missing values and a column "Income" with 30% missing values. Which imputation strategy is MOST appropriate to minimize bias?

A.Replace missing Age with the mean and missing Income with the median.
B.Delete all rows with missing values.
C.Replace missing Age with the mode and missing Income with a constant value.
D.Replace missing values with zeros.
AnswerA

Mean for symmetric Age, median for skewed Income minimizes bias.

Why this answer

Option A is correct because using mean imputation for Age (10% missing) and median imputation for Income (30% missing) minimizes bias. Mean is suitable for roughly symmetric distributions with low missingness, while median is robust to outliers and skewness, which is common in income data. This combination reduces distortion of central tendency and preserves data integrity better than uniform methods.

Exam trap

CompTIA often tests the misconception that a single imputation method (e.g., mean for all columns) is universally appropriate, when in fact the choice must consider the missingness rate and the distribution of each feature to minimize bias.

How to eliminate wrong answers

Option B is wrong because deleting all rows with missing values (listwise deletion) reduces sample size and can introduce selection bias, especially when missingness is not completely at random (MCAR). Option C is wrong because replacing Age with the mode is inappropriate for continuous variables like age, as it discards variance and can create artificial clusters; replacing Income with a constant value (e.g., 0) introduces systematic bias and distorts the distribution. Option D is wrong because replacing missing values with zeros for both Age and Income is arbitrary and unrealistic, leading to severe underestimation of central tendency and inflated model error.

64
MCQhard

A large e-commerce company uses a recommendation system based on collaborative filtering. The system uses a matrix factorization model that is trained nightly on the entire user-item interaction history. Recently, the company launched a flash sale with thousands of new products. Users are reporting that the recommendations are not showing the new products, even for users who have purchased them during the sale. The data engineering team notices that the new products have very few interactions in the training data. The model's loss on the validation set has increased, and the recall@10 metric has dropped from 0.45 to 0.32. The team needs to improve the recommendation of new items without retraining the entire model from scratch every hour. Which approach should the team take?

A.Use a hybrid model that combines collaborative filtering with content-based features from product metadata
B.Retrain the model every hour to incorporate new interactions quickly
C.Remove the new products from the recommendation pool until they accumulate enough interactions
D.Increase the number of latent factors in the matrix factorization model
AnswerA

Content-based features allow the model to recommend new items based on their attributes, overcoming the cold-start problem.

Why this answer

Option A is correct because a hybrid model that combines collaborative filtering with content-based features (e.g., product metadata like category, price, or description) can recommend new products even with zero or very few user interactions. The content-based component leverages item attributes to compute similarity between new and existing items, enabling the system to surface new products without requiring extensive interaction history. This approach addresses the cold-start problem for new items while preserving the collaborative filtering signal for established items, and it does not require retraining the entire model from scratch every hour.

Exam trap

CompTIA often tests the misconception that simply retraining more frequently or increasing model complexity (e.g., more latent factors) can solve the cold-start problem, but the core issue is the lack of interaction data for new items, which requires a content-based or hybrid approach to leverage item metadata.

How to eliminate wrong answers

Option B is wrong because retraining the model every hour would be computationally expensive and operationally impractical for a large-scale system with thousands of new products; it also does not solve the fundamental cold-start issue since new items still have very few interactions in each hourly training window. Option C is wrong because removing new products from the recommendation pool defeats the business purpose of the flash sale, which is to promote and surface new items to users, and it would lead to a poor user experience and lost revenue. Option D is wrong because increasing the number of latent factors in matrix factorization does not address the lack of interaction data for new items; it may even exacerbate overfitting to sparse data and increase computational cost without improving cold-start recommendations.

65
MCQhard

An organization needs to store sensitive customer data for training a machine learning model. The data must be encrypted at rest and in transit, and access must be audited. Which combination of practices should be implemented?

A.Use TLS for transfer, AES-256 for storage, and AWS CloudTrail for auditing
B.Use FTP for transfer, AES-128 for storage, and manual log review
C.Use SSH for transfer, store data in a database, and enable access logs
D.Use MD5 for hashing, store data in plaintext, and enable server logs
AnswerA

These provide encryption and auditing.

Why this answer

Option A is correct because it combines TLS (Transport Layer Security) for encrypting data in transit, AES-256 for strong encryption at rest, and AWS CloudTrail for auditing API-level access. TLS ensures confidentiality and integrity during transmission, AES-256 provides robust symmetric encryption for stored data, and CloudTrail logs all AWS API calls for compliance and audit trails. This triad satisfies the requirements of encryption in transit, at rest, and audited access.

Exam trap

CompTIA often tests the distinction between encryption (AES) and hashing (MD5), and the requirement for both in-transit and at-rest encryption, leading candidates to confuse SSH or FTP with proper TLS-based encryption.

How to eliminate wrong answers

Option B is wrong because FTP transfers data in plaintext, offering no encryption in transit, and AES-128 is weaker than AES-256, while manual log review is not scalable or auditable. Option C is wrong because SSH encrypts only the session, not the data at rest, and storing data in a database without specifying encryption at rest leaves it vulnerable; access logs alone do not provide the same audit trail as a dedicated service like CloudTrail. Option D is wrong because MD5 is a hash function, not encryption, and storing data in plaintext violates the encryption-at-rest requirement; server logs are insufficient for comprehensive auditing.

66
MCQmedium

A company wants to forecast monthly sales for the next year using historical sales data over three years. The data shows strong seasonality and a slight upward trend. Which model type is best suited for this task?

A.Simple moving average of the last 12 months
B.ARIMA model without seasonal terms
C.SARIMA model with seasonal order (1,1,1)[12]
D.Linear regression with time as the independent variable
AnswerC

SARIMA explicitly handles both trend and seasonality.

Why this answer

Option C is correct because SARIMA explicitly models seasonality and trend, making it ideal for this scenario. Option A is wrong because linear regression does not capture seasonality well without manual feature engineering. Option B is wrong because ARIMA can handle trend but not seasonality without differencing at seasonal lag; SARIMA is more appropriate.

Option D is wrong because a simple moving average ignores trend and seasonality.

67
Multi-Selecthard

Which TWO are best practices for versioning machine learning models? (Choose 2)

Select 2 answers
A.Use the same model version for all deployments
B.Tag each model with training date, hyperparameters, and performance metrics
C.Use a version control system (e.g., Git) for model code and configuration
D.Store only the final model binary without metadata
E.Manually rename model files with version numbers
AnswersB, C

Metadata enables comparison and audit.

Why this answer

Option B is correct because tagging each model with training date, hyperparameters, and performance metrics creates a reproducible audit trail. This practice aligns with MLOps principles, enabling teams to trace model behavior back to specific training runs and compare versions objectively.

Exam trap

CompTIA often tests the misconception that versioning is only about file naming or storing the binary, when in fact it requires a comprehensive metadata and code tracking system to ensure reproducibility and traceability.

68
MCQeasy

A data scientist is preparing a dataset for training a classification model. The dataset contains 10,000 records with a binary target variable where 9,500 belong to class A and 500 belong to class B. Which technique should the scientist use to address the class imbalance?

A.SMOTE (Synthetic Minority Oversampling Technique)
B.Random undersampling of class A
C.Adding Gaussian noise to class B
D.Principal Component Analysis (PCA)
AnswerA

SMOTE creates synthetic minority samples to balance classes.

Why this answer

SMOTE is the correct technique because it generates synthetic samples for the minority class (class B) by interpolating between existing minority instances, effectively balancing the dataset without losing information. This approach avoids the overfitting risk of simple oversampling and the information loss of undersampling, making it ideal for a 19:1 imbalance ratio.

Exam trap

CompTIA often tests the misconception that any data augmentation (like adding noise) or dimensionality reduction (like PCA) can solve class imbalance, when in fact only resampling techniques like SMOTE directly address the skewed distribution of the target variable.

How to eliminate wrong answers

Option B is wrong because random undersampling of class A discards 9,000 majority class records, leading to significant information loss and potential bias in the model. Option C is wrong because adding Gaussian noise to class B does not create meaningful synthetic samples; it merely corrupts existing minority data, which can reduce model performance and introduce unrealistic variance. Option D is wrong because PCA is a dimensionality reduction technique used for feature extraction or noise reduction, not for addressing class imbalance in the target variable.

69
Multi-Selectmedium

A computer vision team is building an image classifier for rare wildlife species. The dataset has only 500 images per class, and the model overfits. Which THREE data augmentation techniques are most likely to reduce overfitting? (Choose three.)

Select 3 answers
A.Horizontal flip
B.Adding Gaussian noise
C.Random cropping
D.Color jitter (brightness, contrast, saturation)
E.Random rotation by ±10 degrees
AnswersA, D, E

Flipping is a standard augmentation that doubles the dataset size.

Why this answer

Horizontal flip is a simple and effective data augmentation technique that doubles the training data by mirroring images, which helps the model generalize better to variations in orientation. This is particularly useful for wildlife images where the animal may appear facing left or right, reducing overfitting by exposing the model to more diverse examples without collecting new data.

Exam trap

CompTIA often tests the distinction between augmentations that preserve class labels (like flips and rotations) versus those that may alter semantic content (like extreme cropping or noise), leading candidates to overestimate the effectiveness of Gaussian noise for overfitting reduction.

70
MCQmedium

A logistics company uses a machine learning model to predict delivery times based on historical data including distance, traffic, weather, and driver performance. The model is deployed as a REST API using Flask and run on a single server. Recently, the model has been returning predictions with high latency (over 2 seconds) during peak hours when the API receives 500 requests per second. The server has 8 CPU cores and 32 GB RAM. The model is a gradient boosting model (XGBoost) with 500 trees. The engineer wants to reduce inference latency to under 500ms without retraining the model. Which action is most effective?

A.Prune the model by reducing the number of trees to 100 and limit tree depth
B.Replace XGBoost with a linear regression model
C.Scale horizontally by deploying additional servers behind a load balancer
D.Increase server RAM to 128 GB
AnswerA

Pruning reduces computational load and latency while often maintaining adequate accuracy.

Why this answer

Option B is correct. Model pruning reduces the number of trees in the ensemble, directly lowering inference time. Option A is wrong because adding more servers (horizontal scaling) addresses throughput but not per-request latency; it may help if the bottleneck is CPU, but pruning is more efficient.

Option C is wrong because using a simpler model (linear regression) would require retraining and likely lose accuracy. Option D is wrong because increasing server memory does not speed up CPU-bound tree inference.

71
MCQmedium

Refer to the exhibit. A data scientist reviews the pipeline and notes that the model performance degraded. Which change to the pipeline would most likely improve model performance?

A.Change the impute strategy from mean to median for the 'income' column.
B.Remove the normalization step entirely.
C.Drop the 'product_category' column instead of one-hot encoding.
D.Change the encoding method from onehot to label encoding.
AnswerA

Income often has outliers; median is less affected by extremes.

Why this answer

Imputing missing values with the mean is sensitive to outliers; using the median is more robust. Removing normalization or dropping product_category would lose information or harm scaling. Changing to label encoding on a nominal category could introduce false ordinal relationships.

72
MCQeasy

A data engineer discovers that a dataset contains duplicate rows. Which data cleaning step is MOST appropriate?

A.Keep only the first occurrence.
B.Fill duplicates with the mean.
C.Remove duplicate rows.
D.Convert duplicates to categorical.
AnswerC

Removing duplicates ensures each observation is unique.

Why this answer

Removing duplicate rows is the most appropriate data cleaning step because duplicate rows can bias statistical analyses and machine learning models by overrepresenting certain observations. In data engineering, deduplication is a standard preprocessing step to ensure data integrity and avoid skewed results. Option C directly addresses this by eliminating redundant entries without introducing artificial values or altering the data distribution.

Exam trap

CompTIA often tests the misconception that 'keeping the first occurrence' is a valid deduplication strategy, but in data engineering, this is arbitrary and can lead to data loss or bias, whereas explicit removal is the standard practice.

How to eliminate wrong answers

Option A is wrong because keeping only the first occurrence arbitrarily discards potentially valid later occurrences without considering context, which can introduce bias if duplicates are not truly identical. Option B is wrong because filling duplicates with the mean is nonsensical—duplicates are entire rows, not missing values, and imputing a mean would corrupt the dataset by replacing valid data with an aggregate. Option D is wrong because converting duplicates to categorical does not resolve the issue of overrepresentation; it merely relabels the problem without removing the redundant rows.

73
MCQeasy

An e-commerce company deploys a model to recommend products to users. The recommendation system uses collaborative filtering based on user-item interaction history. After deployment, the model shows decreasing click-through rates (CTR) over time. The data engineer notices that the model was trained on data from the past six months and is retrained daily. However, the trend suggests that user preferences are shifting more rapidly than expected. The engineer suspects that the model is suffering from distribution drift. Which approach should the engineer implement to adapt the model more quickly to changing user behavior?

A.Increase the retraining period to once per week to reduce computational cost
B.Switch to an online learning algorithm that updates the model after each user click
C.Increase the model complexity by adding more features and layers
D.Use only the last week of data for training to focus on recent trends
AnswerB

Online learning continuously adapts to new data, capturing shifts in user preferences promptly.

Why this answer

Option A is correct. Online learning allows the model to update incrementally with each new interaction, adapting quickly to changes. Option B is wrong because batch retraining weekly is slower than daily.

Option C is wrong because using only last week's data may not provide enough data and could be noisy. Option D is wrong because increasing model complexity may cause overfitting and is not a direct solution to drift.

74
Multi-Selectmedium

Which TWO techniques are commonly used for feature selection in machine learning? (Choose 2)

Select 2 answers
A.Principal Component Analysis (PCA)
B.SMOTE
C.L1 regularization (Lasso)
D.Dropout
E.Recursive Feature Elimination (RFE)
AnswersC, E

Lasso can zero out coefficients.

Why this answer

L1 regularization (Lasso) is correct because it adds a penalty equal to the absolute value of the magnitude of coefficients, which can shrink some coefficients exactly to zero, effectively performing feature selection by removing irrelevant features from the model. This makes it a built-in feature selection technique within the training process.

Exam trap

CompTIA often tests the distinction between dimensionality reduction (PCA) and feature selection, where candidates mistakenly think PCA selects original features rather than creating new ones.

75
MCQhard

A deep learning model for image classification achieves 99% training accuracy but only 85% validation accuracy. The model has millions of parameters. Which technique is most likely to reduce overfitting while maintaining high accuracy?

A.Reduce batch size from 32 to 8
B.Decrease the learning rate by a factor of 10
C.Add dropout layers with a rate of 0.5 after each convolutional block
D.Increase the number of training epochs to 500
AnswerC

Dropout is a standard regularization technique for deep networks.

Why this answer

Option A is correct because dropout randomly deactivates neurons during training, acting as regularization and reducing overfitting. Option B is wrong because increasing epochs further will likely worsen overfitting. Option C is wrong because reducing batch size can increase training noise but is not a primary anti-overfitting technique.

Option D is wrong because reducing learning rate may help convergence but does not directly combat overfitting caused by model capacity.

Page 1 of 2 · 95 questions totalNext →

Ready to test yourself?

Try a timed practice session using only AI Models and Data Engineering questions.