Knowledge + Practice

CCNA AI Models and Data Engineering Questions

20 of 95 questions · Page 2/2 · AI Models and Data Engineering · Answers revealed

Practice these questions Domain overview All questions

76

MCQmedium

A real-time recommendation system requires low latency. Which data storage strategy is best for serving user profiles and item embeddings?

A.Time-series database (e.g., InfluxDB)

B.In-memory key-value store (e.g., Redis)

C.Relational database with joins

D.Data lake in object storage (e.g., Amazon S3)

AnswerB

In-memory key-value stores offer sub-millisecond reads, suitable for real-time serving.

Why this answer

An in-memory key-value store like Redis is ideal for serving user profiles and item embeddings in a real-time recommendation system because it provides sub-millisecond read/write latency by keeping data in RAM. This directly meets the low-latency requirement for fetching embeddings and profiles on each request, without the overhead of disk I/O or complex query processing.

Exam trap

CompTIA often tests the misconception that any database with 'fast' in its name (like InfluxDB) is suitable for real-time serving, but the trap is that time-series databases prioritize write throughput and range queries over low-latency point lookups, which is the actual requirement here.

How to eliminate wrong answers

Option A is wrong because time-series databases like InfluxDB are optimized for timestamped metrics (e.g., CPU usage, sensor data) and not for low-latency key-based lookups of user profiles or embeddings; they introduce unnecessary indexing and query overhead. Option C is wrong because relational databases with joins require disk-based storage and SQL parsing, which adds significant latency (often tens to hundreds of milliseconds) compared to in-memory key-value access, making them unsuitable for real-time serving. Option D is wrong because a data lake in object storage like Amazon S3 has high latency (typically 100-200 ms per request) due to HTTP-based API calls and is designed for batch analytics, not for serving individual records in real-time.

Practice this question →

77

MCQeasy

An engineer is building a regression model to predict housing prices. The dataset includes features such as square footage, number of bedrooms, and year built. The engineer notices that the square footage values range from 500 to 10,000, while the number of bedrooms ranges from 1 to 5. Which preprocessing step is most critical before training a gradient descent-based model?

A.Use k-fold cross-validation

B.Apply log transformation to all features

C.Normalize or standardize the features

D.One-hot encode the features

AnswerC

Scaling improves convergence of gradient descent.

Why this answer

Gradient descent-based models are sensitive to the scale of input features because they update weights proportionally to the gradient, which is influenced by feature magnitudes. With square footage ranging 500–10,000 and bedrooms 1–5, the larger feature will dominate the gradient, causing slow or unstable convergence. Normalizing or standardizing (e.g., Z-score or min-max scaling) ensures all features contribute equally, leading to faster and more reliable training.

Exam trap

CompTIA often tests the misconception that any data transformation (like log or one-hot encoding) is universally beneficial, but the key is matching the preprocessing step to the model's mathematical requirements—here, gradient descent's sensitivity to scale makes normalization/standardization the critical step.

How to eliminate wrong answers

Option A is wrong because k-fold cross-validation is a model evaluation technique to assess generalization, not a preprocessing step to address feature scale issues. Option B is wrong because log transformation is used to handle skewed distributions or multiplicative relationships, not to rescale features with different ranges; applying it to all features (including integer counts like bedrooms) can distort their meaning and is unnecessary for gradient descent scaling. Option D is wrong because one-hot encoding is used for categorical features to convert them into binary vectors, but the features listed (square footage, bedrooms, year built) are all numerical and do not require encoding.

Practice this question →

78

MCQhard

An AI team notices that a model's F1 score on the validation set is 0.95, but on the test set it drops to 0.72. Which course of action is most appropriate?

A.Reduce the training dataset size.

B.Adjust the train/test split to be more representative.

C.Increase model complexity.

D.Apply regularization.

AnswerD

Regularization penalizes large weights or complex structures, reducing overfitting and improving generalization.

Why this answer

The F1 score dropping from 0.95 on the validation set to 0.72 on the test set is a classic sign of overfitting, where the model has memorized the training/validation data but fails to generalize to unseen test data. Applying regularization (e.g., L1/L2 weight decay, dropout) is the most appropriate course of action because it penalizes overly complex models, reduces variance, and improves generalization without requiring more data or changing the model architecture.

Exam trap

CompTIA often tests the distinction between overfitting (high variance) and underfitting (high bias), and the trap here is that candidates may incorrectly choose to increase model complexity (Option C) because they focus on the high validation score rather than recognizing the performance drop as a variance problem.

How to eliminate wrong answers

Option A is wrong because reducing the training dataset size would likely worsen overfitting by providing even less data for the model to learn generalizable patterns, increasing variance. Option B is wrong because adjusting the train/test split to be more representative does not address the underlying overfitting issue; the model's poor test performance is due to high variance, not a biased or unrepresentative split. Option C is wrong because increasing model complexity (e.g., adding more layers or parameters) would exacerbate overfitting, further increasing the gap between validation and test performance.

Practice this question →

79

Multi-Selecthard

A data engineer is designing a pipeline for a streaming data application that uses a machine learning model to detect anomalies in real time. Which TWO practices should the engineer implement to ensure data quality and model reliability?

Select 2 answers

A.Use batch processing to transform data in fixed intervals

B.Store all raw data indefinitely for future analysis

C.Use a sliding window for feature computation

D.Implement data validation checks at the ingestion point

E.Retrain the model on a fixed schedule every 24 hours

AnswersC, D

Sliding windows allow the model to use the most recent data for accurate anomaly detection.

Why this answer

Option C is correct because streaming anomaly detection requires real-time feature computation over recent data, and a sliding window ensures that only the most relevant data points are used for model inference, maintaining low latency and adapting to concept drift. This approach avoids the staleness of batch processing and aligns with the continuous nature of streaming pipelines.

Exam trap

CompTIA often tests the misconception that batch processing or fixed retraining schedules are sufficient for real-time streaming applications, when in fact sliding windows and continuous validation are required to maintain low latency and model accuracy.

Practice this question →

80

MCQhard

A deep learning model for image classification is overfitting due to a small dataset. The team decides to apply data augmentation. Which augmentation technique is least likely to preserve the label?

A.Adding random noise

B.Random rotation

C.Random cropping and rescaling

D.Horizontal flip

AnswerC

Cropping might remove the object of interest, rendering the label invalid.

Why this answer

Random cropping and rescaling is least likely to preserve the label because it can cut out the primary object or distort its proportions, potentially removing the discriminative features needed for correct classification. For example, cropping a dog image to show only the background or a leg could change the semantic meaning, making the label 'dog' incorrect. In contrast, other techniques like noise, rotation, or flipping typically retain the core subject and its label.

Exam trap

CompTIA often tests the misconception that all augmentations are equally label-preserving, but the trap here is that random cropping can alter the semantic content by removing the object, while other transformations like rotation or flipping maintain the object's presence and identity.

How to eliminate wrong answers

Option A is wrong because adding random noise preserves the label; it introduces pixel-level variations that help generalize without altering the object's identity or spatial layout. Option B is wrong because random rotation preserves the label; rotating an image by small or moderate angles does not change the object class, as CNNs are designed to be rotation-invariant to some degree. Option D is wrong because horizontal flip preserves the label; mirroring an image does not change the object's category (e.g., a cat remains a cat), and it is a standard augmentation for symmetric or non-directional objects.

Practice this question →

81

MCQhard

A data pipeline ingests streaming data from IoT sensors. The current batch processing pipeline causes stale predictions. Which architecture change is most appropriate?

A.Use a larger batch interval

B.Revert to micro-batch processing with Apache Spark

C.Store raw data in Hadoop HDFS

D.Implement Apache Kafka and stream processing

AnswerD

Kafka combined with stream processing (e.g., Kafka Streams, Flink) enables real-time ingestion and prediction.

Why this answer

Option B is correct because stream processing (e.g., with Kafka and Spark Streaming) processes data in real-time, reducing latency. Option A (micro-batch) is still batch with small intervals, but option B is more explicit. Option C worsens staleness.

Option D stores raw data but does not process faster.

Practice this question →

82

Multi-Selectmedium

A data engineer is designing a feature store for machine learning. Which THREE components are essential for a feature store? (Choose THREE.)

Select 3 answers

A.Data ingestion pipeline

B.Online serving layer

C.Feature repository

D.Experiment tracking

E.Model registry

AnswersA, B, C

The ingestion pipeline brings new data sources into the feature store to keep features current.

Why this answer

A feature store requires a feature repository to store feature definitions and values, an online serving layer to serve features at low latency for inference, and a data ingestion pipeline to continuously update features. Model registry and experiment tracking are important for MLOps but are not core components of a feature store.

Practice this question →

83

MCQmedium

A machine learning model for credit card fraud detection is deployed. The model's precision is 0.95 and recall is 0.60. The business cost of missing a fraud is very high. Which of the following should the team prioritize to reduce the number of false negatives?

A.Use a different model algorithm.

B.Add more features.

C.Increase the classification threshold.

D.Decrease the classification threshold.

AnswerD

Lower threshold classifies more cases as positive, thus catching more actual frauds (reducing false negatives).

Why this answer

Decreasing the classification threshold makes the model more sensitive, classifying more transactions as fraudulent. This increases recall (reducing false negatives) at the cost of precision. Given the high cost of missing fraud, lowering the threshold is the direct way to capture more true positives, even if it increases false positives.

Exam trap

CompTIA often tests the misconception that improving model accuracy or changing algorithms is the primary fix, when in fact adjusting the decision threshold is the simplest and most effective way to address precision-recall trade-offs for high-cost false negatives.

How to eliminate wrong answers

Option A is wrong because simply switching algorithms does not guarantee a reduction in false negatives; the threshold and cost function matter more. Option B is wrong because adding more features may improve overall model performance but does not directly target the trade-off between precision and recall; it could even increase false negatives if the new features are noisy. Option C is wrong because increasing the classification threshold makes the model more conservative, reducing false positives but increasing false negatives, which is the opposite of what is needed.

Practice this question →

84

MCQhard

A data scientist is working with a dataset that has 10,000 features but only 500 samples. The goal is to train a model for binary classification. Which feature selection technique is MOST appropriate to reduce overfitting?

A.Univariate selection using chi-squared test.

B.Recursive Feature Elimination (RFE) with cross-validation.

C.Using all features with L2 regularization.

D.Principal Component Analysis (PCA) without feature selection.

AnswerB

RFE with CV selects subset of features most relevant, reducing overfitting in small-sample settings.

Why this answer

Option B is correct because Recursive Feature Elimination with cross-validation iteratively removes features and evaluates performance, which is robust for high-dimensional small-sample data. Option A (PCA) creates components that may still overfit. Option C (univariate selection) may select irrelevant features.

Option D (L2 regularization) does not reduce features.

Practice this question →

85

MCQhard

Refer to the exhibit. A data scientist reviews the MLflow run for a Random Forest model on customer churn data. What is the most likely issue with this model?

A.The model is underfitting because training accuracy is too high.

B.The model is overfitting because there is a large gap between train and validation accuracy.

C.The model is performing well because validation accuracy is above 0.8.

D.The model has a data leak because dataset version is v2.

AnswerB

High train accuracy with lower validation accuracy is classic overfitting.

Why this answer

Option B is correct because a large gap between training accuracy (e.g., 0.99) and validation accuracy (e.g., 0.82) indicates that the Random Forest model has memorized the training data but fails to generalize to unseen validation data. This is the classic symptom of overfitting, where the model captures noise rather than the underlying pattern. In MLflow, comparing train and validation metrics directly reveals this discrepancy.

Exam trap

CompTIA often tests the misconception that high validation accuracy alone indicates a good model, ignoring the critical comparison between training and validation metrics to detect overfitting.

How to eliminate wrong answers

Option A is wrong because underfitting is characterized by low training accuracy, not high training accuracy; high training accuracy with poor validation performance indicates overfitting, not underfitting. Option C is wrong because a validation accuracy above 0.8 alone does not guarantee good model performance if there is a significant gap between train and validation accuracy, which signals overfitting. Option D is wrong because dataset version v2 is simply a versioning label and does not inherently cause data leakage; data leakage would involve information from the validation set leaking into training, which is unrelated to the version number.

Practice this question →

86

Multi-Selecthard

A data engineer is designing a data pipeline for a real-time recommendation system. The pipeline must handle high velocity streams and ensure data quality. Which three components should be included in the pipeline? (Select THREE).

Select 3 answers

A.A stream processing engine like Apache Kafka Streams

B.A data validation step to check schema compliance

C.A data warehouse for historical analysis

D.A batch processing framework like Apache Spark

E.A message queue for buffering

AnswersA, B, E

Stream processing engines process data in real-time with low latency.

Why this answer

Apache Kafka Streams is a correct choice because it is a stream processing library specifically designed for building real-time applications and microservices that process data in motion. For a high-velocity recommendation pipeline, it provides exactly-once semantics, stateful processing (e.g., windowed joins, aggregations), and seamless integration with Kafka topics, enabling low-latency transformations without requiring an external cluster.

Exam trap

CompTIA often tests the distinction between stream processing and batch processing, and the trap here is that candidates mistakenly select a batch framework like Apache Spark or a data warehouse because they associate 'data pipeline' with traditional ETL, overlooking the strict real-time and low-latency requirements of the scenario.

Practice this question →

87

MCQhard

An engineer is training a neural network and observes the output shown. Which conclusion is most likely correct?

A.The gradients are vanishing.

B.The model is overfitting after epoch 2.

C.The model is underfitting.

D.The learning rate is too high.

AnswerB

Training loss decreases, validation loss increases.

Why this answer

The output shows training loss decreasing while validation loss increases after epoch 2, which is a classic sign of overfitting. The model begins to memorize the training data rather than generalize, leading to poor performance on unseen data. This pattern confirms that overfitting starts after epoch 2, making option B correct.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting by presenting a loss curve where training loss decreases but validation loss increases, leading candidates to mistakenly attribute the issue to vanishing gradients or a high learning rate.

How to eliminate wrong answers

Option A is wrong because vanishing gradients typically cause slow or stalled learning across all epochs, not a sudden divergence between training and validation loss after epoch 2. Option C is wrong because underfitting would show high training loss and high validation loss throughout, not a decreasing training loss with increasing validation loss. Option D is wrong because a learning rate that is too high would cause the loss to oscillate or diverge from the start, not show a clear overfitting pattern after epoch 2.

Practice this question →

88

MCQmedium

A data scientist is training a deep learning model for image classification. The training loss decreases steadily but the validation loss starts increasing after 10 epochs. Which technique should the scientist apply to address this issue?

A.Add more dropout layers

B.Reduce the learning rate

C.Implement early stopping

D.Increase the number of training epochs

AnswerC

Early stopping halts training when validation loss stops improving, preventing overfitting.

Why this answer

The scenario describes overfitting: the model memorizes training data (loss decreases) but fails to generalize to unseen validation data (validation loss increases). Early stopping (Option C) halts training when validation performance degrades, preventing overfitting while preserving the best model weights. This is a standard regularization technique in deep learning frameworks like TensorFlow and PyTorch.

Exam trap

CompTIA often tests the distinction between preventive regularization (dropout, L2) and reactive overfitting control (early stopping), leading candidates to choose dropout or learning rate reduction when the scenario explicitly describes overfitting that has already begun.

How to eliminate wrong answers

Option A is wrong because adding more dropout layers can help regularize the model, but it is not the direct solution for the described symptom of validation loss increasing after a certain epoch; dropout is a preventive measure applied before training, not a reactive fix for overfitting that has already occurred. Option B is wrong because reducing the learning rate may slow down convergence or help escape local minima, but it does not address the core issue of overfitting; a lower learning rate can even exacerbate overfitting by allowing the model to fit noise more precisely. Option D is wrong because increasing the number of training epochs would worsen the overfitting problem, as the model would continue to memorize training data and further diverge from validation performance.

Practice this question →

89

MCQmedium

During model deployment, a data engineer notices that the model's predictions are consistently lower than expected due to a shift in the distribution of one feature between training and production. Which technique should be used to detect and quantify this shift?

A.Compute the root mean square error (RMSE)

B.Calculate the population stability index (PSI)

C.Generate a confusion matrix

D.Perform a t-test on the means

AnswerB

PSI quantifies the degree of distribution shift, commonly used in monitoring.

Why this answer

The Population Stability Index (PSI) is specifically designed to detect and quantify shifts in the distribution of a feature or score between two populations, such as training and production datasets. It measures the stability of the feature by comparing the proportion of observations in each bin across the two time periods, making it the correct choice for diagnosing distribution drift in model deployment.

Exam trap

CompTIA often tests the distinction between performance metrics (like RMSE or confusion matrix) and distribution monitoring metrics (like PSI), trapping candidates who confuse model accuracy evaluation with data drift detection.

How to eliminate wrong answers

Option A is wrong because RMSE measures the average magnitude of prediction errors, not distribution shifts between datasets. Option C is wrong because a confusion matrix evaluates classification performance against ground truth labels, not feature distribution changes. Option D is wrong because a t-test on the means only checks for a difference in central tendency, not the full distributional shift that PSI captures, and it is sensitive to sample size rather than bin-wise stability.

Practice this question →

90

Multi-Selecteasy

A data engineer is preparing a dataset for training a classification model. The dataset contains missing values in multiple features, inconsistent categorical labels, and outliers in numerical features. Which TWO preprocessing steps should the engineer prioritize to improve model performance?

Select 2 answers

A.Normalize numerical features using min-max scaling.

B.Remove all rows with any missing data.

C.Encode categorical variables using label encoding.

D.Impute missing values with the median.

E.Apply one-hot encoding to all categorical variables.

AnswersA, D

Normalization ensures features contribute equally to distance-based models.

Why this answer

The correct steps are imputing missing values with the median (handles outliers better than mean) and normalizing numerical features (for distance-based algorithms). Blindly removing rows loses data; label encoding on nominal categories creates false order; one-hot encoding on high-cardinality categories can cause dimensionality issues.

Practice this question →

91

MCQmedium

A retail company is building a recommendation system to suggest products to customers based on their purchase history. The data engineering team has collected data from point-of-sale systems, online browsing logs, and customer reviews. After cleaning the data, they notice that the feature set has over 500 dimensions, leading to high computational costs and potential overfitting. They need to reduce dimensionality while preserving as much variance as possible for the model. The team is considering various techniques. Which approach should they take to achieve this goal most effectively?

A.Keep all features but apply L1 regularization (Lasso) in the model to automatically reduce coefficients to zero.

B.Apply t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the feature space to 50 dimensions.

C.Select only features that have a high correlation with the target variable, discarding all others.

D.Use Principal Component Analysis (PCA) to reduce the feature space to the top 50 principal components that explain 95% of the variance.

AnswerD

PCA efficiently reduces dimensionality while retaining most variance, and the components can be used in downstream models.

Why this answer

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that projects data onto a lower-dimensional subspace while maximizing variance. It is well-suited for reducing a large number of correlated features. t-SNE is primarily for visualization and does not produce a transformation that can be applied to new data easily. Feature selection based on correlation may discard useful interactions.

Keeping all features and using regularization would still require full feature set during training and may not reduce dimensionality in the pipeline.

Practice this question →

92

MCQeasy

Refer to the exhibit. A data engineer is training a binary classification neural network. The loss fluctuates and does not converge. Which hyperparameter adjustment is most likely to stabilize training?

A.Change the activation to tanh

B.Add dropout after each layer

C.Decrease the learning rate

D.Increase the number of units in the first dense layer

AnswerC

Lower learning rate makes updates smaller, reducing oscillations and promoting convergence.

Why this answer

A high learning rate can cause the loss to oscillate and prevent convergence. Decreasing the learning rate typically stabilizes training. Increasing units, changing activation, or adding dropout may not address the root cause of fluctuation.

Practice this question →

93

MCQmedium

Refer to the exhibit. A data engineer runs a validation report on the customers table. The "income" column has 12 null values. Which imputation strategy is most appropriate for this column?

A.Remove rows with null income

B.Replace nulls with the median income per region

C.Replace nulls with 0

D.Replace nulls with the mean income of the entire dataset

AnswerB

Median per region respects regional variation and is robust to outliers.

Why this answer

Income varies by region, so imputing with the median per region accounts for regional differences. The mean of entire dataset may be skewed, 0 is inappropriate, and removing rows reduces sample size.

Practice this question →

94

MCQmedium

A data engineer is building a pipeline to ingest streaming data from IoT sensors. Which data storage solution is best suited for real-time analytics on timestamped sensor readings?

A.Data warehouse

B.Relational database

C.Data lake

D.Time-series database

AnswerD

Time-series databases provide specialized indexing, compression, and query capabilities for timestamped data.

Why this answer

Time-series databases (TSDBs) are optimized for high-ingest rates of timestamped data and provide efficient downsampling, retention policies, and time-based aggregation functions. For IoT sensor streaming, a TSDB like InfluxDB or TimescaleDB delivers sub-second query performance on time-range scans, which is essential for real-time analytics.

Exam trap

CompTIA often tests the misconception that 'any database can handle time-series data if you add a timestamp column,' ignoring the fundamental architectural differences in storage engines, indexing, and write optimization that make TSDBs the only viable choice for real-time streaming analytics.

How to eliminate wrong answers

Option A is wrong because data warehouses (e.g., Snowflake, Redshift) are designed for batch-oriented, structured querying of historical data and cannot sustain the high write throughput or low-latency time-range scans required for streaming sensor data. Option B is wrong because relational databases (e.g., PostgreSQL, MySQL) use row-based storage and B-tree indexes that degrade under continuous time-series inserts, leading to write contention and slow time-range queries. Option C is wrong because data lakes (e.g., S3, ADLS) store raw data in object storage with no indexing or time-ordering, making real-time analytics impossible due to high read latency and lack of native time-series functions.

Practice this question →

95

MCQeasy

A team is using a pre-trained language model for sentiment analysis. They want to adapt it to a specific domain with limited labeled data. Which approach is most efficient?

A.Fine-tune the pre-trained model on domain data

B.Use the pre-trained model as is

C.Train a new model from scratch

D.Ensemble multiple pre-trained models

AnswerA

Fine-tuning updates the model weights slightly on domain data, achieving good performance with few examples.

Why this answer

Fine-tuning leverages the pre-trained model's knowledge and requires only minimal additional training on domain-specific data, making it efficient. Training from scratch is computationally expensive and requires large datasets. Using the model as-is may perform poorly on domain-specific language.

Ensembling multiple models adds complexity without clear benefit.

Practice this question →

← PreviousPage 2 of 2 · 95 questions total

Ready to test yourself?

Try a timed practice session using only AI Models and Data Engineering questions.

Start 20-question session