Practice AI0-001 AI Models and Data Engineering questions with full explanations on every answer.
Start practicing
AI Models and Data Engineering — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data scientist is preparing a dataset for training a classification model. The dataset contains 10,000 records with a binary target variable where 9,500 belong to class A and 500 belong to class B. Which technique should the scientist use to address the class imbalance?
2An engineer is building a regression model to predict housing prices. The dataset includes features such as square footage, number of bedrooms, and year built. The engineer notices that the square footage values range from 500 to 10,000, while the number of bedrooms ranges from 1 to 5. Which preprocessing step is most critical before training a gradient descent-based model?
3A machine learning team is deploying a sentiment analysis model for customer reviews. The model was trained on reviews from an e-commerce site but will be used for a social media platform. The team observes a drop in accuracy. Which concept best explains this issue?
4A data engineer needs to design a data pipeline for a real-time fraud detection system. The system requires low-latency processing of streaming transactions. Which architecture is most appropriate?
5A team is training a deep learning model for image classification. The training loss decreases rapidly but validation loss starts increasing after a few epochs. Which regularization technique should be applied to mitigate this issue?
6An organization needs to store sensitive customer data for training a machine learning model. The data must be encrypted at rest and in transit, and access must be audited. Which combination of practices should be implemented?
7A data analyst is cleaning a dataset and finds that 20% of the values for the 'age' column are missing. Which imputation method is most robust if the data is not normally distributed?
8Which TWO techniques are commonly used for feature selection in machine learning? (Choose 2)
9Which THREE are common data preprocessing steps in a machine learning pipeline? (Choose 3)
10Which TWO are best practices for versioning machine learning models? (Choose 2)
11An engineer is training a neural network and observes the output shown. Which conclusion is most likely correct?
12A data engineer is reviewing an S3 bucket policy for a machine learning project. The policy is intended to allow access to training data only from the corporate network (10.0.0.0/16). However, users in the corporate network report access denied. Which issue is most likely causing the problem?
13A data scientist is training a deep learning model for image classification. The training loss decreases steadily but the validation loss starts increasing after 10 epochs. Which technique should the scientist apply to address this issue?
14A financial institution is building a fraud detection system using a supervised learning model. The dataset is highly imbalanced with 99.9% legitimate transactions and 0.1% fraudulent ones. Which approach would be MOST effective to train the model to detect fraud?
15A company wants to deploy an AI model for real-time inference on edge devices with limited computational resources. Which model architecture would be MOST suitable?
16A data engineer is designing a pipeline for a streaming data application that uses a machine learning model to detect anomalies in real time. Which TWO practices should the engineer implement to ensure data quality and model reliability?
17A team is developing a natural language processing model to classify customer feedback. The dataset contains text in multiple languages. Which THREE preprocessing steps are essential to ensure the model performs well across all languages?
18A large e-commerce company uses a recommendation system based on collaborative filtering. The system uses a matrix factorization model that is trained nightly on the entire user-item interaction history. Recently, the company launched a flash sale with thousands of new products. Users are reporting that the recommendations are not showing the new products, even for users who have purchased them during the sale. The data engineering team notices that the new products have very few interactions in the training data. The model's loss on the validation set has increased, and the recall@10 metric has dropped from 0.45 to 0.32. The team needs to improve the recommendation of new items without retraining the entire model from scratch every hour. Which approach should the team take?
19A healthcare startup is developing a deep learning model to detect diabetic retinopathy from retinal images. The model is trained on a dataset of 10,000 labeled images. During initial testing, the model achieves 99% accuracy on the training set but only 85% on the test set. The startup wants to deploy the model in a clinical setting where false negatives (missing a disease) are critical. The team has access to additional unlabeled retinal images from multiple sources. Which strategy should the team use to improve the model's generalization and reduce false negatives?
20A data scientist is preparing a dataset for a classification model. The dataset contains a column "Age" with 10% missing values and a column "Income" with 30% missing values. Which imputation strategy is MOST appropriate to minimize bias?
21A team is building a regression model to predict house prices. The dataset includes numerical features (square footage, number of bedrooms) and categorical features (neighborhood, roof type). The categorical features have high cardinality (neighborhood has 200+ unique values). Which encoding strategy should the team use to avoid overfitting and maintain model interpretability?
22A data scientist trains a deep learning model on a large dataset. The training loss decreases steadily but the validation loss starts increasing after 20 epochs. The scientist uses early stopping with patience=5. Which of the following is the MOST likely cause and best corrective action?
23A company streams sensor data from IoT devices. The data arrives as JSON messages at high velocity. Which data pipeline architecture is BEST suited to handle this streaming data for near-real-time analytics?
24A dataset for a binary classification problem has 95% of samples in class "0" and 5% in class "1". The data scientist trains a logistic regression model and achieves 95% accuracy. Which metric should the scientist primarily use to evaluate model performance?
25An e-commerce company needs to update its recommendation model continuously as user preferences change. The model currently retrains from scratch every night, but the training time is too long. Which approach would reduce training time while keeping the model up-to-date?
26A data engineer discovers that a dataset contains duplicate rows. Which data cleaning step is MOST appropriate?
27A machine learning model for credit card fraud detection is deployed. The model's precision is 0.95 and recall is 0.60. The business cost of missing a fraud is very high. Which of the following should the team prioritize to reduce the number of false negatives?
28A data scientist is working with a dataset that has 10,000 features but only 500 samples. The goal is to train a model for binary classification. Which feature selection technique is MOST appropriate to reduce overfitting?
29A data scientist is cleaning a dataset. Which TWO actions are appropriate for handling missing data?
30Which THREE practices are recommended for versioning machine learning models in a production environment?
31Which THREE data quality dimensions are critical for ensuring model reliability?
32Refer to the exhibit. A data scientist reviews the MLflow run for a Random Forest model on customer churn data. What is the most likely issue with this model?
33Refer to the exhibit. A stream processor ingests events. One event arrives with missing "user_id". What will happen?
34Refer to the exhibit. What is the recall of the model?
35A data scientist is preparing a dataset for training a classification model. The dataset has a column with missing values in 5% of rows. Which action should the data engineer take to minimize bias?
36A financial institution is training a risk assessment model. The dataset includes customer credit scores, income, age, and past loan defaults. During feature engineering, a data engineer creates a new feature 'income_to_debt_ratio'. Which type of feature engineering technique is this?
37A machine learning team is developing a model to predict server failure from telemetry data. They use a deep neural network with 3 hidden layers. After training, the model achieves 99% accuracy on training data but only 85% on validation data. Which technique should the team apply to reduce the generalization error?
38A data engineer needs to combine two datasets, each with unique customer_id, to include all records from both datasets. Which join type should be used?
39A team is training a language model using a large text corpus. They want to ensure the model does not learn biased associations between gender and professions. Which data engineering technique should they apply?
40A streaming data pipeline ingests sensor data from IoT devices. The data arrives at irregular intervals and contains occasional spikes. Which data transformation is most appropriate for preparing this data for a time-series model?
41A data engineer needs to store training data in a format that supports columnar pruning during model training. Which storage format should they use?
42During model deployment, a data engineer notices that the model's predictions are consistently lower than expected due to a shift in the distribution of one feature between training and production. Which technique should be used to detect and quantify this shift?
43A deep learning model for image classification is overfitting due to a small dataset. The team decides to apply data augmentation. Which augmentation technique is least likely to preserve the label?
44A data engineer is preparing a dataset for a binary classification model. The dataset has 10,000 samples with 100 features. To improve model performance and reduce training time, the engineer decides to perform feature selection. Which two techniques are appropriate for this task? (Select TWO).
45A data science team is building a model to predict customer churn. The dataset includes categorical variables like 'region' and 'subscription_type'. Which three preprocessing steps should be applied to these categorical features? (Select THREE).
46A data engineer is designing a data pipeline for a real-time recommendation system. The pipeline must handle high velocity streams and ensure data quality. Which three components should be included in the pipeline? (Select THREE).
47Refer to the exhibit. A data engineer is training a binary classification neural network. The loss fluctuates and does not converge. Which hyperparameter adjustment is most likely to stabilize training?
48Refer to the exhibit. A data engineer runs a validation report on the customers table. The "income" column has 12 null values. Which imputation strategy is most appropriate for this column?
49Refer to the exhibit. A data engineer notices that the batch processing step is taking too long and causing delays. Which change would most likely reduce the latency?
50A data scientist is preparing a dataset for a supervised learning model. The dataset contains missing values in 15% of the rows for a numeric feature. Which preprocessing technique should be applied to minimize bias?
51A company is deploying an AI model to recommend products. The model's training data included historical purchases from the past two years, but the business environment has changed significantly due to a market shift. What is the most likely issue affecting model performance?
52An AI team notices that a model's F1 score on the validation set is 0.95, but on the test set it drops to 0.72. Which course of action is most appropriate?
53A data engineer is building a pipeline to ingest streaming data from IoT sensors. Which data storage solution is best suited for real-time analytics on timestamped sensor readings?
54During feature engineering, a data scientist creates a new feature that is a linear combination of two existing features. What risk does this pose to the model?
55A model trained on a dataset with imbalanced classes achieves 98% accuracy but only 50% recall for the minority class. Which technique should be applied first to address the imbalance?
56A team is using a pre-trained language model for sentiment analysis. They want to adapt it to a specific domain with limited labeled data. Which approach is most efficient?
57A data pipeline processes customer data from multiple sources. The data quality check reveals duplicate records. Which step should the pipeline include to handle this?
58An AI model is deployed to a mobile app with limited computational resources. The model is a deep neural network with high latency. Which technique is best to reduce inference time?
59A data scientist is evaluating a logistic regression model for binary classification on highly imbalanced data. Which TWO metrics are most appropriate to assess model performance? (Choose TWO.)
60A data engineer is designing a feature store for machine learning. Which THREE components are essential for a feature store? (Choose THREE.)
61A team is using k-fold cross-validation to evaluate a model. They observe high variance in performance scores across folds. Which TWO actions are most likely to reduce this variance? (Choose TWO.)
62A data scientist notices that a binary classification model consistently predicts the majority class. Which data engineering technique should be applied?
63A team is building a regression model to predict house prices. Which data transformation is most appropriate if the target variable exhibits right skewness?
64A model's training accuracy is 99% but validation accuracy drops to 60%. What is the most likely issue?
65A real-time recommendation system requires low latency. Which data storage strategy is best for serving user profiles and item embeddings?
66A data engineer is preprocessing text data for sentiment analysis. Which technique preserves word order while converting text to numeric features?
67An organization uses a machine learning model to approve loans. The model shows higher false positive rates for a protected group. Which data engineering step should be taken to mitigate this?
68A data pipeline ingests streaming data from IoT sensors. The current batch processing pipeline causes stale predictions. Which architecture change is most appropriate?
69A team is training a deep neural network on a large image dataset. They observe that the training loss decreases smoothly but validation loss oscillates. Which regularization technique should be applied?
70A fraud detection model has high precision but low recall. The cost of false negatives is very high. Which threshold adjustment should be made?
71Which TWO data preprocessing techniques reduce the dimensionality of a dataset?
72Which THREE are common causes of data leakage in machine learning pipelines?
73Which TWO strategies are effective for handling missing values in a dataset when the missingness is not random (MNAR)?
74A data scientist is building a regression model to predict house prices. The dataset contains features such as square footage, number of bedrooms, and year built. Initial model performance is poor, and the scientist suspects that feature engineering could help. Which approach is most likely to improve model accuracy?
75A credit risk model is being developed to predict loan defaults. The dataset has 95% non-default and 5% default instances. The data scientist trains a logistic regression model and obtains 95% accuracy, but the recall for defaults is only 10%. Which action is most appropriate to improve the model's ability to identify defaults?
76A dataset used for training a classification model contains 10% missing values in a feature that is known to be important. The data scientist decides to impute the missing values. Which imputation method is most robust if the data is not missing completely at random?
77A company wants to forecast monthly sales for the next year using historical sales data over three years. The data shows strong seasonality and a slight upward trend. Which model type is best suited for this task?
78A deep learning model for image classification achieves 99% training accuracy but only 85% validation accuracy. The model has millions of parameters. Which technique is most likely to reduce overfitting while maintaining high accuracy?
79A data engineer is splitting a dataset into training, validation, and test sets for a machine learning project. The dataset is large and representative of the population. Which split ratio is commonly recommended?
80A machine learning engineer is training a Support Vector Machine (SVM) with an RBF kernel on a dataset with features on different scales (e.g., age 0-100, income 0-1,000,000). The model converges slowly and yields poor accuracy. What should the engineer do first?
81A natural language processing (NLP) team is building a sentiment analysis model. The raw text data contains punctuation, stop words, and URLs. Which TWO preprocessing steps are most appropriate to improve model performance? (Choose two.)
82A data scientist is evaluating a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). Which TWO metrics are most appropriate for assessing model performance? (Choose two.)
83A computer vision team is building an image classifier for rare wildlife species. The dataset has only 500 images per class, and the model overfits. Which THREE data augmentation techniques are most likely to reduce overfitting? (Choose three.)
84A healthcare startup is deploying a machine learning model to predict patient readmission within 30 days using electronic health records (EHR). The data pipeline uses Apache Spark for preprocessing and training on an Amazon EMR cluster. The training dataset is 50 GB and composed of structured numeric and categorical features, along with unstructured clinical notes. The data scientist observes that training takes over 12 hours and frequently fails due to out-of-memory (OOM) errors, especially when processing the clinical notes via TF-IDF vectorization. The cluster has 10 nodes with 64 GB RAM each. The data engineer has already tried increasing spark.sql.shuffle.partitions to 400 and using Kryo serialization, but OOM persists. Which action should the data engineer take next to resolve the OOM errors?
85A financial services company has a real-time fraud detection system that uses Apache Kafka to stream transaction events, a TensorFlow Serving model for scoring, and a Redis cache for lookup of historical fraud patterns. The system processes 10,000 transactions per second with an SLA of 100ms latency per transaction. Recently, after a model update, the latency for some transactions spiked to over 500ms, causing timeouts. The model uses a deep neural network with 10 million parameters. The engineering team suspects the issue is due to increased model inference time. Which action should be taken to reduce latency without significant loss in accuracy?
86A medical imaging team is developing an AI model to detect tumors from CT scans. They have 10,000 labeled scans, but the labels were created by a semi-automated process with an estimated 20% error rate (mislabeled tumor vs. no tumor). The team trains a convolutional neural network (CNN) and achieves 90% accuracy on a held-out test set that was carefully validated by an expert radiologist. However, when deployed to a new hospital's patient population, the accuracy drops to 70%. The team suspects domain shift and label noise. Which strategy is most likely to improve model robustness for the new hospital?
87An e-commerce company deploys a model to recommend products to users. The recommendation system uses collaborative filtering based on user-item interaction history. After deployment, the model shows decreasing click-through rates (CTR) over time. The data engineer notices that the model was trained on data from the past six months and is retrained daily. However, the trend suggests that user preferences are shifting more rapidly than expected. The engineer suspects that the model is suffering from distribution drift. Which approach should the engineer implement to adapt the model more quickly to changing user behavior?
88A logistics company uses a machine learning model to predict delivery times based on historical data including distance, traffic, weather, and driver performance. The model is deployed as a REST API using Flask and run on a single server. Recently, the model has been returning predictions with high latency (over 2 seconds) during peak hours when the API receives 500 requests per second. The server has 8 CPU cores and 32 GB RAM. The model is a gradient boosting model (XGBoost) with 500 trees. The engineer wants to reduce inference latency to under 500ms without retraining the model. Which action is most effective?
89A data scientist is preparing a dataset for a classification model. The dataset contains several categorical variables with high cardinality. Which TWO encoding methods are appropriate for converting these categorical variables into numerical features?
90A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days. The data engineering team has built a pipeline that collects data from multiple sources, including electronic health records (EHR), lab results, and wearable device data. During initial testing, the model's performance is poor, with high false positives. Upon investigation, the team discovers that the data contains significant temporal misalignment: lab results are timestamped when ordered, not when collected; wearable data is aggregated hourly; and EHR data has inconsistent update frequencies. The data pipeline currently joins all features on the patient ID without aligning timestamps. The data volume is large, and processing time is a concern. Which action should the data engineering team take to most effectively address the issue and improve model performance?
91A retail company is building a recommendation system to suggest products to customers based on their purchase history. The data engineering team has collected data from point-of-sale systems, online browsing logs, and customer reviews. After cleaning the data, they notice that the feature set has over 500 dimensions, leading to high computational costs and potential overfitting. They need to reduce dimensionality while preserving as much variance as possible for the model. The team is considering various techniques. Which approach should they take to achieve this goal most effectively?
92A logistics company uses a machine learning model to predict delivery times based on historical data. The model was performing well, but recently it started making inaccurate predictions, especially for routes that have experienced new traffic patterns and road closures. The data engineering team receives an alert that the model's accuracy has dropped by 15% over the last week. They suspect data drift. The team has access to the original training data and a continuous stream of new data. What is the most appropriate first step for the team to take?
93A data engineer is preparing a dataset for training a classification model. The dataset contains missing values in multiple features, inconsistent categorical labels, and outliers in numerical features. Which TWO preprocessing steps should the engineer prioritize to improve model performance?
94Refer to the exhibit. A data scientist reviews the pipeline and notes that the model performance degraded. Which change to the pipeline would most likely improve model performance?
95A retail company uses a machine learning model to predict daily sales. The model takes features like past sales, promotions, holidays, and weather data. Recently, the model's accuracy dropped significantly. The data engineer checks the data pipeline and finds that the weather data source changed from a free API to a new paid API that provides more detailed data. The new data includes additional attributes like humidity and wind speed, but the existing pipeline only ingests temperature and precipitation. Also, the time zone format changed from UTC to local time. The model was trained on the old format. Which action should the engineer take first to restore model performance?
The AI Models and Data Engineering domain covers the key concepts tested in this area of the AI0-001 exam blueprint published by CompTIA. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all AI0-001 domains — no account required.
The Courseiva AI0-001 question bank contains 95 questions in the AI Models and Data Engineering domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the AI Models and Data Engineering domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included