AI0-001 AI Models and Data Engineering — All Questions With Answers

Question 1easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is preparing a dataset for training a classification model. The dataset contains 10,000 records with a binary target variable where 9,500 belong to class A and 500 belong to class B. Which technique should the scientist use to address the class imbalance?

Question 2easymultiple choice

Read the full AI Models and Data Engineering explanation →

An engineer is building a regression model to predict housing prices. The dataset includes features such as square footage, number of bedrooms, and year built. The engineer notices that the square footage values range from 500 to 10,000, while the number of bedrooms ranges from 1 to 5. Which preprocessing step is most critical before training a gradient descent-based model?

Question 3mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A machine learning team is deploying a sentiment analysis model for customer reviews. The model was trained on reviews from an e-commerce site but will be used for a social media platform. The team observes a drop in accuracy. Which concept best explains this issue?

Question 4mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer needs to design a data pipeline for a real-time fraud detection system. The system requires low-latency processing of streaming transactions. Which architecture is most appropriate?

Question 5hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A team is training a deep learning model for image classification. The training loss decreases rapidly but validation loss starts increasing after a few epochs. Which regularization technique should be applied to mitigate this issue?

Question 6hardmultiple choice

Read the full NAT/PAT explanation →

An organization needs to store sensitive customer data for training a machine learning model. The data must be encrypted at rest and in transit, and access must be audited. Which combination of practices should be implemented?

Question 7easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data analyst is cleaning a dataset and finds that 20% of the values for the 'age' column are missing. Which imputation method is most robust if the data is not normally distributed?

Question 8mediummulti select

Read the full AI Models and Data Engineering explanation →

Which TWO techniques are commonly used for feature selection in machine learning? (Choose 2)

Question 9mediummulti select

Read the full AI Models and Data Engineering explanation →

Which THREE are common data preprocessing steps in a machine learning pipeline? (Choose 3)

Question 10hardmulti select

Read the full AI Models and Data Engineering explanation →

Which TWO are best practices for versioning machine learning models? (Choose 2)

Question 11hardmultiple choice

Read the full AI Models and Data Engineering explanation →

An engineer is training a neural network and observes the output shown. Which conclusion is most likely correct?

Exhibit

Refer to the exhibit.

```
Epoch 1/10 - loss: 0.6932 - accuracy: 0.5234 - val_loss: 0.6918 - val_accuracy: 0.5312
Epoch 2/10 - loss: 0.4231 - accuracy: 0.8047 - val_loss: 0.5234 - val_accuracy: 0.7422
Epoch 3/10 - loss: 0.3125 - accuracy: 0.8828 - val_loss: 0.6015 - val_accuracy: 0.7344
Epoch 4/10 - loss: 0.2146 - accuracy: 0.9219 - val_loss: 0.7234 - val_accuracy: 0.7188
Epoch 5/10 - loss: 0.1478 - accuracy: 0.9531 - val_loss: 0.8342 - val_accuracy: 0.7031
```

Question 12mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer is reviewing an S3 bucket policy for a machine learning project. The policy is intended to allow access to training data only from the corporate network (10.0.0.0/16). However, users in the corporate network report access denied. Which issue is most likely causing the problem?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::ml-training-data/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "10.0.0.0/16"
        }
      }
    }
  ]
}
```

Question 13mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is training a deep learning model for image classification. The training loss decreases steadily but the validation loss starts increasing after 10 epochs. Which technique should the scientist apply to address this issue?

Question 14hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A financial institution is building a fraud detection system using a supervised learning model. The dataset is highly imbalanced with 99.9% legitimate transactions and 0.1% fraudulent ones. Which approach would be MOST effective to train the model to detect fraud?

Question 15easymultiple choice

Read the full AI Models and Data Engineering explanation →

A company wants to deploy an AI model for real-time inference on edge devices with limited computational resources. Which model architecture would be MOST suitable?

Question 16hardmulti select

Read the full AI Models and Data Engineering explanation →

A data engineer is designing a pipeline for a streaming data application that uses a machine learning model to detect anomalies in real time. Which TWO practices should the engineer implement to ensure data quality and model reliability?

Question 17mediummulti select

Read the full NAT/PAT explanation →

A team is developing a natural language processing model to classify customer feedback. The dataset contains text in multiple languages. Which THREE preprocessing steps are essential to ensure the model performs well across all languages?

Question 18hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A large e-commerce company uses a recommendation system based on collaborative filtering. The system uses a matrix factorization model that is trained nightly on the entire user-item interaction history. Recently, the company launched a flash sale with thousands of new products. Users are reporting that the recommendations are not showing the new products, even for users who have purchased them during the sale. The data engineering team notices that the new products have very few interactions in the training data. The model's loss on the validation set has increased, and the recall@10 metric has dropped from 0.45 to 0.32. The team needs to improve the recommendation of new items without retraining the entire model from scratch every hour. Which approach should the team take?

Question 19mediummultiple choice

Read the full NAT/PAT explanation →

A healthcare startup is developing a deep learning model to detect diabetic retinopathy from retinal images. The model is trained on a dataset of 10,000 labeled images. During initial testing, the model achieves 99% accuracy on the training set but only 85% on the test set. The startup wants to deploy the model in a clinical setting where false negatives (missing a disease) are critical. The team has access to additional unlabeled retinal images from multiple sources. Which strategy should the team use to improve the model's generalization and reduce false negatives?

Question 20easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is preparing a dataset for a classification model. The dataset contains a column "Age" with 10% missing values and a column "Income" with 30% missing values. Which imputation strategy is MOST appropriate to minimize bias?

Question 21mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A team is building a regression model to predict house prices. The dataset includes numerical features (square footage, number of bedrooms) and categorical features (neighborhood, roof type). The categorical features have high cardinality (neighborhood has 200+ unique values). Which encoding strategy should the team use to avoid overfitting and maintain model interpretability?

Question 22hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist trains a deep learning model on a large dataset. The training loss decreases steadily but the validation loss starts increasing after 20 epochs. The scientist uses early stopping with patience=5. Which of the following is the MOST likely cause and best corrective action?

Question 23easymultiple choice

Read the full AI Models and Data Engineering explanation →

A company streams sensor data from IoT devices. The data arrives as JSON messages at high velocity. Which data pipeline architecture is BEST suited to handle this streaming data for near-real-time analytics?

Question 24mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A dataset for a binary classification problem has 95% of samples in class "0" and 5% in class "1". The data scientist trains a logistic regression model and achieves 95% accuracy. Which metric should the scientist primarily use to evaluate model performance?

Question 25hardmultiple choice

Read the full AI Models and Data Engineering explanation →

An e-commerce company needs to update its recommendation model continuously as user preferences change. The model currently retrains from scratch every night, but the training time is too long. Which approach would reduce training time while keeping the model up-to-date?

Question 26easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer discovers that a dataset contains duplicate rows. Which data cleaning step is MOST appropriate?

Question 27mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A machine learning model for credit card fraud detection is deployed. The model's precision is 0.95 and recall is 0.60. The business cost of missing a fraud is very high. Which of the following should the team prioritize to reduce the number of false negatives?

Question 28hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is working with a dataset that has 10,000 features but only 500 samples. The goal is to train a model for binary classification. Which feature selection technique is MOST appropriate to reduce overfitting?

Question 29easymulti select

Read the full AI Models and Data Engineering explanation →

A data scientist is cleaning a dataset. Which TWO actions are appropriate for handling missing data?

Question 30mediummulti select

Read the full AI Models and Data Engineering explanation →

Which THREE practices are recommended for versioning machine learning models in a production environment?

Question 31hardmulti select

Read the full AI Models and Data Engineering explanation →

Which THREE data quality dimensions are critical for ensuring model reliability?

Question 32hardmultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A data scientist reviews the MLflow run for a Random Forest model on customer churn data. What is the most likely issue with this model?

Exhibit

The following output is from an MLflow run:
Run ID: abc123
experiment_id: 1
status: FINISHED
start_time: 2023-10-01 10:00:00
end_time: 2023-10-01 10:05:00
params:
  learning_rate: 0.01
  max_depth: 10
  n_estimators: 100
metrics:
  train_accuracy: 0.999
  val_accuracy: 0.82
  val_f1: 0.79
tags:
  model_type: RandomForest
  dataset: churn_v2

Question 33mediummultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A stream processor ingests events. One event arrives with missing "user_id". What will happen?

Exhibit

The following is a JSON schema snippet from a data pipeline:
{
  "type": "object",
  "properties": {
    "user_id": { "type": "integer" },
    "timestamp": { "type": "string", "format": "date-time" },
    "event_type": { "type": "string" },
    "value": { "type": "number" }
  },
  "required": ["user_id", "event_type", "value"]
}

Question 34easymultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. What is the recall of the model?

Exhibit

The following is a confusion matrix for a binary classifier:

              Predicted: Positive  Predicted: Negative
Actual Positive:     80                 20
Actual Negative:     30                 70

Question 35easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is preparing a dataset for training a classification model. The dataset has a column with missing values in 5% of rows. Which action should the data engineer take to minimize bias?

Question 36mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A financial institution is training a risk assessment model. The dataset includes customer credit scores, income, age, and past loan defaults. During feature engineering, a data engineer creates a new feature 'income_to_debt_ratio'. Which type of feature engineering technique is this?

Question 37hardmultiple choice

Read the full network assurance explanation →

A machine learning team is developing a model to predict server failure from telemetry data. They use a deep neural network with 3 hidden layers. After training, the model achieves 99% accuracy on training data but only 85% on validation data. Which technique should the team apply to reduce the generalization error?

Question 38easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer needs to combine two datasets, each with unique customer_id, to include all records from both datasets. Which join type should be used?

Question 39mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A team is training a language model using a large text corpus. They want to ensure the model does not learn biased associations between gender and professions. Which data engineering technique should they apply?

Question 40hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A streaming data pipeline ingests sensor data from IoT devices. The data arrives at irregular intervals and contains occasional spikes. Which data transformation is most appropriate for preparing this data for a time-series model?

Question 41easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer needs to store training data in a format that supports columnar pruning during model training. Which storage format should they use?

Question 42mediummultiple choice

Read the full AI Models and Data Engineering explanation →

During model deployment, a data engineer notices that the model's predictions are consistently lower than expected due to a shift in the distribution of one feature between training and production. Which technique should be used to detect and quantify this shift?

Question 43hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A deep learning model for image classification is overfitting due to a small dataset. The team decides to apply data augmentation. Which augmentation technique is least likely to preserve the label?

Question 44easymulti select

Read the full AI Models and Data Engineering explanation →

A data engineer is preparing a dataset for a binary classification model. The dataset has 10,000 samples with 100 features. To improve model performance and reduce training time, the engineer decides to perform feature selection. Which two techniques are appropriate for this task? (Select TWO).

Question 45mediummulti select

Read the full AI Models and Data Engineering explanation →

A data science team is building a model to predict customer churn. The dataset includes categorical variables like 'region' and 'subscription_type'. Which three preprocessing steps should be applied to these categorical features? (Select THREE).

Question 46hardmulti select

Read the full AI Models and Data Engineering explanation →

A data engineer is designing a data pipeline for a real-time recommendation system. The pipeline must handle high velocity streams and ensure data quality. Which three components should be included in the pipeline? (Select THREE).

Question 47easymultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A data engineer is training a binary classification neural network. The loss fluctuates and does not converge. Which hyperparameter adjustment is most likely to stabilize training?

Exhibit

model:
  type: Sequential
  layers:
    - type: Dense
      units: 128
      activation: relu
    - type: Dense
      units: 64
      activation: relu
    - type: Dense
      units: 1
      activation: sigmoid
optimizer:
  type: Adam
  learning_rate: 0.01

Question 48mediummultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A data engineer runs a validation report on the customers table. The "income" column has 12 null values. Which imputation strategy is most appropriate for this column?

Exhibit

Data Validation Report:
Table: customers
- column "age": null values: 0, unique values: 87, min:18, max:99
- column "income": null values: 12, unique values: 1500, min:0, max:500000
- column "region": null values: 0, unique values: 4, values: ["North", "South", "East", "West"]
- column "gender": null values: 0, unique values: 2, values: ["M", "F"]

Question 49hardmultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A data engineer notices that the batch processing step is taking too long and causing delays. Which change would most likely reduce the latency?

Exhibit

Data Pipeline Architecture:
- Source: IoT devices -> Kafka Topic "sensor_data"
- Stream Processing: Apache Flink job that ingests from Kafka, cleanses data, and outputs to another Kafka Topic "cleaned_sensor_data"
- Batch Processing: Apache Spark job that reads from "cleaned_sensor_data" via Kafka batch integration, performs feature engineering, and writes to HDFS as Parquet
- Model Training: Python script reads from HDFS, trains an LSTM model, and saves to model registry
- Inference: REST API loads model from registry and serves predictions

Question 50easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is preparing a dataset for a supervised learning model. The dataset contains missing values in 15% of the rows for a numeric feature. Which preprocessing technique should be applied to minimize bias?

Question 51mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A company is deploying an AI model to recommend products. The model's training data included historical purchases from the past two years, but the business environment has changed significantly due to a market shift. What is the most likely issue affecting model performance?

Question 52hardmultiple choice

Read the full AI Models and Data Engineering explanation →

An AI team notices that a model's F1 score on the validation set is 0.95, but on the test set it drops to 0.72. Which course of action is most appropriate?

Question 53mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer is building a pipeline to ingest streaming data from IoT sensors. Which data storage solution is best suited for real-time analytics on timestamped sensor readings?

Question 54easymultiple choice

Read the full NAT/PAT explanation →

During feature engineering, a data scientist creates a new feature that is a linear combination of two existing features. What risk does this pose to the model?

Question 55hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A model trained on a dataset with imbalanced classes achieves 98% accuracy but only 50% recall for the minority class. Which technique should be applied first to address the imbalance?

Question 56easymultiple choice

Read the full AI Models and Data Engineering explanation →

A team is using a pre-trained language model for sentiment analysis. They want to adapt it to a specific domain with limited labeled data. Which approach is most efficient?

Question 57mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data pipeline processes customer data from multiple sources. The data quality check reveals duplicate records. Which step should the pipeline include to handle this?

Question 58hardmultiple choice

Read the full AI Models and Data Engineering explanation →

An AI model is deployed to a mobile app with limited computational resources. The model is a deep neural network with high latency. Which technique is best to reduce inference time?

Question 59easymulti select

Read the full AI Models and Data Engineering explanation →

A data scientist is evaluating a logistic regression model for binary classification on highly imbalanced data. Which TWO metrics are most appropriate to assess model performance? (Choose TWO.)

Question 60mediummulti select

Read the full AI Models and Data Engineering explanation →

A data engineer is designing a feature store for machine learning. Which THREE components are essential for a feature store? (Choose THREE.)

Question 61hardmulti select

Read the full AI Models and Data Engineering explanation →

A team is using k-fold cross-validation to evaluate a model. They observe high variance in performance scores across folds. Which TWO actions are most likely to reduce this variance? (Choose TWO.)

Question 62easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist notices that a binary classification model consistently predicts the majority class. Which data engineering technique should be applied?

Question 63easymultiple choice

Read the full AI Models and Data Engineering explanation →

A team is building a regression model to predict house prices. Which data transformation is most appropriate if the target variable exhibits right skewness?

Question 64easymultiple choice

Read the full AI Models and Data Engineering explanation →

A model's training accuracy is 99% but validation accuracy drops to 60%. What is the most likely issue?

Question 65mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A real-time recommendation system requires low latency. Which data storage strategy is best for serving user profiles and item embeddings?

Question 66mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer is preprocessing text data for sentiment analysis. Which technique preserves word order while converting text to numeric features?

Question 67mediummultiple choice

Read the full AI Models and Data Engineering explanation →

An organization uses a machine learning model to approve loans. The model shows higher false positive rates for a protected group. Which data engineering step should be taken to mitigate this?

Question 68hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A data pipeline ingests streaming data from IoT sensors. The current batch processing pipeline causes stale predictions. Which architecture change is most appropriate?

Question 69hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A team is training a deep neural network on a large image dataset. They observe that the training loss decreases smoothly but validation loss oscillates. Which regularization technique should be applied?

Question 70hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A fraud detection model has high precision but low recall. The cost of false negatives is very high. Which threshold adjustment should be made?

Question 71easymulti select

Read the full AI Models and Data Engineering explanation →

Which TWO data preprocessing techniques reduce the dimensionality of a dataset?

Question 72mediummulti select

Read the full AI Models and Data Engineering explanation →

Which THREE are common causes of data leakage in machine learning pipelines?

Question 73hardmulti select

Read the full AI Models and Data Engineering explanation →

Which TWO strategies are effective for handling missing values in a dataset when the missingness is not random (MNAR)?

Question 74mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A data scientist is building a regression model to predict house prices. The dataset contains features such as square footage, number of bedrooms, and year built. Initial model performance is poor, and the scientist suspects that feature engineering could help. Which approach is most likely to improve model accuracy?

Question 75hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A credit risk model is being developed to predict loan defaults. The dataset has 95% non-default and 5% default instances. The data scientist trains a logistic regression model and obtains 95% accuracy, but the recall for defaults is only 10%. Which action is most appropriate to improve the model's ability to identify defaults?

Question 76easymultiple choice

Read the full AI Models and Data Engineering explanation →

A dataset used for training a classification model contains 10% missing values in a feature that is known to be important. The data scientist decides to impute the missing values. Which imputation method is most robust if the data is not missing completely at random?

Question 77mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A company wants to forecast monthly sales for the next year using historical sales data over three years. The data shows strong seasonality and a slight upward trend. Which model type is best suited for this task?

Question 78hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A deep learning model for image classification achieves 99% training accuracy but only 85% validation accuracy. The model has millions of parameters. Which technique is most likely to reduce overfitting while maintaining high accuracy?

Question 79easymultiple choice

Read the full AI Models and Data Engineering explanation →

A data engineer is splitting a dataset into training, validation, and test sets for a machine learning project. The dataset is large and representative of the population. Which split ratio is commonly recommended?

Question 80mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A machine learning engineer is training a Support Vector Machine (SVM) with an RBF kernel on a dataset with features on different scales (e.g., age 0-100, income 0-1,000,000). The model converges slowly and yields poor accuracy. What should the engineer do first?

Question 81mediummulti select

Read the full NAT/PAT explanation →

A natural language processing (NLP) team is building a sentiment analysis model. The raw text data contains punctuation, stop words, and URLs. Which TWO preprocessing steps are most appropriate to improve model performance? (Choose two.)

Question 82hardmulti select

Read the full AI Models and Data Engineering explanation →

A data scientist is evaluating a binary classification model for fraud detection. The dataset is highly imbalanced (99% non-fraud, 1% fraud). Which TWO metrics are most appropriate for assessing model performance? (Choose two.)

Question 83mediummulti select

Read the full AI Models and Data Engineering explanation →

A computer vision team is building an image classifier for rare wildlife species. The dataset has only 500 images per class, and the model overfits. Which THREE data augmentation techniques are most likely to reduce overfitting? (Choose three.)

Question 84hardmultiple choice

Read the full NAT/PAT explanation →

A healthcare startup is deploying a machine learning model to predict patient readmission within 30 days using electronic health records (EHR). The data pipeline uses Apache Spark for preprocessing and training on an Amazon EMR cluster. The training dataset is 50 GB and composed of structured numeric and categorical features, along with unstructured clinical notes. The data scientist observes that training takes over 12 hours and frequently fails due to out-of-memory (OOM) errors, especially when processing the clinical notes via TF-IDF vectorization. The cluster has 10 nodes with 64 GB RAM each. The data engineer has already tried increasing spark.sql.shuffle.partitions to 400 and using Kryo serialization, but OOM persists. Which action should the data engineer take next to resolve the OOM errors?

Question 85mediummultiple choice

Read the full NAT/PAT explanation →

A financial services company has a real-time fraud detection system that uses Apache Kafka to stream transaction events, a TensorFlow Serving model for scoring, and a Redis cache for lookup of historical fraud patterns. The system processes 10,000 transactions per second with an SLA of 100ms latency per transaction. Recently, after a model update, the latency for some transactions spiked to over 500ms, causing timeouts. The model uses a deep neural network with 10 million parameters. The engineering team suspects the issue is due to increased model inference time. Which action should be taken to reduce latency without significant loss in accuracy?

Question 86hardmultiple choice

Read the full NAT/PAT explanation →

A medical imaging team is developing an AI model to detect tumors from CT scans. They have 10,000 labeled scans, but the labels were created by a semi-automated process with an estimated 20% error rate (mislabeled tumor vs. no tumor). The team trains a convolutional neural network (CNN) and achieves 90% accuracy on a held-out test set that was carefully validated by an expert radiologist. However, when deployed to a new hospital's patient population, the accuracy drops to 70%. The team suspects domain shift and label noise. Which strategy is most likely to improve model robustness for the new hospital?

Question 87easymultiple choice

Read the full AI Models and Data Engineering explanation →

An e-commerce company deploys a model to recommend products to users. The recommendation system uses collaborative filtering based on user-item interaction history. After deployment, the model shows decreasing click-through rates (CTR) over time. The data engineer notices that the model was trained on data from the past six months and is retrained daily. However, the trend suggests that user preferences are shifting more rapidly than expected. The engineer suspects that the model is suffering from distribution drift. Which approach should the engineer implement to adapt the model more quickly to changing user behavior?

Question 88mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A logistics company uses a machine learning model to predict delivery times based on historical data including distance, traffic, weather, and driver performance. The model is deployed as a REST API using Flask and run on a single server. Recently, the model has been returning predictions with high latency (over 2 seconds) during peak hours when the API receives 500 requests per second. The server has 8 CPU cores and 32 GB RAM. The model is a gradient boosting model (XGBoost) with 500 trees. The engineer wants to reduce inference latency to under 500ms without retraining the model. Which action is most effective?

Question 89easymulti select

Read the full AI Models and Data Engineering explanation →

A data scientist is preparing a dataset for a classification model. The dataset contains several categorical variables with high cardinality. Which TWO encoding methods are appropriate for converting these categorical variables into numerical features?

Question 90hardmultiple choice

Read the full NAT/PAT explanation →

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days. The data engineering team has built a pipeline that collects data from multiple sources, including electronic health records (EHR), lab results, and wearable device data. During initial testing, the model's performance is poor, with high false positives. Upon investigation, the team discovers that the data contains significant temporal misalignment: lab results are timestamped when ordered, not when collected; wearable data is aggregated hourly; and EHR data has inconsistent update frequencies. The data pipeline currently joins all features on the patient ID without aligning timestamps. The data volume is large, and processing time is a concern. Which action should the data engineering team take to most effectively address the issue and improve model performance?

Question 91mediummultiple choice

Read the full AI Models and Data Engineering explanation →

A retail company is building a recommendation system to suggest products to customers based on their purchase history. The data engineering team has collected data from point-of-sale systems, online browsing logs, and customer reviews. After cleaning the data, they notice that the feature set has over 500 dimensions, leading to high computational costs and potential overfitting. They need to reduce dimensionality while preserving as much variance as possible for the model. The team is considering various techniques. Which approach should they take to achieve this goal most effectively?

Question 92easymultiple choice

Read the full NAT/PAT explanation →

A logistics company uses a machine learning model to predict delivery times based on historical data. The model was performing well, but recently it started making inaccurate predictions, especially for routes that have experienced new traffic patterns and road closures. The data engineering team receives an alert that the model's accuracy has dropped by 15% over the last week. They suspect data drift. The team has access to the original training data and a continuous stream of new data. What is the most appropriate first step for the team to take?

Question 93easymulti select

Read the full AI Models and Data Engineering explanation →

A data engineer is preparing a dataset for training a classification model. The dataset contains missing values in multiple features, inconsistent categorical labels, and outliers in numerical features. Which TWO preprocessing steps should the engineer prioritize to improve model performance?

Question 94mediummultiple choice

Read the full AI Models and Data Engineering explanation →

Refer to the exhibit. A data scientist reviews the pipeline and notes that the model performance degraded. Which change to the pipeline would most likely improve model performance?

Exhibit

{
  "data_pipeline": {
    "input": "raw_sales.csv",
    "steps": [
      {"type": "drop_columns", "columns": ["customer_id", "transaction_id"]},
      {"type": "impute_missing", "strategy": "mean", "columns": ["age", "income"]},
      {"type": "encode_categorical", "method": "onehot", "columns": ["product_category"]},
      {"type": "normalize", "method": "minmax", "columns": ["age", "income"]}
    ],
    "output": "processed_sales.parquet"
  }
}

Question 95hardmultiple choice

Read the full AI Models and Data Engineering explanation →

A retail company uses a machine learning model to predict daily sales. The model takes features like past sales, promotions, holidays, and weather data. Recently, the model's accuracy dropped significantly. The data engineer checks the data pipeline and finds that the weather data source changed from a free API to a new paid API that provides more detailed data. The new data includes additional attributes like humidity and wind speed, but the existing pipeline only ingests temperature and precipitation. Also, the time zone format changed from UTC to local time. The model was trained on the old format. Which action should the engineer take first to restore model performance?

Refer to the exhibit. ``` Epoch 1/10 - loss: 0.6932 - accuracy: 0.5234 - val_loss: 0.6918 - val_accuracy: 0.5312 Epoch 2/10 - loss: 0.4231 - accuracy: 0.8047 - val_loss: 0.5234 - val_accuracy: 0.7422 Epoch 3/10 - loss: 0.3125 - accuracy: 0.8828 - val_loss: 0.6015 - val_accuracy: 0.7344 Epoch 4/10 - loss: 0.2146 - accuracy: 0.9219 - val_loss: 0.7234 - val_accuracy: 0.7188 Epoch 5/10 - loss: 0.1478 - accuracy: 0.9531 - val_loss: 0.8342 - val_accuracy: 0.7031 ```

Refer to the exhibit. ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::ml-training-data/*", "Condition": { "IpAddress": { "aws:SourceIp": "10.0.0.0/16" } } } ] } ```

The following output is from an MLflow run: Run ID: abc123 experiment_id: 1 status: FINISHED start_time: 2023-10-01 10:00:00 end_time: 2023-10-01 10:05:00 params: learning_rate: 0.01 max_depth: 10 n_estimators: 100 metrics: train_accuracy: 0.999 val_accuracy: 0.82 val_f1: 0.79 tags: model_type: RandomForest dataset: churn_v2

The following is a JSON schema snippet from a data pipeline: { "type": "object", "properties": { "user_id": { "type": "integer" }, "timestamp": { "type": "string", "format": "date-time" }, "event_type": { "type": "string" }, "value": { "type": "number" } }, "required": ["user_id", "event_type", "value"] }

The following is a confusion matrix for a binary classifier: Predicted: Positive Predicted: Negative Actual Positive: 80 20 Actual Negative: 30 70

model: type: Sequential layers: - type: Dense units: 128 activation: relu - type: Dense units: 64 activation: relu - type: Dense units: 1 activation: sigmoid optimizer: type: Adam learning_rate: 0.01

Data Validation Report: Table: customers - column "age": null values: 0, unique values: 87, min:18, max:99 - column "income": null values: 12, unique values: 1500, min:0, max:500000 - column "region": null values: 0, unique values: 4, values: ["North", "South", "East", "West"] - column "gender": null values: 0, unique values: 2, values: ["M", "F"]

Data Pipeline Architecture: - Source: IoT devices -> Kafka Topic "sensor_data" - Stream Processing: Apache Flink job that ingests from Kafka, cleanses data, and outputs to another Kafka Topic "cleaned_sensor_data" - Batch Processing: Apache Spark job that reads from "cleaned_sensor_data" via Kafka batch integration, performs feature engineering, and writes to HDFS as Parquet - Model Training: Python script reads from HDFS, trains an LSTM model, and saves to model registry - Inference: REST API loads model from registry and serves predictions

{ "data_pipeline": { "input": "raw_sales.csv", "steps": [ {"type": "drop_columns", "columns": ["customer_id", "transaction_id"]}, {"type": "impute_missing", "strategy": "mean", "columns": ["age", "income"]}, {"type": "encode_categorical", "method": "onehot", "columns": ["product_category"]}, {"type": "normalize", "method": "minmax", "columns": ["age", "income"]} ], "output": "processed_sales.parquet" } }