CCNA Data Preparation for Machine Learning Questions — Page 2 of 2

MCQhard

A data scientist is using SageMaker built-in linear learner algorithm for a regression problem. The dataset has 10 features, some have missing values, and the target variable is right-skewed. The data scientist wants to handle missing values and transform the target variable to improve model performance. Which data preparation steps should the data scientist take?

A.Apply one-hot encoding to all features and remove missing values by dropping rows.

B.Standardize all features to have zero mean and unit variance, then apply a box-cox transformation to the target.

C.Impute missing values with the median of each feature and apply a log transformation to the target variable.

D.Remove rows with missing values and normalize the target to range [0,1].

AnswerC

Handles missing values and skew appropriately.

Why this answer

Option C is correct because imputing missing values with the median is robust to outliers and preserves the distribution of each feature, which is important when the target is right-skewed. Applying a log transformation to the right-skewed target variable helps normalize its distribution, which aligns with the linear learner algorithm's assumption of normally distributed errors and improves convergence and prediction accuracy.

Exam trap

The trap here is that candidates may assume standardizing features (Option B) is always required, but for a right-skewed target, transforming the target itself (e.g., log transform) is more critical than scaling features, and imputation is essential to avoid data loss.

How to eliminate wrong answers

Option A is wrong because one-hot encoding all features, including numeric ones, would dramatically increase dimensionality and is inappropriate for features that are not categorical; dropping rows with missing values reduces the dataset size and can introduce bias. Option B is wrong because standardizing features is beneficial, but applying a Box-Cox transformation to the target variable requires all target values to be positive (which may not hold) and is less commonly used than log transformation for right-skewed targets; also, Box-Cox is not directly available in SageMaker's built-in linear learner without custom preprocessing. Option D is wrong because removing rows with missing values discards potentially valuable data and can lead to biased models; normalizing the target to [0,1] does not address skewness and may compress the variance, harming regression performance.

Practice this question →

MCQhard

A machine learning team is building a model using a dataset that contains a mix of numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). The team wants to use Amazon SageMaker for training. Which technique should the team use to encode the high-cardinality categorical features effectively?

A.Apply hash encoding to map categories to a fixed number of buckets.

B.Apply target encoding (mean encoding) to the high-cardinality features.

C.Apply one-hot encoding to all categorical features.

D.Apply label encoding to assign integer values to each category.

AnswerB

Target encoding reduces dimensionality and captures target-related information.

Why this answer

For high-cardinality categorical features, target encoding (mean encoding) replaces each category with the mean of the target variable for that category, which captures information without creating a large number of dummy variables. One-hot encoding would create too many features. Label encoding implies ordinal relationships.

Hash encoding can cause collisions.

Practice this question →

MCQmedium

A company is building a fraud detection model on an imbalanced dataset (99% legitimate, 1% fraudulent). To improve recall on the minority class, they want to resample data. Which combination of techniques should they use?

A.SMOTE on entire dataset before train/test split

B.Random oversampling of minority class before train/test split

C.Random undersampling of majority class

D.SMOTE on training set only

AnswerD

Correct: SMOTE generates synthetic minority samples on the training set without affecting the test distribution.

Why this answer

SMOTE should be applied only to the training set to avoid data leakage; evaluation must reflect the original distribution. Random undersampling may discard useful majority samples; random oversampling before split leaks information.

Practice this question →

MCQhard

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?

A.Apply data augmentation to the majority class by adding noise.

B.Apply Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class.

C.Use a weighted loss function during training to penalize misclassifications of the minority class.

D.Apply random under-sampling to reduce the majority class to match the minority class size.

AnswerB

SMOTE creates synthetic samples, balancing the dataset without losing data.

Why this answer

Option B is correct because SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances, which directly addresses the severe class imbalance (95% class A, 5% class B) by creating a more balanced training dataset. This technique is particularly effective for tabular data in Amazon SageMaker, as it increases the representation of the minority class without simply duplicating existing samples, thereby reducing overfitting and improving the model's ability to learn decision boundaries for the minority class.

Exam trap

The trap here is that candidates confuse data preparation techniques (like SMOTE) with training-time strategies (like weighted loss functions), leading them to select option C even though the question explicitly specifies applying a technique to the training dataset before training.

How to eliminate wrong answers

Option A is wrong because applying data augmentation by adding noise to the majority class does not address the imbalance—it only increases the size of the already dominant class, potentially worsening the imbalance and introducing irrelevant variance. Option C is wrong because using a weighted loss function is a training-time technique, not a data preparation technique; the question explicitly asks for a data preparation technique to apply to the training dataset before training. Option D is wrong because random under-sampling to match the minority class size would discard 90% of the majority class data, leading to significant information loss and a high risk of underfitting, especially with a severe 95:5 imbalance.

Practice this question →

MCQmedium

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

A.SMOTE (Synthetic Minority Over-sampling Technique)

B.Random undersampling of the majority class

C.Random oversampling of the minority class

D.Apply class weights during model training

AnswerA

Generates synthetic samples for the minority class.

Why this answer

SMOTE (Synthetic Minority Over-sampling Technique) is the best choice because it generates synthetic examples for the minority class by interpolating between existing minority instances and their k-nearest neighbors, rather than simply duplicating data. This addresses the severe 95:5 class imbalance without losing data (as undersampling would) and without the overfitting risk of naive random oversampling. The synthetic samples help the model learn a more general decision boundary for the positive class.

Exam trap

AWS often tests the distinction between data-level techniques (like SMOTE, oversampling, undersampling) and algorithm-level techniques (like class weights), and the trap here is that candidates confuse class weighting as a data preparation method when it is actually a model training adjustment, not a data transformation step.

How to eliminate wrong answers

Option B is wrong because random undersampling of the majority class discards a large portion of the dataset (up to 95% of the negative examples), which leads to significant information loss and can degrade model performance due to reduced training data. Option C is wrong because random oversampling of the minority class simply duplicates existing positive examples, which does not introduce new variability and often causes overfitting, especially when the minority class is very small (5%). Option D is wrong because applying class weights during model training is a cost-sensitive learning technique, not a data preparation technique; it adjusts the loss function to penalize misclassifications of the minority class more heavily, but the question specifically asks for a data preparation technique to address imbalance without losing data.

Practice this question →

MCQmedium

A financial services company is building a fraud detection model using historical transaction data stored in Amazon S3. The data includes features such as transaction amount, merchant category, time of day, and user location. The data scientist observes that the 'merchant_category' column is a text attribute with over 200 unique values. Additionally, the 'transaction_amount' column has a long-tail distribution with extreme outliers. The dataset is 200 GB in size, and the company wants to use Amazon SageMaker for model training. The data scientist needs to engineer features that capture the high-cardinality category and reduce the impact of outliers. What is the MOST efficient and effective approach to prepare this data?

A.Use AWS Glue ETL to apply one-hot encoding to merchant_category and min-max scaling to transaction_amount.

B.Use Amazon EMR with Spark to apply ordinal encoding to merchant_category based on frequency, and log-transform the transaction_amount to reduce skewness.

C.Use Amazon Athena to bin transaction_amount into 10 equal-width bins and replace merchant_category with its count encoding.

D.Use AWS Glue DataBrew to apply a one-hot encoding on merchant_category and a standard scaler on transaction_amount after removing outliers.

AnswerB

Ordinal encoding handles high cardinality efficiently, and log transformation compresses extreme values, both reducing dimensionality and improving model performance.

Why this answer

Option B is correct because ordinal encoding based on frequency handles high-cardinality categorical features efficiently without exploding dimensionality, and log-transform is a standard technique to reduce skewness in long-tail distributions. Using Amazon EMR with Spark provides distributed processing for the 200 GB dataset, making it scalable and cost-effective compared to single-node alternatives.

Exam trap

The trap here is that candidates often default to one-hot encoding for categorical data without considering cardinality, and assume scaling methods like min-max or standard scaling are always appropriate, ignoring the impact of outliers on these transformations.

How to eliminate wrong answers

Option A is wrong because one-hot encoding on a column with over 200 unique values would create over 200 sparse columns, dramatically increasing memory and training time, and min-max scaling is sensitive to outliers, which would compress the majority of values into a narrow range. Option C is wrong because equal-width binning on a long-tail distribution will result in most data falling into the first few bins, losing information, and count encoding alone may not capture the ordinal relationship implied by frequency. Option D is wrong because one-hot encoding again suffers from high dimensionality, standard scaling is not robust to outliers (it uses mean and standard deviation), and removing outliers arbitrarily can discard valuable fraud signals.

Practice this question →

MCQeasy

A marketing company is preparing a dataset to train a logistic regression model to predict whether a customer will click on an online ad. The dataset includes 1 million records with features: customer_age (numeric), income (numeric), education_level (ordinal: high school, bachelor, master, PhD), and ad_category (categorical: 50 unique values). The data is stored in a CSV file in Amazon S3. The data scientist plans to use Amazon SageMaker's built-in linear learner algorithm. The data scientist needs to preprocess the data before training. What is the correct sequence of data preparation steps that should be applied to this dataset to ensure optimal model performance?

A.Drop any duplicate records, apply min-max scaling to all numeric features, and use target encoding for ad_category based on click rates.

B.Apply PCA to all numeric and categorical features after converting categories to numeric indices, then standardize the principal components.

C.Apply min-max scaling to customer_age and income, label encode education_level and ad_category, then use recursive feature elimination to reduce dimensionality.

D.Standardize customer_age and income to have zero mean and unit variance, one-hot encode ad_category, ordinal encode education_level (e.g., map to 1-4), then combine all features into a feature matrix.

AnswerD

Standardization helps linear models converge faster; one-hot encoding for categorical with many categories is standard; ordinal encoding preserves the ordinal nature of education.

Why this answer

Option D is correct because it applies appropriate preprocessing for a logistic regression model using SageMaker's linear learner. Standardizing numeric features (zero mean, unit variance) is essential for linear models to ensure convergence and equal feature influence. One-hot encoding the categorical ad_category (50 unique values) avoids imposing ordinal relationships, while ordinal encoding education_level respects its natural order.

This combination prepares a feature matrix suitable for the linear learner's optimization.

Exam trap

The trap here is that candidates often choose label encoding for all categorical features (Option C) or target encoding (Option A) without considering the ordinal nature of education_level or the risk of data leakage, leading to suboptimal model performance.

How to eliminate wrong answers

Option A is wrong because min-max scaling is not optimal for linear models (it does not center data, which can slow convergence), and target encoding ad_category based on click rates introduces data leakage (future information) and risks overfitting. Option B is wrong because applying PCA to categorical features after converting to numeric indices is inappropriate (PCA assumes linear relationships and continuous data), and standardizing principal components is redundant since PCA already produces uncorrelated components. Option C is wrong because label encoding ad_category (50 unique values) imposes false ordinal relationships, and recursive feature elimination is computationally expensive and unnecessary for this dataset size; min-max scaling also lacks centering for linear models.

Practice this question →

MCQeasy

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

A.Label encoding

B.One-hot encoding

C.Frequency encoding

D.Target encoding

AnswerD

Encodes using target mean, handles high cardinality well.

Why this answer

Target encoding is the most appropriate technique for high-cardinality categorical features because it replaces each category with the mean of the target variable for that category, effectively capturing the predictive signal while keeping the feature as a single numeric column. This avoids the dimensionality explosion of one-hot encoding and the arbitrary ordinality of label encoding, making it a common choice in gradient boosting frameworks like XGBoost or LightGBM for datasets with thousands of unique categories.

Exam trap

AWS often tests the misconception that one-hot encoding is always the safest choice for categorical data, but candidates fail to recognize that high cardinality makes it impractical, leading them to overlook target encoding as a more efficient alternative.

How to eliminate wrong answers

Option A is wrong because label encoding assigns arbitrary integer values to categories, which introduces a false ordinal relationship that can mislead tree-based models into treating high-cardinality features as ordered, degrading performance. Option B is wrong because one-hot encoding creates a binary column for each unique category, which with thousands of categories leads to an extremely high-dimensional and sparse feature space, causing memory issues and overfitting. Option C is wrong because frequency encoding replaces categories with their occurrence counts, which loses the relationship between the category and the target variable, often resulting in weaker predictive power compared to target encoding.

Practice this question →

MCQeasy

Refer to the exhibit. The Glue job reads a CSV file and attempts to write to a Parquet table. What is the most likely cause of this error?

A.The 'price' column is missing from some rows

B.The schema inference incorrectly detected the column as String

C.The 'price' column contains non-numeric values in some rows

D.The CSV file is compressed and not properly decompressed

AnswerC

Non-numeric strings like 'N/A' or commas cause conversion errors.

Why this answer

Option C is correct because the error message indicates a 'NumberFormatException' when parsing the 'price' column, which occurs when Spark attempts to convert a string value to a numeric type. Since the Glue job's schema inference likely detected 'price' as a numeric column based on the majority of rows, any row containing a non-numeric value (e.g., 'N/A', 'null', or a currency symbol) will cause this parsing failure during the write to Parquet.

Exam trap

AWS often tests the distinction between schema inference behavior and runtime type conversion errors, where candidates mistakenly attribute the error to missing data or schema detection rather than the actual parsing failure caused by malformed values.

How to eliminate wrong answers

Option A is wrong because missing values in a column would result in a null value, not a NumberFormatException; Spark can handle nulls in numeric columns without throwing a parsing error. Option B is wrong because if the schema inference had incorrectly detected the column as String, the write to Parquet would succeed without any type conversion error; the error occurs only when Spark tries to parse a string as a number. Option D is wrong because compressed CSV files are automatically decompressed by Spark/Glue based on the file extension (e.g., .gz, .bz2), and a decompression issue would produce an IOException or a different error, not a NumberFormatException.

Practice this question →

MCQmedium

A data scientist is using Amazon SageMaker Processing to run a feature engineering job. The job requires installing additional Python libraries not included in the default SageMaker containers. Which approach should the data scientist use to include these libraries?

A.Add the libraries to the `requirements.txt` file in the same S3 bucket as the script

B.Create a custom Docker image with the libraries installed and specify it in the ProcessingInput

C.Use Amazon EFS to store the libraries and mount them to the processing container

D.Use the `pip install` command within the processing script at runtime

AnswerB

A custom image ensures dependencies are available without runtime installation.

Why this answer

Option B is correct because SageMaker Processing jobs run in isolated containers that cannot install packages at runtime via pip without internet access or custom images. Creating a custom Docker image with the required libraries pre-installed ensures the environment is consistent, reproducible, and avoids dependency resolution failures during job execution. This approach aligns with SageMaker's best practice for custom dependencies.

Exam trap

The trap here is that candidates assume SageMaker containers have internet access by default or that a `requirements.txt` in S3 is automatically processed, but in reality, SageMaker Processing jobs often run in isolated subnets without outbound internet, making pip install impossible without a pre-built custom image.

How to eliminate wrong answers

Option A is wrong because a `requirements.txt` file in S3 is not automatically processed by SageMaker Processing; the container does not read it unless explicitly handled in a custom entry point or lifecycle script, and even then, pip install requires network access or a pre-built wheel. Option C is wrong because Amazon EFS is a file system for shared storage, not for distributing Python libraries; mounting EFS to a processing container would require custom network configuration and does not integrate with Python's import system without additional setup. Option D is wrong because `pip install` inside the processing script at runtime will fail if the container lacks internet access (common in VPC-only modes) or if the required build tools are missing, and it violates the principle of immutable infrastructure.

Practice this question →

MCQhard

A company operates an IoT platform that ingests sensor data from thousands of devices. Data is streamed via Amazon Kinesis Data Streams and stored in an S3 bucket using a Kinesis Firehose delivery stream, which writes data in 5-minute windows. The data is then used to train a machine learning model for anomaly detection. Recently, the data science team noticed that the training dataset is always missing the last 5 minutes of events from the end of each day. The S3 objects show that the last delivery stream buffer window is incomplete. The data engineer checked the Kinesis Firehose metrics and found no delivery errors or data loss, but the 'IncomingBytes' and 'IncomingRecords' metrics show consistent data for all periods. The S3 bucket has Lifecycle policies that do not delete objects. The team suspects the issue is related to the data preparation pipeline. Which course of action would correctly resolve the missing data problem?

A.Increase the buffer size to 10 MB and reduce the buffer interval to 60 seconds in the Firehose delivery stream configuration

B.Reprocess the Kinesis stream data from the beginning using a custom application

C.Modify the data preparation pipeline to use AWS Lambda to write data to S3 directly from Kinesis

D.Increase the buffer interval to 600 seconds to allow more time for data to accumulate

AnswerA

Reducing the buffer interval to 60 seconds ensures that data is flushed every minute, preventing incomplete windows from being missed at the end of the day.

Why this answer

Option A is correct because the issue is that the last 5-minute buffer window at the end of each day never completes, so Firehose never delivers that final object to S3. By reducing the buffer interval to 60 seconds and increasing the buffer size to 10 MB, Firehose will flush data more frequently, ensuring that even small residual data at the end of the day is delivered before the stream stops. This directly addresses the incomplete last window without requiring reprocessing or changing the pipeline architecture.

Exam trap

The trap here is that candidates assume the missing data is due to data loss or pipeline errors, but the real issue is that Firehose's buffer window never completes when data stops arriving, so no S3 object is created for that final period.

How to eliminate wrong answers

Option B is wrong because reprocessing the entire Kinesis stream from the beginning is unnecessary and inefficient; the data is not lost, it is simply never delivered due to the buffer window not closing. Option C is wrong because switching to a Lambda-based direct write from Kinesis to S3 would bypass Firehose entirely, adding complexity and potential for data loss or duplication, and does not fix the root cause of the incomplete buffer window. Option D is wrong because increasing the buffer interval to 600 seconds would make the problem worse, as it would extend the time needed for a buffer window to complete, increasing the likelihood of incomplete windows at day boundaries.

Practice this question →

MCQmedium

A company is building a time series forecasting model using SageMaker DeepAR. The raw data is a CSV with columns: timestamp, item_id, and value. What is the correct data format required for DeepAR training?

A.JSON Lines files with 'start', 'target', and optional fields per time series

B.A wide-format CSV where each column is a different time series

C.Parquet files with a schema containing timestamp, item_id, and value

D.A single CSV file with columns: timestamp, item_id, value

AnswerA

DeepAR's training data format is JSON Lines with start timestamp and target array.

Why this answer

DeepAR requires time series data to be provided in JSON Lines format, where each line represents a single time series with a 'start' timestamp (in ISO 8601 format), a 'target' array of values, and optional fields like 'cat' for categorical features. This structured format allows DeepAR to handle variable-length sequences and missing values natively, which is not possible with simple CSV or wide-format data.

Exam trap

The trap here is that candidates assume DeepAR can accept raw CSV data like other SageMaker built-in algorithms (e.g., XGBoost), but DeepAR is a specialized time series algorithm that requires a specific JSON Lines structure with 'start' and 'target' fields, not a simple tabular format.

How to eliminate wrong answers

Option B is wrong because wide-format CSV (each column as a separate time series) is not supported by DeepAR; it expects each time series to be a separate JSON object, not columns. Option C is wrong because Parquet files are not a native input format for DeepAR; the built-in algorithm specifically requires JSON Lines or RecordIO-protobuf format. Option D is wrong because a single CSV with timestamp, item_id, and value columns does not provide the 'start' and 'target' structure DeepAR needs; it would require significant preprocessing to group by item_id and convert to the required JSON Lines format.

Practice this question →

MCQeasy

A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?

A.Keep the column as a feature because it uniquely identifies each customer.

B.Use the column as the target variable.

C.Remove the column from the feature set.

D.Encode the column using one-hot encoding.

AnswerC

Removing unique identifiers prevents overfitting and is standard practice.

Why this answer

Option C is correct because 'CustomerID' is a unique identifier with no predictive power for churn. Including it as a feature would cause the model to memorize individual customers rather than learn generalizable patterns, leading to overfitting and poor performance on unseen data. In machine learning, such columns should be removed during data preparation to ensure the model learns from meaningful features.

Exam trap

The trap here is that candidates may think unique identifiers are useful for tracking or that they can be encoded as categorical features, but the exam tests the principle that identifiers with no predictive relationship to the target must be removed to avoid overfitting and data leakage.

How to eliminate wrong answers

Option A is wrong because keeping 'CustomerID' as a feature introduces a high-cardinality categorical variable with no correlation to the target, which can cause overfitting and degrade model generalization. Option B is wrong because the target variable for churn prediction should be a binary or categorical label indicating churn status, not a unique identifier that has no relationship to the outcome. Option D is wrong because one-hot encoding a unique identifier like 'CustomerID' would create thousands of sparse binary columns, dramatically increasing dimensionality without adding any predictive value, and is computationally wasteful.

Practice this question →

MCQmedium

Refer to the exhibit. A SageMaker Processing job fails with the following error log. Which change during data preparation would resolve the issue?

A.In SageMaker Data Wrangler, set the 'age' column type to 'number'

B.Drop rows with missing values in the 'age' column before training

C.Remove the 'age' column from the dataset entirely

D.Modify the preprocessing script to cast 'age' to float using astype(float)

AnswerD

Casting the column ensures numeric operations work.

Why this answer

Option D is correct because the error log indicates a type mismatch when processing the 'age' column, likely due to mixed data types (e.g., strings and numbers) in a column expected to be numeric. By explicitly casting the column to float using astype(float) in the preprocessing script, you ensure consistent numeric type handling, which resolves the failure during SageMaker Processing job execution.

Exam trap

The trap here is that candidates often assume missing value handling (Option B) or column removal (Option C) is the fix, when the actual issue is a data type inconsistency that requires explicit type casting in the preprocessing code.

How to eliminate wrong answers

Option A is wrong because setting the 'age' column type to 'number' in SageMaker Data Wrangler only affects the visual interface and exported recipe, but does not enforce type casting in the actual processing script, so the underlying data type mismatch persists. Option B is wrong because dropping rows with missing values does not address the core issue of mixed data types (e.g., strings like 'N/A' or 'unknown') in the 'age' column; the error is about type conversion, not missing values. Option C is wrong because removing the 'age' column entirely discards potentially valuable feature data and does not solve the type mismatch problem; it is an overly aggressive workaround that reduces model performance.

Practice this question →

MCQeasy

A machine learning engineer is building a regression model to predict house prices. The feature 'square_footage' has values ranging from 500 to 10,000, while 'num_bedrooms' ranges from 1 to 10. Which preprocessing step is most critical before training a model that uses gradient descent?

A.Standardize both features to have zero mean and unit variance.

B.Apply a logarithmic transformation to both features.

C.Encode the 'num_bedrooms' feature using one-hot encoding.

D.Impute missing values using the mean of the feature.

AnswerA

Standardization brings features to a common scale, crucial for gradient descent.

Why this answer

Gradient descent is sensitive to the scale of features because it updates weights proportionally to the feature values. With 'square_footage' (500–10,000) and 'num_bedrooms' (1–10), the large range difference causes the loss function's contours to be elongated, leading to slow or unstable convergence. Standardizing both features to zero mean and unit variance ensures each feature contributes equally to the gradient updates, enabling faster and more reliable optimization.

Exam trap

AWS often tests the distinction between scaling for gradient-based optimizers versus other preprocessing steps like encoding or transformation, trapping candidates who confuse feature scaling with handling outliers or categorical data.

How to eliminate wrong answers

Option B is wrong because applying a logarithmic transformation is not the most critical step for gradient descent; it is used to handle skewed distributions or multiplicative relationships, not to address feature scale differences. Option C is wrong because one-hot encoding is for categorical features, and 'num_bedrooms' is ordinal (integer-valued), not nominal; encoding it would create unnecessary sparsity and lose the natural ordering. Option D is wrong because imputing missing values is a general data cleaning step, but the question does not mention any missing data; the core issue here is feature scaling for gradient descent, not missingness.

Practice this question →

MCQmedium

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?

A.Use the G.1X worker type, which provides more memory per worker compared to the Standard worker type.

B.Use partition pruning on the source data to reduce the amount of data processed.

C.Switch the output format from Parquet to CSV to reduce processing overhead.

D.Use a larger instance type for the Glue job by increasing the number of DPUs.

AnswerA

G.1X offers more memory, reducing memory-related bottlenecks without increasing DPU count.

Why this answer

Increasing the number of DPUs (Data Processing Units) in AWS Glue can improve parallelism and reduce job runtime, but it increases cost. Using G.1X worker type with more memory per worker can improve performance without increasing DPU count, offering better resource utilization. Switching to CSV may degrade performance.

Using partition pruning on the source data can reduce data scanned but may not address resource consumption.

Practice this question →

MCQhard

A data science team is building a model to predict fraudulent transactions. The dataset has 1 million legitimate transactions and only 1,000 fraudulent ones. They plan to use Amazon SageMaker to train a model. Which data preparation technique should they apply to address the severe class imbalance before training?

A.Apply data augmentation using image transformations because fraud detection is like image classification.

B.Randomly oversample the fraudulent class to match the legitimate count by duplicating existing fraud records.

C.Use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic fraudulent samples.

D.Randomly undersample the legitimate class to 1,000 samples to create a balanced dataset.

AnswerC

SMOTE creates synthetic examples by interpolating between existing minority instances, reducing overfitting risk.

Why this answer

Option D is correct because using SMOTE generates synthetic samples for the minority class, addressing imbalance without simply duplicating data. Option A is wrong because oversampling with duplication can lead to overfitting. Option B is wrong because undersampling discards too much legitimate data, losing valuable patterns.

Option C is wrong because the data is already in a tabular format, not images.

Practice this question →

Multi-Selectmedium

A data scientist is performing feature engineering for a dataset with both numerical and categorical features. The data scientist wants to apply transformations that preserve the interpretability of the features. Which TWO transformations should the data scientist use? (Select TWO)

Select 2 answers

A.Log transformation of skewed numerical features

B.Target encoding of high-cardinality categorical features

C.Standard scaling of numerical features

D.PCA dimensionality reduction

E.One-hot encoding of categorical features

AnswersA, C

Log transformation reduces skewness while keeping feature order.

Why this answer

Log transformation is correct because it reduces skewness in numerical features by compressing the scale of large values, making the distribution more normal while preserving the original feature's interpretability (e.g., a log-transformed income value still relates to income). This is a monotonic transformation, so the order of values is maintained, and the feature remains directly understandable.

Exam trap

AWS often tests the misconception that one-hot encoding always preserves interpretability (it does, but the question pairs it with target encoding as a distractor), leading candidates to select one-hot encoding instead of recognizing that standard scaling is the correct second choice for numerical features.

Practice this question →

MCQmedium

A company uses Amazon SageMaker Ground Truth to create labeled datasets for object detection. The output must be in COCO format for downstream model training. How should the data preparation process be configured?

A.Use a built-in transformation to convert from Ground Truth JSON to COCO after labeling

B.Use a pre-built AWS Lambda function to transform annotations to COCO

C.Write a custom SageMaker Processing script to convert the output to COCO

D.Select 'Object Detection' task type and specify 'COCO' as the output format in the labeling job configuration

AnswerD

Ground Truth supports COCO output for object detection tasks.

Why this answer

Option D is correct because Amazon SageMaker Ground Truth natively supports outputting object detection labeling jobs in COCO format. When you select 'Object Detection' as the task type, the labeling job configuration includes an option to specify 'COCO' as the output format, which automatically structures the labeled data into the required COCO JSON schema without any post-processing.

Exam trap

The trap here is that candidates assume post-processing is always required for format conversion, overlooking that Ground Truth can directly output COCO format when the correct task type and output format are selected in the labeling job configuration.

How to eliminate wrong answers

Option A is wrong because Ground Truth does not provide a built-in transformation to convert its default JSON output to COCO format; the conversion must be handled externally. Option B is wrong because while AWS Lambda can be used for custom transformations, it is not a pre-built solution for this specific conversion; using a Lambda function would require writing custom code and is not the recommended or simplest approach. Option C is wrong because writing a custom SageMaker Processing script is an unnecessary extra step; Ground Truth can directly output COCO format, eliminating the need for any post-labeling transformation.

Practice this question →

MCQeasy

A data engineer needs to convert a JSON dataset to Parquet format for efficient querying with Amazon Athena. The JSON files are in an S3 bucket. Which service can perform this conversion with minimal coding?

A.Amazon SageMaker Processing

B.Amazon EMR

C.AWS Lambda

D.AWS Glue Studio with a visual job

AnswerD

Glue Studio's drag-and-drop interface enables JSON to Parquet conversion with minimal coding.

Why this answer

AWS Glue Studio with a visual job is the correct choice because it provides a no-code, drag-and-drop interface to create ETL jobs that can read JSON from S3 and write it as Parquet, with built-in schema inference and transformation capabilities. This minimizes coding effort while leveraging Glue's serverless Spark engine for efficient conversion, making it ideal for preparing data for Athena queries.

Exam trap

The trap here is that candidates often confuse AWS Glue Studio with AWS Glue DataBrew or assume that any AWS service with 'processing' in its name (like SageMaker Processing) is suitable for simple ETL tasks, overlooking the specific no-code visual job capability of Glue Studio.

How to eliminate wrong answers

Option A is wrong because Amazon SageMaker Processing is designed for data preprocessing and model training workflows within the ML pipeline, not for simple file format conversion; it requires writing custom processing scripts and managing infrastructure, which adds unnecessary complexity. Option B is wrong because Amazon EMR is a managed Hadoop/Spark cluster that can perform the conversion, but it requires provisioning and configuring a cluster, writing Spark or Hive code, and managing lifecycle, which is far more coding and operational overhead than a visual job. Option C is wrong because AWS Lambda has a maximum execution time of 15 minutes and a deployment package size limit, making it impractical for converting large JSON datasets to Parquet; it also requires custom Python code with libraries like PyArrow or Pandas, which is not minimal coding.

Practice this question →

Multi-Selecteasy

Which TWO actions are recommended best practices when preparing training data for a machine learning model in AWS? (Choose two.)

Select 2 answers

A.Remove all outliers from the dataset.

B.Train the model on the entire dataset to maximize data usage.

C.Check for and handle missing values appropriately.

D.Split the data into training, validation, and test sets.

E.Always normalize all features to a [0,1] range.

AnswersC, D

Missing values can cause errors or bias if not addressed.

Why this answer

Option C is correct because missing values can introduce bias or cause algorithms to fail, so handling them (e.g., via imputation or removal) is a critical data preparation step in AWS SageMaker. Option D is correct because splitting data into training, validation, and test sets allows you to evaluate model performance on unseen data and prevent overfitting, which is a standard practice in SageMaker's built-in algorithms and training jobs.

Exam trap

The trap here is that candidates assume all outliers must be removed (Option A) or that normalization is always required (Option E), but the exam tests nuanced understanding that these steps depend on the algorithm and data characteristics, not blanket rules.

Practice this question →

MCQhard

A data scientist is preparing a dataset for a regression model that predicts house prices. The dataset includes a `neighborhood` feature with 500 distinct categories. The data scientist wants to encode this feature without increasing dimensionality too much and while capturing the target relationship. Which encoding technique should be used?

A.Target encoding (mean encoding)

B.One-hot encoding

C.Frequency encoding

D.Label encoding

AnswerA

Target encoding captures target relationship with low dimensionality.

Why this answer

Target encoding (mean encoding) is the correct choice because it replaces each of the 500 neighborhood categories with the mean of the target variable (house price) for that category. This captures the relationship between the neighborhood and the target while adding only one new feature column, thus avoiding the massive dimensionality explosion that would occur with one-hot encoding (which would create 500 binary columns).

Exam trap

AWS often tests the trade-off between dimensionality and information retention, and the trap here is that candidates may choose one-hot encoding out of habit, failing to recognize that 500 categories make it impractical, or choose label encoding because it seems simple, ignoring the ordinal assumption it imposes.

How to eliminate wrong answers

Option B (One-hot encoding) is wrong because it would create 500 binary columns, drastically increasing dimensionality and leading to the curse of dimensionality, sparsity, and overfitting. Option C (Frequency encoding) is wrong because it replaces categories with their count/frequency, which does not capture the relationship with the target variable (house price) and loses predictive signal. Option D (Label encoding) is wrong because it assigns arbitrary integer labels (e.g., 1, 2, 3) that imply an ordinal relationship, which is inappropriate for a nominal feature like neighborhood and can mislead the regression model into assuming a false order.

Practice this question →

MCQeasy

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

A.Clip negative sales values to zero

B.Apply log transformation after adding a constant

C.Remove all rows with negative sales values

D.Impute negative values with the mean

AnswerA

Sets returns to zero, which is appropriate for sales data.

Why this answer

Option A is correct because clipping negative sales values to zero directly addresses the model's requirement for non-negative input while preserving the data's temporal structure. This approach is appropriate for time series forecasting where returns cause occasional negative values, as it treats returns as zero sales rather than removing or distorting the data points.

Exam trap

AWS often tests the misconception that removing or imputing negative values is safe in time series, but the trap here is that these actions break temporal dependencies and introduce bias, whereas clipping preserves the sequence structure.

How to eliminate wrong answers

Option B is wrong because applying a log transformation after adding a constant does not guarantee non-negative values; it only compresses the scale and can introduce bias, especially with negative values that require arbitrary shifting. Option C is wrong because removing all rows with negative sales values disrupts the time series continuity and can lead to loss of important temporal patterns, such as seasonality or trends. Option D is wrong because imputing negative values with the mean introduces statistical bias and distorts the underlying distribution, which is particularly problematic in time series where data points are sequentially dependent.

Practice this question →

MCQmedium

A company uses Amazon SageMaker Data Wrangler to create a data flow for a classification model. The dataset contains a high-cardinality categorical feature 'product_id' with 50,000 unique values. The data scientist wants to reduce dimensionality while preserving predictive power. Which approach is most effective?

A.Apply one-hot encoding to the 'product_id' column.

B.Perform target encoding by replacing each product ID with the average target value for that product.

C.Use feature hashing to map product IDs to a fixed number of buckets (e.g., 100).

D.Drop the 'product_id' column entirely.

AnswerB

Target encoding condenses information into a single numerical feature while retaining predictive signals.

Why this answer

Target encoding is the most effective approach for high-cardinality categorical features because it replaces each category with the mean of the target variable, preserving predictive signal while drastically reducing dimensionality. In SageMaker Data Wrangler, this can be implemented using the 'Encode categorical' transform with the 'Target encoding' option, which avoids the explosion of features caused by one-hot encoding and retains the relationship between product IDs and the target.

Exam trap

AWS often tests the misconception that feature hashing is always safe for high-cardinality features, but the trap here is that hash collisions can degrade model performance, making target encoding a better choice when the target variable is available and predictive.

How to eliminate wrong answers

Option A is wrong because one-hot encoding on a feature with 50,000 unique values would create 50,000 binary columns, leading to extreme dimensionality and sparsity, which degrades model performance and increases computational cost. Option C is wrong because feature hashing maps product IDs to a fixed number of buckets (e.g., 100), which can cause hash collisions and loss of information, reducing predictive power compared to target encoding. Option D is wrong because dropping the column entirely discards all predictive information contained in the product IDs, which is likely to harm model accuracy.

Practice this question →

100

MCQmedium

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days after discharge. The dataset contains electronic health record (EHR) data from multiple hospitals, stored as Parquet files in Amazon S3. The data includes patient demographics, diagnoses (ICD-10 codes), medications, lab results, and length of stay. A data scientist notices that the 'lab_result' column has a high number of null values (over 60%) because some tests are not applicable to all patients. Additionally, the 'diagnosis_code' column has over 10,000 unique ICD-10 codes. The company wants to build a model that complies with HIPAA and performs well. The data scientist must prepare the features efficiently using AWS services. Which combination of steps should the data scientist take? (Assume the company can use any AWS service.)

A.Use AWS Glue ETL to impute missing lab results with a value predicted from other features using a model like XGBoost, and apply count encoding to diagnosis codes based on their frequency of occurrence.

B.Replace missing lab results with the overall mean, and use a binary flag for nullness. For diagnosis codes, apply one-hot encoding after grouping codes into 20 categories based on clinical relevance.

C.Drop all records where lab_result is null, and use one-hot encoding for diagnosis codes.

D.Use Amazon SageMaker Data Wrangler's built-in 'Fill missing' with KNN imputation for lab results, and apply ordinal encoding to diagnosis codes based on the order of ICD-10 chapters.

AnswerA

Predictive imputation leverages other features to estimate missing values, retaining data. Count encoding reduces the cardinality of diagnosis codes.

Why this answer

Option A is correct because it uses AWS Glue ETL to impute missing lab results with a predictive model (XGBoost), which is appropriate for high missingness (>60%) where simple imputation would bias the model, and applies count encoding to the high-cardinality diagnosis codes (10,000+ unique values) to avoid the dimensionality explosion of one-hot encoding while preserving frequency information. This approach balances HIPAA compliance (data stays within AWS) with model performance.

Exam trap

The trap here is that candidates often choose simple mean imputation (Option B) or dropping rows (Option C) without considering the impact of high missingness on bias and data loss, or they overcomplicate encoding (Option D) without recognizing that ordinal encoding implies a false order for categorical codes.

How to eliminate wrong answers

Option B is wrong because replacing 60%+ missing lab results with the overall mean ignores the non-random missingness (tests not applicable to all patients) and introduces severe bias, and grouping 10,000+ ICD-10 codes into only 20 categories based on clinical relevance loses granularity and may not reflect readmission risk patterns. Option C is wrong because dropping all records with null lab results would discard over 60% of the data, leading to massive data loss and a non-representative dataset, and one-hot encoding 10,000+ diagnosis codes creates an unmanageable feature space (sparse matrix) that degrades model performance. Option D is wrong because KNN imputation on a dataset with >60% missingness in the same column is computationally expensive and unreliable (neighbors themselves may have missing values), and ordinal encoding based on ICD-10 chapter order imposes an arbitrary ordinal relationship that does not reflect clinical risk or readmission likelihood.

Practice this question →

101

MCQhard

A data scientist is using Amazon SageMaker Data Wrangler for feature engineering on a large dataset stored in S3. The dataset has a column 'ProductCategory' with 1000+ unique values. To reduce dimensionality, they want to group categories that appear less than 1% of the time into an 'Other' category. Which Data Wrangler transform should they use?

A.Group similar categories

B.Custom transform with Python

C.Handle rare values

D.One-hot encode with threshold

AnswerC

This built-in transform can group categories below a frequency threshold into an 'Other' value.

Why this answer

The 'Handle rare values' transform in SageMaker Data Wrangler is specifically designed to group infrequent category values into a single 'Other' bucket based on a frequency threshold (e.g., less than 1%). This directly addresses the need to reduce dimensionality by consolidating rare categories without requiring custom code or manual grouping.

Exam trap

The trap here is that candidates may confuse the 'Handle rare values' transform with the 'One-hot encode with threshold' transform, mistakenly thinking the threshold in one-hot encoding serves the same purpose as grouping rare categories, when in fact it limits the number of one-hot columns created, not the grouping of infrequent values.

How to eliminate wrong answers

Option A is wrong because 'Group similar categories' is a manual grouping transform that requires the user to explicitly define which categories to combine, not an automated threshold-based grouping of rare values. Option B is wrong because while a custom Python transform could technically achieve this, it is unnecessary and less efficient when a built-in, optimized transform ('Handle rare values') exists for this exact purpose. Option D is wrong because 'One-hot encode with threshold' applies to one-hot encoding (creating binary columns) and its threshold controls the maximum number of one-hot features, not the grouping of rare categories into an 'Other' bucket.

Practice this question →

102

MCQeasy

A data engineer is preparing a large dataset of 10 TB for ML training on Amazon SageMaker. The data is stored in Amazon S3 as CSV files. To reduce training time and cost, the engineer wants to use a columnar format that is optimized for analytical queries. Which format should the engineer convert the data to?

A.XML

B.Parquet

C.ORC

D.JSON Lines

AnswerB

Parquet is a columnar format that speeds up data access and reduces storage costs.

Why this answer

Parquet is a columnar storage format that is highly optimized for analytical queries and is natively supported by Amazon SageMaker for efficient data loading. By converting the 10 TB of CSV data to Parquet, the data engineer can reduce I/O and storage costs because columnar formats allow SageMaker to read only the columns needed for training, rather than scanning entire rows. This directly addresses the goal of reducing training time and cost for ML workloads.

Exam trap

AWS often tests the distinction between columnar formats (Parquet vs. ORC) by making both appear correct, but the trap here is that ORC is tightly coupled with Hive and less commonly used with SageMaker, while Parquet is the de facto standard for AWS-native ML and analytics services.

How to eliminate wrong answers

Option A (XML) is wrong because XML is a verbose, row-oriented text format that is not optimized for analytical queries; it would increase storage size and I/O overhead, making training slower and more expensive. Option C (ORC) is also a columnar format optimized for analytical queries, but it is primarily designed for and tightly integrated with the Apache Hive ecosystem, whereas Parquet is the more universally supported and recommended format for Amazon SageMaker and AWS analytics services. Option D (JSON Lines) is wrong because it is a row-oriented, text-based format that lacks the compression and columnar pruning benefits of Parquet, leading to higher storage costs and slower data access for ML training.

Practice this question →

103

MCQeasy

A data scientist is preparing a dataset for a linear regression model. The dataset has a few missing values in a numerical feature with a normal distribution and no outliers. Which imputation method is most appropriate?

A.Impute with mode

B.Impute with mean

C.Impute with median

D.Drop rows with missing values

AnswerB

Mean is appropriate for normally distributed numerical data without outliers.

Why this answer

Option B is correct because mean imputation is suitable for normally distributed data without outliers. Option A (drop rows) reduces sample size. Option C (median) is robust to outliers but not needed.

Option D (mode) is for categorical data.

Practice this question →

104

Multi-Selecthard

A data scientist is working with a dataset containing customer demographics and purchase history. The dataset includes categorical variables with high cardinality (e.g., ZIP code, product ID). The data scientist wants to perform feature engineering to improve model performance. Which THREE feature engineering techniques should the data scientist consider? (Choose three.)

Select 3 answers

A.Principal Component Analysis (PCA) to reduce dimensionality of numerical features.

B.Domain-specific feature engineering based on business rules.

C.Target encoding for high-cardinality categorical variables.

D.Frequency encoding to represent categories by their occurrence count.

E.One-hot encoding all categorical features.

AnswersA, C, D

PCA can reduce noise and multicollinearity.

Why this answer

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated numerical features into a smaller set of uncorrelated principal components, capturing the maximum variance in the data. This is correct because the dataset includes numerical features (e.g., purchase amounts, age) where PCA can reduce noise and multicollinearity, improving model performance without losing critical information.

Exam trap

AWS often tests the distinction between techniques that are universally applicable (like PCA for numerical features) versus those that are specifically designed to handle high-cardinality categorical variables (like target encoding and frequency encoding), tempting candidates to choose one-hot encoding without considering its impracticality for high cardinality.

Practice this question →

105

MCQmedium

A data engineer needs to join two large datasets from Amazon S3: one containing customer demographics and another containing transaction history. The join key is `customer_id`. To minimize data shuffling and improve performance, the engineer decides to use Amazon SageMaker Processing with Spark. Which configuration should the engineer use?

A.Use a bucketed join with the same number of buckets

B.Broadcast join the larger dataset

C.Use a bucketed join with the same number of buckets and co-location

D.Use a repartition on the join key before join

AnswerC

Bucketing with co-location allows Spark to perform the join without shuffling.

Why this answer

Option C is correct because bucketed joins with the same number of buckets and co-location ensure that data with the same `customer_id` hash is physically stored together on the same nodes. This eliminates the need for expensive shuffles during the join, as Spark can perform the join locally within each executor, dramatically improving performance for large datasets in SageMaker Processing.

Exam trap

The trap here is that candidates assume bucketing alone (same number of buckets) is sufficient, but without co-location, Spark still performs a shuffle to align the data, so both conditions are required for a shuffle-free join.

How to eliminate wrong answers

Option A is wrong because bucketed joins require both datasets to have the same number of buckets AND co-location (data physically stored together); without co-location, Spark still shuffles data to align partitions, negating the performance benefit. Option B is wrong because broadcast join is only efficient when one dataset is small enough to fit in memory (typically <100 MB); the transaction history dataset is large, so broadcasting it would cause out-of-memory errors or severe performance degradation. Option D is wrong because repartitioning on the join key before the join adds an extra shuffle step, increasing overhead rather than reducing it; bucketing with co-location avoids shuffles entirely.

Practice this question →

106

MCQhard

A data scientist is preparing data for a regression model. The target variable has a skewed distribution. The scientist wants to apply a log transformation to make it closer to normal. Which step should be taken before applying log transformation?

A.Standardize the data to zero mean and unit variance

B.Remove outliers using IQR

C.Ensure all values are positive

D.Center the data by subtracting the mean

AnswerC

Log is undefined for zero and negative values. If present, add a constant or use other transformations.

Why this answer

The log transformation is defined only for positive real numbers; applying it to zero or negative values results in undefined or complex outputs. Therefore, before applying a log transformation, you must ensure all values in the target variable are positive, typically by adding a constant (e.g., log(x + 1)) if zeros are present. This step is a fundamental data preparation requirement for log transformations in regression modeling.

Exam trap

AWS often tests the assumption that candidates will confuse data normalization or centering with the domain restriction of the log function, leading them to pick standardization or mean-centering as a preparatory step.

How to eliminate wrong answers

Option A is wrong because standardizing to zero mean and unit variance (z-score normalization) does not guarantee all values become positive; it centers data around zero, which can produce negative values, making log transformation invalid. Option B is wrong because removing outliers using IQR is not a prerequisite for log transformation; while outliers can affect model performance, the log transformation itself can help mitigate skewness and reduce the influence of outliers, and removing them beforehand is an optional, separate step. Option D is wrong because centering data by subtracting the mean shifts values to have a mean of zero, which inevitably introduces negative values, directly contradicting the requirement for positive inputs for log transformation.

Practice this question →

107

MCQhard

A data scientist creates a feature group as shown in the exhibit. When ingesting data with an 'age' column of integer values, the ingestion fails. What is the most likely cause?

A.The role does not have permissions to write to the feature store.

B.The `age` feature type should be `Integral`, not `String`.

C.The `OnlineStoreConfig` must include a `SecurityConfig`.

D.The `EventTimeFeatureName` is incorrectly spelled.

AnswerB

The feature type must match the ingested data type.

Why this answer

Option B is correct because the feature group definition specifies the 'age' column as a `String` type, but the ingested data contains integer values. Amazon SageMaker Feature Store requires that the data types of ingested records match the schema defined in the feature group. When a mismatch occurs, such as providing an integer for a string field, the ingestion fails with a type conversion error.

Exam trap

AWS often tests the distinction between schema definition and actual data types, trapping candidates who overlook that the feature group schema must exactly match the ingested data's types, not just the column names.

How to eliminate wrong answers

Option A is wrong because the question states the ingestion fails specifically due to a data type mismatch, not a permissions issue; a permissions error would typically occur at the API call level, not during data parsing. Option C is wrong because `SecurityConfig` is not a required field in `OnlineStoreConfig`; the online store configuration only requires an `EnableOnlineStore` boolean and optionally a `SecurityGroupIdList` and `SubnetIdList` for VPC settings. Option D is wrong because the `EventTimeFeatureName` is spelled correctly as 'EventTime' in the exhibit, and a misspelling would cause a different error (e.g., 'InvalidParameterValue') rather than a data type mismatch.

Practice this question →

108

Multi-Selectmedium

A dataset for binary classification has a severe class imbalance (5% positive class). Which two data preparation techniques can help address this imbalance? (Choose two.)

Select 2 answers

A.Remove outliers from the minority class

B.Apply PCA to reduce dimensionality

C.Use stratified splitting for train/test sets

D.Undersample the majority class

E.Oversample the minority class using SMOTE

AnswersD, E

Reduces majority class size to balance with minority class.

Why this answer

Option D is correct because undersampling the majority class reduces the number of instances from the dominant class, helping to balance the dataset and prevent the model from being biased toward the majority class. This technique is straightforward and can be effective when the majority class has redundant or noisy samples, though it risks losing valuable information.

Exam trap

AWS often tests the distinction between techniques that change the dataset distribution (like undersampling and oversampling) versus those that only affect model training or evaluation (like stratified splitting), leading candidates to mistakenly select stratified splitting as a balancing technique.

Practice this question →

109

MCQmedium

A company collects sensor data from IoT devices. The data arrives with missing timestamps due to network issues. For anomaly detection, the engineer needs to create features that capture rolling statistics over fixed windows. Which data preprocessing step is essential before feature generation?

A.Remove missing timestamps

B.Resample data to a fixed frequency

C.Sort data by device ID

D.Impute missing values with forward fill

AnswerB

Resampling ensures consistent time intervals, which is required for rolling windows.

Why this answer

Resampling the data to a fixed frequency is essential because rolling window statistics require a consistent time index to compute accurate aggregations over fixed windows. Without a uniform timestamp grid, the window boundaries become ambiguous and the resulting features will be misaligned or incomplete, undermining the anomaly detection model.

Exam trap

AWS often tests the distinction between handling missing values (imputation) and handling irregular timestamps (resampling), leading candidates to confuse forward-fill as a solution for time alignment when it only addresses missing data points, not the underlying time index irregularity.

How to eliminate wrong answers

Option A is wrong because simply removing missing timestamps discards valuable data and does not address the need for a consistent time index; the remaining timestamps remain irregularly spaced. Option C is wrong because sorting by device ID organizes data by device but does not fix the irregular timestamp spacing required for fixed-window rolling statistics. Option D is wrong because forward-fill imputation fills missing values but does not create a uniform time grid; the timestamps themselves remain irregular, so rolling windows cannot be applied consistently.

Practice this question →

110

Multi-Selecteasy

A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)

Select 2 answers

A.Manually remove outliers by inspecting the data in Amazon S3.

B.Train a neural network to identify anomalies and remove them.

C.Use pandas in a SageMaker notebook to calculate z-scores and filter outliers.

D.Use AWS Glue DynamicFrame with Apache Spark to compute interquartile range (IQR) and filter outliers.

E.Use Amazon SageMaker Data Wrangler to apply an outlier detection transform.

AnswersD, E

Spark can handle large-scale data and IQR is a standard method.

Why this answer

Option D is correct because AWS Glue DynamicFrames, built on Apache Spark, provide a scalable, distributed computing environment to compute statistical measures like the interquartile range (IQR) across large datasets. This allows the engineer to programmatically filter outliers without manual intervention, leveraging Spark's parallel processing for efficient handling of data at scale.

Exam trap

The trap here is that candidates may assume that only a single AWS service can handle outlier detection at scale, but the question requires selecting two approaches, and both Glue DynamicFrames and SageMaker Data Wrangler are valid, scalable, and managed AWS solutions for this task.

Practice this question →

111

MCQeasy

A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?

A.Apply one-hot encoding to the year, month, and day components.

B.Convert the timestamp to a numeric feature (e.g., days since signup) and include it.

C.Use leave-one-out encoding based on the target variable.

D.Drop the 'signup_date' column from the dataset.

AnswerD

Irrelevant columns should be removed to prevent noise.

Why this answer

Option D is correct because the 'signup_date' column is explicitly stated as not relevant for the prediction. In binary classification for customer churn, including an irrelevant timestamp can introduce noise, increase dimensionality, and potentially cause overfitting. Dropping the column is the most appropriate action to maintain model simplicity and focus on predictive features.

Exam trap

AWS often tests the misconception that all timestamp data must be transformed into numeric features, but the key is to first assess relevance—if the column is explicitly not relevant, dropping it is the correct action, not engineering features from it.

How to eliminate wrong answers

Option A is wrong because one-hot encoding the year, month, and day components would create multiple sparse features from an irrelevant column, adding unnecessary complexity and potentially misleading the model with temporal patterns that have no causal relationship with churn. Option B is wrong because converting the timestamp to a numeric feature like 'days since signup' would still retain irrelevant temporal information, which could introduce a spurious correlation or bias, especially if the dataset has a time-based split that leaks future information. Option C is wrong because leave-one-out encoding based on the target variable would leak target information into the feature, causing data leakage and overfitting, as the encoding uses the target value of other rows to encode the current row, which is inappropriate for an irrelevant column.

Practice this question →

112

MCQeasy

A data scientist is preparing a dataset for binary classification using SageMaker. The dataset has 100 features and 10,000 rows, but the target variable is highly imbalanced (95% negative, 5% positive). Which technique should the data scientist apply during data preparation to address the imbalance?

A.Oversampling the minority class by duplicating examples

B.Collect more data to match the number of samples in both classes

C.Random undersampling of the majority class

D.Apply SMOTE to generate synthetic samples for the minority class

AnswerD

SMOTE creates synthetic examples along the line segments of minority class nearest neighbors, addressing imbalance.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the most appropriate technique because it generates synthetic samples for the minority class by interpolating between existing minority instances, which avoids the overfitting risk of simple duplication (oversampling) and the information loss from undersampling. In SageMaker, SMOTE can be applied during data preparation using libraries like imbalanced-learn before training, or via SageMaker Data Wrangler's built-in transform, making it a robust choice for handling class imbalance without discarding data.

Exam trap

AWS often tests the distinction between oversampling by duplication and synthetic oversampling (SMOTE), where candidates mistakenly choose simple duplication (Option A) because they think 'more data is always better,' failing to recognize that SMOTE generates diverse synthetic samples to reduce overfitting.

How to eliminate wrong answers

Option A is wrong because oversampling the minority class by duplicating examples leads to overfitting, as the model sees the same exact data points repeatedly, which does not introduce new variance and can cause poor generalization. Option B is wrong because collecting more data to match the number of samples in both classes is often impractical, costly, or impossible in real-world scenarios, and the question specifically asks for a technique to apply during data preparation, not a data collection strategy. Option C is wrong because random undersampling of the majority class discards potentially valuable data, which can lead to loss of important patterns and reduce model performance, especially when the dataset is already limited to 10,000 rows.

Practice this question →

113

Multi-Selectmedium

A company is preparing data for a time-series forecasting model. The data is collected from IoT sensors at irregular intervals. Which TWO steps are necessary to prepare the data? (Choose 2.)

Select 2 answers

A.Normalize the data to a 0-1 range

B.Resample the data to a fixed frequency

C.Fill missing values using forward fill or interpolation

D.Remove outlier data points

E.Encode categorical features

AnswersB, C

Resampling creates regular time intervals required by most forecasting models.

Why this answer

Time-series forecasting models require data at consistent time intervals to capture temporal patterns and seasonality. Resampling the irregular IoT sensor data to a fixed frequency (e.g., every 5 minutes) creates a uniform time index, which is essential for algorithms like ARIMA, Prophet, or LSTM. This step ensures the model can learn from a structured sequence rather than being confused by variable time gaps.

Exam trap

AWS often tests the misconception that data normalization or outlier removal is a universal first step, but for time-series with irregular intervals, the critical preparatory steps are resampling and handling missing values to create a regular time grid.

Practice this question →

114

Multi-Selecthard

A data engineer is optimizing Amazon Athena queries on large datasets stored in S3 for machine learning data preparation. Which THREE practices improve query performance?

Select 3 answers

A.Partition the data by a frequently filtered column, such as date

B.Use uncompressed CSV files for simplicity

C.Partition the data by every column to maximize filtering

D.Store data in columnar formats like Parquet or ORC

E.Compress the data with Snappy or gzip

AnswersA, D, E

Partition pruning limits scanned data.

Why this answer

Partitioning by a frequently filtered column, such as date, allows Athena to use partition pruning. When a query includes a filter on the partition column, Athena can skip entire directories of data in S3, drastically reducing the amount of data scanned and improving query performance while also lowering cost.

Exam trap

AWS often tests the misconception that more partitions always improve performance, but in reality, over-partitioning leads to metastore overhead and small file problems that degrade query performance.

Practice this question →

115

MCQhard

A financial services company is building a fraud detection model using transactional data stored in Amazon S3. The data includes transaction_id, timestamp, amount, merchant_category, and fraud_label (0/1). The data is collected from multiple sources and has inconsistencies: timestamps are in different timezones (UTC and EST), merchant categories are sometimes misspelled (e.g., 'RESTAURANT', 'Restaurant', 'restaurant'), and the fraud_label is missing for about 5% of records. The data science team uses AWS Glue for ETL. They need to prepare a clean dataset for training. The final dataset must have consistent timestamps in UTC, standardized merchant categories, and no missing fraud labels. The team also wants to minimize data loss. Which set of actions should the team take?

A.Use AWS Glue to convert all timestamps to UTC, apply a mapping function to correct merchant category misspellings to a standard list, and drop records with missing fraud_label.

B.Use AWS Glue to convert timestamps to UTC, use a fuzzy matching algorithm to standardize merchant categories, and replace missing fraud_label with the mean value (0.05).

C.Use AWS Glue to convert timestamps to UTC, correct merchant categories by mapping known misspellings to correct names, and drop records with missing fraud_label.

D.Use AWS Glue to convert timestamps to UTC, use a mapping table to group similar merchant categories (e.g., all restaurant variants to 'Restaurant'), and impute missing fraud_label using mode (most frequent value).

AnswerD

Mode imputation preserves the majority class and avoids data loss, while timestamp conversion and category mapping clean the data correctly.

Why this answer

Option D is correct because it preserves data by imputing missing fraud labels using the mode (most frequent value), which is appropriate for a binary classification label where the majority class is likely 0. It also standardizes timestamps to UTC and uses a mapping table to group merchant category variants, ensuring consistency without data loss. Dropping records (as in A and C) would reduce the dataset size, and imputing with the mean (as in B) is invalid for a categorical label.

Exam trap

The trap here is that candidates often choose to drop missing values (options A and C) to avoid imputation complexity, not realizing that minimizing data loss is explicitly stated as a requirement, and that mode imputation is a standard technique for categorical labels in ML pipelines.

How to eliminate wrong answers

Option A is wrong because dropping records with missing fraud_label causes unnecessary data loss (5% of records) when imputation is feasible, and the mapping function for merchant categories is vague and not standardized. Option B is wrong because replacing missing fraud_label with the mean (0.05) is inappropriate for a binary categorical variable; mean imputation can introduce fractional values that are meaningless for classification. Option C is wrong because dropping records with missing fraud_label again causes data loss, and correcting merchant categories by mapping known misspellings is less robust than using a mapping table to group all variants, which better handles unseen misspellings.

Practice this question →

116

MCQeasy

A data scientist is preparing a dataset for training a binary classification model. The dataset has 100,000 rows and 50 features. The target variable is imbalanced, with only 5% positive cases. Which technique should the data scientist apply to address the class imbalance BEFORE training?

A.Principal Component Analysis (PCA) dimensionality reduction

B.Random oversampling of the minority class

C.Standard scaling of numerical features

D.One-hot encoding of categorical variables

AnswerB

Random oversampling is a valid technique to balance classes by replicating minority samples.

Why this answer

Random oversampling of the minority class (Option B) directly addresses the class imbalance by duplicating examples from the positive class until the class distribution is more balanced. This prevents the binary classification model from being biased toward the majority class, which is critical when only 5% of the 100,000 rows are positive cases. Oversampling is applied before training to ensure the model sees sufficient minority examples during learning.

Exam trap

AWS often tests whether candidates confuse data preprocessing techniques (scaling, encoding, dimensionality reduction) with methods that directly modify the class distribution, leading them to pick a plausible but irrelevant option like PCA or scaling.

How to eliminate wrong answers

Option A is wrong because PCA dimensionality reduction reduces the number of features but does not alter the class distribution; it would not fix the 5% imbalance and could even discard variance useful for separating the minority class. Option C is wrong because standard scaling normalizes numerical feature ranges but has no effect on the ratio of positive to negative samples; it addresses feature magnitude, not class imbalance. Option D is wrong because one-hot encoding converts categorical variables into binary columns but does not change the target variable's distribution; it is a preprocessing step for feature representation, not for balancing classes.

Practice this question →

117

Multi-Selecthard

A data scientist is cleaning a text dataset for natural language processing. The raw data contains HTML tags, URLs, and special characters. Which THREE steps should be taken to preprocess the text data? (Choose 3.)

Select 3 answers

A.Convert all text to lowercase

B.Encode the text using one-hot encoding

C.Remove HTML tags using a regular expression

D.Perform stemming or lemmatization

E.Remove stop words

AnswersA, C, D

Lowercasing standardizes text and reduces vocabulary size.

Why this answer

Converting all text to lowercase (Option A) is a standard text normalization step in NLP preprocessing. It reduces the vocabulary size by treating words like 'Apple' and 'apple' as the same token, which helps downstream models avoid treating case variations as distinct features. This is typically done early in the pipeline before tokenization or vectorization.

Exam trap

AWS often tests the distinction between preprocessing steps that clean raw data (like removing HTML tags and normalizing case) versus later feature engineering steps (like encoding or stop word removal), causing candidates to mistakenly select stop word removal as a cleaning step when it is actually a filtering step applied after tokenization.

Practice this question →

118

MCQeasy

A data scientist is preparing a dataset for a binary classification model. The dataset has 10,000 records with 100 features. The target variable is imbalanced, with 95% negative class and 5% positive class. Which data preparation step should the data scientist take to address the imbalance before training?

A.Normalize all features to a 0-1 range

B.Use cross-validation to handle imbalance

C.Remove enough instances of the negative class to achieve balance

D.Apply SMOTE to oversample the positive class

AnswerD

SMOTE generates synthetic samples for the minority class, effectively balancing the dataset.

Why this answer

Option D is correct because SMOTE (Synthetic Minority Oversampling Technique) generates synthetic samples for the minority class (positive class, 5%) by interpolating between existing minority instances. This addresses the severe class imbalance (95:5) without discarding data, allowing the model to learn decision boundaries for the minority class more effectively than simple duplication.

Exam trap

AWS often tests the misconception that any data preprocessing step (like normalization or cross-validation) can fix class imbalance, when in fact only resampling techniques (oversampling, undersampling, or synthetic generation) directly alter the class distribution.

How to eliminate wrong answers

Option A is wrong because normalizing features to a 0-1 range addresses feature scaling, not class imbalance; it does not change the class distribution. Option B is wrong because cross-validation is a model evaluation technique that helps assess performance but does not modify the training data to correct imbalance; it would still train on the imbalanced dataset. Option C is wrong because removing instances of the negative class (random undersampling) discards potentially valuable data, which can lead to loss of information and reduced model performance, especially when the negative class represents 95% of the data.

Practice this question →

119

MCQhard

A machine learning team is processing a large dataset in Amazon SageMaker using a processing job. The data is stored in S3 in CSV format. The team wants to split the data into training, validation, and test sets (70/20/10) while ensuring that the distribution of a categorical feature 'region' is preserved across splits. Which SageMaker SDK method should they use to write the output?

A.Use sagemaker.sklearn.processing.SKLearnProcessor with a script that uses sklearn's StratifiedShuffleSplit

B.Use sagemaker.xgboost.processing.XGBoostProcessor with a script that uses random split

C.Use sagemaker.processing.Processor.run() with a custom script that uses train_test_split

D.Use sagemaker.processing.FrameworkProcessor with a script that uses pandas.sample

AnswerA

StratifiedShuffleSplit ensures the 'region' distribution is maintained across splits.

Why this answer

Option A is correct because `SKLearnProcessor` allows you to run a custom Python script that uses `sklearn.model_selection.StratifiedShuffleSplit`, which preserves the distribution of the categorical 'region' feature across the training, validation, and test splits. This is the only option that directly supports stratified splitting within a SageMaker processing job, ensuring the 70/20/10 ratio while maintaining class balance.

Exam trap

The trap here is that candidates often confuse generic processing methods (like `Processor.run()` or `FrameworkProcessor`) with the specific processor that supports stratified splitting, or they assume `train_test_split` with a random state is sufficient for preserving categorical distributions, ignoring the need for stratification.

How to eliminate wrong answers

Option B is wrong because `XGBoostProcessor` is designed for XGBoost-specific preprocessing (e.g., converting CSV to libsvm) and does not natively support stratified splitting or custom scripts for data partitioning. Option C is wrong because `Processor.run()` is a generic method that executes a processing job, but it does not provide built-in stratified splitting; using `train_test_split` alone would perform a random split, not preserving the 'region' distribution. Option D is wrong because `FrameworkProcessor` is a generic base class for custom frameworks, and `pandas.sample` performs random sampling without stratification, failing to maintain the categorical feature distribution across splits.

Practice this question →

120

MCQmedium

A healthcare company is building a model to predict patient readmission rates. The dataset contains a mix of numeric features (age, blood pressure, lab test results) and categorical features (gender, diagnosis code, hospital department). The dataset has 2 million rows. The data is stored in an Amazon S3 bucket, and they use AWS Glue to catalog and preprocess the data. The data scientist notices that the 'diagnosis_code' column has 10,000 unique codes, and 20% of the rows have missing values for 'blood_pressure'. They plan to use a SageMaker built-in XGBoost model. For optimal model performance, which preprocessing steps should they apply using AWS Glue ETL?

A.Impute missing 'blood_pressure' with the mean, and apply label encoding to 'diagnosis_code'.

B.Impute missing 'blood_pressure' with median, and apply integer encoding to 'diagnosis_code'.

C.Replace missing 'blood_pressure' with -1 and apply one-hot encoding to 'diagnosis_code' after grouping rare codes into 'other'.

D.Apply one-hot encoding to 'diagnosis_code' and drop rows with missing 'blood_pressure'.

AnswerB

Median is robust; integer encoding is sufficient for tree-based models like XGBoost.

Why this answer

Option B is correct because XGBoost handles missing values natively, so median imputation for 'blood_pressure' is robust to outliers and preserves data distribution, while integer encoding (label encoding) for 'diagnosis_code' with 10,000 unique values is efficient and avoids the dimensionality explosion of one-hot encoding. AWS Glue ETL can apply these transformations using built-in functions like `Imputer` and `StringIndexer` without excessive memory overhead.

Exam trap

The trap here is that candidates overestimate the need for one-hot encoding with high-cardinality categorical features, forgetting that tree-based models like XGBoost can effectively use integer encoding, and they may also default to mean imputation without considering outlier sensitivity.

How to eliminate wrong answers

Option A is wrong because mean imputation for 'blood_pressure' is sensitive to outliers, which can skew the model, and label encoding is a form of integer encoding but the term 'label encoding' often implies ordinal mapping that may introduce unintended ordinal relationships; however, the primary flaw is the mean imputation choice. Option C is wrong because replacing missing 'blood_pressure' with -1 introduces an arbitrary value that XGBoost may misinterpret as a valid numeric pattern, and one-hot encoding 'diagnosis_code' with 10,000 categories (even after grouping rare codes) still creates a very high-dimensional sparse matrix that degrades performance and increases memory usage in Glue ETL. Option D is wrong because dropping 20% of rows with missing 'blood_pressure' leads to significant data loss and potential bias, and one-hot encoding 'diagnosis_code' with 10,000 categories is computationally prohibitive and unnecessary for tree-based models like XGBoost.

Practice this question →

121

MCQeasy

A data engineer notices that an AWS Glue ETL job is failing with an Out of Memory error when processing a large dataset. The dataset is 500 GB in size, and the worker type is G.1X. Which change is MOST likely to resolve the issue?

A.Partition the input data into smaller files

B.Use a Spark DataFrame instead of RDD

C.Increase the number of workers

D.Use a larger worker type like G.2X

AnswerD

G.2X provides double the memory of G.1X, resolving the OOM.

Why this answer

The G.1X worker type provides 16 GB of memory per worker. A 500 GB dataset requires sufficient aggregate memory across workers for processing. Increasing the worker type to G.2X (which doubles memory to 32 GB per worker) increases the memory per executor, allowing each task to handle larger data partitions without running out of memory.

This directly addresses the Out of Memory error by providing more heap space for Spark operations.

Exam trap

The trap here is that candidates often assume adding more workers (scaling out) always solves memory issues, but the real bottleneck is per-executor memory, which is only addressed by using a larger worker type (scaling up).

How to eliminate wrong answers

Option A is wrong because partitioning input data into smaller files does not increase the available memory per worker; it only changes how data is read and may reduce parallelism but does not resolve an OOM caused by insufficient executor memory. Option B is wrong because using a Spark DataFrame instead of RDD does not inherently reduce memory usage; DataFrames use Catalyst optimizer and Tungsten execution for better performance, but they still operate within the same memory constraints and will OOM if memory per worker is insufficient. Option C is wrong because increasing the number of workers distributes the data across more executors but does not increase the memory per executor; if each executor still has only 16 GB, a single large partition or shuffle operation can still cause OOM on an individual executor.

Practice this question →

122

MCQmedium

Refer to the exhibit. A data engineer runs a Glue ETL job that uses a Python script. The job fails because of a missing module `scikit-learn`. Which fix is MOST appropriate?

A.Modify the script to install scikit-learn using pip at runtime

B.Add a --additional-python-modules argument to the job with scikit-learn

C.Switch to a Glue job using Spark instead of Python

D.Use a Glue Python shell job instead

AnswerD

Python shell jobs allow pip install at runtime and are suitable for scripts that need custom modules. However, they are not designed for heavy ETL. The correct answer is A; let me fix the responses. I'll swap: make A correct, B wrong. Actually, the best for ETL is to add a requirements file or use --additional-python-modules. So I'll set A as correct.

Why this answer

Option D is correct because a Glue Python shell job includes pre-installed libraries like scikit-learn, eliminating the missing module error without additional configuration. This job type is designed for lightweight Python scripts that do not require the distributed processing of Spark, making it the most appropriate fix for a simple dependency issue.

Exam trap

The trap here is that candidates assume all Glue jobs require Spark or that pip install at runtime is a valid workaround, but the exam expects you to recognize that Glue Python shell jobs are purpose-built for simple Python scripts and come with pre-installed ML libraries like scikit-learn.

How to eliminate wrong answers

Option A is wrong because modifying the script to install scikit-learn at runtime using pip is inefficient, may fail due to network restrictions or permission issues in the Glue environment, and violates best practices for dependency management. Option B is wrong because the --additional-python-modules argument is used with Glue Spark jobs, not Python shell jobs, and it requires specifying a compatible module version; it does not apply to the Python shell job type. Option C is wrong because switching to a Spark-based Glue job is an overengineered solution that introduces unnecessary complexity and cost for a simple Python script that does not require distributed data processing.

Practice this question →

123

Multi-Selectmedium

A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?

Select 2 answers

A.Data augmentation (e.g., adding noise)

B.SMOTE to balance classes

C.One-hot encoding of all categorical features

D.Log transformation of skewed features

E.Feature selection using correlation analysis

AnswersA, E

Increases training data diversity, reducing overfitting.

Why this answer

Data augmentation (A) is correct because it artificially increases the diversity of the training set by adding noise or transformations, which helps the model generalize better and reduces overfitting. Feature selection using correlation analysis (E) is correct because it removes redundant or highly correlated features, simplifying the model and minimizing the risk of learning noise from irrelevant predictors.

Exam trap

AWS often tests the distinction between techniques that address overfitting versus those that handle other data issues like imbalance or skewness, leading candidates to confuse SMOTE or log transforms as overfitting remedies.

Practice this question →

124

MCQmedium

An e-commerce company is building a recommendation system using user interaction data stored in Amazon DynamoDB. The data includes user_id, product_id, timestamp, event_type (click, add_to_cart, purchase), and session_id. The data science team exports the data to Amazon S3 as JSON files. During preprocessing, they discover that the 'event_type' field contains inconsistent values due to logging errors: 'Click', 'click', 'CLICK', and 'clck' all appear. Also, there are duplicate records where the same user_id, product_id, and timestamp appear multiple times with the same event_type. The team wants to use AWS Glue to clean the data for training a sequence-based recommendation model. Which set of actions should they perform?

A.Use AWS Glue to group records by session_id and aggregate event_types into a list per session. Then apply a mapping function to standardize event_type names.

B.Use AWS Glue to drop exact duplicate rows (all columns identical). Then apply a mapping function to standardize event_type to a controlled vocabulary (e.g., 'click', 'add_to_cart', 'purchase').

C.Use AWS Glue to drop duplicate records based on all columns. Then drop the event_type column and use only numeric features for training.

D.Use AWS Glue to impute event_type with the mode for records with inconsistent values. Then drop duplicate records based on user_id, product_id, and timestamp.

AnswerB

Deduplication removes redundant records, and mapping standardizes event_type, both essential for clean sequence data.

Why this answer

Option B is correct because it addresses both data quality issues: first, dropping exact duplicate rows (all columns identical) removes redundant records that would bias the sequence model; second, standardizing event_type to a controlled vocabulary ensures consistent categorical input for ML training. AWS Glue's DynamicFrame with DropDuplicates and Map transformations are the appropriate tools for this ETL task.

Exam trap

The trap here is that candidates may think grouping by session_id is necessary for sequence modeling, but the question asks for cleaning steps, not feature engineering—duplicate removal and standardization must come first to avoid propagating errors into the sequence aggregation.

How to eliminate wrong answers

Option A is wrong because grouping by session_id and aggregating event_types into a list per session loses the individual event timestamps and ordering, which are critical for sequence-based recommendation models. Option C is wrong because dropping the event_type column removes the target label for the recommendation model, and using only numeric features would discard the core behavioral signal. Option D is wrong because imputing event_type with the mode is inappropriate for categorical data with logging errors (e.g., 'clck' should be mapped to 'click', not replaced by the most frequent value), and dropping duplicates only on user_id, product_id, and timestamp may remove legitimate distinct events that differ in event_type.

Practice this question →

125

MCQhard

A team is using AWS Glue to process streaming data from Amazon Kinesis. The streaming data contains both structured and semi-structured fields. The team needs to flatten the semi-structured fields into columns for downstream ML training. Which Glue feature is BEST suited?

A.Relationalize transform

B.Spigot transform

C.ResolveChoice transform

D.ApplyMapping transform

AnswerA

Relationalize recursively flattens nested data into separate tables or columns.

Why this answer

The Relationalize transform is specifically designed to flatten nested JSON or semi-structured fields into a relational structure, making it ideal for converting complex streaming data from Kinesis into flat columns for ML training. It automatically handles arrays and structs by creating separate tables or columns, which is exactly what the team needs for downstream processing.

Exam trap

The trap here is that candidates confuse 'flattening semi-structured data' with simple schema operations like type resolution or column mapping, leading them to choose ResolveChoice or ApplyMapping instead of the specialized Relationalize transform.

How to eliminate wrong answers

Option B is wrong because the Spigot transform is used to sample or write a subset of data to a specified location for debugging or testing, not for flattening semi-structured fields. Option C is wrong because the ResolveChoice transform resolves ambiguity when a column has multiple data types (e.g., string vs. int) by casting to a chosen type, but it does not flatten nested structures. Option D is wrong because the ApplyMapping transform renames, casts, or drops columns based on a mapping specification, but it cannot flatten nested JSON or semi-structured data into separate columns.

Practice this question →

126

MCQmedium

A data scientist runs the exhibit AWS Glue ETL job. The job fails with a Spark stage failure error. What is the most likely cause?

A.The output path is missing.

B.The S3 bucket does not exist.

C.The job does not have enough memory.

D.The data type mapping in ApplyMapping is incorrect; "value" column contains non-numeric strings that cannot be cast to double.

AnswerD

Casting string to double fails on non-numeric data, causing task failure.

Why this answer

The Spark stage failure error in an AWS Glue ETL job is most likely caused by a data type mismatch during the ApplyMapping transformation. When the 'value' column contains non-numeric strings that cannot be cast to double, Spark throws a stage failure because it cannot complete the required type conversion, leading to task failures and job termination.

Exam trap

The trap here is that candidates often attribute Spark stage failures to resource issues (memory or missing paths) rather than recognizing that data type casting errors during transformations are a primary cause of stage-level failures in Glue ETL jobs.

How to eliminate wrong answers

Option A is wrong because a missing output path would cause a different error, such as 'Path does not exist' or 'FileNotFoundException', not a Spark stage failure. Option B is wrong because a non-existent S3 bucket would result in an 'AccessDenied' or 'NoSuchBucket' error at the job start, not during a Spark stage. Option C is wrong because insufficient memory typically manifests as an 'OutOfMemoryError' or 'Container killed by YARN' error, not a generic stage failure; stage failures are more commonly tied to data processing errors like type casting issues.

Practice this question →

127

MCQmedium

A data scientist is exploring data stored in an Amazon Redshift cluster. The data includes timestamp columns with different formats. The scientist wants to create a new column that standardizes the timestamp format to UTC. Which approach is MOST efficient?

A.Use AWS Glue to read the Redshift table and apply a custom transform

B.Use a SELECT with CONVERT_TIMEZONE in Redshift and export to S3

C.Use a SageMaker notebook to query Redshift and transform

D.Use Amazon QuickSight to transform the timestamp

AnswerB

CONVERT_TIMEZONE is a built-in Redshift function that efficiently converts timestamps.

Why this answer

Option B is correct because `CONVERT_TIMEZONE` in Amazon Redshift is a native SQL function that directly converts timestamps to UTC without moving data outside the cluster. This approach avoids the overhead of external services, leverages Redshift's massively parallel processing (MPP) engine, and is the most efficient for in-database transformations.

Exam trap

The trap here is that candidates assume external ETL tools (Glue, SageMaker) are always necessary for complex transforms, overlooking Redshift's powerful built-in SQL functions that can perform the same task with zero data egress.

How to eliminate wrong answers

Option A is wrong because AWS Glue would require reading the entire Redshift table into a separate Spark environment, adding network latency and compute costs, which is far less efficient than a native SQL transform. Option C is wrong because a SageMaker notebook would need to query Redshift via a JDBC/ODBC connection, pulling data into the notebook's memory for transformation, introducing unnecessary data movement and serialization overhead. Option D is wrong because Amazon QuickSight is a visualization and dashboarding service, not a data transformation engine; it cannot create new columns or modify schemas in Redshift.

Practice this question →

128

Multi-Selecteasy

A data engineer is using SageMaker Pipelines to automate data preparation. Which TWO statements about data validation within a pipeline are correct?

Select 2 answers

A.The pipeline can be configured to fail if data quality checks do not meet thresholds

B.SageMaker Pipelines has a built-in 'CheckDataQuality' step for data validation

C.Data validation can only be performed on training data, not inference data

D.Data validation steps cannot pass results to subsequent steps

E.Data validation requires a trained model to evaluate predictions

AnswersA, B

You can set conditions to fail the pipeline.

Why this answer

Option A is correct because SageMaker Pipelines allows you to define conditions that evaluate the output of data quality checks (e.g., using Amazon SageMaker Model Monitor or custom validation scripts). If the checks fail to meet specified thresholds (e.g., missing values exceed 5%), the pipeline can be configured to fail, stopping execution and preventing downstream steps from processing invalid data.

Exam trap

The trap here is that candidates assume data validation requires a trained model or is limited to training data, but SageMaker Pipelines supports rule-based validation on any dataset, including inference data, without needing a model.

Practice this question →