Sample questions
AWS Certified Machine Learning Specialty MLS-C01 practice questions
A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?
Trap 1: The data retention period of the stream is too short.
Retention period does not affect throughput.
Trap 2: The S3 bucket has insufficient write capacity.
S3 scales automatically.
Trap 3: The Lambda function's reserved concurrency is set too high.
High concurrency would not cause throttling.
- A
The data retention period of the stream is too short.
Why wrong: Retention period does not affect throughput.
- B
The S3 bucket has insufficient write capacity.
Why wrong: S3 scales automatically.
- C
The Kinesis stream has too few shards for the data volume.
Insufficient shards cause ProvisionedThroughputExceededException.
- D
The Lambda function's reserved concurrency is set too high.
Why wrong: High concurrency would not cause throttling.
A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?
Trap 1: Use Amazon EMR with Spark to convert data to Parquet and use…
On-demand instances are more expensive.
Trap 2: Use AWS Glue to convert data to gzip-compressed CSV and query with…
CSV is not optimal for Athena performance.
Trap 3: Use Amazon EMR with Hive to transform data to compressed CSV and…
CSV still incurs full scan costs.
- A
Use Amazon EMR with Spark to convert data to Parquet and use on-demand instances.
Why wrong: On-demand instances are more expensive.
- B
Use Amazon EMR with Spark to convert data to Parquet and store in S3, using spot instances for task nodes.
Parquet reduces scan size, spot instances reduce cost.
- C
Use AWS Glue to convert data to gzip-compressed CSV and query with Athena.
Why wrong: CSV is not optimal for Athena performance.
- D
Use Amazon EMR with Hive to transform data to compressed CSV and store in S3.
Why wrong: CSV still incurs full scan costs.
A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?
Trap 1: Use the Kinesis Client Library (KCL) to process the stream in an…
KCL-based processing on EC2 adds operational overhead and does not directly address Lambda timeout issues; DynamoDB storage is not necessary for the pipeline.
Trap 2: Increase the Lambda function timeout to 15 minutes and allocate…
Lambda has a maximum timeout of 15 minutes, but this increases cost and still risks data loss if failures occur; it does not provide checkpointing.
Trap 3: Configure Amazon Kinesis Firehose to deliver the stream to an…
Kinesis Firehose introduces minutes of delay, which is unsuitable for real-time fraud detection.
- A
Use Amazon Kinesis Data Analytics for Apache Flink to consume the stream, perform feature engineering, and invoke the SageMaker endpoint with exactly-once processing.
Kinesis Data Analytics provides stateful stream processing with checkpointing, ensuring no data loss and low-latency integration with SageMaker.
- B
Use the Kinesis Client Library (KCL) to process the stream in an Amazon EC2 instance, and store the predictions in Amazon DynamoDB.
Why wrong: KCL-based processing on EC2 adds operational overhead and does not directly address Lambda timeout issues; DynamoDB storage is not necessary for the pipeline.
- C
Increase the Lambda function timeout to 15 minutes and allocate more memory to reduce processing time.
Why wrong: Lambda has a maximum timeout of 15 minutes, but this increases cost and still risks data loss if failures occur; it does not provide checkpointing.
- D
Configure Amazon Kinesis Firehose to deliver the stream to an Amazon S3 bucket, then trigger a Lambda function to process the data in batches.
Why wrong: Kinesis Firehose introduces minutes of delay, which is unsuitable for real-time fraud detection.
A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?
Trap 1: Use S3 Transfer Acceleration to speed up the data transfer from S3…
S3 Transfer Acceleration optimizes uploads over long distances, but training jobs download data; it does not apply to SageMaker training jobs.
Trap 2: Use larger EC2 instances with more vCPUs and memory to speed up the…
Larger instances may speed up computation but do not reduce the data download time; they increase cost unnecessarily.
Trap 3: Enable Elastic Fabric Adapter (EFA) on the training instances to…
EFA is designed for inter-node communication, not for S3 data loading; it does not address the download bottleneck.
- A
Configure the SageMaker training job to use Pipe mode, which streams data directly from S3 without downloading to the instance's local storage.
Pipe mode reduces start-up time by streaming data, and it is cost-effective as it avoids EBS volume costs associated with File mode.
- B
Use S3 Transfer Acceleration to speed up the data transfer from S3 to the training instance.
Why wrong: S3 Transfer Acceleration optimizes uploads over long distances, but training jobs download data; it does not apply to SageMaker training jobs.
- C
Use larger EC2 instances with more vCPUs and memory to speed up the training process.
Why wrong: Larger instances may speed up computation but do not reduce the data download time; they increase cost unnecessarily.
- D
Enable Elastic Fabric Adapter (EFA) on the training instances to improve network throughput.
Why wrong: EFA is designed for inter-node communication, not for S3 data loading; it does not address the download bottleneck.
A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?
Trap 1: Use Amazon EMR with Spark to read JSON, convert to Parquet, and…
EMR requires cluster management and is not serverless, increasing operational overhead.
Trap 2: Use AWS Glue ETL jobs to read JSON from S3, transform to Parquet,…
Glue ETL is serverless but has overhead; for simple transformations, Lambda is more efficient and cheaper.
Trap 3: Use Amazon Kinesis Firehose to ingest data and convert to Parquet,…
Kinesis Firehose is for streaming data, not batch files; it cannot handle existing S3 objects.
- A
Use Amazon EMR with Spark to read JSON, convert to Parquet, and partition, then query with Athena.
Why wrong: EMR requires cluster management and is not serverless, increasing operational overhead.
- B
Use AWS Glue ETL jobs to read JSON from S3, transform to Parquet, and write to a partitioned S3 location, then use Athena.
Why wrong: Glue ETL is serverless but has overhead; for simple transformations, Lambda is more efficient and cheaper.
- C
Use S3 Event Notifications to trigger an AWS Lambda function that converts the JSON to Parquet and writes to a partitioned S3 location, then query with Athena.
Lambda is serverless, cost-effective for per-file processing, and can partition output easily.
- D
Use Amazon Kinesis Firehose to ingest data and convert to Parquet, then write to S3, and query with Athena.
Why wrong: Kinesis Firehose is for streaming data, not batch files; it cannot handle existing S3 objects.
A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)
Trap 1: AWS Identity and Access Management (IAM)
IAM provides identity and access management but not column/row-level granularity.
Trap 2: Amazon RDS for PostgreSQL
RDS is a relational database, not a data catalog for the data lake.
Trap 3: Amazon DynamoDB
DynamoDB is a key-value store and not used for data lake cataloging.
- A
AWS Lake Formation
Lake Formation provides fine-grained access control and integrates with Glue Catalog.
- B
AWS Identity and Access Management (IAM)
Why wrong: IAM provides identity and access management but not column/row-level granularity.
- C
AWS Glue Data Catalog
Glue Data Catalog is the central metadata repository for Athena, EMR, and other services.
- D
Amazon RDS for PostgreSQL
Why wrong: RDS is a relational database, not a data catalog for the data lake.
- E
Amazon DynamoDB
Why wrong: DynamoDB is a key-value store and not used for data lake cataloging.
A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?
Trap 1: Create a Glue crawler that runs continuously.
Crawlers discover schema, not transform data.
Trap 2: Schedule a Glue ETL job to run every hour.
Scheduled runs may waste resources if no new data.
Trap 3: Use Glue DataBrew to transform data and schedule it daily.
DataBrew is interactive, not automated event-driven.
- A
Create a Glue crawler that runs continuously.
Why wrong: Crawlers discover schema, not transform data.
- B
Schedule a Glue ETL job to run every hour.
Why wrong: Scheduled runs may waste resources if no new data.
- C
Use Glue DataBrew to transform data and schedule it daily.
Why wrong: DataBrew is interactive, not automated event-driven.
- D
Create a Glue ETL job triggered by an S3 event notification via Lambda.
Event-driven trigger ensures cost-effectiveness.
A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?
Trap 1: Use a custom classifier to detect partition patterns.
Classifiers detect schema, not partitions.
Trap 2: Increase the crawler schedule to run every hour.
More frequent runs may still miss partitions.
Trap 3: Configure the crawler to update all partitions on each run.
Not a supported setting.
- A
Use a custom classifier to detect partition patterns.
Why wrong: Classifiers detect schema, not partitions.
- B
Increase the crawler schedule to run every hour.
Why wrong: More frequent runs may still miss partitions.
- C
Configure the crawler to update all partitions on each run.
Why wrong: Not a supported setting.
- D
Enable partition indexing in the Glue table properties.
Partition indexing helps Athena query without full scan.
A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)
Trap 1: Amazon Redshift for data warehousing.
Redshift is a warehouse, not a data lake.
Trap 2: Amazon EMR for data processing.
EMR is for processing, not storage.
- A
AWS Glue Data Catalog for schema-on-read.
Glue enables schema-on-read for analytics.
- B
Amazon Redshift for data warehousing.
Why wrong: Redshift is a warehouse, not a data lake.
- C
Amazon S3 as the primary storage layer.
S3 is the foundation of a data lake.
- D
Amazon EMR for data processing.
Why wrong: EMR is for processing, not storage.
- E
S3 Lifecycle policies to transition data to Glacier.
Lifecycle policies reduce cost for old data.
A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?
Exhibit
Refer to the exhibit.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-data-lake/*",
"Condition": {
"StringEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
}
]
}Trap 1: The policy grants s3:PutObject on all buckets, not just the…
The resource is specific to the bucket.
Trap 2: The condition requires objects to be encrypted with SSE-KMS, but…
The condition specifies AES256, which is SSE-S3.
Trap 3: The condition requires objects to use SSE-S3, but the job uses…
The job uses SSE-S3, not SSE-KMS.
- A
The policy grants s3:PutObject on all buckets, not just the specific one.
Why wrong: The resource is specific to the bucket.
- B
The condition requires objects to be encrypted with SSE-KMS, but the job uses SSE-S3.
Why wrong: The condition specifies AES256, which is SSE-S3.
- C
The policy does not grant s3:PutObject on the bucket itself, which is needed for some write operations.
Bucket-level permissions may be required for certain write operations.
- D
The condition requires objects to use SSE-S3, but the job uses SSE-KMS.
Why wrong: The job uses SSE-S3, not SSE-KMS.
A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?
Trap 1: Increase the write capacity of the DynamoDB lease table to handle…
Lease table capacity is not the bottleneck.
Trap 2: Switch to using AWS Lambda as the consumer to handle scaling…
Lambda does not provide reprocessing capabilities.
Trap 3: Increase the number of EC2 instances to match the number of shards.
Adding instances won't increase per-shard throughput.
- A
Increase the write capacity of the DynamoDB lease table to handle more workers.
Why wrong: Lease table capacity is not the bottleneck.
- B
Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput.
Enhanced fan-out gives dedicated throughput per consumer.
- C
Switch to using AWS Lambda as the consumer to handle scaling automatically.
Why wrong: Lambda does not provide reprocessing capabilities.
- D
Increase the number of EC2 instances to match the number of shards.
Why wrong: Adding instances won't increase per-shard throughput.
A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?
Trap 1: Apply PCA to all features to decorrelate them.
PCA reduces dimensionality but loses interpretability and may not be needed.
Trap 2: Standardize all features using StandardScaler.
Scaling does not reduce multicollinearity.
Trap 3: Randomly drop half of the correlated features.
Random dropping may remove important features.
- A
Apply PCA to all features to decorrelate them.
Why wrong: PCA reduces dimensionality but loses interpretability and may not be needed.
- B
Standardize all features using StandardScaler.
Why wrong: Scaling does not reduce multicollinearity.
- C
For each highly correlated pair, remove one feature based on domain knowledge or higher correlation with target.
This reduces redundancy while retaining predictive power.
- D
Randomly drop half of the correlated features.
Why wrong: Random dropping may remove important features.
Match each hyperparameter tuning strategy to its description.
Drag a concept onto its matching description — or click a concept then click the description.
Exhaustive search over specified hyperparameter values
Random sampling of hyperparameter combinations
Probabilistic model to guide search
Early stopping and resource allocation
SageMaker automatic tuning
Match each AWS AI service to its capability.
Drag a concept onto its matching description — or click a concept then click the description.
Natural language processing
Language translation
Text-to-speech
Speech-to-text
Conversational chatbots
A data scientist is analyzing a dataset with missing values in 30% of the rows for the 'age' column. The data scientist decides to impute the missing values with the median of the observed 'age' values. What is a potential drawback of this approach?
Trap 1: The imputation will introduce bias if the missing values are not…
This is a potential drawback, but the question asks for a drawback of median imputation specifically; the more direct drawback is reduced variance.
Trap 2: Imputation using median is computationally expensive for large…
Median imputation is computationally cheap.
Trap 3: The imputed values will increase the variance of the feature,…
Median imputation does not increase variance.
- A
The imputation will introduce bias if the missing values are not random.
Why wrong: This is a potential drawback, but the question asks for a drawback of median imputation specifically; the more direct drawback is reduced variance.
- B
Imputation using median is computationally expensive for large datasets.
Why wrong: Median imputation is computationally cheap.
- C
The imputed values may reduce the variance of the 'age' distribution.
Replacing missing values with a constant reduces the variability of the feature.
- D
The imputed values will increase the variance of the feature, leading to overfitting.
Why wrong: Median imputation does not increase variance.
An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?
Trap 1: Use a polynomial kernel in a support vector regressor
Polynomial kernel can increase complexity and overfitting.
Trap 2: Add more features, including interaction terms
Adding features may increase variance.
Trap 3: Increase training data size by duplicating existing samples
Duplicating data does not add new information and may not reduce overfitting.
- A
Apply L2 regularization (Ridge regression)
L2 regularization shrinks coefficients and reduces variance.
- B
Use a polynomial kernel in a support vector regressor
Why wrong: Polynomial kernel can increase complexity and overfitting.
- C
Add more features, including interaction terms
Why wrong: Adding features may increase variance.
- D
Increase training data size by duplicating existing samples
Why wrong: Duplicating data does not add new information and may not reduce overfitting.
A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)
Trap 1: Increase the number of layers
Increases model capacity, worsens overfitting.
Trap 2: Remove L2 regularization
Regularization helps reduce overfitting.
Trap 3: Increase the number of training steps
More steps can lead to more overfitting.
- A
Increase the number of layers
Why wrong: Increases model capacity, worsens overfitting.
- B
Remove L2 regularization
Why wrong: Regularization helps reduce overfitting.
- C
Increase the number of training steps
Why wrong: More steps can lead to more overfitting.
- D
Add dropout layers
Dropout regularizes by randomly dropping neurons.
- E
Add early stopping based on validation loss
Stops training when validation loss stops improving.
A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?
Trap 1: Remove all but one feature from each group of highly correlated…
This approach may discard information that is not redundant.
Trap 2: Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.
LDA is supervised and requires a target variable; it may not be appropriate if the goal is unsupervised dimensionality reduction.
Trap 3: Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce…
t-SNE is non-linear and stochastic; it is not suitable for reducing to a specific number of components for modeling.
- A
Remove all but one feature from each group of highly correlated features.
Why wrong: This approach may discard information that is not redundant.
- B
Apply Principal Component Analysis (PCA) and keep the top 50 principal components.
PCA finds orthogonal directions of maximum variance and can reduce dimensionality effectively.
- C
Use Linear Discriminant Analysis (LDA) to project to 50 dimensions.
Why wrong: LDA is supervised and requires a target variable; it may not be appropriate if the goal is unsupervised dimensionality reduction.
- D
Use t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce to 50 dimensions.
Why wrong: t-SNE is non-linear and stochastic; it is not suitable for reducing to a specific number of components for modeling.
A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)
Trap 1: Standardize the text data using z-score normalization.
Text data is categorical; standardization is not applicable.
Trap 2: Tokenize the text into individual words.
Tokenization is a prerequisite but not typically considered a preprocessing step after tokenization; it is part of the pipeline.
- A
Standardize the text data using z-score normalization.
Why wrong: Text data is categorical; standardization is not applicable.
- B
Apply stemming to reduce words to their root form.
Stemming groups related words, reducing feature dimensionality.
- C
Tokenize the text into individual words.
Why wrong: Tokenization is a prerequisite but not typically considered a preprocessing step after tokenization; it is part of the pipeline.
- D
Convert all text to lowercase.
Lowercasing ensures consistency and reduces vocabulary size.
- E
Remove common stop words (e.g., 'the', 'and', 'is').
Stop words are frequent but often irrelevant for sentiment analysis.
During EDA, a data scientist notices that a feature has a high proportion of missing values (e.g., 70%). The feature is continuous and expected to be important based on domain knowledge. What is the best approach to handle this?
Trap 1: Remove the feature entirely to avoid bias.
Removing a potentially important feature is not optimal.
Trap 2: Impute missing values with -1 since it is out of range.
Arbitrary constant can distort the distribution.
Trap 3: Drop all rows with missing values in that feature.
Dropping 70% of rows is wasteful.
- A
Remove the feature entirely to avoid bias.
Why wrong: Removing a potentially important feature is not optimal.
- B
Create a binary indicator for missingness and impute the continuous values with the median.
This captures both the pattern of missingness and the distribution.
- C
Impute missing values with -1 since it is out of range.
Why wrong: Arbitrary constant can distort the distribution.
- D
Drop all rows with missing values in that feature.
Why wrong: Dropping 70% of rows is wasteful.
During EDA, a data scientist creates a scatter matrix of numerical features and notices that some features have a funnel-shaped pattern (variance increases with the mean). What is the appropriate transformation to stabilize variance?
Trap 1: Standardize the features using Z-scores.
Standardization does not fix heteroscedasticity.
Trap 2: Apply a sine transformation.
Sine transformation is not for variance stabilization.
Trap 3: Apply Box-Cox transformation with lambda=0.
Box-Cox with lambda=0 is log, but the question asks for appropriate; log is more straightforward.
- A
Apply log transformation.
Log transformation stabilizes variance when variance increases with mean.
- B
Standardize the features using Z-scores.
Why wrong: Standardization does not fix heteroscedasticity.
- C
Apply a sine transformation.
Why wrong: Sine transformation is not for variance stabilization.
- D
Apply Box-Cox transformation with lambda=0.
Why wrong: Box-Cox with lambda=0 is log, but the question asks for appropriate; log is more straightforward.
Which TWO of the following are appropriate techniques for detecting outliers in a univariate continuous feature?
Trap 1: Apply a Random Forest classifier to predict outliers.
Outlier detection is unsupervised; Random Forest requires labels.
Trap 2: Remove any value that is more than one standard deviation from the…
One standard deviation includes 68% of data; too aggressive.
Trap 3: Use DBSCAN clustering with default parameters.
DBSCAN is for multivariate data and requires parameter tuning.
- A
Apply a Random Forest classifier to predict outliers.
Why wrong: Outlier detection is unsupervised; Random Forest requires labels.
- B
Use Z-score and flag values with absolute Z-score > 3.
Z-score >3 is a common outlier threshold.
- C
Remove any value that is more than one standard deviation from the mean.
Why wrong: One standard deviation includes 68% of data; too aggressive.
- D
Use DBSCAN clustering with default parameters.
Why wrong: DBSCAN is for multivariate data and requires parameter tuning.
- E
Use the interquartile range (IQR) and flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
IQR method is standard for univariate outlier detection.
Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?
Trap 1: Encode all categorical features using label encoding for simplicity.
Label encoding can introduce false ordinal relationships.
Trap 2: Include all categorical features with high cardinality as-is in the…
High cardinality features often need encoding or grouping.
- A
Check the proportion of missing values for each feature.
Missing value analysis is a key EDA step.
- B
Compute pairwise correlation coefficients between numerical features.
Correlation analysis detects multicollinearity.
- C
Encode all categorical features using label encoding for simplicity.
Why wrong: Label encoding can introduce false ordinal relationships.
- D
Include all categorical features with high cardinality as-is in the model.
Why wrong: High cardinality features often need encoding or grouping.
- E
Visualize the distribution of numerical features using histograms and box plots.
Visualization helps identify skewness and outliers.
A data scientist is performing exploratory data analysis on a dataset containing customer transactions. The dataset has 1 million rows with 50 features, including numerical and categorical variables. The goal is to identify patterns and potential data quality issues before building a model. Which approach should the data scientist take to efficiently explore the data?
Trap 1: Use Amazon Athena to run SQL queries and generate summary…
Athena can query data but lacks built-in profiling and visualization features.
Trap 2: Use Amazon SageMaker Data Wrangler to import the data and create a…
Data Wrangler is more focused on feature engineering and requires building a flow, not initial EDA.
Trap 3: Use Amazon SageMaker Ground Truth to label the data and then…
Ground Truth is for creating training datasets with human labelers, not for general EDA.
- A
Use AWS Glue DataBrew to profile the dataset, view data quality reports, and visualize distributions.
DataBrew provides an interactive interface for data profiling, cleaning, and visualization, making it suitable for EDA.
- B
Use Amazon Athena to run SQL queries and generate summary statistics.
Why wrong: Athena can query data but lacks built-in profiling and visualization features.
- C
Use Amazon SageMaker Data Wrangler to import the data and create a flow for feature engineering.
Why wrong: Data Wrangler is more focused on feature engineering and requires building a flow, not initial EDA.
- D
Use Amazon SageMaker Ground Truth to label the data and then analyze the labels.
Why wrong: Ground Truth is for creating training datasets with human labelers, not for general EDA.
Question Discussion
Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.
Sign in to join the discussion.