CCNA Ml Data Engineering Questions — Page 3 of 5

151

Matchingmedium

Match each data format to its typical use in AWS ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Tabular data for SageMaker built-in algorithms

Efficient binary format for SageMaker

Columnar storage for analytics

Semi-structured data, e.g., for Lambda

TensorFlow training data format

Why these pairings

Different formats are optimized for different tasks.

Practice this question →

152

MCQhard

A data engineer is troubleshooting an AWS Glue job that reads from and writes to the S3 bucket 'data-lake-bucket'. The job fails when trying to write to the 'sensitive/' prefix. The IAM policy attached to the Glue job's IAM role is shown in the exhibit. What is the MOST likely reason for the failure?

A.The IAM role does not have permission to read objects from the bucket

B.The IAM role has an explicit deny for s3:PutObject on the 'sensitive/' prefix

C.The IAM policy does not specify the bucket resource correctly

D.The IAM policy lacks a required condition for encryption

AnswerB

The Deny statement blocks write access to the sensitive prefix.

Why this answer

Option B is correct. Even though the first statement allows s3:PutObject on the entire bucket, the second statement explicitly denies s3:PutObject on the 'sensitive/' prefix. Explicit deny overrides any allow.

Option A is wrong because the policy allows GetObject. Option C is wrong because the policy covers the bucket. Option D is wrong because there is a deny statement.

Practice this question →

153

MCQhard

A data engineer needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3 for ML training. The data is currently stored in HDFS and is compressible. The network bandwidth between the on-premises data center and AWS is 1 Gbps. The team needs to minimize the time to transfer and also wants to avoid any downtime for the on-premises system. Which solution meets these requirements?

A.Set up an AWS Direct Connect connection and use rsync to copy data to S3.

B.Enable S3 Transfer Acceleration on the bucket and use the AWS CLI to copy data.

C.Install the AWS DataSync agent on-premises, configure a task to transfer data to S3 with compression enabled.

D.Use AWS Snowball Edge devices to export the data and ship them to AWS.

AnswerC

DataSync is optimized for large data transfers with compression and parallelization.

Why this answer

Option D is correct because AWS DataSync can transfer data over the network efficiently using parallel streams and compression, and it can be installed as an agent on the on-premises cluster. Option A is wrong because Snowball Edge would require shipping, which takes longer and may not be faster than network transfer with optimization. Option B is wrong because Direct Connect provides a dedicated network connection but does not include the data transfer software; DataSync works over Direct Connect.

Option C is wrong because S3 Transfer Acceleration improves speed over public internet but may not be as fast as DataSync with compression.

Practice this question →

154

MCQhard

A data engineering team is building a real-time data pipeline using Amazon Kinesis Data Streams with AWS Lambda for processing. The pipeline ingests clickstream data from a mobile app. The team notices that occasionally, a Lambda function fails due to a transient error, and the failed record is not retried, leading to data loss. The Lambda function is configured with a batch size of 100 and a maximum retry count of 0. The team wants to ensure that all records are processed successfully, even if transient failures occur. They also want to minimize the impact of poison pill records that could block processing. Which combination of actions should the team take to address this issue?

A.Set the maximum retry count to 5 and configure a dead-letter queue on the Lambda function to capture failed records after retries.

B.Switch to using Amazon Kinesis Data Firehose to buffer data and use AWS Lambda for transformation with built-in retry logic.

C.Set the maximum retry count to 5, configure an on-failure destination Amazon SQS queue, and set up a dead-letter queue on that SQS queue for poison pills.

D.Reduce the batch size to 1 and increase the Lambda function timeout to handle transient errors.

AnswerC

This provides retries and isolates poison pills without blocking the main stream.

Why this answer

Option B is correct because increasing the maximum retry count allows Lambda to retry failed batches, and splitting the failure destination into a separate SQS queue with a dead-letter queue (DLQ) for poison pills ensures that problematic records are isolated and can be analyzed separately, while the main processing continues. Option A is incorrect because using a DLQ alone without retries will still lose records if retries are not enabled. Option C is incorrect because reducing batch size may help but does not solve the retry or poison pill problem.

Option D is incorrect because Kinesis Data Firehose is not suitable for real-time per-record processing.

Practice this question →

155

MCQhard

A financial services company is building a fraud detection model that requires joining real-time transaction data with a reference dataset of known fraudulent accounts stored in Amazon DynamoDB. The solution must minimize latency and be highly available. The reference dataset is updated frequently (every few minutes). Which architecture should the team use?

A.Use Amazon Athena to query the DynamoDB table and join with streaming data.

B.Use Amazon Kinesis Data Analytics to process the stream and join with a DynamoDB table.

C.Use AWS Glue streaming ETL to read from Kinesis and join with DynamoDB.

D.Use Amazon SageMaker to host a model that queries DynamoDB for each inference.

AnswerB

Kinesis Data Analytics supports real-time joins with DynamoDB using reference data.

Why this answer

Option C is correct because Kinesis Data Analytics can perform real-time SQL joins with a DynamoDB table using the reference data feature, providing low latency. Option A is wrong because Glue is for batch ETL, not real-time. Option B is wrong because SageMaker is for ML training, not real-time data processing.

Option D is wrong because Athena is for querying S3, not real-time streaming.

Practice this question →

156

MCQhard

A data engineer runs the CLI command to download an object from S3. The bucket owner is 123456789012, and the engineer's IAM user has s3:GetObject permission on the bucket. The object was uploaded by a different AWS account. What is the MOST likely reason for the AccessDenied error?

A.The --expected-bucket-owner parameter is incorrect

B.The object is owned by a different AWS account, and the bucket owner has not been granted access

C.The bucket policy denies access to the engineer's IAM user

D.The IAM policy does not allow s3:GetObject for that specific key

AnswerB

Object ACLs or bucket policy must grant access to bucket owner.

Why this answer

By default, objects uploaded by another account are owned by the uploading account, and the bucket owner does not have access unless explicitly granted via bucket policy or ACL. The --expected-bucket-owner parameter only checks the bucket owner, not the object. The engineer's permissions are on the bucket, but the object is owned by another account, so the bucket owner needs additional permissions.

Practice this question →

157

MCQhard

A data engineer is designing a data lake on Amazon S3 that must support both batch and streaming analytics. The data comes in Parquet format and needs to be queryable by Amazon Athena. Which partitioning strategy will optimize query performance and reduce costs?

A.Partition by date and hour for time-based queries

B.Store data as CSV without partitioning for simplicity

C.Partition by device_id for granular access

D.Use a single partition for all data to simplify management

AnswerA

Common query patterns are time-filtered; this reduces data scanned.

Why this answer

Partitioning by date and hour allows Athena to prune partitions effectively for time-based queries, reducing data scanned. Option A is wrong because a single partition is not efficient. Option C is wrong because partitioning by a high-cardinality column like device_id creates many small partitions.

Option D is wrong because using CSV negates the benefits of columnar storage.

Practice this question →

158

MCQeasy

A company uses Amazon Kinesis Data Streams to collect clickstream data. The data is consumed by a Lambda function that writes to DynamoDB. Occasionally, the Lambda function fails due to throttling from DynamoDB. How can the company resolve this issue without losing data?

A.Ignore the throttling errors and let Lambda retry.

B.Increase the number of shards in the Kinesis stream.

C.Use an Amazon SQS queue as a buffer between Kinesis and Lambda.

D.Decrease the batch size in the Lambda event source mapping.

AnswerD

Smaller batches reduce the write rate, avoiding throttling.

Why this answer

Decreasing the batch size reduces the number of records per invocation, lowering the write load on DynamoDB. Increasing shards would increase parallelism, potentially worsening throttling. Using SQS would add latency.

Removing error handling would lose data.

Practice this question →

159

MCQmedium

A research institution is building a data lake to store genomics data. Each experiment generates multiple files totaling about 500 GB. The data is stored in Amazon S3 and needs to be processed by multiple machine learning (ML) training jobs running on Amazon SageMaker. The data has a high churn rate; after 30 days, most data becomes irrelevant and should be moved to Amazon S3 Glacier Deep Archive. The institution wants to minimize storage costs while maintaining data durability. Which S3 storage class should they use for the first 30 days?

A.Use S3 Intelligent-Tiering for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

B.Use S3 One Zone-IA for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

C.Use S3 Standard for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

D.Use S3 Glacier Instant Retrieval for all data, and set a lifecycle policy to transition to S3 Glacier Deep Archive after 30 days.

AnswerA

Intelligent-Tiering automatically optimizes costs by moving data to lower-cost tiers when not accessed, and it provides high durability.

Why this answer

Option C is correct because S3 Intelligent-Tiering automatically moves data between access tiers based on usage, which is ideal for data with unknown or changing access patterns. Option A is wrong because S3 Standard is more expensive for data that may not be accessed frequently after the initial processing. Option B is wrong because S3 One Zone-IA is not durable across AZs.

Option D is wrong because S3 Glacier Instant Retrieval is for long-lived, rarely accessed data requiring millisecond access; not cost-effective for the first 30 days.

Practice this question →

160

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data comes from various sources, including IoT devices, web logs, and transactional databases. The engineer needs to organize the data in a way that supports efficient querying using Amazon Athena and allows for easy management of access permissions. Which S3 bucket structure is the most appropriate?

A.Store all data in a single prefix without any partitioning.

B.Use a prefix structure like s3://bucket/source/year/month/day/.

C.Store all data in separate S3 buckets for each source and date.

D.Use a prefix structure like s3://bucket/date/source/.

AnswerB

This structure enables partition pruning by source and time, optimizing Athena queries and allowing granular access control at the source level.

Why this answer

Option B is correct because partitioning by source, year, month, day allows Athena to prune partitions, reducing scan costs and improving performance. Option A is wrong because storing all data in a flat structure forces full scans. Option C is wrong because prefix-based access controls can be applied at the source level within the partitioned structure.

Option D is wrong because using date as the first partition level is less intuitive for managing permissions by source.

Practice this question →

161

MCQmedium

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to an Amazon S3 bucket. The data is then queried using Amazon Athena. The marketing team wants to run daily reports that aggregate click events by product ID. However, the reports are slow because Athena scans the entire dataset each time. The data is partitioned by date (e.g., s3://bucket/clickstream/2023/01/01/). The product ID is a column within the data. The data engineering team wants to improve query performance without moving the data to another service. Which approach should the team take?

A.Convert the data from JSON to Parquet format

B.Use Amazon Redshift Spectrum to query the data

C.Create a view in Athena that filters by product ID

D.Repartition the data by product ID in addition to date

AnswerD

Partitioning by product ID allows Athena to skip irrelevant partitions.

Why this answer

Partitioning by product ID would allow Athena to prune partitions for queries filtering by product ID. Option A is wrong because converting to Parquet alone may improve but does not eliminate full scan. Option C is wrong because creating a view does not change physical storage.

Option D is wrong because Redshift Spectrum still requires scanning.

Practice this question →

162

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineering team needs to load 10 TB of data from Amazon S3 into Redshift every night. The team wants to minimize the load time and use the fewest number of COPY commands. The data is in CSV format and is partitioned by date in S3. Which approach should the team take?

A.Use a manifest file with a single COPY command.

B.Use multiple COPY commands, one per partition.

C.Concatenate all data into a single large file before loading.

D.Use AWS Glue to transform the data and then load into Redshift.

AnswerA

A manifest file allows Redshift to load from multiple files in parallel efficiently.

Why this answer

Option D is correct. Using a manifest file with a single COPY command allows Redshift to load multiple files in parallel, optimizing throughput. Option A is wrong because using multiple COPY commands increases overhead.

Option B is wrong because loading all data in one file is impractical for 10 TB. Option C is wrong because AWS Glue adds unnecessary overhead and cost for a simple load task.

Practice this question →

163

MCQmedium

A team is using Amazon SageMaker to train a model on a dataset that is 500 GB in size, stored as CSV files in S3. The training job takes 2 hours using a single ml.p3.2xlarge instance. The team wants to reduce training time to under 30 minutes. The model architecture supports distributed training. Which solution will achieve this goal with the LEAST amount of code changes?

A.Use managed spot training to reduce cost and then use cost savings to train with a larger instance.

B.Use a single ml.p3.16xlarge instance with more GPUs and memory.

C.Use multiple ml.p3.2xlarge instances with SageMaker's distributed data parallelism library, enabling automatic sharding of the training data.

D.Change the input mode to Pipe mode to stream data from S3 directly, reducing I/O wait time.

AnswerC

Distributed training across multiple instances reduces time proportionally; minimal code changes with SageMaker's SDK.

Why this answer

Option C is correct because SageMaker's distributed data parallelism library automatically shards the training data across multiple ml.p3.2xlarge instances, enabling parallel gradient computation and reducing wall-clock training time from 2 hours to under 30 minutes without requiring manual code changes to the training script. The model architecture already supports distributed training, so the library handles the communication and synchronization (e.g., AllReduce) transparently.

Exam trap

The trap here is that candidates often confuse 'larger instance' (Option B) with 'distributed training' (Option C), failing to realize that a single large instance cannot parallelize data loading and gradient computation across multiple nodes, while distributed data parallelism with multiple smaller instances can achieve the required speedup with minimal code changes.

How to eliminate wrong answers

Option A is wrong because managed spot training reduces cost but does not inherently reduce training time; using a larger instance with spot training still requires code changes for distributed training and may not achieve the sub-30-minute goal. Option B is wrong because a single ml.p3.16xlarge instance, while having more GPUs and memory, still processes data sequentially on one node and cannot scale training time linearly to under 30 minutes for a 500 GB dataset without distributed data parallelism across multiple instances. Option D is wrong because Pipe mode streams data directly from S3 to reduce I/O wait time, but it does not parallelize computation across multiple GPUs or instances, so the training time remains bound by the single-instance compute capacity.

Practice this question →

164

MCQeasy

A company uses Amazon S3 to store log files from various applications. The logs are in JSON format and are appended to existing files every few minutes. A data analyst wants to run SQL queries on the logs using Amazon Athena. However, queries return incomplete results because Athena does not support modifying data. The team needs to enable querying of the latest log data with minimal changes to the existing ingestion process. Which solution should the team implement?

A.Convert the logs to Parquet format using a scheduled AWS Glue job and store them in a separate S3 bucket.

B.Stream the logs to Amazon Kinesis Data Firehose, which writes the data to S3 in Parquet format.

C.Create an Athena table using the Hive JSON SerDe that reads the logs directly from the existing S3 bucket.

D.Use AWS Glue to load the JSON logs into Amazon Redshift and query using Redshift.

AnswerC

Athena can query JSON logs with the correct SerDe without changing the ingestion.

Why this answer

Option D is correct because Athena supports reading JSON data with the Hive JSON SerDe. By creating a table with the appropriate SerDe, the analyst can query the JSON logs directly. Option A is wrong because converting to Parquet would require changing the ingestion process.

Option B is wrong because Glue ETL to load into Redshift is overkill and adds latency. Option C is wrong because Kinesis Data Firehose would require changing the ingestion pipeline.

Practice this question →

165

MCQeasy

A company wants to build a data lake on Amazon S3. The data lake will store raw data in its original format and also transformed data in Parquet. The data is generated by various sources and must be cataloged for discovery. Which service should the company use to automatically discover, catalog, and make the data searchable?

A.AWS Glue Data Catalog

B.Amazon S3

C.Amazon Athena

D.Amazon EMR

AnswerA

Glue Data Catalog is a managed metadata repository with crawlers.

Why this answer

Option D is correct. AWS Glue Data Catalog is a central metadata repository that can automatically crawl S3 data sources to populate the catalog. Option A is wrong because Athena is a query engine, not a catalog.

Option B is wrong because EMR is a processing framework. Option C is wrong because S3 is storage.

Practice this question →

166

MCQhard

A company uses AWS Glue ETL jobs to process data from an Amazon RDS for MySQL database into Amazon S3. The job runs daily and takes 6 hours to complete. The team wants to reduce runtime and cost. The source table has 50 million rows and is updated continuously. Which combination of changes would be MOST effective?

A.Use a single worker with a larger instance type.

B.Increase the number of DPUs and enable job bookmarking.

C.Use JDBC connections with pushdown predicates and increase the number of DPUs.

D.Change the job trigger from time-based to event-based.

AnswerC

Pushdown predicates filter data at source, reducing data transfer; more DPUs parallelize the work.

Why this answer

Option B is correct because JDBC connections with pushdown predicates reduce data transferred, and increasing DPUs can parallelize processing. Option A is wrong because increasing DPUs without pushdown may cause bottleneck on source. Option C is wrong because a single worker cannot process 50M rows quickly.

Option D is wrong because triggers do not optimize runtime.

Practice this question →

167

Drag & Dropmedium

Drag and drop the steps to set up Amazon SageMaker Ground Truth for a labeling job in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Ground Truth setup involves dataset preparation, job creation, task configuration, instructions, and execution.

Practice this question →

168

Multi-Selectmedium

A data engineer is designing a streaming pipeline using Amazon Kinesis Data Analytics for Apache Flink. The pipeline reads from a Kinesis data stream and writes to a S3 bucket. The job must recover quickly from failures without reprocessing large amounts of data. Which TWO configurations should be used? (Choose TWO)

Select 2 answers

A.Enable checkpointing with a state backend like RocksDB.

B.Use in-memory state backend for low latency.

C.Configure the S3 sink to use exactly-once delivery semantics.

D.Set the parallelism to the maximum number of shards.

E.Increase the retention period of the Kinesis stream to 365 days.

AnswersA, C

Checkpointing enables state recovery after failure.

Why this answer

Option A is correct because enabling checkpointing with a state backend like RocksDB allows Apache Flink to periodically save the state of the streaming application to durable storage. In the event of a failure, Flink can restart from the last completed checkpoint, avoiding the need to reprocess large amounts of data from the beginning of the stream. RocksDB is specifically designed for large state and provides fast recovery by storing state on disk with memory caching, making it ideal for production streaming pipelines.

Exam trap

The trap here is that candidates often confuse parallelism or stream retention settings with fault-tolerance mechanisms, mistakenly believing that increasing parallelism or retention alone can prevent data reprocessing, when in fact only checkpointing with a durable state backend ensures fast recovery.

Practice this question →

169

MCQmedium

A data engineer needs to transform large CSV files stored in Amazon S3 into Parquet format before loading into Amazon Redshift. The transformation logic is complex and requires custom Python code. Which AWS service should be used to perform this transformation with minimal operational overhead?

A.AWS Glue

B.AWS Lambda

C.Amazon EMR

D.AWS Data Pipeline

AnswerA

Glue is a serverless ETL service that can run complex transformations on data in S3 and write to Parquet.

Why this answer

Option B is correct because AWS Glue provides a serverless Spark environment with built-in support for ETL jobs, including converting CSV to Parquet. Option A (AWS Lambda) has a 15-minute timeout and is not suitable for large files. Option C (Amazon EMR) requires managing clusters.

Option D (AWS Data Pipeline) is a legacy service with less flexibility.

Practice this question →

170

Multi-Selecteasy

A company wants to build a data lake on Amazon S3. The data lake should support both batch and real-time data ingestion. Which AWS services should be used for data ingestion? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Redshift

D.Amazon Athena

E.Amazon SQS

AnswersA, B

Glue performs batch ETL and can ingest data into S3.

Why this answer

Option B (Kinesis Firehose) is for real-time ingestion, Option D (AWS Glue) is for batch ETL. Option A (Athena) is for querying, Option C (Redshift) is a warehouse, Option E (SQS) is not for ingestion into S3.

Practice this question →

171

MCQhard

A data engineer is designing a data pipeline that transforms raw JSON files (each 50-200 KB) in Amazon S3 into Parquet format using AWS Glue. The pipeline must minimize data processing costs and handle a high volume of small files (millions per day). The engineer configures a Glue ETL job with Spark, but the job is slow and expensive due to overhead of reading many small files. Which optimization should the engineer implement to reduce cost and improve performance?

A.Increase the worker type to G.2X for more memory per worker.

B.Increase the number of DPUs allocated to the Glue job.

C.Change the output format from Parquet to CSV to reduce compression overhead.

D.Use S3 object grouping or batch operations to combine small files before Glue processing.

AnswerD

Combining small files reduces task overhead, leading to faster and cheaper jobs.

Why this answer

Option C is correct. Using S3 object grouping (e.g., S3 batch operations or partitioning) to create larger files reduces the number of tasks and overhead in Spark. Option A is wrong because increasing DPUs increases cost without addressing the small file problem.

Option B is wrong because converting to CSV is not more efficient than Parquet; Parquet is columnar and efficient. Option D is wrong because increasing worker type also increases cost without solving small file issue.

Practice this question →

172

MCQeasy

A data scientist needs to run a one-time SQL query on a large dataset in Amazon S3. The dataset is stored in Parquet format and is about 500 GB. The query requires complex aggregations and joins. Which AWS service should be used to minimize cost and setup time?

A.Amazon Redshift

B.Amazon Athena

C.Amazon RDS for MySQL

D.Amazon EMR with Spark SQL

AnswerB

Serverless, pay-per-query, no setup required.

Why this answer

Amazon Athena is serverless and charges per query, ideal for one-time queries on S3 data. Option B (Amazon EMR) requires cluster setup and management. Option C (Amazon Redshift) requires provisioning a cluster.

Option D (Amazon RDS) is not designed for direct S3 querying.

Practice this question →

173

MCQhard

A data engineer is designing a data pipeline that ingests 500 GB of data daily from an on-premises Oracle database to Amazon S3. The pipeline must minimize data loss and support change data capture (CDC). Which combination of services should they use?

A.AWS Database Migration Service (DMS) with ongoing replication

B.AWS Data Pipeline with SQL query

C.Amazon Kinesis Data Streams with a custom Oracle CDC connector

D.AWS Glue ETL jobs running on a schedule

AnswerA

DMS supports CDC and can write to S3.

Why this answer

AWS Database Migration Service (DMS) with ongoing replication enables CDC from Oracle to S3. Option A is wrong because Glue does not support CDC directly. Option B is wrong because Kinesis requires custom agents.

Option D is wrong because Data Pipeline is batch.

Practice this question →

174

MCQeasy

A company needs to move 10 TB of data from an on-premises NAS to Amazon S3 over a 100 Mbps internet connection. The transfer must complete within 3 days. Which solution is the most appropriate?

A.Use AWS DataSync to transfer over the internet

B.Enable S3 Transfer Acceleration on the bucket

C.Use AWS CLI to copy data directly over the internet

D.Use AWS Snowball Edge to transfer the data

AnswerD

Snowball Edge provides physical transport, faster than internet for large data.

Why this answer

Option D is correct because AWS Snowball Edge is a physical device that can transfer large data volumes faster than internet. Option A is wrong because over 100 Mbps, 10 TB would take ~10 days; Option B is wrong because AWS DataSync requires network; Option C is wrong because S3 Transfer Acceleration improves speed only up to ~200% at best, still insufficient.

Practice this question →

175

Multi-Selecthard

A company is designing a data pipeline that ingests streaming data from social media feeds. The data must be processed in real-time to detect trending topics, and results must be stored in Amazon DynamoDB for low-latency access. Which services should the company use? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Lambda

C.Amazon Simple Queue Service (SQS)

D.Amazon Kinesis Data Analytics

E.Amazon Kinesis Data Streams

AnswersD, E

Provides real-time analytics to detect trending topics.

Why this answer

Option A (Kinesis Data Streams) is required for ingestion, and Option D (Kinesis Data Analytics) is required for real-time trending detection. Option B (SQS) is not for streaming. Option C (Lambda) can process but is not the best for real-time analytics on streams.

Option E (Firehose) is near-real-time, not real-time.

Practice this question →

176

MCQmedium

A company is streaming real-time sensor data from IoT devices to Amazon Kinesis Data Streams. The data is then consumed by an AWS Lambda function that enriches the records with metadata from an Amazon DynamoDB table and writes the results to an Amazon S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors from DynamoDB. The data volume is variable, with occasional bursts. Which solution should a data engineer implement to resolve this issue without losing data?

A.Increase the DynamoDB table's provisioned read capacity units to a high static value.

B.Use an Amazon SQS queue to buffer the Lambda requests before querying DynamoDB.

C.Enable DynamoDB auto scaling for the table to automatically adjust read capacity based on demand.

D.Configure an Amazon SNS topic to throttle the data stream before it reaches Lambda.

AnswerC

Auto scaling adjusts capacity dynamically to handle bursts without manual intervention.

Why this answer

Option B is correct because enabling DynamoDB auto scaling dynamically adjusts read/write capacity based on traffic patterns, handling bursts without manual intervention. Option A (increasing read capacity units) is costly and may not handle all peaks. Option C (SQS) introduces latency and does not address the DynamoDB throttling directly.

Option D (SNS) is for push notifications, not for resolving throughput issues.

Practice this question →

177

Multi-Selectmedium

A data engineer is designing a data pipeline that uses Amazon S3 events to trigger an AWS Lambda function for processing. The pipeline must handle high throughput with low latency. Which TWO configurations should be applied?

Select 2 answers

A.Configure Lambda with reserved concurrency

B.Use an SQS queue between S3 and Lambda

C.Place Lambda in a VPC to reduce network latency

D.Use Amazon Kinesis Data Streams as an intermediary

E.Enable S3 Event Notifications to invoke Lambda directly

AnswersA, E

Ensures Lambda has enough capacity to handle bursts.

Why this answer

Reserved concurrency ensures that the Lambda function always has a guaranteed number of concurrent executions available, preventing it from being throttled by other functions in the same AWS account. This is critical for high-throughput, low-latency pipelines because S3 event notifications can burst many invocations simultaneously, and without reserved concurrency, the function might hit the account-level concurrency limit and drop events.

Exam trap

The trap here is that candidates often confuse 'reducing latency' with 'using a VPC' or 'adding a queue,' but for S3-triggered Lambda, direct invocation with reserved concurrency is the simplest and lowest-latency path, while VPCs and queues add overhead.

Practice this question →

178

MCQhard

Refer to the exhibit. A CloudFormation template creates an S3 bucket. The data engineering team stores daily log files in this bucket and queries them using Amazon Athena. After 30 days, queries on logs older than 30 days start failing with 'Access Denied' errors. What is the MOST likely reason?

A.The lifecycle rule transitions objects to GLACIER after 30 days, making them inaccessible to Athena.

B.The bucket uses default encryption with SSE-S3, which Athena does not support.

C.The lifecycle rule deletes objects after 30 days.

D.The bucket policy denies access to objects older than 30 days.

AnswerA

Athena cannot query GLACIER objects; they must be restored first.

Why this answer

Option C is correct because the lifecycle rule transitions objects to GLACIER after 30 days, and Athena cannot query objects in GLACIER storage class. Option A is wrong because transition to GLACIER does not affect permissions. Option B is wrong because object is not deleted until 365 days.

Option D is wrong because SSE-S3 is not mentioned.

Practice this question →

179

MCQeasy

A company stores sensitive customer data in Amazon S3. The company must ensure that data is encrypted at rest. The company also needs to manage the encryption keys using an AWS service that allows automatic rotation of keys. Which solution meets these requirements?

A.Use client-side encryption with AWS CloudHSM

B.Use Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) and enable automatic key rotation

C.Use Server-Side Encryption with Customer-Provided Keys (SSE-C)

D.Use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3)

AnswerB

SSE-KMS allows automatic rotation of KMS keys.

Why this answer

Option B is correct because AWS KMS with automatic key rotation meets the encryption and key management requirements. Option A is wrong because SSE-S3 uses S3-managed keys that cannot be rotated manually. Option C is wrong because SSE-C requires the customer to manage keys.

Option D is wrong because CloudHSM is a hardware security module but does not automatically rotate keys.

Practice this question →

180

MCQhard

A data engineer needs to build a pipeline that ingests CSV files from an S3 bucket, validates the schema, and loads the data into an Amazon Redshift cluster. The pipeline must handle schema evolution gracefully by adding new columns as they appear in the source files. Which combination of AWS services and configurations would meet these requirements with minimal operational overhead?

A.Use AWS Glue to create a crawler that updates the schema, then use Redshift Spectrum to query the data directly from S3

B.Use Amazon Kinesis Data Firehose to ingest the files and load into Redshift, with a Lambda function to detect schema changes

C.Use Amazon Athena to create external tables with schema-on-read, and insert results into Redshift using INSERT INTO

D.Use AWS Glue to create a crawler and an ETL job that writes to Redshift, with 'resolveChoice' to handle new columns

AnswerD

Glue handles schema evolution via DynamicFrame and resolveChoice, and loads into Redshift.

Why this answer

Option C is correct because AWS Glue can crawl the S3 data to infer schema, and Glue ETL jobs can handle schema evolution using DynamicFrame and resolveChoice. Option A is wrong because Kinesis Data Firehose is for streaming, not batch CSV files. Option B is wrong because Redshift Spectrum does not handle schema evolution automatically.

Option D is wrong because Athena is an interactive query engine, not an ETL pipeline.

Practice this question →

181

MCQeasy

A data engineer needs to process streaming data from an IoT fleet and store the results in Amazon S3 for analysis. The solution must be serverless and handle data that arrives at irregular intervals. Which AWS service should be used to ingest the data?

A.Amazon S3

B.AWS IoT Core

C.Amazon Simple Queue Service (SQS)

D.Amazon Kinesis Data Streams

AnswerB

AWS IoT Core provides secure device connectivity, message routing, and integrates with serverless processing.

Why this answer

Option B is correct because AWS IoT Core is designed to ingest data from IoT devices securely and at scale, and it integrates with other AWS services for processing. Option A is wrong because Kinesis Data Streams is for real-time streaming but not specifically for IoT device connectivity. Option C is wrong because SQS is a message queue, not optimized for IoT ingestion.

Option D is wrong because S3 is storage, not ingestion.

Practice this question →

182

MCQhard

Refer to the exhibit. An ML engineer applies this bucket policy to an S3 bucket. The SageMaker execution role MySageMakerRole is used to train a model. The training data is located in s3://my-bucket/data/. The SageMaker training job fails with an access error. What is the most likely cause?

A.The policy allows GetObject only from the data/ prefix, but the training job uses a different prefix.

B.The role is not in the same AWS account as the bucket.

C.The Deny statement on s3:ListBucket prevents the role from listing objects in the bucket.

D.The bucket has default encryption enabled, causing a conflict.

AnswerC

SageMaker may need to list objects to iterate over files; the explicit deny blocks this.

Why this answer

Option A is correct because the Deny statement explicitly denies the ListBucket action to all principals, including the SageMaker role. Even though GetObject is allowed, SageMaker often needs to list objects to read data from a prefix. Option B is wrong because the role is in the same account.

Option C is wrong because the bucket is not encrypted. Option D is wrong because the policy does not restrict GetObject.

Practice this question →

183

MCQmedium

A company is using Amazon SageMaker to train machine learning models. The training data is stored in Amazon S3, but the data includes personally identifiable information (PII) that must be anonymized before training. What is the most efficient way to anonymize the data?

A.Use an AWS Glue ETL job to read from S3, apply anonymization, and write to another S3 bucket.

B.Use Amazon Athena to query the data and apply anonymization functions.

C.Use Amazon Redshift Spectrum to query and anonymize data in S3.

D.Use a SageMaker Processing job to read from S3 and apply anonymization.

AnswerA

Glue is a serverless ETL service that can efficiently transform large datasets.

Why this answer

Option B is correct because AWS Glue can run a transformation job to anonymize PII before training. Option A is wrong because SageMaker Processing jobs are for feature engineering, not data anonymization from S3. Option C is wrong because Athena is for querying, not transforming.

Option D is wrong because Redshift Spectrum queries data in S3 but does not anonymize efficiently.

Practice this question →

184

Multi-Selectmedium

A company is designing a data pipeline to analyze customer behavior. The pipeline must handle real-time streaming data and batch data. The data must be stored in a data lake on Amazon S3 and also made available for interactive queries. Which THREE services should be combined to build this pipeline? (Choose THREE.)

Select 3 answers

A.Amazon Kinesis Data Streams

B.AWS Glue

C.Amazon Redshift

D.Amazon DynamoDB Streams

E.Amazon Athena

AnswersA, B, E

Real-time data ingestion.

Why this answer

Amazon Kinesis Data Streams ingests real-time data. AWS Glue can perform ETL and catalog the data. Amazon Athena allows interactive querying on S3.

Options B, D, and E are not needed or redundant.

Practice this question →

185

Multi-Selecteasy

Which TWO AWS services can be used to transform data in a streaming fashion without using a persistent cluster? (Choose 2.)

Select 2 answers

A.AWS Glue

B.Amazon EMR

C.AWS Lambda

D.Amazon Kinesis Data Analytics

E.AWS Data Pipeline

AnswersC, D

Lambda can process streaming data from Kinesis or DynamoDB Streams serverlessly.

Why this answer

Option A (Lambda) and Option D (Kinesis Data Analytics) are serverless streaming transformation services. Option B (Glue) is serverless but not low-latency streaming. Option C (EMR) requires a cluster.

Option E (Data Pipeline) is for batch.

Practice this question →

186

MCQhard

A financial services company needs to build a data lake on Amazon S3 that meets regulatory requirements for data retention and encryption. Data must be encrypted at rest and in transit, and access must be audited. The data lake will be queried by Amazon Athena and Amazon Redshift Spectrum. Which combination of actions should be taken?

A.Enable S3 default encryption with SSE-KMS and enable AWS CloudTrail for S3 data events.

B.Use IAM policies to control access and enable S3 server access logging.

C.Use SSL/TLS for all connections and enable S3 versioning.

D.Enable S3 default encryption with SSE-S3 and use S3 access logs.

AnswerA

SSE-KMS provides encryption with managed keys; CloudTrail logs data events for auditing.

Why this answer

Option D is correct because server-side encryption with KMS (SSE-KMS) provides encryption at rest, and CloudTrail logs S3 API calls for auditing. Option A is wrong because SSE-S3 does not provide key management control. Option B is wrong because SSL/TLS is for in-transit encryption, not at rest.

Option C is wrong because IAM does not provide encryption.

Practice this question →

187

MCQeasy

A company is using Amazon Kinesis Data Firehose to load streaming data into an S3 bucket. The data schema evolves over time, with new columns added. The data must be queryable using Amazon Athena. What is the BEST way to handle schema changes?

A.Manually update the Athena table definition each time a new column is added

B.Configure Firehose to convert the data to Apache JSON format

C.Use AWS Glue Crawlers to automatically detect schema changes and update the table metadata

D.Recreate the Athena table daily to pick up new columns

AnswerC

Glue Crawlers can run on a schedule to discover new columns and update the Data Catalog.

Why this answer

Option D is correct. Using Glue Crawlers to update the schema and partitioning by date allows Athena to handle schema evolution gracefully. Option A (schema-on-read) is how Athena works, but manual updates are not needed.

Option B (convert to JSON) is not necessary. Option C (recreate table) is disruptive.

Practice this question →

188

MCQeasy

A data engineer needs to run a one-time ETL job to transform 500 GB of data from Amazon RDS to Amazon S3. The job should be cost-effective and require minimal infrastructure management. Which AWS service should be used?

A.AWS Glue

B.Amazon EMR

C.Amazon Athena

D.AWS Data Pipeline

AnswerA

Glue is serverless, cost-effective, and ideal for one-time ETL.

Why this answer

Option B is correct because AWS Glue is serverless and suitable for one-time ETL jobs. Option A is wrong because EMR requires cluster management and is more expensive for one-time jobs. Option C is wrong because Data Pipeline is a managed service but still requires provisioning.

Option D is wrong because Athena is for querying, not ETL.

Practice this question →

189

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be transformed before being stored in Amazon S3. The transformations include enrichment with reference data from Amazon DynamoDB. Which AWS service should be used to perform the transformation with minimal operational overhead?

A.Amazon Kinesis Data Firehose with data transformation

B.AWS Lambda functions invoked by Kinesis Data Streams

C.Amazon Kinesis Data Analytics for Apache Flink

D.Amazon EMR with Apache Spark Streaming

AnswerC

Managed Flink application can perform complex transformations and enrichments with low operational overhead.

Why this answer

Amazon Kinesis Data Analytics for Apache Flink allows real-time stream processing with Flink, including enrichment from external sources like DynamoDB. Option A (AWS Lambda) is simpler but has a 15-minute timeout and may not handle high throughput well. Option C (Amazon EMR) requires cluster management.

Option D (Amazon Kinesis Data Firehose) is for delivery and can invoke Lambda for transformation, but complex transformations with external lookups are better handled by Data Analytics.

Practice this question →

190

MCQhard

A company uses Amazon EMR to run Spark jobs on a large dataset stored in Amazon S3. The jobs are failing with 'OutOfMemoryError' in the executors. The data is not skewed. Which configuration change will most likely resolve the issue?

A.Enable Kryo serialization

B.Decrease the number of shuffle partitions

C.Increase the spark.executor.memoryOverhead setting

D.Increase the number of executor cores

AnswerC

Memory overhead handles JVM overhead and off-heap memory, preventing OOM errors.

Why this answer

Increasing the executor memory overhead provides additional memory for JVM overhead and can prevent OutOfMemoryError. Option A (increasing cores) may increase parallelism but not memory. Option B (decreasing shuffle partitions) may reduce memory usage but is not a direct fix.

Option D (using Kryo serialization) reduces memory usage but is not as effective as increasing overhead.

Practice this question →

191

MCQmedium

A data engineering team is building a pipeline to process terabytes of log data daily using Amazon EMR with Spark. The data arrives in hourly batches and must be processed within 4 hours. The team needs to minimize cost. Which cluster configuration is MOST cost-effective?

A.Use a single large instance with multiple cores to avoid data shuffling.

B.Use a transient cluster with a mix of on-demand and spot instances, terminated after the job completes.

C.Use a long-running cluster of on-demand instances to avoid startup time.

D.Use Amazon EMR Serverless to automatically scale.

AnswerB

Transient clusters reduce idle cost, spot instances lower compute cost.

Why this answer

Option B is correct because spot instances offer significant cost savings and are suitable for fault-tolerant Spark jobs. Option A is wrong because on-demand is more expensive. Option C is wrong because a single large instance reduces parallelism.

Option D is wrong because EMR Serverless may be more expensive for predictable, large workloads.

Practice this question →

192

MCQeasy

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are failing intermittently with timeouts. What is the most likely cause?

A.The S3 bucket policy is too restrictive.

B.The AWS Glue job does not have enough DPUs (Data Processing Units) allocated.

C.The Amazon Redshift cluster is in maintenance mode.

D.The source data is not compressed.

AnswerB

Insufficient resources can cause timeouts.

Why this answer

Insufficient DPU allocation can cause timeouts in Glue jobs. The other options are less likely: S3 bucket policies would cause permission errors, not timeouts; Redshift maintenance would affect all jobs; compression would improve performance.

Practice this question →

193

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. The job reads from an Amazon DynamoDB table and writes to Amazon S3. Which AWS service should the engineer use to trigger the Glue job on schedule?

A.Amazon Kinesis Data Streams

B.AWS Step Functions

C.Amazon EventBridge (CloudWatch Events)

D.AWS Lambda

AnswerC

EventBridge can schedule events to trigger Glue jobs.

Why this answer

Option A is correct. Amazon CloudWatch Events (now Amazon EventBridge) can trigger Glue jobs on a schedule. Option B is wrong because AWS Lambda is not a scheduler.

Option C is wrong because Amazon Kinesis is for streaming. Option D is wrong because AWS Step Functions is for orchestrating workflows, but scheduling is typically done via EventBridge.

Practice this question →

194

MCQmedium

A company is streaming e-commerce events to Amazon Kinesis Data Streams. The data science team needs to join events from multiple shards in near real-time and then store the joined results in Amazon S3. Which solution would meet these requirements with the LEAST operational overhead?

A.Use AWS Lambda functions with Kinesis triggers to process each record, join across shards using a DynamoDB table for state, and write to S3.

B.Use Amazon Kinesis Data Firehose to buffer the data and write to S3, then use Amazon Athena to join the data after it is stored.

C.Use AWS Glue ETL jobs that read from the Kinesis stream via the Kinesis connector and write the joined results to S3.

D.Use Amazon Kinesis Data Analytics for Apache Flink to read from the Kinesis stream, perform a join operation using Flink SQL, and write the results to S3 using a sink connector.

AnswerD

Kinesis Data Analytics for Apache Flink supports stateful stream processing and can join across shards natively.

Why this answer

Option C is correct because Amazon Kinesis Data Analytics for Apache Flink can read from a Kinesis stream, perform stateful joins across shards using Flink's SQL or DataStream API, and write results to S3. Option A is wrong because while Glue ETL can process data, it is batch-oriented and not designed for near real-time streaming joins. Option B is wrong because Lambda with Kinesis triggers processes each shard independently; joining across shards would require external state management and is not a typical pattern.

Option D is wrong because Kinesis Data Firehose cannot perform joins; it only writes data to destinations.

Practice this question →

195

MCQhard

A company uses AWS Glue to run ETL jobs that transform data from Amazon RDS for MySQL to Amazon S3. The current job runs daily and takes 3 hours to process 100 GB of data. The company expects data volume to grow 10x in the next year. They need to reduce job runtime and cost. Which approach should they take?

A.Use S3 Select with Glue to filter data before transformation.

B.Use parallel reads with pushdown predicates in the Glue job's source connection, and write the output in columnar format (Parquet) partitioned by date.

C.Increase the number of Glue DPUs to 100 and enable job bookmarking.

D.Use Amazon Redshift Spectrum to perform transformations in place on S3.

AnswerB

Parallel reads with partition pushdown reduce load on RDS and speed up extraction; Parquet with partitioning reduces storage and query costs.

Why this answer

Option D is correct because parallel reads from RDS with pushdown predicates reduce the load on the source and speed up extraction; using columnar formats like Parquet reduces storage and scanning costs in Athena. Option A is wrong because increasing DPUs without changing the extraction method may not help if the bottleneck is the source database. Option B is wrong because S3 Select is for server-side filtering, not for Glue jobs.

Option C is wrong because Redshift Spectrum is for querying data in S3, not for transforming it.

Practice this question →

196

MCQeasy

A data engineer needs to transform large CSV files stored in S3 into Parquet format and load them into a data warehouse for analysis. The transformation must be cost-effective and serverless. Which AWS service should be used?

A.Amazon Athena

B.Amazon EMR with Spark

C.AWS Glue

D.AWS Data Pipeline

AnswerC

AWS Glue is a serverless ETL service that can perform the transformation efficiently.

Why this answer

AWS Glue is the correct choice because it provides a fully managed, serverless ETL service that can automatically convert CSV files from S3 into Parquet format using its built-in Spark engine. It is cost-effective as you only pay for the resources consumed during the job execution, and it integrates directly with data warehouses like Amazon Redshift for loading transformed data.

Exam trap

The trap here is that candidates confuse Amazon Athena's ability to query Parquet files with the ability to transform CSV into Parquet, overlooking that Athena is a query engine, not an ETL service, while Glue is purpose-built for serverless data transformation.

How to eliminate wrong answers

Option A is wrong because Amazon Athena is an interactive query service that can query CSV and Parquet files directly in S3, but it does not perform ETL transformations or convert file formats; it is for ad-hoc analysis, not data transformation. Option B is wrong because Amazon EMR with Spark requires provisioning and managing clusters, which is not serverless; it incurs costs for running EC2 instances even when idle, making it less cost-effective for occasional transformations. Option D is wrong because AWS Data Pipeline is a workflow orchestration service that can move and transform data, but it is not serverless (it relies on EC2 instances for task runners) and is primarily designed for scheduled data movement, not optimized for converting CSV to Parquet with built-in Spark capabilities.

Practice this question →

197

MCQmedium

A company uses Amazon Kinesis Data Firehose to ingest streaming data and deliver it to an S3 bucket. The data is in JSON format with a timestamp field. The data science team wants to query the data using Athena with partitioning by year/month/day. How should the S3 data be organized?

A.Configure Firehose to use dynamic partitioning with custom prefix

B.Store data in a single prefix and use Athena's 'partition projection' feature

C.Use AWS Glue crawler to partition the data after delivery

D.Use Amazon EMR to partition the data after delivery

AnswerA

Firehose dynamic partitioning creates directories based on record fields or timestamps.

Why this answer

Kinesis Firehose can partition data using custom prefixes like 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/'. This creates Hive-style partitions that Athena can automatically discover.

Practice this question →

198

MCQhard

An S3 event notification triggers an AWS Lambda function when a new object is created. The Lambda function parses the event and processes the object. The function is failing with a timeout error for large objects. Which approach should be used to handle large objects efficiently?

A.Increase the Lambda function timeout to 15 minutes

B.Use an SQS queue to buffer event notifications and configure Lambda with a batch window

C.Stream events to Amazon Kinesis Data Streams and process with Lambda

D.Use AWS Step Functions to orchestrate the processing

AnswerB

SQS decouples events and allows Lambda to process in batches, reducing timeout risk.

Why this answer

Option B is correct because S3 event notifications can be sent to an SQS queue, and Lambda can process messages in batches with longer timeouts. Option A is wrong because increasing timeout alone doesn't solve the issue of large objects; Option C is wrong because Kinesis is not needed; Option D is wrong because Step Functions adds complexity.

Practice this question →

199

MCQhard

A data pipeline uses AWS Glue to transform data from Amazon RDS to Amazon S3. The team wants to ensure that only new or updated records are processed in each run, minimizing cost and time. Which AWS Glue feature should be used?

A.Use Glue triggers to run the job on a schedule.

B.Use Glue partition pruning to filter data.

C.Use Glue crawlers to detect new data.

D.Enable Glue Job Bookmarks.

AnswerD

Job Bookmarks maintain state and process only new or changed data.

Why this answer

Option D is correct because Job Bookmarks track processed data and enable incremental processing. Option A is wrong because Partitioning organizes data but doesn't track changes. Option B is wrong because Crawlers discover schema but don't process incrementally.

Option C is wrong because Triggers schedule jobs but don't enable incremental processing.

Practice this question →

200

Multi-Selecteasy

A data engineer is building a data pipeline that ingests streaming data from Amazon Kinesis Data Streams, transforms the data using AWS Lambda, and stores the results in Amazon S3. The engineer needs to ensure that each record is processed exactly once and in order. Which TWO approaches should the engineer consider? (Choose TWO.)

Select 2 answers

A.Configure the Lambda function with a reserved concurrency of 1 and a batch size of 1 to process records sequentially.

B.Use the Kinesis Producer Library (KPL) with a sequence number for each record.

C.Use Amazon SQS FIFO queues to decouple Kinesis and Lambda, ensuring ordering and exactly-once delivery.

D.Enable S3 Event Notifications to trigger Lambda for each object.

E.Use Kinesis Data Firehose to buffer and deliver data to S3, then use Lambda to process.

AnswersA, B

This ensures in-order processing per shard, and with proper idempotency, exactly-once can be achieved.

Why this answer

Option A is correct because Kinesis Data Streams supports exactly-once processing within a shard using the sequence number. Option E is correct because AWS Lambda can be configured with a reserved concurrency of 1 and a batch size of 1 to process records in order. Option B is incorrect because S3 does not provide ordering guarantees.

Option C is incorrect because Kinesis Data Firehose does not guarantee exactly-once delivery. Option D is incorrect because SQS FIFO does not integrate directly with Kinesis.

Practice this question →

201

Multi-Selecthard

A data engineer is designing a data pipeline that ingests data from a relational database into a data lake on Amazon S3. The data must be incrementally loaded daily. Which TWO AWS services can be used together to achieve this?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Athena

D.Amazon Redshift

E.AWS Database Migration Service (DMS)

AnswersA, E

Glue can use job bookmarks for incremental loads.

Why this answer

AWS Database Migration Service (DMS) can perform continuous replication or scheduled tasks to move data from a relational database to S3. AWS Glue can also connect to databases and run incremental ETL jobs using job bookmarks. Option A (Athena) is for querying, Option B (Kinesis) is for streaming, Option E (Redshift) is a data warehouse.

Practice this question →

202

MCQmedium

A company is using Amazon Athena to query a data lake in S3. Queries are slow and expensive. The data is stored as JSON. Which action will improve query performance and reduce cost?

A.Compress the JSON files using gzip

B.Partition the data by date

C.Convert the data to Parquet format

D.Increase the number of Athena workers

AnswerC

Parquet is columnar, reducing scanned data and improving performance.

Why this answer

Option C is correct because converting to columnar formats like Parquet reduces scan volume and improves performance. Option A is wrong because increasing workers is not applicable to Athena (serverless); Option B is wrong because compressing with gzip reduces size but still requires full scan; Option D is wrong because partitioning helps but columnar format is more impactful.

Practice this question →

203

MCQhard

A company runs a real-time fraud detection pipeline using Amazon Kinesis Data Analytics. The pipeline reads from a Kinesis data stream, performs sliding window aggregations, and writes results to a DynamoDB table. The application is experiencing high latency during peak hours. Which action would MOST effectively reduce latency?

A.Enable DynamoDB auto scaling to handle write spikes.

B.Decrease the parallelism level in the Kinesis Data Analytics application.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the sliding window size to reduce computational frequency.

AnswerC

More shards increase parallelism and reduce processing backlog.

Why this answer

Option A is correct because increasing the number of shards increases parallelism in both the stream and the Kinesis Data Analytics application. Option B is wrong because DynamoDB auto scaling helps with writes but not streaming latency. Option C is wrong because a larger window increases latency.

Option D is wrong because decreasing parallelism reduces throughput.

Practice this question →

204

MCQmedium

A company uses AWS Glue ETL jobs to transform data from Amazon RDS for MySQL to Amazon S3. The transformation includes aggregations and joins. The job runs daily and processes approximately 100 GB of data. Recently, the job started failing with memory errors on the worker nodes. Which approach would MOST effectively resolve the issue without changing the logic?

A.Switch from a Spark ETL job to a Python shell job

B.Decrease the number of workers to reduce overhead

C.Change the worker type from G.2X to G.1X to increase memory per worker

D.Increase the number of workers in the job configuration

AnswerD

More workers distribute the data processing, reducing memory per node.

Why this answer

Option A is correct because increasing the number of workers distributes the workload and reduces memory pressure per node. Option B is wrong because reducing workers increases memory pressure. Option C is wrong because changing to G.1X increases memory per worker but may not be as cost-effective as adding more G.2X workers.

Option D is wrong because Python shell is not suitable for large data transformations.

Practice this question →

205

MCQeasy

A data engineer needs to transform raw clickstream data (JSON files) stored in S3 into a partitioned Parquet dataset for querying with Athena. The transformation includes cleaning, deduplication, and enrichment. The pipeline should run daily. Which solution is MOST cost-effective and requires the least operational overhead?

A.Launch an Amazon EMR cluster with Spark, transform the data, and terminate the cluster after completion.

B.Use an AWS Glue ETL job with a schedule trigger to perform the transformation and write to S3.

C.Use AWS Lambda functions triggered by S3 events to transform each file incrementally.

D.Use Amazon Athena to run CTAS queries to convert and partition the data daily.

AnswerB

Glue ETL is serverless, can handle complex transformations, and scheduling is built-in.

Why this answer

Option B is correct because AWS Glue crawlers can catalog the data, and Glue ETL jobs are serverless, cost-effective, and can be scheduled. Option A is wrong because Athena is for querying, not ETL. Option C is wrong because EMR requires cluster management.

Option D is wrong because Lambda has execution time limits and is not ideal for large datasets.

Practice this question →

206

MCQeasy

A machine learning team is preparing a dataset for model training. The data is stored in an Amazon S3 bucket with objects that are each approximately 100 MB in size. The team wants to use Amazon SageMaker for training. To optimize training performance, which data format and storage configuration should be used?

A.Store data as RecordIO-Protobuf files and use SageMaker File input mode

B.Store data as RecordIO-Protobuf files and use SageMaker Pipe input mode

C.Store data as CSV files and use SageMaker Pipe input mode

D.Store data as CSV files and use SageMaker File input mode

AnswerB

Pipe mode streams data directly from S3, and RecordIO-Protobuf provides efficient binary format.

Why this answer

Option B is correct because SageMaker Pipe input mode streams data directly from S3, avoiding disk I/O, and RecordIO-Protobuf is an optimized binary format. Option A is wrong because File mode copies data to disk, increasing latency. Option C is wrong because CSV is not as efficient as binary.

Option D is wrong because File mode with CSV is not optimal.

Practice this question →

207

MCQmedium

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS for PostgreSQL and load it into Amazon Redshift. The Glue job runs nightly and takes 6 hours to complete. The Redshift cluster is a single dc2.large node. The team needs to reduce the load time to under 3 hours. The data volume is 200 GB per night. The team is considering using Amazon Redshift Spectrum to query data directly from S3 instead of loading it. However, the data transformation logic is complex and requires multiple joins and aggregations that are currently performed in Glue. Which approach should the team recommend to meet the time requirement?

A.Use Redshift Spectrum to create external tables and run the transformations directly in Redshift, bypassing the Glue job.

B.Increase the Redshift cluster to a multi-node cluster with dc2.8xlarge nodes to improve COPY and query performance.

C.Split the Glue job into multiple parallel jobs that each load a portion of the data into separate Redshift tables, then use UNION ALL views.

D.Stage the data in S3 in Parquet format and use a COPY command with the PARQUET option to load data faster.

AnswerB

More nodes increase parallelism for loading and any post-load transformations.

Why this answer

Option C is correct because increasing the node size or number of nodes provides more compute resources for the COPY command and any subsequent processing. Using a larger node type like dc2.8xlarge or adding nodes increases parallelism. Option A is wrong because Redshift Spectrum does not replace the transformation logic; the complex transformations would still need to be run, possibly in Glue or Redshift.

Option B is wrong because staging data in S3 does not reduce the transformation time. Option D is wrong because using a single node with elastic resize is not possible for dc2; also, splitting the load does not reduce total time if the bottleneck is compute.

Practice this question →

208

Multi-Selecthard

A company is using AWS Glue to catalog data stored in Amazon S3. The data is partitioned by year, month, day, and hour. The company runs hourly ETL jobs that add new partitions. The Glue crawler is scheduled to run every hour to update the Data Catalog. However, the crawler is taking longer than expected and is not completing before the next crawler run starts. Which THREE actions could the company take to resolve this issue?

Select 2 answers

A.Increase the throughput of the crawler by configuring the 'Schema updates' option

B.Enable partition indexing on the table to speed up the crawler

C.Decrease the crawler schedule frequency to every 2 hours to avoid overlapping runs

D.Use multiple crawlers, each configured to crawl a different path (e.g., one for year=2023, one for year=2024)

E.Increase the number of crawler instances by configuring the 'Crawler queue' to process multiple partitions in parallel

AnswersD, E

Splitting the crawl scope across crawlers allows parallel execution.

Why this answer

Options A, C, and D are correct. Increasing the crawler's S3 throughput improves speed. Partition indexing speeds up queries but does not affect crawler speed.

Adding more crawlers allows parallel processing. Option B is wrong because partition indexing is for Athena/Redshift, not crawler performance. Option E is wrong because decreasing schedule frequency would increase the backlog.

Practice this question →

209

Multi-Selectmedium

A company is building a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time events. The pipeline then uses AWS Lambda to process the events and store results in Amazon DynamoDB. The company wants to ensure that the Lambda function can process all events without data loss and without duplicating processing. Which TWO configuration steps should the company take?

Select 2 answers

A.Increase the data retention period of the Kinesis stream to 7 days to allow reprocessing

B.Set the Lambda function's batch window to a small value (e.g., 1 second) to reduce processing latency

C.Enable the 'iterator age' metric in Amazon CloudWatch to monitor consumer lag

D.Use a single shard for the Kinesis stream to ensure order and avoid parallel processing issues

E.Configure the Lambda function to disable retries on failure to avoid duplicate processing

AnswersB, C

Small batch window ensures timely processing.

Why this answer

Options A and C are correct. Setting the Lambda batch window to a small value ensures low latency, and enabling the iterator age metric helps monitor lag. Option B is wrong because increasing the retention period does not prevent duplication.

Option D is wrong because using a single shard limits throughput. Option E is wrong because disabling retries could cause data loss.

Practice this question →

210

MCQhard

A data engineering team is designing a data lake on Amazon S3. Raw data is ingested in JSON format and must be partitioned by year, month, and day. The team expects high query performance for recent data but infrequent queries for older data. The data is immutable. Which storage tier configuration minimizes costs while meeting performance requirements?

A.Store all data in S3 Standard, then move to S3 Glacier after 30 days using a lifecycle policy

B.Store recent partitions in S3 Standard, older partitions in S3 One Zone-IA

C.Keep all data in S3 Standard because query performance is critical

D.Use S3 Intelligent-Tiering for the entire data lake

AnswerD

Intelligent-Tiering automatically moves data between access tiers based on usage, optimizing cost without retrieval delays.

Why this answer

Using S3 Intelligent-Tiering for the entire data lake automatically optimizes costs by moving data between frequent and infrequent access tiers based on usage patterns, without performance impact. Option A uses Glacier for recent data, which would cause retrieval delays. Option C uses S3 Standard for all data, which is cost-inefficient for older data.

Option D uses a lifecycle policy to S3 One Zone-IA, which is not cost-optimal for infrequent queries and may have durability concerns.

Practice this question →

211

Multi-Selectmedium

A data engineer needs to transform and move 2 TB of data from an Amazon RDS for PostgreSQL instance to Amazon S3 daily. The transformation includes filtering, joining with data in S3, and aggregating. Which AWS services can be used together to accomplish this with minimal operational overhead? (Choose THREE.)

Select 3 answers

A.Amazon EMR

B.Amazon Redshift

C.Amazon S3

D.AWS Glue Data Catalog

E.AWS Glue

AnswersC, D, E

Target storage for transformed data.

Why this answer

Option A (Glue) is the ETL service, Option C (S3) is the target, and Option D (Glue Data Catalog) manages metadata. Option B (EMR) adds overhead, Option E (Redshift) is a warehouse, not needed.

Practice this question →

212

MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 100 Mbps connection to AWS. The transfer must be completed within one week. Which approach should the engineer use?

A.Use AWS Snowball Edge device to physically transfer the data.

B.Use Amazon S3 Transfer Acceleration.

C.Use AWS DataSync to transfer the data over the network.

D.Use multiple concurrent AWS CLI copy commands over VPN.

AnswerA

Snowball Edge can handle large data volumes without network limitations.

Why this answer

AWS Snowball Edge is a physical device that can handle large data volumes over slow networks. 50 TB over 100 Mbps would take about 50 days, exceeding the one-week requirement. Option B is wrong because AWS DataSync still uses the network. Option C is wrong because S3 Transfer Acceleration improves speed but not enough.

Option D is wrong because VPN is not designed for bulk data transfer.

Practice this question →

213

Multi-Selectmedium

A company is using Amazon Kinesis Data Streams with 10 shards to ingest clickstream data. Each record is approximately 50 KB. The data is consumed by a Lambda function that writes to DynamoDB. The Lambda function is experiencing throttling errors. Which TWO actions should the data engineer take to resolve the issue? (Choose TWO.)

Select 2 answers

A.Increase the record size to 1 MB to reduce the number of records

B.Switch to Kinesis Data Firehose instead of Data Streams

C.Request a limit increase for the Lambda function's concurrent execution limit

D.Increase the number of shards in the Kinesis stream

E.Increase the batch size in the Lambda event source mapping

AnswersC, E

This directly alleviates throttling by allowing more concurrent executions.

Why this answer

Option A (increase shards) increases throughput but may not solve Lambda throttling. Option B (increase Lambda concurrency limit) directly addresses throttling. Option C (increase batch size) reduces number of Lambda invocations.

Option D (use Firehose) changes architecture. Option E (increase record size) is irrelevant. The correct answers are B and C because they reduce the number of concurrent Lambda executions and increase efficiency.

Practice this question →

214

Multi-Selectmedium

A company runs a data lake on Amazon S3 with AWS Glue for ETL. The data science team needs to train machine learning models on historical data, but they are concerned about data quality issues such as missing values, duplicates, and outliers. The team wants to build a data quality monitoring solution that automatically detects anomalies and alerts the data engineering team. Which THREE steps should the team take to implement this solution? (Choose THREE.)

Select 3 answers

A.Use AWS Glue DataBrew to create data quality rules that check for missing values, duplicates, and outliers, and schedule them to run regularly.

B.Use Amazon Kinesis Data Analytics to continuously monitor streaming data for data quality issues.

C.Create Amazon CloudWatch alarms based on the data quality metrics and trigger Amazon SNS notifications when thresholds are breached.

D.Implement the Deequ library on Amazon EMR to compute data quality metrics and store them in Amazon CloudWatch.

E.Use Amazon SageMaker Processing jobs to run custom data quality scripts and store results in SageMaker Experiments.

AnswersA, C, D

DataBrew has built-in data quality functionalities.

Why this answer

Option A is correct because AWS Glue DataBrew provides built-in data quality checks and profiling. Option C is correct because Deequ is an open-source library that can run on Amazon EMR or Glue to compute data quality metrics. Option D is correct because CloudWatch alarms can be set on custom metrics to send alerts via SNS.

Option B is incorrect because SageMaker is for model training, not data quality monitoring. Option E is incorrect because Kinesis Data Analytics is for real-time streaming analytics, not batch data quality checks.

Practice this question →

215

MCQmedium

A data engineer needs to automate the transformation of CSV files to Parquet format as soon as they are uploaded to an S3 bucket. The transformed files should be stored in another S3 bucket. Which solution is the most cost-effective and requires the least maintenance?

A.Configure an S3 event notification to invoke a Lambda function.

B.Configure an S3 event notification to invoke an AWS Glue job.

C.Run an Amazon EMR cluster continuously to watch for new files.

D.Set up an EC2 instance with a cron job to poll the S3 bucket.

AnswerA

Lambda is serverless, pay-per-execution, ideal for this use case.

Why this answer

S3 Event Notification to Lambda is serverless and cost-effective. Glue jobs have a minimum billing of 1 minute. EMR is overkill.

EC2 requires management.

Practice this question →

216

MCQeasy

A company wants to perform real-time analytics on streaming data from clickstreams. The data needs to be ingested, processed, and made available for querying within seconds. Which AWS service should be used for the processing step?

A.AWS Glue

B.Amazon Redshift

C.Amazon Kinesis Data Analytics

D.Amazon Athena

AnswerC

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is the correct choice because it enables real-time processing and analysis of streaming data using SQL or Apache Flink. It can ingest data from Kinesis Data Streams or Kinesis Data Firehose, process it with sub-second latency, and output results to destinations like Kinesis Data Streams or Firehose for further querying, meeting the requirement for analytics within seconds.

Exam trap

The trap here is that candidates often confuse AWS Glue's streaming ETL capability (which still relies on Spark Structured Streaming with higher latency) with Kinesis Data Analytics' native real-time processing, or they assume Athena can query streaming data directly when it only queries data at rest in S3.

How to eliminate wrong answers

Option A is wrong because AWS Glue is a serverless ETL service designed for batch processing and data cataloging, not for real-time stream processing with sub-second latency. Option B is wrong because Amazon Redshift is a data warehouse optimized for analytical queries on large datasets, but it is not designed for real-time stream processing; it ingests data in batches or via streaming ingestion with higher latency. Option D is wrong because Amazon Athena is an interactive query service for analyzing data in Amazon S3 using SQL, but it operates on data at rest and cannot process streaming data in real time.

Practice this question →

217

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose three.)

Select 3 answers

A.The need for built-in data transformation and analytics.

B.The need for custom real-time processing logic using consumer applications.

C.The required end-to-end latency (seconds vs. minutes).

D.The need to manually manage shard capacity and scaling.

E.The requirement for exactly-once delivery semantics.

AnswersB, C, D

Correct: Data Streams supports custom consumers; Firehose does not.

Why this answer

Three key differences: Kinesis Data Streams supports custom processing with consumers (option A), requires manual scaling (option D), and has lower latency (option B). Firehose is for simple delivery with minimal code, auto-scaling, and higher latency due to buffering. Option C (exactly-once delivery) is not guaranteed by either.

Option E (built-in analysis) is not a feature of Firehose.

Practice this question →

218

MCQmedium

A company is building a data lake on Amazon S3. Raw data is ingested from multiple sources in different formats (CSV, JSON, Parquet). The data must be cataloged and made queryable using Amazon Athena. The data schema may evolve over time. Which approach minimizes manual effort and supports schema evolution?

A.Use Athena only, without a catalog, by directly querying files

B.Use Amazon EMR to process data and write to a Hive metastore

C.Use AWS Glue Crawlers to automatically create and update the Glue Data Catalog

D.Manually create tables in Athena using DDL statements

AnswerC

Crawlers automatically detect schema changes and update the catalog.

Why this answer

Using AWS Glue Crawlers automatically infers and updates schemas in the Glue Data Catalog, supporting schema evolution. Option B (manual schema creation) is error-prone and not scalable. Option C (EMR only) does not leverage Glue Catalog.

Option D (Athena only) can query but does not manage schema evolution.

Practice this question →

219

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and the data must be transferred within 5 days. Which AWS service is best suited for this task?

A.AWS DataSync

B.AWS Snowball Edge

C.Amazon S3 Transfer Acceleration

D.AWS Storage Gateway

AnswerB

Snowball Edge is a physical device that can transfer 50 TB in a few days, bypassing network bandwidth limitations.

Why this answer

Option C is correct because AWS Snowball Edge is a physical device that can transfer large amounts of data over a network faster than a 100 Mbps connection. Option A is wrong because AWS DataSync is designed for network transfers, but 100 Mbps would take much longer than 5 days. Option B is wrong because AWS Storage Gateway is for ongoing hybrid cloud storage, not large-scale data migration.

Option D is wrong because S3 Transfer Acceleration speeds up internet transfers but still limited by the 100 Mbps connection.

Practice this question →

220

MCQhard

A data engineer is investigating a slow Athena query on a partitioned table. The table is partitioned by year, month, and day, and the data is stored in S3 with the prefix pattern 'raw/YYYY/MM/DD/'. The engineer runs the above CLI command and sees that there are many small files. Which action would most improve query performance?

A.Convert the data to columnar format like Parquet or ORC.

B.Use S3DistCp to coalesce files into fewer, larger files.

C.Increase the number of partitions in the Athena DDL.

D.Add more partitions to reduce the amount of data scanned per query.

AnswerB

Coalescing reduces the number of files, improving query performance.

Why this answer

Athena performs best with larger files. Consolidating small files into fewer, larger files (e.g., using S3DistCp or Glue ETL) reduces the overhead of reading many small files and improves query performance.

Practice this question →

221

Multi-Selecteasy

A company has a large number of small CSV files (hundreds of thousands) in an S3 bucket. A data engineer needs to run a SQL query on this data using Amazon Athena. The queries are currently slow and expensive. Which two actions will improve query performance and reduce cost?

Select 2 answers

A.Increase the S3 request rate per prefix to improve read throughput.

B.Compress the CSV files using gzip.

C.Partition the data by a commonly filtered column (e.g., date).

D.Increase the number of partitions by splitting files into smaller ones.

E.Convert the data to Parquet or ORC columnar format.

AnswersC, E

Partitioning limits the data scanned per query, improving performance and reducing cost.

Why this answer

Option A and D are correct. Partitioning the data reduces the amount of data scanned by Athena, improving performance and reducing cost. Converting to columnar format (Parquet) further reduces scanned data and improves compression.

Option B (compression) helps but is less impactful than partitioning and columnar format. Option C (more partitions) could help but too many small partitions may hurt performance. Option E (increasing S3 request rate) is not a direct action for Athena.

Practice this question →

222

MCQeasy

A data scientist wants to query a dataset stored in Amazon S3 using standard SQL without provisioning any servers. The dataset is in CSV format and is updated daily. Which AWS service should be used?

A.Amazon Athena

B.Amazon Redshift

C.Amazon RDS

D.Amazon DynamoDB

AnswerA

Athena is serverless and supports SQL queries on S3 data.

Why this answer

Amazon Athena is a serverless interactive query service that allows querying data in S3 using standard SQL. Option B is wrong because Redshift is a data warehouse requiring provisioning. Option C is wrong because RDS is a relational database service.

Option D is wrong because DynamoDB is a NoSQL database.

Practice this question →

223

MCQeasy

A data scientist needs to query a dataset stored as Parquet files in Amazon S3 using standard SQL without managing any infrastructure. Which service should they use?

A.Amazon Athena

B.Amazon QuickSight

C.AWS Glue

D.Amazon Redshift

AnswerA

Athena is serverless and supports SQL on S3.

Why this answer

Amazon Athena is serverless and allows SQL queries directly on S3 data. Redshift requires a cluster. Glue is for ETL.

QuickSight is for visualization.

Practice this question →

224

MCQhard

A data engineer created a CloudFormation template for a Glue ETL job as shown. The job processes 500 GB of data and takes 90 minutes to complete. However, the job fails after 60 minutes. What is the MOST likely cause?

A.The IAM role does not have sufficient permissions.

B.The ScriptLocation S3 bucket is in a different region.

C.The Timeout property is set to 60 minutes, but the job requires more time.

D.The MaxRetries property is set to 0, so the job does not retry on failure.

AnswerC

The job is killed when it exceeds the timeout, causing failure.

Why this answer

Option C is correct because the Timeout is set to 60 minutes, but the job takes 90 minutes, so it is killed. Option A is wrong because the script location is correct. Option B is wrong because the role appears correctly configured.

Option D is wrong because MaxRetries is 0, but the job fails due to timeout, not due to retry policy.

Practice this question →

225

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The ETL jobs are failing intermittently with write timeout errors when writing to S3. The company wants to implement a retry mechanism for transient errors. What should the company do?

A.Configure the AWS Glue job to retry on failure by setting the 'Max retries' parameter

B.Increase the size of the Amazon EBS volumes attached to the Glue job

C.Use Amazon CloudWatch to monitor the job and manually restart on failure

D.Place the failed job messages in an Amazon SQS queue and reprocess them

AnswerA

Glue jobs can automatically retry up to a specified number of times.

Why this answer

Option D is correct because configuring job retry in AWS Glue automatically retries failed jobs. Option A is wrong because CloudWatch alarms do not retry. Option B is wrong because SQS is not integrated with Glue jobs.

Option C is wrong because increasing EBS volume does not address S3 write errors.

Practice this question →