CCNA Data Operations and Support Questions

75 of 387 questions · Page 1/6 · Data Operations and Support · Answers revealed

1
MCQmedium

A company uses AWS DMS to migrate an on-premises Oracle database to Amazon Aurora PostgreSQL. The migration is ongoing with continuous replication. The data engineer notices that the target Aurora database has a higher lag than expected. Which action would most likely reduce the lag?

A.Increase the size of the S3 bucket used for staging
B.Increase the number of parallel tasks in the DMS task settings
C.Enable Batch Optimized Apply on the DMS task
D.Disable validation of data on the target
AnswerB

More parallel tasks improve apply throughput.

Why this answer

Option D is correct because increasing the number of parallel tasks improves throughput. Option A is wrong because turning off validation reduces reliability but may help slightly. Option B is wrong because it does not affect replication lag.

Option C is wrong because batch-optimized apply is not for Aurora.

2
MCQmedium

Refer to the exhibit. This log snippet is from a failed AWS Glue job. The job processes a large dataset in memory. What is the MOST likely cause of the OutOfMemoryError?

A.The Glue job is running with insufficient DPUs or worker type.
B.The input data is in an unsupported file format.
C.The job is attempting to join two tables with mismatched keys.
D.The job has too many partitions.
AnswerA

Insufficient resources cause out-of-memory.

Why this answer

Option B is correct because an OutOfMemoryError in Glue often indicates that the DPU (worker) allocation is insufficient for the data size. Option A is wrong because the error is not about file format. Option C is wrong because the job is not about joining; the error is heap space.

Option D is wrong because the error is not about partitioning.

3
MCQeasy

A data engineer is monitoring an Amazon Redshift cluster and notices that the disk space usage is increasing rapidly. The engineer wants to reclaim space from deleted rows. Which command should the engineer run?

A.VACUUM
B.ANALYZE
C.UNLOAD
D.COPY
AnswerA

VACUUM reclaims space from deleted rows.

Why this answer

Option D is correct because VACUUM reclaims space from deleted rows. Option A is wrong because ANALYZE updates statistics. Option B is wrong because COPY loads data.

Option C is wrong because UNLOAD exports data.

4
Multi-Selectmedium

A data engineer is setting up a Redshift cluster and needs to ensure high availability. Which TWO actions should be taken?

Select 2 answers
A.Enable concurrency scaling.
B.Configure cross-region snapshot copy.
C.Enable automatic replication across Availability Zones.
D.Use a single-node cluster to reduce complexity.
E.Deploy a multi-node cluster with at least two compute nodes.
AnswersC, E

Cross-AZ replication ensures availability in case of AZ failure.

Why this answer

Options B and D are correct. Multi-node clusters with automatic replication provide high availability. Option A is wrong because single-node clusters are not HA.

Option C is wrong because concurrency scaling improves performance, not HA. Option E is wrong because cross-region snapshot is disaster recovery, not HA.

5
MCQhard

A company runs an Amazon DynamoDB table with on-demand capacity. A new reporting application performs frequent Scan operations on the table, causing occasional 'ProvisionedThroughputExceededException' errors. The operations team needs to resolve this with minimal cost. What should they do?

A.Increase the table's maximum read capacity by requesting a limit increase from AWS Support.
B.Switch the table to provisioned capacity and increase the read capacity units.
C.Enable DynamoDB Accelerator (DAX) to cache the Scan results.
D.Create a global secondary index (GSI) on the attributes used in the reporting queries.
AnswerD

GSI enables efficient queries, reducing Scans and avoiding partition-level throttling.

Why this answer

Option B is correct because on-demand tables do not have provisioned throughput, but the error indicates throttling due to per-partition throughput limits. Creating a GSI allows the reporting queries to use a more efficient query pattern, reducing scans and partition hot spots. Option A is incorrect because switching to provisioned capacity would require careful capacity planning and might increase cost.

Option C is incorrect because the error is not due to table-level limits but partition-level limits. Option D is incorrect because DAX is a caching layer that can reduce read load but does not address the root cause of inefficient Scan operations.

6
MCQmedium

A company is using Amazon Redshift for its data warehouse. A data engineer notices that COPY commands from S3 are failing intermittently with 'S3ServiceException: Access Denied'. The IAM role used by Redshift has the correct permissions. What is the MOST likely cause?

A.The IAM role is not attached to the Redshift cluster.
B.The S3 bucket uses SSE-KMS encryption and the role lacks kms:Decrypt.
C.The IAM role name contains a typo in the COPY command.
D.The S3 bucket policy denies access to the Redshift cluster's IP addresses.
AnswerD

Bucket policies can override IAM permissions and cause Access Denied.

Why this answer

Option D is correct because S3 bucket policies may deny access even if the role allows it. Option A is wrong because the role is already attached. Option B is wrong because encryption would cause different errors.

Option C is wrong because if the role exists, it should work; the issue is likely external.

7
MCQhard

A company uses Amazon Athena to query data in an S3 bucket. A data engineer notices that a query fails with the error: 'HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket/path/file.parquet (Path does not exist)'. However, the file exists in S3. What is the most likely cause?

A.The file was uploaded using S3 multipart upload and is incomplete.
B.The table's metadata in the Glue Data Catalog is outdated.
C.The S3 bucket has a bucket policy that denies access to the Athena principal.
D.Another process deleted the file after Athena listed the files but before reading.
AnswerD

Eventual consistency for deletions can cause this.

Why this answer

Option C is correct because concurrent delete operations can cause a split to be read after deletion. Option A is incorrect because the error mentions the file path, not table metadata. Option B is incorrect because ACLs would cause permission errors.

Option D is incorrect because S3 is strongly consistent for new objects but deletions can race.

8
Multi-Selectmedium

A data engineer is troubleshooting a slow Amazon Redshift query. The query plan shows a large number of 'DS_DIST_ALL_INNER' and 'DS_BCAST_INNER' operations. Which TWO actions would likely improve query performance?

Select 2 answers
A.Set DISTSTYLE to ALL for both tables.
B.Change the distribution style of large tables to KEY on the join column.
C.Increase the number of slices by resizing the cluster.
D.Define SORTKEYs on the join columns.
E.Drop and recreate the tables with the same DDL.
AnswersB, C

KEY distribution collocates data on the same slice, reducing redistribution.

Why this answer

Option A is correct because using DISTSTYLE KEY on join columns can reduce data redistribution. Option D is correct because increasing the number of slices distributes data across more compute nodes. Option B is incorrect because SORTKEY helps with range restriction, not joins.

Option C is incorrect because DISTSTYLE ALL on both tables would broadcast both, which is inefficient. Option E is incorrect because dropping and recreating tables is disruptive and may not help.

9
MCQhard

A company runs a data pipeline that ingests streaming data via Amazon Kinesis Data Streams, processes it with an AWS Lambda function, and stores results in Amazon DynamoDB. The Lambda function sometimes fails due to 'ProvisionedThroughputExceededException' on the DynamoDB table. Which combination of steps should a data engineer take to resolve this issue?

A.Enable DynamoDB auto scaling and configure a dead-letter queue for the Lambda function.
B.Increase the Lambda function timeout and enable batch windows.
C.Increase the number of Kinesis shards to reduce Lambda invocations.
D.Increase Lambda reserved concurrency and disable retries.
AnswerA

Auto scaling adjusts throughput; DLQ captures failed records for reprocessing.

Why this answer

Option C is correct because DynamoDB auto scaling adjusts capacity based on load, and adding a dead-letter queue (DLQ) for failed records prevents data loss and allows reprocessing. Option A is wrong because increasing Lambda timeout does not address DynamoDB throttling. Option B is wrong because Lambda reserved concurrency may limit processing, not help.

Option D is wrong because Kinesis shard count does not affect DynamoDB throughput.

10
MCQeasy

A company is using Amazon S3 as a data lake. Data is ingested hourly from multiple sources. The data engineer needs to ensure that once an object is written to S3, it cannot be overwritten or deleted for 30 days. Which S3 feature should be used?

A.Use S3 Lifecycle policies to transition objects to Glacier after 30 days.
B.Enable S3 Versioning and MFA Delete.
C.Configure a bucket policy that denies s3:DeleteObject for all principals.
D.Enable S3 Object Lock with a retention period of 30 days.
AnswerD

Object Lock enforces write-once-read-many (WORM) protection.

Why this answer

Option B is correct because S3 Object Lock with retention mode prevents overwrites and deletes. Option A is wrong because it only prevents accidental deletion. Option C is wrong because it only protects against delete, not overwrite.

Option D is wrong because it manages lifecycle, not immutability.

11
MCQmedium

A data engineer is troubleshooting a nightly ETL job that extracts data from an Amazon RDS MySQL instance and loads it into an Amazon S3 bucket in Parquet format. The job runs on an Amazon EMR cluster and has been failing with the error 'Access Denied' when writing to S3. The IAM role attached to the EMR cluster has permissions for S3 PutObject. What is the MOST likely cause?

A.The S3 bucket uses SSE-KMS encryption and the EMR role lacks kms:GenerateDataKey permission.
B.The S3 bucket has a Lifecycle rule that expires objects too quickly.
C.The EMR cluster was terminated before the write operation completed.
D.The S3 bucket policy denies access to the EMR cluster's IAM role.
AnswerD

S3 bucket policies can explicitly deny access, overriding IAM allow.

Why this answer

Option D is correct because S3 bucket policies can override IAM permissions; if the bucket policy denies access from the EMR cluster, the write will fail even with IAM allow. Option A is wrong because S3 Lifecycle rules do not affect write permissions. Option B is wrong because KMS permissions are needed only if the bucket uses SSE-KMS.

Option C is wrong because EMR cluster termination would cause a different error.

12
MCQeasy

Refer to the exhibit. A data engineer runs this CLI command to check an object's metadata. The engineer wants to verify if the object is eligible for lifecycle transition to S3 Glacier based on its age. What additional information is needed?

A.The current date
B.The ETag value
C.The metadata archive flag
D.The ContentLength value
AnswerA

The object age is based on LastModified and current date.

Why this answer

Option C is correct because the last modified date shows when the object was last modified, and lifecycle policies are based on object age. The engineer needs the current date to calculate the age. Option A is wrong because ContentLength is not relevant.

Option B is wrong because ETag is for integrity. Option D is wrong because metadata is shown but not needed for age calculation.

13
MCQmedium

A company uses Amazon DynamoDB as a data store for a real-time dashboard application. The application performs point lookups and range queries on a table that has a partition key and sort key. The table uses on-demand capacity mode. Recently, the application's response time has increased, and CloudWatch metrics show high 'ThrottledRequests' for the table. The application uses the AWS SDK with default retry settings. The data access pattern is read-heavy with occasional spikes. What is the most effective way to reduce throttling?

A.Switch the table to provisioned capacity and set the read capacity units to a high value.
B.Enable DynamoDB Accelerator (DAX) to cache frequently read items.
C.Increase the read capacity units to a higher value.
D.Implement exponential backoff with jitter in the application code.
AnswerD

Retries with backoff reduce the rate of requests during throttling, allowing the table to recover.

Why this answer

Option C is correct because DynamoDB on-demand mode accommodates traffic spikes but can throttle if the spike exceeds the previous peak by more than double. Implementing exponential backoff with jitter in the application allows retries to succeed without overwhelming the table. Option A is wrong because switching to provisioned capacity would require predicting the peak and may still throttle if the spike is higher.

Option B is wrong because DAX (DynamoDB Accelerator) is an in-memory cache that reduces read load but does not eliminate throttling from write spikes; it also adds cost and complexity. Option D is wrong because increasing read capacity units only applies to provisioned mode.

14
MCQmedium

A data engineer is responsible for a real-time data pipeline that ingests clickstream data from a website into Amazon Kinesis Data Streams, then processed by an AWS Lambda function that writes to an Amazon DynamoDB table for user session tracking. The Lambda function is idempotent and uses the DynamoDB PutItem API with a condition expression to avoid overwriting existing records. Over the past week, the engineer has observed an increase in DynamoDB write throttling (ProvisionedThroughputExceededException) during peak traffic hours. The DynamoDB table has on-demand capacity. The engineer checks the Lambda function's reserved concurrency and finds it set to 1000. The Kinesis stream has 10 shards. The Lambda function's batch size is set to 100. The engineer suspects that the retry behavior is causing duplicate writes and throttling. Which change should the engineer make to reduce throttling?

A.Increase the number of Kinesis shards to 20 to distribute the load.
B.Decrease the Lambda batch size to 10 to reduce the number of records processed per invocation.
C.Decrease the Lambda reserved concurrency to 500 to limit the number of concurrent invocations.
D.Use a DynamoDB Stream to trigger a second Lambda function that writes to the table.
AnswerB

Smaller batches reduce the number of concurrent writes to DynamoDB, lowering throttling.

Why this answer

Option B is correct. On-demand DynamoDB can scale, but it has a per-partition throughput limit. Reducing the Lambda batch size reduces the number of concurrent writes per shard, decreasing the chance of hitting partition limits.

Option A is wrong because increasing shards would increase concurrency, worsening throttling. Option C is wrong because decreasing reserved concurrency could cause Lambda throttling but not DynamoDB throttling. Option D is wrong because using a DynamoDB stream adds complexity and does not directly reduce write throttling.

15
Multi-Selectmedium

A company uses Amazon DynamoDB as the primary data store for a web application. The application experiences high read latency. Which TWO actions can improve read performance?

Select 2 answers
A.Add a Global Secondary Index (GSI)
B.Enable DynamoDB Global Tables
C.Enable DynamoDB Accelerator (DAX)
D.Enable DynamoDB Streams
E.Increase the write capacity units
AnswersB, C

Global Tables allow reads from local regions, reducing latency.

Why this answer

Option B is correct because DynamoDB Accelerator (DAX) provides in-memory caching, reducing read latency. Option D is correct because Global Tables provide local read replicas, reducing latency for global users. Option A is wrong because GSI adds secondary indexes but does not reduce read latency for primary key lookups.

Option C is wrong because increasing write capacity does not affect read performance. Option E is wrong because DynamoDB Streams is for change capture, not reducing latency.

16
MCQmedium

A company uses AWS Lake Formation to manage data lake permissions. A data analyst cannot query a table in Athena, although the table appears in the catalog. The analyst has IAM permissions to run Athena. What is the MOST likely cause?

A.The Glue Data Catalog does not have the table registered.
B.The S3 bucket policy denies access to the analyst's IAM role.
C.The analyst lacks Lake Formation permissions on the table.
D.The Athena workgroup is not configured with the correct output location.
AnswerC

Lake Formation grants fine-grained permissions; the analyst needs SELECT.

Why this answer

Option B is correct because Lake Formation permissions are separate from IAM; the analyst needs SELECT permissions granted via Lake Formation. Option A is wrong because S3 bucket policy may allow access but Lake Formation can override. Option C is wrong because the table appears, so catalog permissions are likely fine.

Option D is wrong because Athena workgroup permissions are not related.

17
MCQmedium

A company runs a data pipeline that uses AWS Glue to process data from an Amazon DynamoDB table and write results to Amazon S3. The Glue job runs on a schedule every hour. Recently, the job started failing intermittently with 'ProvisionedThroughputExceededException' errors from DynamoDB. What is the BEST solution?

A.Use DynamoDB Accelerator (DAX) to reduce read latency.
B.Change the Glue job schedule to run every 2 hours.
C.Implement exponential backoff and retries in the Glue job for DynamoDB operations.
D.Increase the read capacity units of the DynamoDB table.
AnswerC

Exponential backoff handles throttling gracefully.

Why this answer

Option D is correct because implementing exponential backoff in the Glue job will handle throttling gracefully. Option A is wrong because increasing provisioned capacity increases cost and may not be necessary. Option B is wrong because using DynamoDB Accelerator does not affect throughput limits.

Option C is wrong because changing the schedule may not solve the issue.

18
MCQmedium

A company uses Amazon DynamoDB as a source for an AWS Glue job. The job reads a large table using a DynamoDB export to S3 feature. The job is failing with 'ThrottlingException' from DynamoDB. What should the data engineer do to resolve this issue WITHOUT changing the job's logic?

A.Use DynamoDB Streams to capture changes and process them incrementally
B.Reduce the number of DynamoDB read segments in the Glue job
C.Use the DynamoDB export to S3 feature and read the exported data from S3
D.Increase the read capacity units (RCU) of the DynamoDB table
AnswerC

Export to S3 reads from the table without consuming RCU, avoiding throttling entirely.

Why this answer

Option C is correct because the DynamoDB export to S3 feature creates a point-in-time snapshot of the table data in S3 without consuming any read capacity units (RCUs) from the DynamoDB table. By reading the exported data from S3 instead of directly scanning the DynamoDB table, the Glue job avoids triggering ThrottlingException entirely, as the export operation uses the table's backup and restore mechanism, not the read path. This resolves the issue without altering the job's logic, as the job can be reconfigured to read from the S3 export location.

Exam trap

The trap here is that candidates often assume the only way to resolve DynamoDB throttling is to increase RCUs (Option D) or reduce parallelism (Option B), missing the fact that the export-to-S3 feature completely eliminates the need to read from DynamoDB during the Glue job, which is the most efficient and cost-effective solution without altering job logic.

How to eliminate wrong answers

Option A is wrong because using DynamoDB Streams to capture changes and process them incrementally changes the job's logic from a full scan to a streaming/incremental approach, which violates the requirement to not change the job's logic; additionally, streams consume read capacity and could still cause throttling if not properly managed. Option B is wrong because reducing the number of DynamoDB read segments in the Glue job would decrease parallelism and potentially reduce the throttling, but it does not eliminate the root cause—the job still reads directly from DynamoDB, consuming RCUs and risking ThrottlingException; it also changes the job's configuration, which may alter performance. Option D is wrong because increasing the read capacity units (RCU) of the DynamoDB table addresses throttling by raising the throughput limit, but it incurs additional cost and does not leverage the export-to-S3 feature; it also changes the table's provisioned capacity, which is a modification outside the job's logic but still a change to the infrastructure, and the question asks to resolve the issue without changing the job's logic, which increasing RCU does not technically violate, but it is not the best practice and does not avoid the underlying scan overhead.

19
Multi-Selectmedium

A data engineer is designing a data lake on Amazon S3 that will be used for both batch processing with Amazon EMR and interactive queries with Amazon Athena. The data includes sensitive personally identifiable information (PII) that must be encrypted at rest. The company requires that the encryption keys be managed by the company and rotated every 90 days. Which TWO options should the engineer implement to meet these requirements? (Choose TWO.)

Select 2 answers
A.Use customer-provided keys (SSE-C) and store the keys in AWS Secrets Manager.
B.Configure a bucket policy to deny uploads that are not encrypted.
C.Enable S3 default encryption using SSE-KMS with the customer managed key.
D.Use AWS Key Management Service (KMS) to create a customer managed key with automatic yearly rotation.
E.Use S3 managed keys (SSE-S3) for server-side encryption.
AnswersC, D

Default encryption ensures all new objects are encrypted with the KMS key.

Why this answer

Option A is correct because AWS KMS allows you to create and manage your own customer managed keys, and you can set automatic key rotation every 90 days. Option D is correct because enabling S3 default encryption with SSE-KMS ensures all objects are encrypted using the KMS key. Option B is wrong because SSE-S3 uses Amazon-managed keys, not customer-managed keys.

Option C is wrong because SSE-C requires you to manage the keys yourself, but you cannot use KMS for rotation; also, SSE-C is not recommended for this scenario. Option E is wrong because bucket policies do not enable encryption; they can enforce encryption but do not manage keys.

20
MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Lambda function that processes data from Amazon S3. The function is triggered by S3 events, but no logs appear in CloudWatch Logs. The engineer runs the AWS CLI command shown. What is the MOST likely reason for the missing logs?

A.The Lambda execution role does not have permissions to create log groups and write logs.
B.The Lambda function is configured to log to a different log group.
C.The Lambda function is not being invoked by S3 events.
D.The log retention policy is set to 7 days, causing logs to expire immediately.
AnswerA

Missing logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents.

Why this answer

Option B is correct because the log group shows 'storedBytes': 0, indicating no logs have been written. The most common cause is that the Lambda execution role lacks permission to create log streams and put logs. Option A is wrong because the log group exists.

Option C is wrong because retention does not prevent logs from being written. Option D is wrong because the CLI command shows the log group exists.

21
MCQmedium

A data pipeline uses AWS Glue to process data from Amazon S3. The job fails with an 'OutOfMemoryError' during the transformation phase. Which action should the data engineer take to resolve this issue?

A.Enable S3 server-side encryption.
B.Increase the number of partitions in the input data.
C.Change the data format from CSV to Parquet.
D.Increase the number of DPUs (Data Processing Units) for the Glue job.
AnswerD

More DPUs provide additional memory and compute resources to handle large transformations.

Why this answer

Increasing the number of DPUs allocated to the Glue job provides more memory and processing capacity, which directly addresses the OutOfMemoryError. Option A is wrong because it increases parallelism without addressing memory. Option B is wrong because the error is not related to schema.

Option D is wrong because S3 performance is not the issue.

22
MCQhard

A data engineer is designing a data pipeline that uses AWS Glue to process data from an RDS MySQL database. The pipeline must capture only incremental changes (inserts and updates) and run every hour. Which approach is most cost-effective and reliable?

A.Use Glue job bookmarks to track and process only new and updated records
B.Use AWS DMS with change data capture (CDC) to replicate changes to S3
C.Add a timestamp column and query rows where timestamp > last run
D.Perform a full table scan each hour and compare with previous snapshot
AnswerA

Bookmarks efficiently handle incremental loads.

Why this answer

Option B is correct because Glue job bookmarks track processed data and enable incremental processing without reprocessing full tables. Option A is wrong because CDC from DMS adds complexity and cost. Option C is wrong because full scans are inefficient and costly.

Option D is wrong because querying by timestamp may miss updates if timestamps are not indexed or updated.

23
Multi-Selecteasy

A data engineer is setting up an AWS Glue job to process data from an Amazon S3 bucket. The job fails with an 'Access Denied' error. Which TWO IAM permissions are MOST likely missing from the Glue job's IAM role?

Select 2 answers
A.s3:PutObject
B.kms:Decrypt
C.dynamodb:GetItem
D.glue:StartJobRun
E.s3:GetObject
AnswersA, E

Required to write output to S3.

Why this answer

Options A and C are correct. Glue needs s3:GetObject to read from S3 and s3:PutObject to write output. Option B is incorrect because glue:StartJobRun is not needed for the job itself.

Option D is incorrect because kms:Decrypt is needed only if S3 uses KMS encryption. Option E is incorrect because dynamodb:GetItem is not relevant unless accessing DynamoDB.

24
MCQhard

A data engineer uses AWS Database Migration Service (DMS) to migrate an on-premises Oracle database to Amazon Aurora MySQL. The migration is successful, but the engineer notices that the target Aurora cluster has a higher CPU utilization than expected during the full load phase. What is the MOST likely cause?

A.The DMS task has LOB mode set to 'Full LOB mode', causing additional processing.
B.DMS is performing data validation during the full load phase.
C.DMS is reading from an Amazon Aurora read replica instead of the primary instance.
D.The DMS task is configured to use multiple parallel threads to load data, overwhelming the target instance.
AnswerD

Parallel threads increase throughput but also increase CPU usage.

Why this answer

Option A is correct because DMS uses multiple tasks in parallel to maximize throughput, which can cause high CPU on the target. Option B is wrong because DMS does not use read replicas during full load. Option C is wrong because LOB settings affect column size handling, not CPU load.

Option D is wrong because validation occurs after migration, not during full load.

25
MCQhard

A data engineer is designing a data pipeline that ingests millions of small JSON files (1-10 KB each) from an S3 bucket into Amazon Redshift. The current approach uses a Lambda function triggered by S3 events to call the Redshift COPY command for each file. This is causing high latency and throttling. Which alternative is MOST cost-effective and efficient?

A.Use Amazon Kinesis Data Streams and a consumer to batch files before COPY
B.Use Amazon Kinesis Data Firehose to buffer and write larger files to S3, then use a scheduled COPY command
C.Increase the Lambda concurrency limit and memory
D.Use AWS Glue to merge files into larger Parquet files before loading
AnswerB

Firehose buffers small files into larger ones, reducing COPY frequency and cost.

Why this answer

Option D is correct because Kinesis Firehose can buffer the small files and write larger batches to S3, then use a scheduled COPY or Redshift Spectrum. Option A still processes each file individually. Option B is for streaming, not batch.

Option C adds complexity and cost.

26
Multi-Selecteasy

A data engineer is monitoring an AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job is taking longer than expected. The engineer suspects that the job's parallelism is not optimal. Which THREE actions can improve the job's performance? (Choose THREE.)

Select 3 answers
A.Enable the 'groupFiles' option in the S3 sink to coalesce small files.
B.Decrease the 'dynamodb.splits' parameter to reduce the number of parallel readers.
C.Increase the 'MaxCapacity' (DPU) setting for the Glue job.
D.Disable job bookmark to avoid storing metadata.
E.Increase the 'dynamodb.throughput.read.percentage' parameter to allocate more read capacity.
AnswersA, C, E

Coalescing small files reduces the number of output files and improves write performance.

Why this answer

Option A is correct because increasing the number of DynamoDB reads per segment increases parallelism when reading from DynamoDB. Option C is correct because increasing the number of DPUs (data processing units) allocated to the job increases parallelism. Option E is correct because using the 'groupFiles' option can reduce the number of small files written to S3, reducing write overhead.

Option B is wrong because decreasing the batch size reduces throughput. Option D is wrong because disabling job bookmark may cause reprocessing but does not improve performance; it may actually degrade it.

27
Multi-Selectmedium

A data engineer is setting up an Amazon Redshift cluster for a new data warehouse. The engineer needs to ensure that the cluster can automatically recover from failures and maintain high availability. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers
A.Configure workload management (WLM) queues to prioritize critical queries.
B.Configure automated snapshots with a retention period of 1 day and enable cluster recreation from snapshots.
C.Enable Multi-AZ deployment.
D.Enable concurrency scaling.
E.Configure manual snapshots with cross-region copy.
AnswersB, C

Automated snapshots allow Redshift to automatically restore a cluster from the latest snapshot if the primary fails.

Why this answer

Option A (enable Multi-AZ) and Option D (automated snapshots with cluster recreation) are correct. Multi-AZ provides synchronous replication across AZs for high availability. Automated snapshots allow point-in-time recovery and cluster replacement in case of failure.

Option B is wrong because concurrency scaling improves query performance, not availability. Option C is wrong because manual snapshots require manual intervention. Option E is wrong because workload management (WLM) does not affect availability.

28
MCQeasy

Refer to the exhibit. A data engineer runs the command on an Amazon S3 bucket used for data lake storage. The engineer is concerned about accidental overwrites of objects. What does the output indicate?

A.Versioning is enabled, so previous versions of objects are preserved.
B.Old versions will be automatically deleted after a retention period.
C.Objects are encrypted at rest by default.
D.MFA Delete is disabled, meaning anyone can delete objects permanently.
AnswerA

Versioning keeps all versions of objects.

Why this answer

Option A is correct because the Status 'Enabled' means versioning is turned on for the bucket. Option B is wrong because MFA Delete is not required for versioning. Option C is wrong because versioning does not automatically expire old versions.

Option D is wrong because versioning does not encrypt objects.

29
MCQmedium

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. The consumer application is falling behind and the iterator age is increasing. Which action would most effectively improve throughput?

A.Switch from Kinesis Data Streams to Kinesis Data Firehose
B.Decrease the batch size in the consumer
C.Enable enhanced fan-out for the consumer
D.Increase the number of shards in the stream
AnswerD

More shards increase read capacity and parallelism.

Why this answer

Option C is correct because increasing shards increases parallelism. Option A is wrong because it reduces batch size. Option B is wrong because it doesn't help throughput.

Option D is wrong because it adds latency.

30
Multi-Selecteasy

A data engineer needs to monitor the performance of an RDS for PostgreSQL database. Which THREE CloudWatch metrics are most useful for this purpose?

Select 3 answers
A.CPUUtilization
B.DatabaseConnections
C.FreeStorageSpace
D.NetworkThroughput
E.ReadLatency / WriteLatency
AnswersA, B, E

Indicates compute load.

Why this answer

CPUUtilization is a critical metric for monitoring RDS for PostgreSQL because high CPU usage can indicate inefficient queries, insufficient instance size, or contention. Sustained high CPU can lead to performance degradation and increased query latency, making it essential for capacity planning and troubleshooting.

Exam trap

The trap here is that candidates often confuse storage metrics (like FreeStorageSpace) with performance metrics, or assume NetworkThroughput is a performance indicator, when in fact latency and CPU metrics directly reflect query execution health.

31
MCQhard

A company uses Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function is failing with 'ProvisionedThroughputExceededException' when writing to a DynamoDB table. Which action should the data engineer take to resolve this without losing data?

A.Reduce the number of Kinesis shards to lower the ingestion rate.
B.Increase the DynamoDB table's read capacity.
C.Configure a dead-letter queue (DLQ) on the Lambda function and increase the DynamoDB write capacity.
D.Disable retries on the Lambda function to avoid throttling.
AnswerC

The DLQ captures failed records, and increasing write capacity reduces throttling. Together, they prevent data loss.

Why this answer

Option B is correct because adding a dead-letter queue (DLQ) on the Lambda function captures failed records for later reprocessing, preventing data loss. Option A (disable retries) would lose data. Option C (increase DynamoDB write capacity) addresses the throughput but does not handle failures that may still occur.

Option D (reduce shards) would reduce the ingestion rate but may not solve the issue and could cause data loss.

32
MCQeasy

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by an AWS Lambda function that processes records and stores results in Amazon DynamoDB. Recently, the Lambda function has been failing with ProvisionedThroughputExceededException errors. Which action should the data engineer take to resolve this issue?

A.Enable auto scaling on the DynamoDB table to handle increased write capacity.
B.Reduce the number of shards in the Kinesis stream to lower the ingestion rate.
C.Increase the batch size in the Lambda event source mapping to process more records per invocation.
D.Configure the Lambda function to discard records that cause throttling errors.
AnswerA

Auto scaling adjusts throughput based on actual usage, preventing throttling.

Why this answer

Option A is correct because enabling DynamoDB auto scaling dynamically adjusts throughput to match demand. Option B is wrong because increasing Lambda batch size would increase write requests, worsening the problem. Option C is wrong because Lambda retries are already built-in but would still fail.

Option D is wrong because disabling retries would lose data.

33
MCQhard

A company runs a batch ETL job on Amazon EMR every night. Recently, the job started failing with 'Out of Memory' errors in the Spark executors. The data volume has grown 20% in the past month. The cluster uses uniform instance groups with 5 core nodes of r5.xlarge (4 vCPU, 32 GB RAM). Which change should the data engineer implement to resolve the issue with minimal cost increase?

A.Increase the number of core nodes to 7.
B.Change instance type to r5.2xlarge (8 vCPU, 64 GB RAM) for all nodes.
C.Configure instance fleets to include r5.xlarge and r5.2xlarge instances.
D.Tune Spark memory configurations to reduce executor memory overhead.
AnswerC

Instance fleets allow cost-effective scaling by mixing types.

Why this answer

Option C is correct because using instance fleets allows the cluster to include both r5.xlarge and r5.2xlarge instances, enabling the Spark executors to use the larger instances for memory-intensive tasks while still leveraging the existing r5.xlarge nodes. This provides a cost-effective way to handle the 20% data growth by adding memory capacity without replacing the entire cluster or over-provisioning all nodes. Instance fleets also support Spot Instances, which can further reduce costs while addressing the Out of Memory errors.

Exam trap

The trap here is that candidates often assume increasing the number of nodes (Option A) or tuning Spark memory settings (Option D) can solve memory issues, but they fail to recognize that the root cause is insufficient memory per executor, which is best addressed by adding larger instances via instance fleets to minimize cost increase.

How to eliminate wrong answers

Option A is wrong because simply increasing the number of core nodes to 7 does not increase the memory per executor; it only adds more nodes with the same 32 GB RAM each, which may not resolve the Out of Memory errors if individual executors are hitting their limits due to data skew or large partitions. Option B is wrong because changing all nodes to r5.2xlarge (64 GB RAM) would double the memory per node but also double the cost for the entire cluster, which is not the minimal cost increase solution. Option D is wrong because tuning Spark memory configurations (e.g., reducing executor memory overhead) cannot create additional physical memory; it only reallocates existing memory, which will not resolve the Out of Memory errors if the total available memory is insufficient for the increased data volume.

34
MCQhard

A data engineer is using AWS DMS to migrate a 2 TB Oracle database to Amazon Aurora PostgreSQL. The migration is running in full load mode with ongoing replication. After the full load completes, the ongoing replication task shows a 'TargetMetadata' error: 'ERROR: duplicate key value violates unique constraint'. The engineer verifies that the target table already contains the data. What should the engineer do to resolve this issue?

A.Enable 'BatchApplyEnabled' and set 'TaskRecoveryTableEnabled' to false in the task settings.
B.Disable the unique constraint on the target table.
C.Truncate the target table and restart the full load.
D.Drop the indexes on the target table and recreate them after the migration.
AnswerA

Batch apply minimizes duplicate key errors, and disabling recovery table prevents re-application of already-applied changes.

Why this answer

Option C is correct because the error indicates that the task is trying to insert rows that already exist. Enabling 'BatchApplyEnabled' and setting 'TaskRecoveryTableEnabled' to false prevents duplicate handling errors during replication. Option A is wrong because truncating would lose data.

Option B is wrong because disabling constraints may cause data integrity issues. Option D is wrong because dropping indexes does not prevent duplicate key violations.

35
Multi-Selectmedium

A data engineer is troubleshooting a slow-running Amazon Athena query on a large dataset stored in S3. The query scans many small files. Which TWO actions can improve query performance?

Select 2 answers
A.Increase the number of files to increase parallelism
B.Disable S3 server-side encryption
C.Concatenate small files into larger files
D.Partition the data by a frequently filtered column
E.Convert files from CSV to JSON
AnswersC, D

Reduces file open overhead.

Why this answer

Option A is correct because compacting small files into larger ones reduces overhead. Option C is correct because partitioning limits data scanned. Option B is wrong because converting to Parquet helps but is not one of the two selected here (though it is also good).

Option D is wrong because more files increase overhead. Option E is wrong because disabling encryption does not affect performance.

36
MCQeasy

A data engineer is designing a disaster recovery strategy for an Amazon Redshift data warehouse. The RPO (Recovery Point Objective) is 1 hour, and the RTO (Recovery Time Objective) is 2 hours. Which approach meets these requirements with the least operational overhead?

A.Set up a secondary Redshift cluster in another region and use AWS DMS for continuous replication.
B.Configure automated cross-region snapshot copy to another region.
C.Export the data to S3 daily using UNLOAD and copy to another region.
D.Enable automated snapshots with a retention period of 1 day.
AnswerB

Cross-region snapshots protect against regional failures and can restore quickly.

Why this answer

Option B is correct because automated snapshots to S3 with cross-region copy provide up-to-the-hour recovery and fast restore. Option A is incorrect because automated snapshots are default but not cross-region. Option C and D have higher overhead and may not meet RTO.

37
Multi-Selectmedium

A data engineer is troubleshooting an Amazon Redshift cluster that has experienced a node failure. The engineer needs to ensure that the cluster is highly available and can withstand a single node failure without downtime. Which TWO actions should the engineer take?

Select 2 answers
A.Enable automated snapshots with cross-region copy.
B.Enable concurrency scaling to handle increased read traffic.
C.Deploy the cluster as a single-node cluster for simplicity.
D.Place the cluster in a public subnet with an internet gateway.
E.Use a multi-node cluster with RA3 node types.
AnswersA, E

Cross-region snapshots allow recovery from region failures.

Why this answer

Options B and D are correct. Enabling multi-node cluster with RA3 node types provides managed storage and resilience. Enabling cross-region snapshot copy ensures data can be restored in another region.

Option A is incorrect because concurrency scaling does not provide high availability. Option C is incorrect because single-node clusters are not highly available. Option E is incorrect because VPC routing does not affect node failure.

38
MCQmedium

A data engineer needs to set up a data catalog for a new data lake in AWS Glue. The data resides in S3 in Parquet format. The engineer wants to ensure that the schema is automatically detected and updated when new columns are added to the data. Which configuration should the engineer use?

A.Add a partition index to the Glue Data Catalog table.
B.Configure the crawler's 'Schema updates' option to 'Update the table schema'.
C.Set the crawler's 'Database' output to a new database.
D.Enable partition indexing on the table.
AnswerB

This enables automatic schema detection and updates.

Why this answer

Option D is correct because Glue crawlers can be configured to update the table schema when new columns are detected. Option A is wrong because partition indexing does not update schema. Option B is wrong because the crawler's output is a table, not a database.

Option C is wrong because adding partition indexes does not update schema.

39
MCQhard

A data team runs a daily AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes results to Amazon S3. The job completes successfully but takes 2 hours longer than expected. The job uses the JDBC connection to Redshift. The Redshift cluster is 4 dc2.large nodes. The Glue job has 10 workers of type G.1X. Which change would MOST likely reduce the job duration?

A.Use Redshift Spectrum to query data directly from S3
B.Use the S3 staging option in the Glue connection to unload data from Redshift to S3 first
C.Increase the Redshift cluster size to 8 nodes
D.Increase the number of Glue workers to 20
AnswerB

UNLOAD is parallel and faster than JDBC; Glue can then read from S3.

Why this answer

The JDBC connection in AWS Glue reads data row-by-row from Redshift, which is slow for large datasets. By enabling the S3 staging option in the Glue connection, the job uses Redshift's UNLOAD command to export data to S3 in parallel, then Glue reads from S3. This bypasses the JDBC bottleneck and leverages Redshift's massively parallel processing (MPP) to export data much faster.

Exam trap

The trap here is that candidates assume the bottleneck is either Redshift compute (C) or Glue parallelism (D), when in fact the JDBC driver's single-threaded row-by-row fetch is the primary performance limiter.

How to eliminate wrong answers

Option A is wrong because Redshift Spectrum queries data directly from S3, but the source data is in Redshift, not S3; Spectrum does not help extract data from Redshift. Option C is wrong because the bottleneck is the JDBC connection, not Redshift compute capacity; adding more Redshift nodes would not speed up a single-threaded JDBC read. Option D is wrong because increasing Glue workers only helps if the job is CPU-bound or parallelizable; the JDBC read is I/O-bound and limited by the single connection, so more workers would not reduce the 2-hour delay.

40
MCQeasy

A data analyst needs to query a large Amazon S3 bucket containing CSV files using Amazon Athena. The bucket has millions of small files (less than 1 MB each). The analyst reports that queries are very slow and often time out. The data is partitioned by date and the partition columns are defined in the table. What is the most effective way to improve query performance?

A.Convert the files to Apache Parquet format using an AWS Glue ETL job.
B.Run a compaction job to consolidate small files into fewer larger files (e.g., 128 MB each).
C.Add more partitions by including hour and minute as partition keys.
D.Use S3 Select to push down filtering to S3 before Athena processes the data.
AnswerB

Consolidating small files reduces the overhead of listing and reading many objects, significantly improving Athena performance.

Why this answer

Option D is correct because many small files cause high overhead for listing and reading files in Athena. Compacting small files into fewer larger files (e.g., 128 MB each) reduces the I/O operations and improves performance. Option A is wrong because file format conversion (e.g., to Parquet) helps but does not address the small file problem.

Option B is wrong because increasing partitions would increase overhead. Option C is wrong because S3 Select is for retrieving subset of data from a single file, not for optimizing many files.

41
Matchingmedium

Match each AWS data analytics service to its primary function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Serverless SQL query on S3

Business intelligence and dashboards

Data lake setup and access control

Real-time SQL on streaming data

Query data in S3 from Redshift

Why these pairings

Analytics services cover different needs.

42
MCQhard

A company uses AWS Lake Formation to manage data lake permissions. A data engineer needs to grant a group of analysts SELECT permission on a set of tables in the 'analytics' database, but only for columns that are not classified as 'PII'. Which approach should the engineer use?

A.Grant SELECT on the entire database and rely on analysts to avoid PII columns.
B.Create an IAM policy that denies access to PII columns.
C.Use an S3 bucket policy to restrict access to objects containing PII data.
D.Use Lake Formation tag-based access control (LF-TBAC) to grant SELECT on columns without the 'PII' tag.
AnswerD

LF-TBAC allows column-level permissions by matching tags on columns with tags on the grant.

Why this answer

Option C is correct because LF-TBAC allows column-level permissions based on tags. Option A (IAM policy) does not support column-level restrictions. Option B (S3 bucket policy) is too coarse.

Option D (database-level grant) grants access to all columns.

43
MCQeasy

Your organization uses Amazon Redshift for analytical workloads. You have noticed that queries are slow on a large fact table. The table is distributed by KEY on the customer_id column and sorted by transaction_date. The table is frequently updated with new records. To improve query performance, you decide to implement a distribution style that reduces data movement. Which action should you take?

A.Change the distribution style to ALL to put a copy of the table on every node.
B.Change the distribution style to AUTO to let Redshift choose the best distribution.
C.Change the distribution style to EVEN to distribute rows evenly across all nodes.
D.Change the sort key to include customer_id as well.
AnswerC

EVEN reduces data movement for large fact tables when join columns are not well-distributed.

Why this answer

Option D is correct because EVEN distribution distributes rows evenly across nodes, reducing data movement for queries that do not benefit from key distribution. Option A is wrong because ALL distribution duplicates the entire table on each node, which is inefficient for large tables. Option B is wrong because AUTO lets Redshift decide, but it may not choose the best.

Option C is wrong because sorting by other columns does not reduce data movement.

44
Multi-Selecteasy

A data engineer is setting up a data pipeline using AWS Glue. The engineer wants to monitor job failures and receive notifications. Which TWO services can be used together for this purpose?

Select 2 answers
A.AWS Step Functions
B.Amazon CloudWatch
C.Amazon SNS
D.Amazon Kinesis Data Streams
E.Amazon SQS
AnswersB, C

Glue publishes job metrics to CloudWatch.

Why this answer

Amazon CloudWatch (B) is correct because it is the native monitoring service for AWS Glue, capturing job metrics, logs, and state changes. You can configure CloudWatch alarms to trigger on job failures, which then invoke Amazon SNS (C) to send notifications via email, SMS, or other endpoints. Together, they provide a complete monitoring and alerting solution without additional orchestration.

Exam trap

The trap here is that candidates may confuse AWS Step Functions (A) as a monitoring tool because it can orchestrate retries, but it does not natively send notifications and is not the primary service for monitoring Glue job failures.

45
MCQeasy

A data engineer needs to schedule a daily ETL job that runs on Amazon EMR. The job should be triggered automatically and send an email on failure. Which AWS service should the engineer use to orchestrate the job?

A.Amazon EventBridge
B.AWS Step Functions
C.Amazon Simple Queue Service (SQS)
D.Amazon CloudWatch Events
AnswerB

Orchestrates EMR steps and integrates with SNS.

Why this answer

Option A is correct because AWS Step Functions can orchestrate EMR steps and integrate with Amazon SNS for notifications. Option B is wrong because Amazon CloudWatch Events can trigger Lambda but not directly orchestrate EMR steps. Option C is wrong because Amazon SQS is a message queue.

Option D is wrong because Amazon EventBridge can trigger but not orchestrate complex workflows.

46
MCQeasy

A data engineer is monitoring an Amazon EMR cluster and notices that the cluster is running out of disk space on the core nodes. Which action can be taken to resolve this issue?

A.Reduce the retention period of data stored on HDFS
B.Change the core node instance type to a compute-optimized type
C.Increase the EBS volume size attached to core nodes
D.Use Spot Instances for core nodes
AnswerC

More EBS capacity directly adds disk space.

Why this answer

Option A is correct because increasing the volume size of EBS attached to core nodes provides more storage. Option B is wrong because changing instance type affects memory/CPU, not disk. Option C is wrong because Spot Instances are cheaper but do not add disk.

Option D is wrong because reducing data retention is not always feasible and may lose data.

47
MCQmedium

Your company has an Amazon S3-based data lake partitioned by year/month/day. An AWS Glue crawler runs daily to update the Data Catalog. A Spark job on Amazon EMR reads the latest partition and performs transformations. Recently, the Spark job has been failing with a 'FileNotFoundException' for a file that is expected to exist. You check the S3 bucket and see that the file exists. The job is configured to use S3 as the direct input source with EMRFS consistent view enabled. The IAM role for the EMR cluster has full S3 access. What is the most likely cause?

A.The Data Catalog partition metadata is out of sync because the crawler runs after the job.
B.The Glue crawler did not update the Data Catalog because of insufficient permissions.
C.The EMR cluster is experiencing eventual consistency issues despite EMRFS consistent view.
D.The file was deleted by a lifecycle policy after the job started.
AnswerC

Even with consistent view, there can be short propagation delays; the job might be reading a directory listing that hasn't updated.

Why this answer

Option C is correct because even with EMRFS consistent view enabled, there can be a propagation delay between the crawler updating the catalog and the actual S3 objects being visible. The Spark job may be reading stale metadata. Option A is wrong because the crawler updates the catalog, but the job may read from S3 directly.

Option B is wrong because the file exists. Option D is wrong because the catalog is updated.

48
MCQeasy

A data engineer is troubleshooting an Amazon Redshift cluster that is running slowly. The cluster has 4 dc2.large nodes. The engineer runs a query that scans a large table and notices that the query uses only a single slice instead of all slices. The table is distributed with DISTSTYLE ALL. What is the most likely reason for the query using only one slice?

A.The query is running on the leader node instead of the compute nodes.
B.The table uses DISTSTYLE ALL, which stores the entire table on a single slice per node.
C.The workload management (WLM) queue is configured with a single query slot.
D.The table does not have a sort key defined.
AnswerB

DISTSTYLE ALL replicates the table to each node, but it is stored on one slice per node, limiting parallelism.

Why this answer

Option B is correct because DISTSTYLE ALL means the entire table is copied to every node, but within each node, the data is stored on a single slice. Redshift slices are determined by the number of cores per node; for dc2.large nodes, there are 2 slices per node. However, with DISTSTYLE ALL, each node stores a full copy of the table, but the table is not distributed across slices; it is replicated to each node, and queries may not parallelize across slices efficiently.

Actually, with DISTSTYLE ALL, the table is replicated to all nodes, but within each node, the data is stored on a single slice (the leader node assigns the table to one slice per node). So queries that scan the entire table may only use one slice per node, leading to underutilization. Option A is wrong because the sort key affects data ordering, not slice utilization.

Option C is wrong because WLM queues affect concurrency, not slice usage. Option D is wrong because the query is not using a single node but multiple nodes, but only one slice per node.

49
MCQeasy

A data engineer is running an AWS Glue ETL job that reads from an Amazon RDS MySQL database and writes to Amazon S3. The job fails with a 'Communications link failure' error. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause of the failure?

A.The JDBC connection string in the Glue job does not include the database name.
B.The Glue job is using the wrong JDBC driver.
C.The Glue job's security group does not allow outbound traffic to the RDS security group on port 3306.
D.The IAM role used by the Glue job does not have rds:Connect permission.
AnswerC

Without outbound rule, the connection fails.

Why this answer

Option C is correct because AWS Glue ETL jobs run in a VPC that requires outbound security group rules to initiate connections to RDS. Even if the RDS security group allows inbound traffic from the Glue security group, the Glue security group must also have an outbound rule allowing traffic to the RDS security group on port 3306 (MySQL default port). Without this outbound rule, the TCP handshake from Glue to RDS fails, causing a 'Communications link failure'.

Exam trap

The trap here is that candidates assume only inbound rules matter for security groups, but outbound rules are equally critical for initiating connections from the client (Glue) to the server (RDS).

How to eliminate wrong answers

Option A is wrong because omitting the database name from the JDBC connection string would cause a different error (e.g., 'Unknown database' or connection rejection), not a 'Communications link failure', which indicates a network-level issue. Option B is wrong because AWS Glue automatically includes the correct JDBC driver for MySQL (compatible with Amazon RDS MySQL) when using the Glue connection type 'MySQL'; using the wrong driver would typically produce a class-not-found or driver-incompatibility error, not a communications link failure. Option D is wrong because IAM permissions for Glue jobs use actions like 'glue:GetConnection' and 'rds:DescribeDBInstances' to retrieve connection metadata, but there is no 'rds:Connect' IAM action; database authentication is handled via username/password in the Glue connection, not IAM.

50
MCQmedium

A company uses AWS Glue to run ETL jobs on a schedule. Recently, a job failed with the error: 'AnalysisException: cannot resolve '`column_name`' given input columns: ...'. The job reads from an Amazon S3 source that has a schema defined in the AWS Glue Data Catalog. What is the MOST likely cause?

A.The schema of the source data has changed and is not reflected in the Data Catalog.
B.The source data file is corrupted and cannot be parsed.
C.The IAM role associated with the Glue job does not have permissions to read the S3 bucket.
D.The data type of the column in the source does not match the Data Catalog definition.
AnswerA

Schema evolution without updating catalog causes column resolution errors.

Why this answer

Option B is correct because the error indicates a missing column. This typically happens when the source data schema changes (e.g., column renamed or dropped) but the Data Catalog schema is not updated. Option A is incorrect because an incompatible IAM role would cause a different error (e.g., AccessDenied).

Option C is incorrect because a corrupted file would cause a read error, not a schema resolution error. Option D is incorrect because the 'cannot resolve' error is about column names, not data types.

51
MCQhard

A data engineer is troubleshooting an AWS Glue ETL job that fails with the error 'java.lang.OutOfMemoryError: Java heap space'. The job processes a large number of small files in Amazon S3. Which action would MOST effectively resolve the issue?

A.Enable S3 groupFiles option in the Glue job
B.Change the worker type to G.1X
C.Increase the number of workers in the Glue job
D.Use a G.2X worker type with more memory
AnswerA

Groups small files, reducing partitions and memory usage.

Why this answer

Option D is correct because grouping small files into fewer, larger splits reduces the number of Spark partitions and memory overhead. Option A is wrong because increasing the number of workers increases parallelism but may not fix heap space per worker. Option B is wrong because using a different instance type with more memory per worker could help, but grouping files is more effective.

Option C is wrong because changing worker type to G.1X increases memory but may not solve the root cause of too many small files.

52
MCQmedium

A data engineer is troubleshooting a failed AWS Glue job that reads from an Amazon RDS for MySQL table. The error message indicates 'java.sql.SQLException: No suitable driver'. What is the most likely cause?

A.The MySQL JDBC driver JAR is not included in the Glue job's dependencies.
B.The Glue job is using the wrong JDBC driver class name.
C.The Glue job's VPC subnet does not have a route to the RDS instance.
D.The RDS instance is not publicly accessible.
AnswerA

Glue needs the JDBC driver in its classpath to connect to MySQL.

Why this answer

Option A is correct because the MySQL JDBC driver must be included in the Glue job's dependent JARs or as a Python module. Option B is incorrect because the driver class name is correct; the driver JAR is missing. Option C is incorrect because the error is about driver, not connection.

Option D is incorrect because subnet routing does not affect driver loading.

53
MCQmedium

An AWS Glue job that performs data transformation on large Parquet files in Amazon S3 is taking a long time to complete. The job uses the default number of DPUs. Which change would most likely improve the job's performance?

A.Increase 'Max capacity' (number of DPUs) for the job.
B.Use 'coalesce' to reduce the number of output files.
C.Reduce the number of partitions in the source data.
D.Change the input format from Parquet to CSV.
AnswerA

More DPUs provide more compute resources.

Why this answer

Option B is correct because increasing DPUs adds parallelism and memory, speeding up processing. Option A is incorrect because Parquet is already efficient; CSV would be slower. Option C is incorrect because reducing partitions may cause OOM.

Option D is incorrect because consolidating files into fewer, larger files can reduce parallelism.

54
MCQmedium

A data pipeline uses AWS Step Functions to orchestrate multiple Lambda functions for data transformation. The pipeline occasionally fails with a 'StateMachineExecutionLimitExceeded' error. What is the MOST likely cause?

A.The API Gateway endpoint used by Step Functions has a rate limit.
B.The Lambda functions have reached their concurrent execution limit.
C.The account has reached the maximum number of concurrent state machine executions.
D.The state machine definition has a syntax error causing infinite loops.
AnswerC

Step Functions has a limit on concurrent executions; increase the limit or reduce concurrency.

Why this answer

Option C is correct because Step Functions has a default limit on concurrent executions (e.g., 1 million per account per region). Option A is wrong because Lambda concurrency limits would produce a different error. Option B is wrong because API Gateway is not involved.

Option D is wrong because state machine definition does not affect execution limits.

55
MCQmedium

A company is using Kinesis Data Firehose to deliver data to an S3 bucket. The delivery stream is failing with 'S3 bucket access denied' errors. The bucket policy allows the Firehose service principal. What could be the issue?

A.The S3 bucket is in a different VPC
B.The S3 bucket uses SSE-KMS and Firehose does not have KMS permissions
C.The S3 bucket name contains invalid characters
D.The IAM role assigned to Firehose lacks s3:PutObject permission
AnswerD

The role must have S3 write permissions.

Why this answer

Option C is correct because Firehose uses a trust policy with a service principal but the delivery role must have S3 permissions. Option A is wrong because SSE-KMS requires KMS permissions, not S3 access. Option B is wrong because bucket name is part of the configuration.

Option D is wrong because VPC endpoints affect connectivity, not access denied after connection.

56
MCQeasy

Refer to the exhibit. A data engineer runs this CloudWatch Logs Insights query on a log group but gets no results. What is the most likely reason?

A.The query syntax is invalid
B.The query limit of 20 is too low
C.The time range is not specified and defaults to the last 15 minutes
D.There are no log events containing the string 'ERROR' in the log group
AnswerD

The filter matches only lines with ERROR.

Why this answer

Option A is correct because the query filters for 'ERROR' and if logs don't contain that string, no results are returned. Option B is wrong because the query has a limit of 20. Option C is wrong because the query is syntactically correct.

Option D is wrong because the query does not use a time filter, so it scans all available logs, but if no error logs exist, results are empty.

57
Multi-Selecthard

A data engineer is designing an Amazon Redshift data warehouse for a high-traffic analytics workload. The engineer needs to ensure fast query performance and minimize data movement. Which THREE design decisions should be made? (Choose THREE.)

Select 3 answers
A.Choose DISTSTYLE KEY for tables that are frequently joined.
B.Use the default distribution style for all tables.
C.Use DISTSTYLE ALL for all large fact tables.
D.Apply appropriate compression encodings to columns.
E.Define SORT KEYs on columns used in WHERE clauses.
AnswersA, D, E

KEY distribution collocates rows based on join keys, reducing data movement.

Why this answer

Options B, C, and D are correct. Distribution style KEY on join keys collocates data, sort keys on WHERE columns improve scan efficiency, and compression reduces I/O. Option A is wrong because ALL distribution stores copy on every node, increasing storage and load time.

Option E is wrong because default distribution is AUTO, which may not be optimal.

58
MCQmedium

A data engineer is monitoring an AWS Glue ETL job that processes data from an S3 bucket and writes to a Redshift table. The job completes successfully but takes longer than expected. The engineer notices that the job uses 10 DPUs and the data size is 500 GB. The job runs in standard mode. Which change would MOST reduce job duration?

A.Increase the number of DPUs to 20.
B.Use a smaller worker type like G.1X.
C.Change the output format from Parquet to CSV.
D.Reduce the number of partitions in the data.
AnswerA

More DPUs provide more parallelism, reducing job execution time.

Why this answer

Option D is correct because increasing DPUs allows more parallelism if the job is distributed. Option A is wrong because reducing the number of partitions may increase data skew. Option B is wrong because using a smaller instance type reduces resources.

Option C is wrong because converting to CSV increases file size and processing time.

59
MCQmedium

A company uses AWS Glue DataBrew to clean and prepare data for machine learning. The source data is in an S3 bucket with server-side encryption using AWS KMS (SSE-KMS). The DataBrew project is set up with an IAM role that has permissions to read from the S3 bucket and use the KMS key. When the DataBrew job runs, it fails with an error indicating that it cannot access the data. The IAM role has the following policy: { 'Version': '2012-10-17', 'Statement': [ { 'Effect': 'Allow', 'Action': ['s3:GetObject', 's3:ListBucket'], 'Resource': ['arn:aws:s3:::my-bucket', 'arn:aws:s3:::my-bucket/*'] }, { 'Effect': 'Allow', 'Action': 'kms:Decrypt', 'Resource': 'arn:aws:kms:us-east-1:123456789012:key/my-key' } ] }. What is the most likely cause of the failure?

A.The IAM role is missing s3:PutObject permission on the DataBrew output bucket.
B.The IAM role is missing s3:ListBucket permission on the source bucket.
C.The S3 bucket is in a different region than the DataBrew project.
D.The IAM role is missing kms:GenerateDataKey permission for the KMS key.
AnswerA

DataBrew writes to its own S3 bucket for job outputs; missing write permission causes failure.

Why this answer

Option C is correct because DataBrew uses a separate S3 bucket for storing intermediate outputs and recipe results. The IAM role needs s3:PutObject permission on that bucket. The error typically manifests as access denied when DataBrew tries to write.

Option A is wrong because the role has kms:Decrypt permission. Option B is wrong because DataBrew does not require VPC endpoints by default. Option D is wrong because the role includes s3:ListBucket.

60
MCQmedium

A company uses Kinesis Data Streams to ingest real-time clickstream data. The data is processed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function has been failing with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

A.Increase the Lambda function's memory allocation.
B.Increase the number of shards in the Kinesis stream.
C.Enable enhanced fan-out for the stream.
D.Reduce the batch size in the Lambda event source mapping.
AnswerB

More shards increase throughput capacity.

Why this answer

The 'ProvisionedThroughputExceededException' error indicates that the Lambda function is reading data from the Kinesis stream faster than the stream's shard-level throughput limits allow. Each shard in a Kinesis Data Stream supports up to 2 MB/s read throughput or 5 read transactions per second. Increasing the number of shards distributes the read load across more shards, raising the total available read throughput and resolving the throttling.

Exam trap

The trap here is that candidates confuse ProvisionedThroughputExceededException (a read-side throttling error) with write-side throttling, leading them to choose options like reducing batch size or increasing memory, which do not address the shard-level read throughput limit.

How to eliminate wrong answers

Option A is wrong because increasing Lambda memory allocation improves compute performance (CPU, network) but does not affect the read throughput limits of the Kinesis stream shards, which is the root cause of the throttling. Option C is wrong because enhanced fan-out provides dedicated 2 MB/s read throughput per consumer per shard, which reduces contention between consumers but does not increase the total read capacity of the stream; the error is from exceeding shard-level limits, not from consumer contention. Option D is wrong because reducing the batch size decreases the number of records per invocation, but the Lambda function still reads from the same shard at the same rate; the throttling is caused by exceeding the shard's read throughput, not by batch size.

61
Multi-Selecthard

A data engineer is designing a data pipeline that ingests data from multiple sources into Amazon S3, then processes it with AWS Glue and loads it into Amazon Redshift. Which THREE practices should be implemented to ensure data quality?

Select 3 answers
A.Implement data validation checks at the ingestion stage
B.Use AWS Glue DataBrew for data profiling and schema enforcement
C.Compress data files to reduce storage costs
D.Use manual sampling to check data quality periodically
E.Set up Amazon CloudWatch alarms for pipeline failures and data anomalies
AnswersA, B, E

Early validation catches errors before processing.

Why this answer

Option A is correct because data validation at ingestion catches issues early. Option C is correct because schema enforcement prevents data type mismatches. Option E is correct because monitoring with CloudWatch allows proactive detection of failures.

Option B is wrong because manual sampling is not scalable. Option D is wrong because compressing data does not ensure quality.

62
MCQmedium

A data engineering team uses Amazon EMR to run Spark jobs on a transient cluster. The jobs read data from S3 and write results back to S3. The team notices that jobs are taking longer than expected. Which configuration change is most likely to improve performance?

A.Use a larger instance type for the master node.
B.Enable HDFS as the intermediate storage and copy data from S3 to HDFS before processing.
C.Increase the number of core nodes to improve parallelism.
D.Enable EMRFS consistent view and use S3 as the direct input source.
AnswerD

EMRFS consistent view reduces retries due to S3 eventual consistency, improving performance.

Why this answer

Option D is correct because EMRFS consistent view is important for eventually consistent S3; lack of consistency can cause retries. Option A is wrong because reducing node count reduces parallelism. Option B is wrong because fewer instances reduce compute capacity.

Option C is wrong because HDFS is slower for S3 access.

63
MCQmedium

A company runs a production Amazon Redshift cluster. The data engineering team notices that queries are running slowly during peak hours. The cluster's CPU utilization is consistently above 80%. Which action should the engineer take to improve query performance?

A.Move some tables to Amazon Redshift Spectrum.
B.Re-distribute the tables using a different distribution key.
C.Enable concurrency scaling.
D.Perform an elastic resize to add more nodes.
AnswerD

Elastic resize adds nodes and CPU capacity quickly.

Why this answer

Option C is correct because elastic resize allows adding nodes without downtime, addressing CPU bottleneck. Option A is incorrect because concurrency scaling only helps with many concurrent queries, not CPU. Option B is incorrect because distributing data on a different key may not reduce CPU.

Option D is incorrect because spectrum offloads to S3 but does not reduce cluster CPU usage.

64
MCQmedium

A team uses Amazon Redshift for analytics. They notice that some queries are slow and the system shows high disk usage. The team wants to improve query performance without adding more nodes. Which action should they take first?

A.Run the VACUUM and ANALYZE commands on the tables.
B.Enable compression on all columns.
C.Redistribute the tables by changing the distribution key to a column with high cardinality.
D.Modify the workload management (WLM) queue to increase concurrency.
AnswerA

VACUUM reclaims space, ANALYZE updates statistics.

Why this answer

Option B is correct because VACUUM and ANALYZE reclaim space and update statistics, which can significantly improve query performance. Option A is wrong because distribution key changes require table recreation. Option C is wrong because WLM queues affect concurrency, not disk usage.

Option D is wrong because compression encoding is set at table creation.

65
Multi-Selecthard

A company runs an Amazon EMR cluster processing data from S3. The data engineer notices that the cluster's task nodes are underutilized while core nodes are fully utilized. Which TWO steps should the engineer take to improve resource utilization?

Select 2 answers
A.Consolidate multiple small tasks into larger tasks.
B.Increase the number of core nodes.
C.Add more task nodes using Spot Instances.
D.Reduce the number of core nodes and increase the number of task nodes.
E.Move HDFS data from EBS to instance store volumes.
AnswersB, C

More core nodes distribute processing load.

Why this answer

Option A is correct because increasing the number of core nodes adds more capacity for processing. Option D is correct because enabling task nodes with spot instances can offload work from core nodes. Option B is incorrect because instance store is temporary and not suitable for HDFS.

Option C is incorrect because consolidating tasks may not help. Option E is incorrect because reducing core nodes would worsen utilization.

66
MCQeasy

A company is using AWS Glue to catalog data stored in Amazon S3. The data is partitioned by year, month, and day. A data analyst reports that new partitions are not automatically discovered by the Glue crawler. The crawler runs on a schedule every hour. What is the MOST likely reason for the missing partitions?

A.The IAM role used by the crawler does not have permission to list the S3 bucket.
B.The Glue Data Catalog is not configured to use a Hive metastore.
C.The number of partitions exceeds the Glue catalog limit of 100,000.
D.The crawler schedule is set to run too frequently.
AnswerA

Without s3:ListBucket permission, the crawler cannot see new partitions.

Why this answer

Option C is correct because if the S3 bucket policy denies the crawler's IAM role, the crawler cannot list objects and discover partitions. Option A is incorrect because the crawler can handle many partitions. Option B is incorrect because the schedule is hourly, which should be sufficient.

Option D is incorrect because the crawler can discover partitions in S3 without a Hive metastore connection.

67
MCQhard

A company runs a nightly Amazon EMR job that processes data from S3 and writes results back to S3. The job fails with 'OutOfMemoryError' in the reduce phase. The cluster currently uses 5 m5.xlarge instances. Which cost-effective change should the data engineer make?

A.Increase the number of core nodes to 10.
B.Increase the number of reducers (mapreduce.reduce.tasks) and keep the same instance type.
C.Reduce the input data size by filtering early in the job.
D.Switch to r5.xlarge instances for more memory per instance.
AnswerB

More reducers reduce memory per reducer, preventing OOM.

Why this answer

Option A is correct because increasing the number of reducers distributes the memory load, and m5.xlarge instances are cost-effective. Option B is wrong because r5 instances are memory-optimized but more expensive. Option C is wrong because increasing instance count may not help if reducer memory is the issue.

Option D is wrong because reducing input size may affect completeness.

68
MCQeasy

A data engineer is troubleshooting a failed Amazon Kinesis Data Firehose delivery stream. The stream is configured to deliver data to an Amazon S3 bucket. The error log shows: 'The destination S3 bucket's bucket policy does not allow the firehose to put objects.' What is the MOST likely issue?

A.The S3 bucket's ACL is configured to deny write access to the firehose.
B.The IAM role used by Firehose does not have the necessary permissions.
C.The S3 bucket policy does not include an Allow statement for the firehose to put objects.
D.The IAM role's trust policy does not allow Firehose to assume the role.
AnswerC

The bucket policy must explicitly grant s3:PutObject to the firehose's IAM role.

Why this answer

Option C is correct because the error states the bucket policy does not allow the firehose to put objects. The solution is to add an Allow statement in the bucket policy granting the firehose's IAM role permission to execute s3:PutObject. Option A is incorrect because the error is about bucket policy, not ACLs.

Option B is incorrect because the error is already about permissions. Option D is incorrect because the issue is at the S3 bucket policy level, not IAM role trust policy.

69
MCQhard

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon Redshift. The migration is successful, but after a few days, data in Redshift becomes inconsistent with the source due to ongoing changes. The company needs to keep Redshift synchronized with minimal latency. Which approach should the data engineer use?

A.Configure DMS with ongoing replication using change data capture (CDC).
B.Use Amazon Redshift COPY with S3 staging and AWS Lambda triggers.
C.Schedule a full DMS load every night.
D.Set up Amazon Redshift Spectrum to query the Oracle database directly.
AnswerA

CDC captures changes continuously and applies them to Redshift.

Why this answer

AWS DMS supports ongoing replication using change data capture (CDC), which captures incremental changes from the Oracle source (via Oracle LogMiner or binary logs) and applies them to Amazon Redshift in near real-time. This approach ensures that Redshift remains synchronized with the source database with minimal latency, meeting the requirement for ongoing consistency after the initial full load.

Exam trap

The trap here is that candidates may confuse Amazon Redshift Spectrum's federated querying capability with actual data replication, or assume that nightly batch loads (Option C) are sufficient for 'minimal latency' requirements, when DMS CDC is the only option that provides continuous, low-latency synchronization.

How to eliminate wrong answers

Option B is wrong because Amazon Redshift COPY with S3 staging and AWS Lambda triggers requires manual or event-driven extraction of data from Oracle, which introduces latency and complexity, and does not provide native CDC-based continuous replication. Option C is wrong because scheduling a full DMS load every night would result in significant data loss between loads (up to 24 hours of inconsistency) and does not achieve minimal latency. Option D is wrong because Amazon Redshift Spectrum queries external data directly from Oracle via federated querying, but it does not replicate or synchronize data into Redshift; it only provides a query-time view, which incurs high latency and does not maintain a consistent local copy.

70
MCQhard

A company runs a time-series forecasting model that writes results to an S3 bucket every 5 minutes. A downstream ETL job reads this data, but sometimes fails because it encounters incomplete files (zero bytes). What is the MOST reliable way to ensure the ETL job only processes complete files?

A.Set an S3 Lifecycle policy to delete files smaller than 1 MB.
B.Use S3 Copy to move files to a 'processed' folder after the ETL job reads them.
C.Configure S3 Select to query the files and only return rows if the file is complete.
D.Use S3 Event Notifications to trigger a Lambda function that checks file size and then moves the file to a 'ready' prefix.
AnswerD

Lambda can verify completeness before moving, ensuring only complete files are processed.

Why this answer

Option B is correct because S3 Event Notifications with a Lambda function can process only after a final PUT, and writing to a temporary prefix then renaming ensures atomicity. Option A is wrong because S3 Copy can't detect completeness. Option C is wrong because S3 Select still reads incomplete files.

Option D is wrong because S3 Lifecycle policies manage lifecycle, not completeness.

71
MCQhard

A company uses Amazon DynamoDB as the primary data store for a high-traffic application. Recently, read latency has increased significantly. The DynamoDB table has on-demand capacity mode. Which action is MOST effective to reduce read latency?

A.Add a DynamoDB Accelerator (DAX) cluster in front of the table
B.Switch the table to provisioned capacity mode with higher read capacity
C.Increase the read capacity units in the table's auto scaling settings
D.Enable DynamoDB Global Tables to distribute reads across regions
AnswerA

DAX caches reads, reducing latency.

Why this answer

Option B is correct because adding a DynamoDB Accelerator (DAX) cluster provides an in-memory cache, reducing read latency significantly. Option A is wrong because on-demand capacity already scales automatically. Option C is wrong because increasing read capacity units is not applicable in on-demand mode.

Option D is wrong because Global Tables do not reduce read latency for local reads.

72
Multi-Selectmedium

A data engineer needs to ensure that sensitive data stored in Amazon S3 is encrypted at rest. Which TWO options meet this requirement? (Choose TWO.)

Select 2 answers
A.Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS)
B.Server-Side Encryption with S3-Managed Keys (SSE-S3)
C.Using a VPC to restrict network access
D.Enabling MFA Delete on the S3 bucket
E.Client-Side Encryption with SSL/TLS
AnswersA, B

SSE-KMS encrypts objects at rest using KMS keys.

Why this answer

Server-Side Encryption with AWS KMS-Managed Keys (SSE-KMS) allows you to enforce encryption at rest for S3 objects using a customer-managed or AWS-managed KMS key. This option meets the requirement because the encryption is applied server-side by S3 before the data is written to disk, and the data is decrypted automatically when accessed with appropriate permissions. SSE-KMS also provides an audit trail via AWS CloudTrail for every key usage.

Exam trap

The trap here is that candidates often confuse encryption in transit (SSL/TLS) with encryption at rest, or they mistakenly think network controls like VPCs or access controls like MFA Delete provide data encryption, when they only address different security domains.

73
MCQhard

A data engineer runs the above AWS CLI command to investigate who uploaded a file to an S3 bucket. The output shows the event was recorded. Which additional step is needed to confirm the identity of the user?

A.No additional step is needed; the 'Username' field already identifies the IAM user.
B.Use the 'MFA' field to check if multi-factor authentication was used.
C.View the 'accessKeyId' field in the CloudTrailEvent JSON.
D.Look up the 'sourceIPAddress' in the CloudTrailEvent.
AnswerA

The 'Username' field contains the full ARN of the IAM user who made the request.

Why this answer

Option C is correct because the 'Username' field shows the ARN of the IAM user, which is the identity that made the API call. Option A (access key ID) is not shown in the output. Option B (IP address) is not in the output.

Option D (MFA) is not recorded in this field.

74
Multi-Selecteasy

A data engineer is monitoring Amazon CloudWatch metrics for an Amazon Redshift cluster and notices high CPU utilization. The engineer wants to reduce CPU usage. Which TWO actions should the engineer take?

Select 2 answers
A.Enable concurrency scaling to offload read queries to additional clusters.
B.Increase the number of nodes in the cluster.
C.Optimize the table design by using sort keys and compression.
D.Run the VACUUM command on all tables.
E.Enable audit logging to monitor queries.
AnswersA, C

Offloads queries, reducing CPU on main cluster.

Why this answer

Options B and C are correct. Setting concurrency scaling offloads queries, and using sort keys reduces the amount of data scanned. Option A is wrong because vacuum does not reduce CPU significantly.

Option D is wrong because increasing node count increases CPU capacity but not reduce usage. Option E is wrong because enabling audit logging adds CPU overhead.

75
MCQmedium

Refer to the exhibit. A data engineer sees this error in CloudWatch Logs from an AWS Glue ETL job. The job reads from an S3 location that contains both .parquet and .csv files. What is the most likely cause?

A.The S3 object was deleted during the job execution.
B.The IAM role does not have permission to read the S3 object.
C.The job is reading a CSV file that was incorrectly placed in the directory with .parquet extension.
D.The Glue job does not have enough memory to parse the Parquet file.
AnswerC

The file might have .parquet extension but be CSV, or the job is reading all files regardless of extension.

Why this answer

Option B is correct because the error indicates that the object is not a valid Parquet file. Since the job expects Parquet, it likely encountered a CSV file. Option A is wrong because the error specifically says invalid Parquet, not missing file.

Option C is wrong because insufficient memory would cause OOM errors. Option D is wrong because the error is about the object's format, not permissions.

Page 1 of 6 · 387 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Operations and Support questions.