Knowledge + Practice

CCNA Data Operations Support Questions

75 of 387 questions · Page 5/6 · Data Operations Support topic · Answers revealed

Practice these questions Exam hub All questions

301

MCQhard

A company runs a data pipeline that ingests user activity logs from an API gateway into an Amazon Kinesis Data Firehose delivery stream. The Firehose stream writes data to an S3 bucket. The data is then processed by a scheduled AWS Glue job that runs every hour. Recently, the company noticed that the data in S3 is incomplete: some logs from the API are missing. The Glue job processes all files in the S3 bucket. The Firehose stream has a buffer size of 5 MB and a buffer interval of 60 seconds. The API sends data at a rate of approximately 2 MB per minute. What should the company do to reduce data loss?

A.Decrease the buffer interval to 30 seconds.

B.Increase the Firehose buffer size to 10 MB.

C.Configure a Dead Letter Queue (DLQ) for the Firehose stream.

D.Enable data transformation with AWS Lambda to compress data.

AnswerC

A DLQ captures failed deliveries so data can be reprocessed.

Why this answer

Option C is correct because the buffer interval is 60 seconds, and data is sent at 2 MB/min. If the Firehose stream fails to deliver within the buffer interval, it retries and eventually writes to the S3 bucket. However, if the buffer size is not met within the interval, Firehose will still deliver after the interval.

Data loss could occur if the delivery fails permanently. Increasing the buffer interval reduces the frequency of deliveries but may increase latency; however, it does not directly prevent data loss. The real issue is likely that the Firehose stream is configured with a small buffer interval, causing frequent writes that may fail.

However, the best practice to prevent data loss is to enable S3 backup or use a Dead Letter Queue. Option A is wrong because increasing buffer size may cause more data to be buffered, but if the interval is the same, it may not help. Option B is wrong because enabling compression does not prevent data loss.

Option D is wrong because adding a Lambda function does not directly prevent data loss.

Practice this question →

302

MCQeasy

A data engineer is troubleshooting a failed AWS Glue job that writes results to Amazon S3. The error log shows 'AccessDenied' when trying to list the bucket. Which IAM policy statement should the engineer add to the Glue job's role?

A.s3:ListBucket

B.s3:PutObject

C.s3:DeleteObject

D.s3:GetObject

AnswerA

Required to list objects in the bucket.

Why this answer

Option A is correct because listing a bucket requires the s3:ListBucket permission. Option B is wrong because s3:PutObject is for writing, not listing. Option C is wrong because s3:GetObject is for reading objects.

Option D is wrong because s3:DeleteObject is not needed.

Practice this question →

303

MCQeasy

A data engineer is monitoring an Amazon Kinesis Data Stream with a shard count of 10. The stream receives 5 MB/s of write traffic and 10 MB/s of read traffic. The engineer notices that writes are throttled with ProvisionedThroughputExceededException errors. Which action should the engineer take to resolve the throttling?

A.Increase the shard count to 20.

B.Decrease the shard count to 5.

C.Enable enhanced fan-out on the stream.

D.Configure auto-scaling on the stream.

AnswerA

Doubling shards doubles write capacity to 20 MB/s, eliminating throttling.

Why this answer

Option A is correct because each shard supports 1 MB/s write capacity. With 10 shards, total write capacity is 10 MB/s, but the stream receives 5 MB/s, so write capacity is sufficient. However, read capacity is 2 MB/s per shard (total 20 MB/s), and reads are 10 MB/s, so reads are fine.

The throttling may be due to uneven partition key distribution. Increasing shards to 20 provides 20 MB/s write capacity, solving the issue. Option B is wrong because increasing shards reduces read capacity per shard.

Option C is wrong because enabling enhanced fan-out increases read cost but does not affect write limits. Option D is wrong because Kinesis Data Streams does not auto-scale; you must manually update shard count.

Practice this question →

304

MCQmedium

A data engineer is designing a data pipeline that ingests streaming data from an IoT fleet using Kinesis Data Streams and processes it with a Lambda function. The Lambda function often times out when the data volume spikes. What is the most scalable solution?

A.Reduce the batch size in the event source mapping.

B.Increase the Lambda function timeout to 15 minutes.

C.Increase the Lambda function memory and set reserved concurrency.

D.Increase the number of shards and use a Kinesis Data Analytics application for windowed aggregation before Lambda.

AnswerD

More shards increase parallelism, and pre-aggregation reduces Lambda load.

Why this answer

Option D is correct because increasing shard count increases throughput, and using a fan-out pattern with Kinesis Data Analytics involves windowed processing that can handle spikes without Lambda timeouts. Option A is wrong because increasing Lambda timeout may not be enough for large spikes. Option B is wrong because Lambda reserved concurrency limits scaling.

Option C is wrong because reducing batch size decreases throughput.

Practice this question →

305

MCQeasy

A data engineer notices that an AWS Glue ETL job is failing with a 'MemoryError' when processing a large dataset. Which approach should the engineer take to resolve this issue?

A.Increase the number of DPUs for the job.

B.Change the source file format from Parquet to JSON.

C.Reduce the number of partitions in the source data.

D.Use S3 Select to filter data before processing.

AnswerA

More DPUs provide more memory and compute resources.

Why this answer

Option A is correct because increasing the number of DPUs (Data Processing Units) allocates more memory and processing capacity to the job, which can resolve memory errors for large datasets. Option B is incorrect because S3 Select does not help with memory in Glue jobs. Option C is incorrect because reducing the number of partitions may increase memory pressure.

Option D is incorrect because changing the file format to JSON typically increases memory usage.

Practice this question →

306

MCQmedium

A data engineer needs to set up a data pipeline that ingests CSV files from an S3 bucket, transforms them using AWS Glue, and loads the results into Amazon Redshift. The pipeline must handle schema evolution and data quality checks. Which combination of services is most appropriate?

A.Use S3 Events to trigger an AWS Lambda function that writes directly to Redshift

B.Use Amazon Athena to query data in S3 and insert results into Redshift via CTAS

C.Use Amazon Kinesis Data Firehose to transform and load data into Redshift

D.Use AWS Glue ETL jobs with Glue DataBrew for data quality and write to Redshift

AnswerD

Glue supports schema evolution and DataBrew provides data quality checks.

Why this answer

Option B is correct because Glue handles schema evolution and Deequ provides data quality checks. Option A is wrong because Lambda is not ideal for large transforms. Option C is wrong because Athena cannot write to Redshift.

Option D is wrong because Kinesis is for streaming.

Practice this question →

307

MCQmedium

A data engineer needs to monitor the number of records processed by an Amazon Kinesis Data Analytics application and trigger an alarm if the count drops below a threshold over 5 minutes. Which CloudWatch metric should be used?

A.millisBehindLatest (from KinesisDataAnalytics)

B.IncomingRecords (from Kinesis Streams)

C.DPUCount (from Glue)

D.IncomingBytes (from Kinesis Firehose)

AnswerA

This metric indicates how far behind the application is; a drop in processing can be inferred.

Why this answer

Option C is correct because KinesisDataAnalytics publishes 'millisBehindLatest' for application progress. Option A is for Kinesis Streams. Option B is for Firehose.

Option D is for Glue.

Practice this question →

308

MCQhard

Refer to the exhibit. This IAM policy is attached to a user who is trying to read the object s3://data-bucket/confidential/report.csv. The user's principal tag 'role' is set to 'analyst'. What will happen when the user attempts to read the object?

A.Denied because the Deny statement covers all actions under confidential

B.Allowed because there is an explicit Allow and no explicit Deny that matches

C.Denied because the condition in the Deny statement evaluates to true

D.Allowed because of the Allow statement for s3:GetObject

AnswerC

The condition StringNotEquals 'admin' is true for 'analyst', so Deny is applied.

Why this answer

Option C is correct because the Deny statement applies when the role tag is not 'admin'. The user's tag is 'analyst', so the condition matches and access is denied. Option A is wrong because the Allow statement is overridden by the explicit Deny.

Option B is wrong because the Deny applies to all actions in the confidential prefix. Option D is wrong because Deny overrides Allow.

Practice this question →

309

MCQmedium

A data pipeline using AWS Glue jobs is failing with 'Insufficient capacity' errors for Spark executors. Which action should the data engineer take to resolve this?

A.Reduce the number of workers in the Glue job configuration.

B.Increase the job timeout value.

C.Disable Spark UI logging.

D.Increase the number of workers (DPUs) in the Glue job configuration.

AnswerD

Increasing workers adds more computing capacity, resolving the 'Insufficient capacity' error.

Why this answer

Option B is correct because the error indicates resource limits; increasing the number of workers (DPUs) can resolve capacity issues. Option A (reduce workers) would worsen the problem. Option C (increase timeout) does not address capacity.

Option D (disable logging) does not help.

Practice this question →

310

Multi-Selecthard

A company is experiencing high costs from Amazon Redshift. The data engineer wants to optimize costs. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers

A.Increase the frequency of automated snapshots.

B.Right-size the cluster based on workload analysis.

C.Increase the number of nodes to improve performance.

D.Purchase Reserved Instances for steady-state workloads.

E.Enable Concurrency Scaling and set up a usage limit.

AnswersB, D, E

Right-sizing ensures you only pay for needed resources.

Why this answer

Option B is correct because right-sizing the cluster based on workload analysis ensures that the provisioned resources (number and type of nodes) match the actual compute and storage demands. Over-provisioned clusters waste money on unused capacity, while under-provisioned clusters cause performance issues. Analyzing metrics like CPU utilization, disk usage, and query queue wait times helps identify the optimal node count and instance type, directly reducing costs.

Exam trap

The trap here is that candidates confuse cost optimization with performance improvement, leading them to select 'Increase the number of nodes' (Option C) thinking it will reduce costs by improving efficiency, when in fact it increases costs.

Practice this question →

311

Multi-Selectmedium

A company uses Amazon S3 to store raw data and runs AWS Glue ETL jobs to transform it into Parquet. The data is then queried using Amazon Athena. Queries are slow and expensive due to high scan volumes. Which THREE design changes can improve query performance and reduce costs? (Select THREE.)

Select 3 answers

A.Increase the number of files by reducing file size to 1 MB

B.Convert the data to a columnar format like Parquet or ORC if not already

C.Compress the data using a splittable compression format like Snappy

D.Use bucketing on high-cardinality columns

E.Partition the data by commonly filtered columns such as date or region

AnswersB, C, E

Columnar formats store data by column, reducing I/O for queries that select few columns.

Why this answer

Option B is correct because columnar formats like Parquet or ORC store data by column rather than by row, allowing Athena to read only the columns needed for a query. This drastically reduces the amount of data scanned per query, directly lowering both latency and cost since Athena charges based on the volume of data read.

Exam trap

The trap here is that candidates may confuse bucketing with partitioning, or assume that increasing file count always improves parallelism, when in fact small files harm performance in distributed query engines like Athena.

Practice this question →

312

MCQmedium

A company is using Amazon Athena to query data in an S3 bucket. Queries are failing with the error 'HIVE_PATH_ALREADY_EXISTS'. The data is partitioned by year, month, day. What is the MOST likely cause?

A.A partition was manually added to the Glue Data Catalog that already exists

B.The data format in the partition is inconsistent with the table schema

C.The S3 location for the partition is empty

D.The IAM role used by Athena lacks s3:ListBucket permission on the bucket

AnswerA

Attempting to add a duplicate partition causes this error.

Why this answer

Option C is correct because the error occurs when a partition is already registered in the Glue Data Catalog and a new ALTER TABLE ADD PARTITION tries to add it again. Option A would cause schema mismatch. Option B would cause permission error.

Option D would cause file not found.

Practice this question →

313

MCQmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data from web applications. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. Recently, the data analytics application has been falling behind, and the 'MillisBehindLatest' metric for the consumer has been increasing steadily. The shard count is 4, and the average records per second per shard is 200, with an average record size of 1 KB. The provisioned shard limit for the account is 10. Which action will resolve the issue?

A.Enable enhanced fan-out on the Kinesis stream and subscribe the analytics application to it.

B.Reduce the checkpoint interval on the Kinesis Client Library (KCL) consumer to commit offsets more frequently.

C.Increase the number of shards in the Kinesis stream to 8.

D.Increase the provisioned write capacity of the Kinesis stream by requesting a shard limit increase.

AnswerC

More shards increase total read capacity, allowing the consumer to process data faster.

Why this answer

Option C is correct because the consumer is falling behind due to insufficient read capacity. Increasing the number of shards increases the total read capacity and allows the consumer to keep up. Option A is wrong because the write capacity is not the issue; the consumer is behind.

Option B is wrong because switching to enhanced fan-out does not address the shard count limitation; it improves dedicated throughput per consumer but the total throughput is still limited by shard count. Option D is wrong because the consumer is already using KCL, and the issue is not related to checkpointing.

Practice this question →

314

MCQhard

A company runs a data pipeline that ingests streaming data from an IoT fleet into Amazon Kinesis Data Streams (KDS) with 50 shards. A Lambda function processes records from the stream and writes them to an Amazon DynamoDB table for real-time analytics. The Lambda function is configured with a batch size of 100 and a maximum batching window of 60 seconds. Recently, the company has been seeing an increasing number of 'WriteProvisionedThroughputExceededException' errors from DynamoDB, causing Lambda to retry and eventually send records to a dead-letter queue (DLQ). The DynamoDB table is provisioned with 5000 read capacity units (RCU) and 5000 write capacity units (WCU). The average item size is 1 KB. The KDS stream receives an average of 8000 records per second, each 2 KB in size. The Lambda function performs a simple transformation and writes each record individually to DynamoDB. The company wants to reduce the throttling errors without increasing the DynamoDB WCU provision. Which course of action is most likely to achieve this?

A.Modify the Lambda function to use DynamoDB BatchWriteItem to write records in batches of 25.

B.Increase the Lambda function's reserved concurrency to 1000.

C.Increase the Lambda function timeout to 5 minutes to allow more time for retries.

D.Increase the Lambda batch size to 500 and reduce the batching window to 30 seconds.

AnswerA

BatchWriteItem reduces the number of write API calls, lowering the effective WCU consumption per record and reducing throttling.

Why this answer

Option A is correct because writing records in batches using DynamoDB's BatchWriteItem API reduces the number of write requests, lowering the consumed WCU per request (since each batch consumes WCU for all items but with fewer API calls, reducing overhead). This can reduce throttling without increasing WCU. Option B is wrong because increasing Lambda batch size would cause more records to be processed per invocation, but if each record is still written individually, the number of write requests remains the same.

Option C is wrong because increasing Lambda concurrency would increase the number of concurrent invocations, potentially increasing throttling. Option D is wrong because increasing Lambda timeout does not affect the rate of writes.

Practice this question →

315

Multi-Selectmedium

A data engineer is designing a disaster recovery plan for an Amazon RDS for MySQL database. The database must have a Recovery Point Objective (RPO) of less than 5 minutes and a Recovery Time Objective (RTO) of less than 30 minutes. Which TWO actions should the engineer take to meet these requirements?

Select 2 answers

A.Enable automated backups with a 1-day retention period and point-in-time recovery.

B.Enable Multi-AZ deployment.

C.Enable automated backups with a 5-minute retention period.

D.Create a cross-Region read replica.

E.Use a single-AZ instance with a standby in another Region.

AnswersA, B

Automated backups with point-in-time recovery allow restoring to any point within the retention period, achieving RPO of 5 minutes.

Why this answer

Options B and D are correct. Multi-AZ provides automatic failover to a standby in another AZ, meeting RTO. Automated backups with point-in-time recovery provide RPO of 5 minutes.

Option A is wrong because read replicas are for read scaling, not DR. Option C is wrong because cross-Region read replicas have higher RTO due to manual promotion. Option E is wrong because single-AZ does not provide automatic failover.

Practice this question →

316

Multi-Selectmedium

A company is using Amazon DynamoDB as a data store for a real-time application. The application reads a single item by primary key and occasionally updates it. The data engineer notices high read latency during peak hours. Which TWO actions would most effectively reduce read latency?

Select 2 answers

A.Increase the read capacity units for the table.

B.Enable DynamoDB global tables.

C.Add a local secondary index on the table.

D.Disable auto-scaling and set a fixed read capacity.

E.Enable DynamoDB Accelerator (DAX) for the table.

AnswersA, E

More capacity reduces throttling and latency during peaks.

Why this answer

Option B: Increasing read capacity units reduces throttling, which reduces latency. Option D: Enabling DynamoDB Accelerator (DAX) adds an in-memory cache, reducing read latency. Option A is wrong because disabling auto-scaling could make things worse.

Option C is wrong because global tables are for multi-region, not latency. Option E is wrong because sparse indexes don't help single-item reads.

Practice this question →

317

MCQmedium

A data engineer is troubleshooting a step function that orchestrates ETL jobs. The state machine fails with 'State Machine Execution Throttled' error. What should the engineer do to resolve this?

A.Reduce the number of steps in the state machine.

B.Set up a CloudWatch alarm to detect throttling and retry.

C.Adjust the API rate limits in the state machine definition.

D.Request a service quota increase for concurrent executions.

AnswerD

Increasing the limit resolves the throttling.

Why this answer

Option C is correct because the throttling is likely due to exceeding the default concurrent execution limit. Requesting a limit increase from AWS Support is the proper solution. Option A is wrong because reducing the number of steps does not affect concurrent execution limits.

Option B is wrong because CloudWatch alarms only monitor, not resolve throttling. Option D is wrong because the error is about execution throttling, not API throttling, so adjusting API rate limits is not relevant.

Practice this question →

318

MCQmedium

A company uses AWS Kinesis Data Streams to ingest real-time data. The data engineer notices that the stream's 'WriteProvisionedThroughputExceeded' error occurs frequently during peaks. Which action should be taken to resolve this issue?

A.Increase the number of shards in the stream.

B.Modify the producer to use a different partition key.

C.Compress the data before sending to the stream.

D.Enable enhanced fan-out for consumers.

AnswerA

More shards provide higher write throughput.

Why this answer

Increasing the number of shards increases the write capacity, directly addressing the throughput exceeded error. Option B is wrong because the error is not from the producer side. Option C is wrong because the error is about write throughput, not read.

Option D is wrong because the error is not about data format.

Practice this question →

319

MCQhard

A data engineer is monitoring an Amazon Kinesis Data Streams application that processes real-time events. The application uses a Kinesis Client Library (KCL) consumer. The engineer notices that the consumer is lagging behind the producer, and the lag is increasing over time. The stream has 10 shards. Which action will MOST effectively reduce the lag?

A.Use multiple KCL workers per shard to increase processing capacity.

B.Increase the number of shards in the Kinesis data stream.

C.Decrease the number of records per shard per second.

D.Decrease the number of shards in the Kinesis data stream.

AnswerB

More shards increase the stream's read and write capacity.

Why this answer

Option C is correct because increasing the number of shards increases the stream's capacity and allows more parallel processing. Option A is wrong because reducing the number of records per shard per second would decrease throughput. Option B is wrong because using a single worker per shard is the default; adding more workers per shard is not recommended.

Option D is wrong because decreasing the number of shards reduces capacity, worsening lag.

Practice this question →

320

Multi-Selecthard

A company uses Amazon S3 to store sensitive data. The security team requires that all data in transit between on-premises applications and S3 be encrypted. The data engineer must implement a solution that meets this requirement without changing the applications. Which TWO solutions should the engineer consider? (Choose two.)

Select 2 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Enable default encryption on the S3 bucket.

C.Use server-side encryption with S3 managed keys (SSE-S3).

D.Use an S3 VPC Endpoint and enforce the use of HTTPS through bucket policies.

E.Use AWS Storage Gateway to mount S3 as a file system and configure it to use HTTPS.

AnswersD, E

VPC Endpoint with HTTPS policy ensures encrypted transit.

Why this answer

Option A is correct because using an S3 VPC Endpoint with a gateway endpoint ensures traffic stays within AWS and can enforce encryption via policies. Option C is correct because mounting S3 as a file system using AWS Storage Gateway with HTTPS ensures encryption in transit. Option B is incorrect because default encryption is for data at rest, not in transit.

Option D is incorrect because S3 Transfer Acceleration does not enforce encryption; it uses HTTPS but is optional. Option E is incorrect because SSE-S3 is for data at rest.

Practice this question →

321

Multi-Selecthard

A company is using AWS Glue to run ETL jobs that process data from Amazon S3 and load it into Amazon Redshift. The data engineer notices that the Glue job is failing with the error 'S3ServiceException: Access Denied' when writing to the staging S3 bucket. Which THREE actions should the engineer take to resolve this issue?

Select 3 answers

A.Ensure that the Glue job script is correctly referencing the S3 bucket path.

B.Verify that the IAM role used by the Glue job has the s3:PutObject permission for the staging bucket.

C.Ensure that the S3 bucket has a bucket policy that allows the AWS Glue service principal to write objects.

D.Verify that the IAM role has s3:GetObject permission for the source bucket.

E.Check the S3 bucket policy for the staging bucket and ensure it allows the Glue job's IAM role to perform s3:PutObject.

AnswersB, C, E

The role needs write permission to the staging bucket to store temporary files.

Why this answer

Options A, C, and E are correct. The IAM role must have s3:PutObject, and also the S3 bucket policy must allow the role's access (option A). Option C is needed for writing temporary files.

Option E is needed for the Glue service principal to write to the bucket. Option B is wrong because the error is about write access, not read. Option D is wrong because the error is about the staging bucket, not the Glue job script.

Practice this question →

322

MCQhard

A data engineer is investigating a failed AWS Glue job. The engineer runs the CLI command shown in the exhibit to retrieve the latest log stream. The output shows storedBytes: 0. What does this indicate?

A.The log stream is from a different Glue job.

B.The log stream is empty because the job is still running.

C.The Glue job failed before writing any log events to CloudWatch.

D.The log stream has been expired and deleted.

AnswerC

No logs were written, indicating early failure or logging misconfiguration.

Why this answer

Option C is correct because storedBytes: 0 means no log events were stored, likely because the job failed before writing any logs or logging was not enabled. Option A is wrong because logs are not expired; they were never written. Option B is wrong because the stream exists.

Option D is wrong because the command retrieved the latest stream.

Practice this question →

323

MCQeasy

A data engineer is troubleshooting a nightly AWS Glue ETL job that reads from an Amazon RDS for MySQL table and writes to an Amazon S3 bucket in Parquet format. The job runs successfully most days, but occasionally fails with the error 'ERROR: An error occurred while calling o67.pyWriteDynamicFrame. The transaction log for the database is full due to 'LOG_BACKUP'.' What is the MOST likely cause of this error?

A.The MySQL database has reached its maximum number of concurrent connections.

B.The AWS Glue job does not have sufficient permissions to write to the S3 bucket.

C.The AWS Glue job is configured with an incorrect 'writeDynamicFrame' method.

D.The MySQL database transaction log needs to be backed up to free space.

AnswerD

The error 'LOG_BACKUP' indicates that the transaction log is full and requires a backup to truncate it.

Why this answer

The error message 'The transaction log for the database is full due to 'LOG_BACKUP'' indicates that the MySQL database's transaction log has reached its maximum size because it has not been backed up and truncated. In MySQL, the transaction log (often the InnoDB redo log or binary log) must be backed up periodically to free space; otherwise, write operations fail. This is a database-side issue, not a Glue or permissions problem, so the correct action is to back up the transaction log to release space.

Exam trap

The trap here is that candidates may confuse a database-side resource exhaustion error (transaction log full) with a permissions or configuration issue in AWS Glue, leading them to incorrectly select options related to Glue permissions or method syntax.

How to eliminate wrong answers

Option A is wrong because the error specifically mentions the transaction log being full, not a limit on concurrent connections; a connection limit would produce a 'Too many connections' error. Option B is wrong because insufficient S3 write permissions would result in an access denied or authorization error, not a database transaction log error. Option C is wrong because the 'writeDynamicFrame' method is correctly used in AWS Glue for writing DynamicFrames; an incorrect method would cause a syntax or API error, not a database transaction log issue.

Practice this question →

324

MCQmedium

A company uses Amazon S3 to store sensitive data. The data engineer needs to ensure that all data in transit between the S3 bucket and clients is encrypted. Which configuration should the engineer implement?

A.Use Amazon CloudFront to serve the content and enable SSL.

B.Enable default encryption on the S3 bucket using SSE-S3.

C.Create an S3 bucket policy that denies requests where SecureTransport is false.

D.Use SSE-C to encrypt the data with a customer-provided key.

AnswerC

This ensures all requests use HTTPS, encrypting data in transit.

Why this answer

Option D is correct because a bucket policy that denies requests without the aws:SecureTransport condition enforces HTTPS for all access. Option A is wrong because S3 default encryption only encrypts data at rest, not in transit. Option B is wrong because CloudFront does not enforce HTTPS by default; it can be configured but is not the direct solution.

Option C is wrong because SSE-C is server-side encryption at rest, not in transit.

Practice this question →

325

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3 with sensitive data. The engineer needs to ensure that data at rest is encrypted and that access is logged for compliance. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Enable S3 Select to filter data at read time.

B.Enable S3 Transfer Acceleration.

C.Enable CloudTrail data events for S3 object-level operations.

D.Enable default encryption on the S3 bucket using SSE-KMS.

E.Enable S3 Block Public Access on the account.

AnswersC, D

CloudTrail data events log read/write operations to objects.

Why this answer

Option C is correct because enabling CloudTrail data events for S3 object-level operations captures detailed logs of all read, write, and delete actions on objects, which is essential for compliance auditing. Option D is correct because enabling default encryption on the S3 bucket using SSE-KMS ensures that all objects stored in the bucket are encrypted at rest with AWS Key Management Service (KMS) keys, providing centralized control and auditability of encryption keys.

Exam trap

The trap here is that candidates often confuse S3 Block Public Access (a security control) with encryption or logging, or they mistakenly think S3 Select or Transfer Acceleration contribute to compliance requirements, when they are unrelated to data-at-rest encryption and access logging.

Practice this question →

326

MCQmedium

Refer to the exhibit. A data engineer sees this output from the AWS CLI for a failed Glue job. The job uses 10 workers of Standard type. What is the MOST appropriate action to resolve the OutOfMemoryError?

A.Increase NumberOfWorkers to 20

B.Reduce NumberOfWorkers to 5

C.Change WorkerType to G.1X

D.Increase MaxCapacity to 20

AnswerD

Increasing MaxCapacity allocates more DPUs per worker, increasing memory per worker.

Why this answer

Option B is correct because increasing the number of workers adds more parallelism but does not increase memory per worker; the error is heap space per worker. Option A is wrong because G.1X has less memory. Option C is correct: increasing MaxCapacity (DPUs) per worker increases memory per worker.

Option D is wrong because reducing workers exacerbates the issue.

Practice this question →

327

MCQmedium

A data engineer runs an AWS Glue ETL job that reads from an S3 bucket containing JSON files. The job fails with an error indicating that some records are malformed. The engineer wants to skip the malformed records and continue processing. Which approach should the engineer take?

A.Pre-process the JSON files to correct the malformed records before Glue reads them.

B.Convert the JSON files to Parquet format and use Glue to read Parquet.

C.Use AWS Glue Schema Registry to reject invalid records.

D.Configure the Glue DynamicFrame to use the `withErrorThreshold` option to skip corrupt records.

AnswerD

Glue can skip malformed records using error thresholds.

Why this answer

Option C is correct because Glue's DynamicFrame has a `recurse` option in `withErrorThreshold` or can use `from_options` with `multiLine` and `allowQuotedRecordDelimiters`. However, the best approach is to use the `from_catalog` or `from_options` with `withErrorThreshold` to skip bad records. Option A is wrong because modifying the data source is not always possible.

Option B is wrong because using a different file format does not fix malformed JSON. Option D is wrong because AWS Glue Schema Registry validates schemas, not malformed records.

Practice this question →

328

Multi-Selectmedium

A company's Amazon Redshift cluster is running slowly. The data engineer suspects that table design is the cause. Which TWO design practices can improve query performance? (Choose TWO.)

Select 2 answers

A.Define appropriate sort keys on frequently filtered columns.

B.Use GROUP BY instead of DISTINCT in queries.

C.Define appropriate distribution keys to collocate joins.

D.Increase the number of slices per node by resizing the cluster.

E.Use VARCHAR instead of CHAR for fixed-length strings.

AnswersA, C

Sort keys minimize the number of blocks scanned.

Why this answer

Options A and B are correct. Sort keys help the query optimizer prune blocks, and distribution keys reduce data movement. Option C is wrong because VARCHAR is variable-length and may not improve performance.

Option D is wrong because GROUP BY is not a design practice. Option E is wrong because adding more slices (via cluster resize) is not a table design change.

Practice this question →

329

MCQmedium

A company runs a daily batch process that reads data from Amazon S3, transforms it with AWS Glue, and loads it into Amazon Redshift. The process takes 6 hours, but the business requires completion within 4 hours. Which design change would MOST reduce runtime?

A.Increase the number of Glue workers

B.Load data directly from S3 to Redshift using COPY command, then transform in Redshift

C.Use S3 Select to filter data before Glue

D.Switch to columnar storage in Redshift

AnswerB

COPY is highly efficient for bulk loading, and in-database transformation can be faster than Glue.

Why this answer

Option D is correct because using Redshift's COPY command with S3 is optimized for bulk loads and avoids transformation delays. Option A is wrong because increasing Glue workers may help but doesn't address Redshift load speed. Option B is wrong because columnar storage is already used.

Option C is wrong because S3 Select reduces data scanned but does not accelerate the Glue job.

Practice this question →

330

MCQmedium

A data engineer is monitoring an Amazon EMR cluster running a Spark job. The job is processing a large dataset and the engineer notices that the cluster is using a high percentage of disk space on the core nodes. The job fails with 'No space left on device' error. What is the most effective way to resolve this issue without modifying the job logic?

A.Attach additional EBS volumes to the core nodes.

B.Increase the EBS volume size attached to the core nodes.

C.Change the core node instance type to one with more memory.

D.Increase the number of core nodes in the cluster.

AnswerD

More nodes distribute the intermediate data, reducing disk usage per node.

Why this answer

Option D is correct because increasing the number of core nodes distributes the intermediate shuffle data and temporary files across more nodes, reducing the per-node disk usage. This directly addresses the 'No space left on device' error without altering the Spark job logic, as core nodes in EMR store both HDFS data and local shuffle spills.

Exam trap

The trap here is that candidates confuse storage issues with memory or compute issues, and incorrectly choose to increase EBS volume size (Option B) instead of scaling horizontally, which is the most effective way to distribute disk load in a distributed system like EMR.

How to eliminate wrong answers

Option A is wrong because attaching additional EBS volumes does not increase the total available disk space on the core nodes unless they are mounted and configured; EMR automatically uses the root volume for local data, and adding extra volumes requires manual intervention or instance store configuration, which is not a direct fix. Option B is wrong because increasing the EBS volume size on existing core nodes only provides more space on the root device, but the error may stem from ephemeral storage or HDFS usage; moreover, this requires stopping the cluster or modifying the launch configuration, which is less effective than scaling horizontally. Option C is wrong because changing the instance type to one with more memory does not increase disk space; it addresses memory constraints, not the 'No space left on device' error, which is a storage issue.

Practice this question →

331

MCQhard

A data pipeline uses AWS Glue to run ETL jobs that read from and write to an Amazon Redshift cluster. The pipeline recently started failing with the error 'ERROR: cannot execute INSERT in a read-only transaction'. The Glue job's IAM role has the necessary permissions. What could be the cause of this error?

A.The Glue job is using a transaction that was opened in read-only mode.

B.The Redshift cluster is in read-only mode due to maintenance.

C.The Glue connection is configured with 'read-only' set to true.

D.The Glue job's IAM role does not have sufficient Redshift permissions.

AnswerA

If auto_commit=False and the first operation is a SELECT, the session becomes read-only; subsequent INSERT fails.

Why this answer

Option D is correct because Redshift uses a read-only transaction when connected via a read-only workload or if the connection string specifies auto_commit=False and the job tries to write without committing. Option A is wrong because insufficient permissions would cause a different error. Option B is wrong because Redshift is not in read-only mode.

Option C is wrong because Glue connections do not have a read-only setting.

Practice this question →

332

MCQeasy

A company uses AWS Glue to run ETL jobs that process data from an Amazon RDS for MySQL database and load it into an Amazon S3 data lake. The Glue job runs daily and processes incremental data. Recently, the job has been taking longer than expected. The engineer checks the CloudWatch logs and sees that the job is spending most of its time on the 'Reading from JDBC' phase. The MySQL table has 10 million rows and is indexed on the primary key. The Glue job uses a 'job bookmark' to track processed data. The engineer wants to improve the performance of the read phase. Which action is most likely to help?

A.Increase the JDBC 'fetchSize' parameter to 10000.

B.Disable job bookmark and perform a full refresh each time.

C.Increase the number of DPUs for the Glue job.

D.Modify the job to use a 'query' parameter that selects only the new or modified rows based on a timestamp column.

AnswerD

By filtering at the source, less data is read and transferred, speeding up the read phase.

Why this answer

Option B is correct because using a 'query' parameter with a WHERE clause that filters on the bookmark key (e.g., a timestamp column) allows Glue to read only the incremental data, reducing the amount of data transferred. Option A is wrong because increasing the number of DPUs adds parallelism but the bottleneck may be the database's ability to serve data. Option C is wrong because increasing the fetch size may cause memory issues.

Option D is wrong because job bookmark already tracks processed data; disabling it would cause reprocessing.

Practice this question →

333

MCQhard

Refer to the exhibit. An AWS Glue job is failing with 'AccessDenied' when trying to write to the 'data-lake-bucket' which is encrypted with an AWS KMS key. The IAM role used by the Glue job has the attached policy shown. What is the MOST likely cause of the failure?

A.The policy does not include s3:ListBucket permission.

B.The policy does not include s3:GetObject permission.

C.The KMS key ARN in the policy is incorrect.

D.The policy does not include kms:GenerateDataKey or kms:Encrypt permission.

AnswerD

Writing to SSE-KMS encrypted S3 requires GenerateDataKey and Encrypt.

Why this answer

Option C is correct because the policy allows s3:PutObject but does not allow kms:GenerateDataKey or kms:Encrypt, which are needed to write to an SSE-KMS encrypted bucket. Option A is wrong because ListBucket is allowed. Option B is wrong because GetObject is allowed.

Option D is wrong because the key ARN is correct.

Practice this question →

334

MCQhard

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data engineer observes that some read requests are returning stale data, even though the application uses strongly consistent reads. The table has auto-scaling enabled with a maximum read capacity of 10,000 RCUs. The observed read traffic averages 8,000 RCUs but occasionally spikes to 12,000 RCUs. What is the most likely cause of the stale reads?

A.Read capacity auto-scaling cannot keep up with sudden traffic spikes, causing throttling and fallback to eventually consistent reads.

B.The application uses write sharding, causing read-after-write inconsistencies.

C.The application is using DynamoDB Accelerator (DAX) which caches data and may return stale values.

D.The table is part of a DynamoDB global table, and the application reads from a replica in a different region.

AnswerA

Throttling can cause fallback to eventual consistency.

Why this answer

Option C is correct because strongly consistent reads can return stale data if the application is throttled due to insufficient read capacity. During spikes above the maximum auto-scaling limit (10,000 RCUs), requests may be throttled, and the SDK may retry with eventually consistent reads, returning stale data. Option A is incorrect because global tables with eventually consistent reads would not affect a single table.

Option B is incorrect because DynamoDB Accelerator (DAX) provides eventual consistency by default, but strongly consistent reads would bypass DAX. Option D is incorrect because write sharding does not cause stale reads on the same table.

Practice this question →

335

Drag & Dropmedium

Order the steps to set up a Kinesis Data Analytics application for real-time stream processing.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, set up the source stream. Then create the analytics application, configure it with the source and logic, start it, and finally monitor performance.

Practice this question →

336

Multi-Selectmedium

A company is using AWS Glue Data Catalog to store metadata about datasets in S3. The data engineer wants to implement a data governance solution that tracks lineage and versioning of datasets. Which TWO AWS services can be used together to achieve this?

Select 2 answers

A.AWS Data Pipeline

B.AWS Lake Formation

C.AWS Glue Data Catalog

D.AWS CloudTrail

E.Amazon S3

AnswersB, C

Provides data lineage and versioning capabilities.

Why this answer

Option A and D are correct. AWS Lake Formation provides data lineage and versioning. AWS Glue Data Catalog stores metadata and can be integrated with Lake Formation.

Option B is wrong because S3 does not provide lineage. Option C is wrong because CloudTrail logs API calls but not lineage. Option E is wrong because Data Pipeline is for data movement.

Practice this question →

337

MCQeasy

A company runs a daily batch processing job on Amazon EMR that reads data from Amazon S3 and writes results back to S3. The job takes longer than expected. The engineer wants to monitor the job's resource utilization. Which AWS service should be used to collect and visualize metrics such as CPU and memory usage of the EMR cluster's nodes?

A.AWS Config to record configuration changes in the EMR cluster.

B.Amazon Athena to query EMR job logs stored in S3.

C.Amazon CloudWatch with the CloudWatch Agent installed on the EMR nodes.

D.AWS CloudTrail to log API calls made by the EMR job.

AnswerC

CloudWatch can collect CPU, memory, and disk metrics from EC2 instances (EMR nodes) via the CloudWatch Agent.

Why this answer

Option A is correct because CloudWatch can collect custom metrics from EMR via the CloudWatch agent or EMR metrics integration. Option B is incorrect because CloudTrail records API calls, not resource utilization. Option C is incorrect because AWS Config tracks configuration changes.

Option D is incorrect because Athena is a query service, not a monitoring service.

Practice this question →

338

MCQhard

A company uses Amazon Redshift for its data warehouse. During a routine audit, the data engineer discovers that some queries are returning stale data even though the underlying source data has been updated. The engineer confirms that the COPY command completes successfully and that no errors are reported. Which action should the engineer take to ensure queries reflect the latest data?

A.Run the VACUUM command on the source tables.

B.Clear the Redshift result cache by running RESET ALL.

C.Run the ANALYZE command on the source tables.

D.Refresh the materialized views that the queries are using.

AnswerD

Materialized views must be refreshed to reflect changes in base tables.

Why this answer

Option B is correct because Redshift does not automatically maintain materialized views; they need to be refreshed to reflect changes in the base tables. Option A is wrong because Redshift does not have a cache that needs clearing in this context. Option C is wrong because VACUUM reclaims space and sorts tables, but does not refresh materialized views.

Option D is wrong because ANALYZE updates statistics, not the data content.

Practice this question →

339

MCQeasy

Refer to the exhibit. A data engineer sees this error log from an Amazon EC2 instance that is trying to access an S3 bucket in the us-west-2 region. The EC2 instance is in a VPC with a private subnet and no internet gateway. What is the MOST likely cause of this error?

A.The S3 bucket is in a different region than us-west-2.

B.The VPC does not have a VPC endpoint for S3.

C.The S3 bucket does not exist.

D.The IAM role attached to the EC2 instance does not have s3:GetObject permission.

AnswerB

Private subnet needs VPC endpoint to access S3.

Why this answer

Option D is correct. The EC2 instance is in a private subnet without internet gateway, so it cannot reach S3 over the internet. A VPC endpoint (Gateway or Interface) is needed for private connectivity.

Option A is incorrect because the bucket exists (DNS resolves). Option B is incorrect because the error is a connection timeout, not a 403. Option C is incorrect because there is no indication of an incorrect region.

Practice this question →

340

Multi-Selecthard

A data engineer is designing a disaster recovery strategy for an Amazon RDS for MySQL database with Multi-AZ deployment. Which THREE actions should the engineer take to meet a Recovery Point Objective (RPO) of 5 minutes and a Recovery Time Objective (RTO) of 15 minutes? (Choose THREE.)

Select 3 answers

A.Enable automated backups with a retention period of 7 days.

B.Create a cross-region read replica to another AWS region.

C.Configure the DB instance to be Single-AZ for simplicity.

D.Export automated snapshots to an S3 bucket in a different region.

E.Enable Multi-AZ deployment for automatic failover.

AnswersA, B, E

Automated backups allow point-in-time recovery within the retention window, helping meet RPO.

Why this answer

Options A, B, and D provide fast failover and minimal data loss. Option A ensures automatic failover to a standby in another AZ. Option B (cross-region read replica) can be promoted in minutes.

Option D (automated backups) provide point-in-time recovery. Option C (snapshots to S3) is slower and may not meet RTO. Option E (Single-AZ) increases risk.

Practice this question →

341

MCQmedium

An IAM policy is attached to an IAM role used by an EC2 instance in the 10.0.0.0/8 VPC. The EC2 instance cannot read objects from the S3 bucket. What is the most likely cause?

A.The policy does not grant s3:ListBucket permission.

B.The S3 bucket has a bucket policy that denies public access, and the IAM policy alone is insufficient.

C.The bucket is encrypted with SSE-KMS and the role does not have kms:Decrypt permission.

D.The EC2 instance's public IP is not in the 10.0.0.0/8 range.

AnswerB

The IAM policy allows access, but if the bucket policy denies all access except from specific principals, the IAM role may still be denied. The bucket policy must explicitly allow the role.

Why this answer

Option D is correct because IAM policies cannot use the aws:SourceIp condition for services that use the principal's IP, but for EC2 with an IAM role, the source IP is the instance's private IP, which is within the condition, so the condition should work. However, the issue is that S3 bucket policies are required for cross-account access or when the bucket is not public. The exhibited policy is an IAM policy, not a bucket policy.

The bucket itself likely has a bucket policy that denies access or the bucket is not public. Option A (wrong IP) is not necessarily true. Option B (no KMS) is irrelevant.

Option C (no s3:ListBucket) is not required for GetObject.

Practice this question →

342

Multi-Selectmedium

A company uses Amazon S3 to store data for analytics. The data engineer needs to ensure that the S3 bucket is protected against accidental deletion of objects. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers

A.Enable server access logging for the S3 bucket.

B.Create an S3 bucket policy that explicitly denies the s3:DeleteObject action.

C.Configure a lifecycle policy to transition objects to Glacier.

D.Enable versioning on the S3 bucket.

E.Enable MFA Delete on the S3 bucket.

AnswersB, D, E

Prevents any user from deleting objects.

Why this answer

Option A is correct because MFA Delete adds an extra layer of protection. Option B is correct because versioning keeps multiple versions, allowing recovery. Option D is correct because a bucket policy denying s3:DeleteObject to all principals prevents any deletion.

Option C is wrong because lifecycle policies delete objects automatically. Option E is wrong because server access logs are for auditing, not prevention.

Practice this question →

343

MCQmedium

A company uses AWS DMS to migrate data from an on-premises Oracle database to Amazon Aurora MySQL. The migration is successful, but the ongoing replication task is experiencing high latency. Which configuration change is most likely to reduce latency?

A.Increase the size of the DMS replication instance.

B.Decrease the task's batch size and batch apply timeout.

C.Change the target endpoint to Amazon S3.

D.Enable Change Data Capture (CDC) from binary logs.

AnswerA

A larger instance provides more resources to process change data capture (CDC) faster.

Why this answer

Option B is correct because increasing the DMS replication instance size provides more CPU and memory, which can process changes faster. Option A (S3 as target) is not applicable as the target is Aurora. Option C (CDC from binary logs) is for MySQL source.

Option D (decrease batch size) would likely increase latency.

Practice this question →

344

MCQeasy

A company uses Amazon Athena to query data in S3. Recently, queries have become slow. The data is stored as CSV files in a partitioned table. What is the most effective way to improve query performance?

A.Increase the number of nodes in the Athena query engine.

B.Convert the data to Parquet format and optimize partitioning.

C.Convert the data to JSON format.

D.Increase the size of the CSV files to reduce the number of files.

AnswerB

Parquet is columnar and compressed, improving scan efficiency.

Why this answer

Option C is correct because converting to Parquet and partitioning improves compression, columnar scanning, and partition pruning. Option A is wrong because increasing file size alone may not help; large CSV files still require full scans. Option B is wrong because converting to JSON would likely worsen performance.

Option D is wrong because more nodes only help with distributed processing, but Athena manages resources automatically.

Practice this question →

345

Multi-Selecthard

Which THREE considerations are important when designing a data pipeline that uses AWS Glue to process streaming data from Amazon Kinesis Data Streams? (Choose 3.)

Select 3 answers

A.Set the number of Glue workers to match the number of shards for optimal parallelism

B.Ensure the Kinesis stream has enough shards to handle the expected record rate

C.Configure checkpointing to prevent data loss on failure

D.Use batch window to accumulate data before processing

E.Convert data to Avro format for better compression

AnswersA, B, C

Each worker can consume one shard.

Why this answer

A, B, and D are correct. A: Checkpointing ensures exactly-once processing. B: Multiple workers for parallelism.

D: Sufficient shards for throughput. C (batch mode) is not streaming. E (data format) is not Glue-specific for streaming.

Practice this question →

346

MCQeasy

A company runs a nightly batch processing pipeline using AWS Glue ETL jobs. The pipeline reads data from an Amazon S3 bucket, transforms it, and writes results to an Amazon Redshift cluster. Recently, the data volume has increased significantly, and some Glue jobs are failing with the error 'java.lang.OutOfMemoryError: Java heap space'. The data engineer needs to modify the job configuration to prevent these failures without changing the code. The job currently uses 10 DPUs and processes data in a single Spark DataFrame. Which of the following is the MOST effective solution?

A.Reduce the number of DPUs to 5 and increase the Spark executor memory by setting 'spark.executor.memory' in job parameters.

B.Increase the number of DPUs to 20 and enable job bookmarking for incremental processing.

C.Change the script to use DynamicFrame instead of DataFrame and disable the 'spark.sql.shuffle.partitions' configuration.

D.Add a 'coalesce(1)' operation before writing to Redshift to reduce the number of output files.

AnswerB

More DPUs increase total available memory; job bookmarking reduces data volumes by processing only new data.

Why this answer

Increasing DPUs from 10 to 20 provides more memory and compute resources, directly addressing the 'java.lang.OutOfMemoryError: Java heap space' caused by insufficient memory for the single DataFrame. Enabling job bookmarking allows incremental processing, which reduces the volume of data processed per run, further mitigating memory pressure without code changes.

Exam trap

The trap here is that candidates may think reducing DPUs or using coalesce reduces memory usage, but in reality, both actions increase memory pressure on individual executors, making OOM errors more likely.

How to eliminate wrong answers

Option A is wrong because reducing DPUs to 5 would decrease available memory, worsening the OOM error, and increasing 'spark.executor.memory' without more DPUs cannot compensate for the overall resource reduction. Option C is wrong because changing to DynamicFrame does not inherently reduce memory usage; disabling 'spark.sql.shuffle.partitions' may cause imbalanced partitions and does not address the heap space issue. Option D is wrong because 'coalesce(1)' forces all data into a single partition, which increases memory pressure on that executor and can trigger or worsen OOM errors.

Practice this question →

347

MCQhard

A team manages an Amazon DynamoDB table with on-demand capacity. Recently, they noticed increased throttling errors during peak hours. The table has a Lambda trigger that processes changes and writes to an S3 bucket. Which design change would BEST reduce throttling?

A.Switch the table to provisioned capacity and enable auto-scaling.

B.Increase the write capacity units to handle the peak load.

C.Enable S3 bucket versioning to reduce the number of writes.

D.Implement DynamoDB Accelerator (DAX) to cache frequent reads.

AnswerD

DAX reduces read load on the table, lowering throttling.

Why this answer

Option D is correct because DynamoDB Accelerator (DAX) provides an in-memory cache that reduces the number of read requests hitting the table, which can alleviate throttling during peak hours. The question describes throttling errors, which are typically caused by exceeding the table's read or write capacity; DAX offloads read traffic, reducing the load on the table and thus decreasing throttling events.

Exam trap

The trap here is that candidates may assume throttling is always due to insufficient write capacity, but the question's context of a Lambda trigger writing to S3 can increase read traffic (e.g., via stream processing or re-reading items), making DAX a read-side solution that addresses the actual cause.

How to eliminate wrong answers

Option A is wrong because switching to provisioned capacity with auto-scaling does not address the root cause of throttling under on-demand capacity, which already scales automatically; throttling in on-demand mode is usually due to exceeding the table's per-partition throughput limits or burst capacity, not capacity mode. Option B is wrong because increasing write capacity units is not applicable to on-demand capacity, which does not use provisioned write capacity units; on-demand tables automatically scale, and throttling is not resolved by manually setting a capacity that doesn't exist in that mode. Option C is wrong because enabling S3 bucket versioning increases the number of writes (by storing multiple versions of objects) rather than reducing them, and it does not affect DynamoDB throttling.

Practice this question →

348

MCQeasy

A data engineer needs to move data from an Amazon S3 bucket to an Amazon Redshift cluster on a daily schedule. The data is in CSV format and the target table already exists. Which AWS service should the engineer use to automate this task?

A.AWS Glue

B.Amazon Athena

C.Amazon EMR

D.Amazon Kinesis Data Analytics

AnswerA

Glue provides job scheduling and ETL capabilities.

Why this answer

Option B is correct because AWS Glue can be used to run ETL jobs that copy data from S3 to Redshift on a schedule. Option A is wrong because Amazon Athena is a query service, not an ETL scheduler. Option C is wrong because Amazon EMR is a big data platform but requires more setup for simple scheduling.

Option D is wrong because Amazon Kinesis Data Analytics is for real-time streaming data.

Practice this question →

349

MCQeasy

A data engineer needs to monitor the number of records processed by an AWS Glue ETL job and send an alert if the count drops below a threshold. Which AWS service should be used to create this custom metric?

A.Amazon S3

B.AWS Config

C.Amazon CloudWatch

D.AWS CloudTrail

AnswerC

CloudWatch can store custom metrics and trigger alarms.

Why this answer

Amazon CloudWatch is the correct service for creating custom metrics because it allows you to publish your own data points, such as the number of records processed by an AWS Glue ETL job. You can use the CloudWatch PutMetricData API or the AWS Glue job script to emit a custom metric, then set an alarm on that metric to trigger an alert when the count drops below a threshold.

Exam trap

The trap here is that candidates often confuse AWS CloudTrail with CloudWatch because both are monitoring-related, but CloudTrail is for auditing API calls, not for ingesting custom numerical metrics or setting alarms on them.

How to eliminate wrong answers

Option A is wrong because Amazon S3 is an object storage service and does not provide a mechanism to create or monitor custom metrics; it only stores data and logs access via server access logs or AWS CloudTrail. Option B is wrong because AWS Config is a service for evaluating and auditing resource configurations against rules, not for ingesting or alerting on custom operational metrics like record counts. Option D is wrong because AWS CloudTrail records API activity for auditing and governance, but it cannot be used to create custom metrics or set threshold-based alarms; it captures events, not numerical data points.

Practice this question →

350

MCQeasy

A data engineer needs to monitor the number of records processed by a Kinesis Data Firehose delivery stream and set an alarm if the count drops below a threshold. Which CloudWatch metric should be used?

A.IncomingRecords

B.PutRecord.Success

C.DeliveryToS3.Success

D.IncomingBytes

AnswerA

This metric counts the number of records sent to Firehose.

Why this answer

Option A is correct because 'IncomingRecords' counts records received by Firehose, which directly indicates processing volume. Option B is wrong because 'IncomingBytes' measures bytes, not records. Option C is wrong because 'DeliveryToS3.Success' is a success metric but measures successful deliveries, not record count.

Option D is wrong because 'PutRecord.Success' is a per-API call metric for Kinesis Data Streams, not Firehose.

Practice this question →

351

MCQmedium

A data engineer notices that an AWS Glue job processing data from an Amazon S3 bucket frequently fails with 'OutOfMemoryError'. The job reads CSV files, applies transformations, and writes Parquet to another S3 bucket. The job has 10 workers of type G.1X. Which change is MOST likely to resolve the issue?

A.Change the worker type from G.1X to G.2X

B.Increase the number of workers to 20

C.Change the worker type from G.1X to G.8X

D.Enable the Spark UI to monitor memory and tune the job

AnswerA

G.2X provides 2x the memory of G.1X, directly addressing the OutOfMemoryError.

Why this answer

The G.1X worker type provides 16 GB of memory per worker. An OutOfMemoryError indicates that the job's memory requirements exceed this limit. Upgrading to G.2X doubles the memory per worker to 32 GB, directly addressing the memory shortage without changing the parallelism or incurring the overhead of additional workers.

Exam trap

The trap here is that candidates might think adding more workers (Option B) solves memory issues, but OutOfMemoryError is per-worker, not a cluster-wide shortage, so increasing parallelism does not fix the root cause.

How to eliminate wrong answers

Option B is wrong because increasing the number of workers to 20 does not increase the memory per worker; it only adds more workers, which can help with parallelism but not with per-worker memory exhaustion. Option C is wrong because G.8X provides 64 GB of memory per worker, which is excessive and likely unnecessary; the most cost-effective fix is G.2X. Option D is wrong because enabling the Spark UI only helps with monitoring and debugging, not with resolving the memory issue; it does not allocate additional memory.

Practice this question →

352

MCQeasy

A data engineer notices that an AWS Glue ETL job is failing intermittently with the error 'Connection refused'. The job reads from Amazon RDS for MySQL and writes to Amazon S3. What is the MOST likely cause?

A.The RDS instance has reached its maximum number of connections.

B.The security group for the RDS instance is not allowing inbound traffic from the Glue job's subnet.

C.The Glue job is using too many DPUs and hitting resource limits.

D.The IAM role associated with the Glue job lacks permissions to write to the S3 bucket.

AnswerB

The 'Connection refused' error typically occurs due to network or security group misconfiguration blocking access to the RDS instance.

Why this answer

Option A is correct because the error indicates a network connectivity issue to the RDS database. Option B is incorrect because the error is not about permissions. Option C is incorrect because the error is about connection, not resource limits.

Option D is incorrect because the error is not about job parallelism.

Practice this question →

353

MCQmedium

A data engineer is troubleshooting an AWS Glue job that writes data to an S3 bucket. The IAM role attached to the Glue job has the policy shown in the exhibit. The job fails when writing to the 'secrets/' prefix but succeeds when writing to other prefixes. What is the reason for the failure?

A.The job does not have permission to write to the bucket at all.

B.The resource ARN in the Allow statement does not include the bucket itself.

C.The Deny statement is not effective because it is placed after the Allow.

D.The Deny statement explicitly denies PutObject to the secrets/ prefix.

AnswerD

Deny overrides Allow.

Why this answer

Option B is correct because the Deny statement explicitly denies s3:PutObject to the secrets/ prefix, which overrides the Allow. Option A is wrong because the resource is correctly specified for both statements. Option C is wrong because the Deny is explicit.

Option D is wrong because the job can write to other prefixes.

Practice this question →

354

Multi-Selecteasy

A data engineer is monitoring an Amazon RDS for PostgreSQL instance. The engineer wants to set up alerts for high CPU utilization and low free storage space. Which AWS services can be used together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon Simple Notification Service (SNS)

B.Amazon CloudWatch

C.AWS CloudTrail

D.AWS Config

E.Amazon Route 53

AnswersA, B

SNS delivers alarm notifications.

Why this answer

Amazon CloudWatch is the correct service because it can monitor RDS metrics such as CPUUtilization and FreeStorageSpace, and trigger alarms based on thresholds. Amazon SNS is correct because it can receive CloudWatch alarm notifications and deliver them via email, SMS, or other endpoints, enabling the data engineer to be alerted when high CPU or low storage conditions occur.

Exam trap

The trap here is that candidates often confuse AWS CloudTrail (audit logging) with CloudWatch (monitoring), or think AWS Config can monitor performance metrics instead of just configuration compliance.

Practice this question →

355

MCQmedium

A data engineer manages an Amazon Redshift cluster that hosts a 10 TB data warehouse. The cluster uses a single node of type dc2.large (160 GB SSD). The engineer notices that the cluster's disk space is 95% full, and queries are running slowly. The engineer runs the STV_PARTITIONS view and sees that many slices have high 'tossed' counts. The engineer also runs VACUUM and ANALYZE commands, but the disk space does not improve. The engineer suspects that the cluster needs more storage. However, the company wants to minimize cost. Which action should the engineer take to resolve the disk space issue most cost-effectively?

A.Switch to a ra3.xlplus node with managed storage.

B.Replace the cluster with a single ds2.xlarge node.

C.Scale the cluster to a single dc2.8xlarge node.

D.Add another dc2.large node to the cluster to increase total storage.

AnswerB

ds2.xlarge provides 2 TB HDD storage at a lower cost than dc2 options, solving the disk space issue.

Why this answer

Option C is correct because dc2.large is a dense compute node with limited SSD storage; upgrading to ds2.xlarge provides more storage (2 TB HDD) at a lower cost per GB compared to scaling up to dc2.8xlarge. Option A is wrong because adding more dc2.large nodes increases storage but also CPU and memory, which may be unnecessary and more expensive than a single ds2 node. Option B is wrong because dc2.8xlarge has 2.56 TB SSD, which is more expensive than ds2.xlarge.

Option D is wrong because switching to RA3 nodes is costly and designed for managed storage, which may be overkill.

Practice this question →

356

MCQmedium

A data engineer is running a Spark job on Amazon EMR. The job reads from S3, processes data, and writes to S3. The job is taking longer than expected. The engineer notices that the job is spending a lot of time in the 'GC' (garbage collection) phase. Which configuration change is most likely to improve performance?

A.Increase the spark.executor.memory setting.

B.Increase the spark.sql.shuffle.partitions.

C.Decrease the number of executor cores.

D.Decrease the spark.executor.memoryOverhead.

AnswerA

More memory reduces GC overhead.

Why this answer

Option A is correct because increasing executor memory reduces GC frequency. Option B is wrong because it reduces parallelism. Option C is wrong because it reduces memory per task.

Option D is wrong because it reduces memory and may increase GC.

Practice this question →

357

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data includes sensitive personally identifiable information (PII). Which combination of services would provide the most comprehensive data protection?

A.Use S3 Transfer Acceleration and enable versioning

B.Enable S3 server-side encryption with AWS KMS

C.Use Amazon CloudWatch Logs to monitor access and enable MFA Delete

D.Enable S3 Block Public Access and use Amazon Macie to discover and classify PII

AnswerD

Block Public Access prevents exposure; Macie identifies and alerts on PII.

Why this answer

Option B is correct because S3 Block Public Access prevents exposure and Macie identifies sensitive data. Option A is wrong because KMS only encrypts. Option C is wrong because CloudWatch does not protect data.

Option D is wrong because S3 Transfer Acceleration is for speed.

Practice this question →

358

MCQhard

A company runs a critical data pipeline using Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data is compressed with GZIP and partitioned by year/month/day/hour. Recently, the delivery to S3 has been failing with 'Rate exceeded' errors. The Firehose delivery stream has a buffer size of 128 MB and buffer interval of 60 seconds. What is the most effective way to resolve this issue?

A.Transition objects to S3 Glacier after 30 days.

B.Decrease the buffer size to 64 MB and buffer interval to 30 seconds.

C.Increase the buffer size to 256 MB and buffer interval to 120 seconds.

D.Enable server-side encryption on the S3 bucket.

AnswerC

Larger buffers reduce the number of S3 PUT requests, alleviating throttling.

Why this answer

The 'Rate exceeded' error indicates that Kinesis Data Firehose is sending requests to S3 at a rate that exceeds the S3 bucket's request rate limits for PUT operations. Increasing the buffer size to 256 MB and the buffer interval to 120 seconds allows Firehose to accumulate more data before each S3 PUT request, reducing the number of requests per second and staying within S3's 3,500 PUT requests per second limit per prefix. This directly addresses the throttling issue without changing the data volume.

Exam trap

The trap here is that candidates mistakenly think reducing buffer size or interval will speed up delivery, but in reality, it increases request frequency and worsens S3 throttling, while increasing buffers is the correct way to reduce request rate.

How to eliminate wrong answers

Option A is wrong because transitioning objects to S3 Glacier after 30 days does not affect the rate of PUT requests to the S3 bucket; it only changes storage class after delivery, so it cannot resolve current delivery failures. Option B is wrong because decreasing the buffer size to 64 MB and buffer interval to 30 seconds would increase the frequency of S3 PUT requests, worsening the 'Rate exceeded' errors by exceeding the bucket's request rate limits even more. Option D is wrong because enabling server-side encryption on the S3 bucket does not change the request rate or throughput; it only encrypts objects at rest and has no impact on throttling of PUT operations.

Practice this question →

359

MCQhard

A data engineer is monitoring an Amazon Redshift cluster and notices that queries are taking longer than expected. The engineer checks the system tables and sees that many queries are waiting for 'WLM' resources. What is the most likely cause and recommended fix?

A.The table sort keys are poorly designed; recreate tables with better sort keys.

B.The distribution style is set to ALL; change to KEY distribution.

C.The WLM queue concurrency is set too low; increase the concurrency level.

D.The cluster is running low on disk space; resize the cluster.

AnswerC

Higher concurrency allows more simultaneous queries.

Why this answer

Option D is correct because WLM queue wait indicates concurrency throttling. Option A is wrong because disk space is unrelated. Option B is wrong because sort keys improve scan efficiency, not concurrency.

Option C is wrong because distribution style affects data movement, not queue wait.

Practice this question →

360

MCQhard

A company has an S3 data lake with millions of objects. A data engineer needs to provide a daily report of objects that are not accessed for 90 days. The engineer must minimize cost and impact on performance. Which approach should be used?

A.Enable S3 Inventory and query with Athena

B.Use S3 Select on each object to check last access metadata

C.Analyze S3 server access logs to find objects not accessed

D.Use S3 Storage Lens to generate a dashboard of object age and last access

AnswerD

Storage Lens provides built-in metrics at low cost.

Why this answer

Option C is correct because S3 Storage Lens provides cost-effective analytics including last access date. Option A is wrong because S3 Inventory creates daily lists but requires Athena queries and is more complex. Option B is wrong because S3 Server Access Logs can be large and costly to query.

Option D is wrong because S3 Select is for querying objects' content, not metadata.

Practice this question →

361

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources and must be queryable using Amazon Athena. The engineer needs to optimize query performance and reduce costs. Which THREE actions would achieve this?

Select 3 answers

A.Store data in many small files to increase parallelism.

B.Partition the data by a commonly used filter column.

C.Use S3 Select instead of Athena for queries.

D.Compress data with a splittable compression format like Snappy.

E.Convert data to Apache Parquet or ORC format.

AnswersB, D, E

Partition pruning limits the data scanned.

Why this answer

Option A: Partitioning reduces data scanned. Option C: Using columnar formats like Parquet reduces data scanned. Option E: Compression reduces storage and data scanned.

Option B is wrong because the number of files should be optimized, not increased. Option D is wrong because S3 Select is for filtering within a single file, not for Athena.

Practice this question →

362

MCQhard

Your company runs a critical data processing pipeline that ingests data from multiple sources into an Amazon S3 bucket. An AWS Glue ETL job processes this data and writes the output to an Amazon Redshift cluster. The pipeline is triggered by an S3 event notification that invokes an AWS Lambda function, which starts the Glue job. Recently, you have observed that the Glue job occasionally fails with an AccessDenied error when trying to access the S3 bucket. The IAM role used by the Glue job has the following policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::input-bucket", "arn:aws:s3:::input-bucket/*" ] }, { "Effect": "Allow", "Action": [ "redshift:CopyData" ], "Resource": "*" } ] }. The S3 bucket has a bucket policy that allows access only from a specific VPC. The Glue job runs in a VPC with the appropriate VPC endpoints configured. The error occurs intermittently and sometimes retries succeed. What is the most likely cause and correct course of action?

A.Add a VPC endpoint for S3 and configure the bucket policy to allow access from the Glue job's VPC endpoint.

B.Ensure the Glue job's VPC configuration includes a NAT gateway to route traffic to S3.

C.Change the Lambda function to use a different IAM role with broader S3 permissions.

D.Modify the Glue job's IAM role to include s3:PutObject permission for the output bucket.

AnswerA

This ensures requests from Glue are routed through the VPC endpoint and comply with the bucket policy.

Why this answer

Option D is correct because the Glue job runs in a VPC, but if the S3 bucket policy requires requests to come from the VPC endpoint, the Glue job's requests must originate from that endpoint. However, Glue jobs running in a VPC do not automatically route S3 traffic through VPC endpoints; they go through the internet unless a VPC endpoint is explicitly used. The intermittent success might be due to other requests coming from the same IP range.

The correct action is to ensure the Glue job uses a VPC endpoint for S3. Option A is wrong because the bucket policy is the issue, not the IAM policy. Option B is wrong because the Lambda function is not directly accessing S3.

Option C is wrong because the Glue job already runs in the VPC.

Practice this question →

363

MCQeasy

A data engineer is monitoring an Amazon Kinesis Data Analytics application that uses a SQL query to aggregate streaming data. The application is falling behind and the millisBehindLatest metric is increasing. Which action should the engineer take to improve performance?

A.Switch from SQL to Apache Flink for the analytics application

B.Increase the number of shards in the source Kinesis stream

C.Increase the Parallelism setting of the Kinesis Data Analytics application

D.Decrease the window duration of the SQL query

AnswerC

Higher parallelism increases processing capacity, reducing lag.

Why this answer

Increasing the Parallelism setting of the Kinesis Data Analytics application allows the SQL query to process data across more in-application streams and operators concurrently, directly addressing the lag indicated by the rising millisBehindLatest metric. This action scales the compute resources allocated to the application without changing the source stream or the query logic, making it the most direct way to improve throughput for a SQL-based Kinesis Data Analytics application.

Exam trap

The trap here is that candidates often confuse scaling the source (shards) with scaling the processing engine (parallelism), assuming that more data input automatically fixes processing lag, when in fact the bottleneck is the application's compute capacity.

How to eliminate wrong answers

Option A is wrong because switching from SQL to Apache Flink is a fundamental architectural change that is not required to address performance tuning; the question specifically states the application uses SQL, and Flink would require rewriting the application entirely, not a simple performance fix. Option B is wrong because increasing the number of shards in the source Kinesis stream increases the ingestion capacity but does not directly improve the processing speed of the Kinesis Data Analytics application; if the application is already falling behind due to insufficient compute, more shards will only increase the backlog. Option D is wrong because decreasing the window duration of the SQL query reduces the amount of data aggregated per window, which may reduce latency but does not increase the overall processing parallelism or throughput; it could even cause more frequent window triggers, potentially worsening the lag.

Practice this question →

364

MCQhard

A company uses Amazon Redshift for data warehousing. They notice that queries are running slowly, and the STL_LOAD_ERRORS table shows many 'Parse error' entries. The data is loaded from Amazon S3 using COPY commands. What is the MOST likely cause of the parse errors?

A.The source data files have a different schema or delimiter than what is specified in the COPY command.

B.The Redshift cluster does not have enough compute nodes to process the data.

C.The IAM role used by Redshift does not have permission to decrypt the S3 objects.

D.The source data files are compressed using an unsupported compression format.

AnswerA

Schema mismatch leads to parse errors.

Why this answer

Option A is correct because parse errors during COPY typically indicate that the source data does not match the target table schema (e.g., data type mismatch or delimiter issues). Option B is incorrect because incorrect compression would cause decompression errors, not parse errors. Option C is incorrect because data encryption issues would cause access denied errors.

Option D is incorrect because insufficient compute resources would cause performance issues but not parse errors.

Practice this question →

365

MCQeasy

A company uses Amazon EMR to run Spark jobs on a transient cluster. The jobs process data from S3 and write results back to S3. The team wants to reduce costs by optimizing the cluster. Which action should the team take?

A.Use Spot Instances for the task nodes.

B.Increase the number of core nodes and use larger instance types.

C.Enable EMRFS consistent view.

D.Terminate the cluster after each job and manually restart it for the next job.

AnswerA

Spot instances are cheaper than on-demand.

Why this answer

Using Spot Instances for task nodes in a transient EMR cluster significantly reduces compute costs because Spot Instances are spare AWS EC2 capacity offered at up to 90% discount compared to On-Demand. Since transient clusters are terminated after job completion, the risk of Spot Instance interruptions is mitigated—the job can simply be retried on a new cluster if needed. This directly addresses the cost optimization goal without sacrificing job functionality.

Exam trap

The trap here is that candidates confuse cost optimization with performance tuning, leading them to choose larger instances (Option B) or consistency features (Option C), when the real cost lever for transient workloads is leveraging Spot pricing for ephemeral compute.

How to eliminate wrong answers

Option B is wrong because increasing the number of core nodes and using larger instance types increases costs, not reduces them, and core nodes host HDFS which is unnecessary for a transient cluster that reads/writes directly to S3. Option C is wrong because EMRFS consistent view is a feature to handle S3 eventual consistency for listing and renaming, not a cost optimization mechanism—it adds overhead without reducing spend. Option D is wrong because EMR transient clusters already terminate automatically after the job completes; manually restarting is redundant and does not further reduce costs, and it introduces operational overhead.

Practice this question →

366

MCQhard

A data engineer is running an AWS Glue ETL job that converts CSV files to Parquet. The job fails with the error shown in the exhibit. The input files are about 500 MB each. The job uses 5 workers of type G.1X (16 GB memory each). What is the MOST likely cause?

A.The output Parquet file size is too large for the executor memory

B.The data is highly skewed causing a single partition to receive too much data

C.The Spark driver does not have enough memory to handle the schema inference

D.The input CSV files are corrupt or have inconsistent schema

AnswerA

Writing a large file requires memory proportional to file size; splitting into smaller files can help.

Why this answer

Option D is correct because the error shows OOM in the write task, which typically occurs when writing large files. Spark tries to write a large Parquet file that exceeds the executor memory. Option A would cause different errors.

Option B is about reading, not writing. Option C is about data skew, which would cause OOM in shuffle, not in write.

Practice this question →

367

Multi-Selecteasy

A data engineer is migrating a legacy data warehouse to Amazon Redshift. The engineer needs to load data from multiple sources efficiently. Which THREE services can be used to load data into Redshift? (Choose THREE.)

Select 3 answers

A.Use the COPY command to load from Amazon DynamoDB.

B.Use Kinesis Data Firehose to deliver data directly to Redshift.

C.Use AWS DMS to replicate data continuously.

D.Use S3 Transfer Acceleration.

E.Use the COPY command to load from Amazon S3.

AnswersA, B, E

COPY can load data from DynamoDB tables.

Why this answer

Options A, B, and D are correct: COPY from S3, COPY from DynamoDB, and Kinesis Data Firehose directly load into Redshift. Option C is wrong because S3 Transfer Acceleration is for uploading to S3, not loading into Redshift. Option E is wrong because AWS DMS can migrate data but is not a direct load service for Redshift; it can load into Redshift as a target.

Practice this question →

368

MCQmedium

A data engineer is troubleshooting a Kinesis Data Analytics application that processes streaming data. The application is falling behind and has a high 'MillisBehindLatest' metric. The application uses a parallelism of 2. The source stream has 4 shards. What is the MOST likely cause and solution?

A.The application is using a JSON format; switch to Avro.

B.The source stream has too many shards; decrease to 2.

C.The application parallelism is too low; increase it to 4.

D.The output destination is slow; change to a faster sink.

AnswerC

With 4 shards, parallelism should be at least 4 to process all shards concurrently.

Why this answer

The 'MillisBehindLatest' metric indicates the application is not keeping up with the incoming data. With a source stream of 4 shards and a parallelism of only 2, the application cannot process data from all shards concurrently, leading to backpressure. Increasing parallelism to match the shard count (4) allows each shard to be processed by a separate task, reducing lag.

Exam trap

The trap here is that candidates may assume increasing parallelism always improves performance, but the key insight is that parallelism must match or exceed the number of source shards to avoid a concurrency bottleneck, not just be arbitrarily high.

How to eliminate wrong answers

Option A is wrong because changing the serialization format from JSON to Avro reduces data size but does not address the fundamental throughput mismatch between shard count and parallelism; the bottleneck is concurrency, not serialization efficiency. Option B is wrong because reducing the number of shards would decrease the source stream's throughput capacity, potentially causing data loss or throttling; the correct approach is to scale application parallelism to match the existing shards. Option D is wrong because a slow output sink would cause backpressure that manifests as increased 'MillisBehindLatest', but the question states the application is falling behind, and the most direct cause given the parallelism of 2 versus 4 shards is insufficient processing concurrency, not sink performance.

Practice this question →

369

MCQeasy

A company uses Amazon S3 as a data lake. A data engineer needs to ensure that all objects uploaded to the 'incoming' prefix are automatically encrypted at rest using AWS KMS with a specific customer managed key. What is the simplest way to enforce this?

A.Enable S3 Transfer Acceleration to force encryption in transit.

B.Use a bucket policy that denies PutObject requests without the required encryption header.

C.Configure S3 Inventory to report on encryption status and alert on non-compliance.

D.Enable default encryption on the bucket with SSE-S3.

AnswerB

A bucket policy with a condition for s3:x-amz-server-side-encryption-aws-kms-key-id enforces the specific key.

Why this answer

Option B is correct because S3 bucket policies can enforce encryption using a condition key. Option A is wrong because default encryption does not enforce customer managed keys. Option C is wrong because S3 Inventory does not enforce encryption.

Option D is wrong because S3 Transfer Acceleration does not affect encryption.

Practice this question →

370

MCQhard

A company runs a data pipeline that uses Amazon EMR to process large datasets. The pipeline reads data from S3, processes it using Spark, and writes results back to S3. Recently, the pipeline has been failing with 'OutOfMemoryError' in the Spark executors. The EMR cluster is configured with 5 core nodes of type m5.xlarge (4 vCPU, 16 GB memory each). The Spark application uses dynamic allocation and default Spark configurations. The input data size is approximately 500 GB in Parquet format. What is the most cost-effective way to resolve the out-of-memory errors?

A.Increase the spark.executor.memory setting to 8 GB in the Spark configuration.

B.Change the core node instance type to r5.xlarge (32 GB memory) and keep 5 nodes.

C.Increase the number of core nodes to 10 to distribute the data across more executors.

D.Change the input data format from Parquet to ORC to reduce memory footprint.

AnswerB

Memory-optimized instances provide more memory per node, reducing OOM without increasing node count.

Why this answer

Option B is correct because the current cluster has limited memory per node (16 GB). By switching to memory-optimized instances like r5.xlarge (32 GB), each node has double the memory, reducing the chance of OOM. This is more cost-effective than adding more nodes because the total memory per node increases without increasing the number of instances.

Option A is wrong because increasing the number of nodes adds more memory but also more cost; it might be more expensive than using fewer, larger nodes. Option C is wrong because it's generally not recommended to increase spark.executor.memory beyond the physical memory; it could cause YARN to kill containers. Option D is wrong because Parquet is already efficient; changing to a different format may not solve memory issues.

Practice this question →

371

Multi-Selectmedium

A data engineer needs to ensure that data in an Amazon S3 bucket is not publicly accessible. Which TWO measures should the engineer implement? (Choose TWO.)

Select 2 answers

A.Attach a bucket policy that denies access to 'Principal': '*' unless specific conditions are met.

B.Create a lifecycle policy to delete objects after 30 days.

C.Enable S3 Block Public Access settings on the bucket.

D.Enable S3 Versioning on the bucket.

E.Enable default encryption on the bucket.

AnswersA, C

A bucket policy can deny all public access.

Why this answer

Blocking public access at the bucket level and using bucket policies to deny public access are effective controls. Option C is wrong because encryption does not prevent public access. Option D is wrong because versioning does not control access.

Option E is wrong because lifecycle policies do not control access.

Practice this question →

372

MCQeasy

A data engineer is designing a data pipeline that ingests data from an on-premises database into Amazon S3 using AWS Database Migration Service (DMS). The data must be encrypted at rest in S3 using SSE-S3. The engineer also needs to track changes to the source database in real time. Which DMS configuration should the engineer use?

A.Use DMS with a snapshot of the source database.

B.Use DMS with ongoing replication (change data capture) enabled.

C.Use DMS with a full load task only.

D.Use DMS with a full load task and then stream to Amazon Kinesis.

AnswerB

CDC captures real-time changes.

Why this answer

Option A is correct because DMS with CDC captures ongoing changes. Option B is wrong because full load only captures a snapshot. Option C is wrong because DMS supports CDC without needing Kinesis.

Option D is wrong because restoring from a snapshot is not real-time.

Practice this question →

373

MCQmedium

A company is running an Amazon EMR cluster with Spark for data processing. The data engineer wants to automatically scale the core and task nodes based on the YARN memory and CPU utilization. Which scaling metric should the engineer use for the EMR managed scaling policy?

A.YARNMemoryAvailablePercentage

B.CPUUtilization

C.DiskIOPS

D.HDFSUtilization

AnswerA

EMR managed scaling uses YARN memory metrics.

Why this answer

Option A is correct because EMR managed scaling uses YARNMemoryAvailablePercentage and YARNContainersPending as the default metrics for scaling. Option B is incorrect because CPUUtilization is not a default metric for EMR managed scaling. Option C is incorrect because HDFSUtilization is for HDFS, not YARN.

Option D is incorrect because IOPS is not a metric for EMR managed scaling.

Practice this question →

374

Multi-Selecteasy

A data engineer is monitoring an Amazon Kinesis Data Stream used to ingest clickstream data. The engineer notices that the stream's 'WriteProvisionedThroughputExceeded' metric is frequently above zero. Which TWO actions could help mitigate this issue? (Choose TWO.)

Select 2 answers

A.Increase the number of shards in the stream.

B.Reduce the data retention period to free up capacity.

C.Decrease the number of shards to reduce overhead.

D.Implement a random prefix for the partition key to distribute data evenly.

E.Enable enhanced fan-out on the stream.

AnswersA, D

More shards increase total write capacity.

Why this answer

Options A and D are correct. Increasing the number of shards increases write capacity. Implementing a random prefix for partition keys distributes writes more evenly across shards.

Option B is wrong because decreasing shards reduces capacity. Option C is wrong because enabling enhanced fan-out increases read capacity, not write. Option E is wrong because reducing the retention period does not affect write throughput.

Practice this question →

375

Multi-Selectmedium

A data engineer is troubleshooting a Glue ETL job that reads from an S3 bucket and writes to a Redshift table. The job fails with a 'MemoryError' when processing a large dataset. Which TWO actions should the engineer take to resolve this issue? (Choose TWO.)

Select 2 answers

A.Increase the number of DPUs and set 'spark.sql.shuffle.partitions' to a higher value.

B.Increase the number of DPUs and set 'coalesce(1)' in the script.

C.Decrease the number of DPUs and increase 'spark.shuffle.partitions'.

D.Set the 'RedshiftTempDir' parameter to a larger S3 bucket.

E.Set the 'groupFiles' option to 'inPartition' in the S3 source configuration.

AnswersA, E

More DPUs and shuffle partitions distribute data across more executors, reducing per-executor memory load.

Why this answer

Option A is correct because increasing the number of DPUs (Data Processing Units) provides more memory and compute resources to the Glue job, directly addressing the MemoryError. Setting 'spark.sql.shuffle.partitions' to a higher value reduces the amount of data shuffled per partition, preventing out-of-memory errors during wide transformations like joins or aggregations.

Exam trap

The trap here is that candidates confuse 'coalesce(1)' (which reduces parallelism) with a memory-saving technique, or mistakenly think decreasing DPUs or adjusting RedshiftTempDir can fix memory errors, when in fact memory errors require more resources and better partition management.

Practice this question →

← PreviousPage 5 of 6 · 387 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Operations Support questions.

Start 20-question session