Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1501–1575

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 21 of 24

1501

MCQeasy

A media company stores video files in an Amazon S3 bucket with S3 Standard storage class. The files are accessed frequently for the first 30 days, then rarely after that. However, the company must be able to restore any deleted file within 7 days. The company wants to minimize storage costs while meeting the access and retention requirements. What should a data engineer do?

A.Use S3 Standard-IA storage class from the start.

B.Use a lifecycle policy to transition objects to S3 One Zone-IA after 30 days.

C.Use S3 Glacier Deep Archive after 30 days and enable S3 Object Lock for retention.

D.Use S3 Intelligent-Tiering and enable S3 Versioning on the bucket.

AnswerD

S3 Intelligent-Tiering optimizes costs by moving data between access tiers, and Versioning allows recovery of deleted objects.

Why this answer

Option D is correct because S3 Intelligent-Tiering automatically moves objects between access tiers (frequent, infrequent, and archive instant access) based on changing access patterns, which minimizes storage costs for data with unknown or changing access patterns. Enabling S3 Versioning allows the company to restore any deleted file within 7 days by reverting to a previous version, meeting the retention requirement without additional cost for a separate backup.

Exam trap

The trap here is that candidates may overlook the requirement to restore deleted files within 7 days and focus only on cost optimization, leading them to choose a storage class like S3 Standard-IA or S3 One Zone-IA that lacks versioning or retention capabilities, or they may incorrectly assume S3 Object Lock can restore already deleted files.

How to eliminate wrong answers

Option A is wrong because S3 Standard-IA has a minimum storage duration charge of 30 days and a per-GB retrieval cost, making it more expensive than S3 Standard for the first 30 days of frequent access, and it does not provide the ability to restore deleted files within 7 days. Option B is wrong because S3 One Zone-IA does not provide the same durability as S3 Standard (it stores data in a single Availability Zone) and lacks versioning or retention features to restore deleted files within 7 days; additionally, transitioning after 30 days incurs lifecycle transition costs. Option C is wrong because S3 Glacier Deep Archive has a minimum storage duration of 180 days and a retrieval time of 12 hours or more, which does not meet the requirement to restore deleted files within 7 days; S3 Object Lock only prevents object deletion or overwrites, but does not enable restoration of already deleted files.

Full explanation →

1502

MCQmedium

A company uses AWS Kinesis Data Streams to ingest real-time data. The data engineer notices that the stream's 'WriteProvisionedThroughputExceeded' error occurs frequently during peaks. Which action should be taken to resolve this issue?

A.Increase the number of shards in the stream.

B.Modify the producer to use a different partition key.

C.Compress the data before sending to the stream.

D.Enable enhanced fan-out for consumers.

AnswerA

More shards provide higher write throughput.

Why this answer

Increasing the number of shards increases the write capacity, directly addressing the throughput exceeded error. Option B is wrong because the error is not from the producer side. Option C is wrong because the error is about write throughput, not read.

Option D is wrong because the error is not about data format.

Full explanation →

1503

MCQhard

A data team uses the CloudFormation template in the exhibit to create an S3 bucket for storing log files. After one year, they notice that the bucket size is larger than expected. They investigate and find that older versions of objects are not being deleted or transitioned. What is the most likely cause?

A.The lifecycle rule does not apply to noncurrent versions because it lacks NoncurrentVersionExpiration or NoncurrentVersionTransition.

B.The lifecycle rule is not enabled because the Status property is not set to 'Enabled' properly.

C.The bucket has versioning enabled, but the lifecycle rule only applies to current versions.

D.The expiration in days is set to 365, which is too short.

AnswerA

Old versions are not managed by the rule.

Why this answer

Option A is correct because the lifecycle rule in the CloudFormation template is missing the `NoncurrentVersionExpiration` or `NoncurrentVersionTransition` properties. When S3 bucket versioning is enabled, lifecycle rules that only specify `ExpirationInDays` or `Transition` apply exclusively to the current version of objects. To manage older (noncurrent) versions, you must explicitly include `NoncurrentVersionExpirationInDays` or `NoncurrentVersionTransitionInDays` in the rule.

Without these, noncurrent versions accumulate indefinitely, causing the bucket size to grow larger than expected.

Exam trap

The trap here is that candidates assume a lifecycle rule with `ExpirationInDays` automatically cleans up all versions of an object, but in S3 with versioning enabled, it only affects the current version, leaving noncurrent versions to accumulate.

How to eliminate wrong answers

Option B is wrong because the `Status` property set to 'Enabled' is not the issue; the lifecycle rule is active, but it only targets current versions. Option C is wrong because the lifecycle rule does apply to current versions, but the problem is that it does not apply to noncurrent versions, which is the exact reason for the bucket size growth. Option D is wrong because the expiration in days being set to 365 is not too short; it is a reasonable period, but the rule still only affects current versions, leaving noncurrent versions untouched.

Full explanation →

1504

MCQeasy

A data engineer needs to ingest data from an Amazon S3 bucket into an Amazon Redshift table on a daily schedule. The data is in CSV format and the schema matches. Which service is simplest for this batch ingestion?

A.Amazon Redshift COPY command

B.AWS Glue ETL job with JDBC connection

C.AWS Data Pipeline

D.Amazon Athena CREATE TABLE AS SELECT

AnswerA

Direct and optimized for loading from S3.

Why this answer

Redshift COPY command loads from S3 efficiently. Glue and Data Pipeline also work but are more complex. Athena is for querying, not loading.

Full explanation →

1505

MCQmedium

A financial services company uses Amazon Athena to query a data lake in S3. The data lake contains sensitive financial transactions. The security team has implemented row-level security using views in AWS Glue Data Catalog. Each view is defined with a WHERE clause that filters rows based on the user's IAM role using a custom tag. However, when a data analyst runs a SELECT * FROM view_name in Athena, the query returns all rows, ignoring the row-level filter. The analyst's IAM role has the tag 'department=analytics'. The view was created with a filter condition 'department = current_user_department()', where current_user_department() is a user-defined function that extracts the department tag from the caller's IAM role. The function is defined in the Glue Data Catalog. What is the most likely reason the filter is not applied?

A.The user-defined function current_user_department() is not registered in the Glue Data Catalog.

B.Athena does not support user-defined functions in views.

C.The IAM role does not have the tag 'department=analytics'.

D.The view is not defined with the filter condition properly; the function current_user_department() may not be invoked correctly in the view definition.

AnswerD

The function must be used in the view's SELECT statement.

Why this answer

Option A is correct. Views in Athena do not automatically enforce row-level security; the filter must be applied via the view definition. Option B is wrong because the function exists.

Option C is wrong because Athena executes the view's SQL. Option D is wrong because the tag is present.

Full explanation →

1506

MCQmedium

A data engineer is using Amazon EMR to transform large datasets stored in S3. The cluster runs once a day and takes 3 hours. The engineer notices that the cluster is idle for 30 minutes at the start while waiting for resources. What is the most cost-effective way to reduce the idle time?

A.Increase the instance type size

B.Use Spot Instances for all nodes

C.Configure a larger initial core instance count and enable managed scaling

D.Purchase Reserved Instances for the cluster

AnswerC

More core nodes reduce the time to allocate resources, and managed scaling adjusts during the job.

Why this answer

Option C is correct because using a cluster with managed scaling and a larger initial core node count can reduce the time waiting for resource allocation. Option A (spot instances) may cause interruptions. Option B (reserved instances) is for long-term use, not transient clusters.

Option D (increasing instance size) may reduce processing time but not idle time.

Full explanation →

1507

MCQhard

A data engineer is troubleshooting an issue where an IAM role used by AWS Glue cannot read data from an S3 bucket encrypted with SSE-KMS. The bucket policy allows the role to perform s3:GetObject. What additional permission is needed?

A.s3:GetObjectVersion

B.kms:Decrypt on the KMS key

C.s3:GetObjectAcl

D.kms:GenerateDataKey on the KMS key

AnswerB

The role must be able to decrypt the S3 object.

Why this answer

For SSE-KMS, the IAM role must have kms:Decrypt permission on the KMS key. Option A is wrong because s3:GetObject is already allowed. Option B is wrong because SSE-KMS does not require s3:GetObjectAcl.

Option D is wrong because kms:GenerateDataKey is used for writes. Option C is correct.

Full explanation →

1508

MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The on-premises network has a 1 Gbps link to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

A.Use Amazon S3 Transfer Acceleration to speed up the transfer over the internet.

B.Use AWS DataSync to transfer the data over the existing network link.

C.Use AWS Snowball Edge to physically transfer the data.

D.Use AWS Direct Connect to establish a dedicated network connection.

AnswerC

Snowball Edge is a physical device that can transfer large amounts of data quickly and cost-effectively, especially when network bandwidth is limited.

Why this answer

Option C is correct because AWS Snowball Edge is a physical device that can transfer large amounts of data faster than a network link, especially with a 1 Gbps link that would take about 4.6 days for 50 TB (theoretical max, but actual throughput will be lower due to overhead). Snowball Edge can transfer 50 TB in a few days and is cost-effective for large data volumes. Option A (AWS DataSync) is efficient for online transfers but may not meet the 5-day deadline over 1 Gbps.

Option B (Amazon S3 Transfer Acceleration) speeds up transfers but still limited by network bandwidth. Option D (AWS Direct Connect) would require additional setup and cost, and still limited by the 1 Gbps link.

Full explanation →

1509

Multi-Selectmedium

A data engineer is building a batch ETL pipeline using AWS Glue. The source data is in Amazon RDS for MySQL. The pipeline must run daily and process only new and modified records since the last run. The engineer needs to implement change data capture (CDC) efficiently. Which THREE steps should the engineer take? (Choose THREE.)

Select 3 answers

A.Configure the Glue job to use a JDBC connection with a SQL query that reads from the binary log.

B.Ingest the RDS data into Kinesis Data Streams and then use Glue.

C.Set a job bookmark in Glue to track processed records.

D.Enable binary logging (binlog) on the RDS MySQL instance.

E.Use a full table scan with a WHERE clause on a timestamp column.

AnswersA, C, D

Glue can read CDC from binlog.

Why this answer

Options A, C, and D are correct. Enable binary logging for CDC, use a bookmark to track processed records, and create a job bookmark in Glue. Option B is wrong because querying entire table daily is inefficient.

Option E is wrong because Kinesis Data Streams is for real-time, not batch CDC.

Full explanation →

1510

MCQmedium

A data engineer needs to allow a Lambda function to read data from an S3 bucket in the same account. The Lambda function's execution role has the required permissions, but access is denied. The S3 bucket has a bucket policy that explicitly denies access to any principal that is not from the organization. What is the most likely issue?

A.The Lambda execution role is not part of the AWS organization.

B.The Lambda function is in a VPC without an S3 VPC endpoint.

C.The S3 bucket is in a different AWS account.

D.The Lambda function does not have kms:Decrypt permission.

AnswerA

The bucket policy explicitly denies access to principals not in the organization, so the Lambda role must be part of the organization.

Why this answer

Option C is correct because the explicit deny in the bucket policy overrides any allow in the Lambda role. Option A is wrong because the VPC endpoint policy is not mentioned. Option B is wrong because KMS permissions are not relevant.

Option D is wrong because the bucket is in the same account.

Full explanation →

1511

MCQmedium

A data engineer needs to ingest JSON files from an on-premises SFTP server into Amazon S3. The files are uploaded daily and each file is up to 500 MB. The solution must be serverless and minimize cost. Which service should the engineer use?

A.Amazon Kinesis Data Firehose.

B.AWS DataSync with an on-premises agent.

C.Amazon S3 Transfer Acceleration.

D.AWS Transfer Family (SFTP) endpoint.

AnswerD

Fully managed SFTP service that writes directly to S3.

Why this answer

Option B is correct because AWS Transfer Family supports SFTP and can automatically transfer files to S3, with no need to manage servers. Option A is wrong because AWS DataSync requires an agent on-premises. Option C is wrong because S3 Transfer Acceleration is for speeding up transfers to S3, not for ingesting from SFTP.

Option D is wrong because Kinesis Data Firehose is for streaming data, not file-based SFTP transfers.

Full explanation →

1512

MCQmedium

A company uses Amazon DynamoDB with global tables in three AWS Regions. The data engineer needs to ensure that writes to the table in us-east-1 are replicated to other regions with minimal latency. Which DynamoDB feature should be used?

A.DynamoDB Global Tables

B.DynamoDB Time to Live (TTL)

C.DynamoDB Streams

D.DynamoDB Accelerator (DAX)

AnswerA

Global Tables replicate data across regions automatically.

Why this answer

DynamoDB Global Tables provide multi-region replication with low latency. Option A is correct. Option B: DAX is a cache, not replication.

Option C: TTL is for automatic item expiration. Option D: Streams capture changes but do not automatically replicate to other regions without custom code.

Full explanation →

1513

MCQhard

A data engineer is monitoring an Amazon Kinesis Data Streams application that processes real-time events. The application uses a Kinesis Client Library (KCL) consumer. The engineer notices that the consumer is lagging behind the producer, and the lag is increasing over time. The stream has 10 shards. Which action will MOST effectively reduce the lag?

A.Use multiple KCL workers per shard to increase processing capacity.

B.Increase the number of shards in the Kinesis data stream.

C.Decrease the number of records per shard per second.

D.Decrease the number of shards in the Kinesis data stream.

AnswerB

More shards increase the stream's read and write capacity.

Why this answer

Option C is correct because increasing the number of shards increases the stream's capacity and allows more parallel processing. Option A is wrong because reducing the number of records per shard per second would decrease throughput. Option B is wrong because using a single worker per shard is the default; adding more workers per shard is not recommended.

Option D is wrong because decreasing the number of shards reduces capacity, worsening lag.

Full explanation →

1514

Multi-Selecthard

A company uses Amazon S3 to store sensitive data. The security team requires that all data in transit between on-premises applications and S3 be encrypted. The data engineer must implement a solution that meets this requirement without changing the applications. Which TWO solutions should the engineer consider? (Choose two.)

Select 2 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Enable default encryption on the S3 bucket.

C.Use server-side encryption with S3 managed keys (SSE-S3).

D.Use an S3 VPC Endpoint and enforce the use of HTTPS through bucket policies.

E.Use AWS Storage Gateway to mount S3 as a file system and configure it to use HTTPS.

AnswersD, E

VPC Endpoint with HTTPS policy ensures encrypted transit.

Why this answer

Option A is correct because using an S3 VPC Endpoint with a gateway endpoint ensures traffic stays within AWS and can enforce encryption via policies. Option C is correct because mounting S3 as a file system using AWS Storage Gateway with HTTPS ensures encryption in transit. Option B is incorrect because default encryption is for data at rest, not in transit.

Option D is incorrect because S3 Transfer Acceleration does not enforce encryption; it uses HTTPS but is optional. Option E is incorrect because SSE-S3 is for data at rest.

Full explanation →

1515

MCQhard

A company has a DynamoDB table with a partition key of 'user_id' and a sort key of 'timestamp'. They need to query all items for a user within a date range. Which query operation should be used?

A.BatchGetItem with multiple keys

B.Query with KeyConditionExpression on partition key and sort key

C.GetItem with both partition and sort key

D.Scan with FilterExpression

AnswerB

Query is efficient for this access pattern.

Why this answer

The Query operation in DynamoDB is designed to retrieve items based on a specific partition key and an optional sort key condition. Since the table has a partition key of 'user_id' and a sort key of 'timestamp', using Query with a KeyConditionExpression that filters on the partition key (user_id) and a range condition on the sort key (timestamp) is the most efficient and correct approach to get all items for a user within a date range.

Exam trap

The trap here is that candidates often confuse BatchGetItem with Query, thinking BatchGetItem can handle range queries, but BatchGetItem only retrieves items by exact primary key values and cannot filter by sort key conditions.

How to eliminate wrong answers

Option A is wrong because BatchGetItem retrieves items by their primary key (partition key and sort key) but does not support range-based filtering on the sort key; it only fetches specific items by exact key values, not a range of timestamps. Option C is wrong because GetItem retrieves a single item by its full primary key (both partition key and sort key), so it cannot return multiple items or filter by a date range. Option D is wrong because Scan reads the entire table and then applies a FilterExpression, which is inefficient and costly for large tables, and it should be avoided when a more targeted Query operation can be used.

Full explanation →

1516

MCQhard

A data engineering team is troubleshooting a slow AWS Glue ETL job that reads from an Amazon DynamoDB table and writes to Amazon S3 in Parquet format. The job processes 50 GB of data. Which action would most effectively improve job performance?

A.Use S3 Select to push down filters

B.Reduce the batch size in the DynamoDB connector

C.Increase the number of DPUs

D.Change output to JSON format to reduce overhead

AnswerC

More DPUs increase parallelism and can speed up the job.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job provides more parallelism and memory, which can significantly speed up processing. Using G.1X worker type with more memory can also help. Using S3 Select is for filtering within S3, not for DynamoDB.

Changing to JSON format may reduce performance due to larger file size. Reducing batch size could slow down the job.

Full explanation →

1517

MCQhard

A data engineer is designing a data ingestion pipeline for IoT sensor data. The data arrives as JSON via AWS IoT Core, and must be stored in Amazon S3 in partitioned Parquet format. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which combination of services should the engineer use?

A.Amazon Kinesis Data Streams with AWS Lambda for transformation and Amazon S3.

B.Amazon Simple Queue Service (SQS) with AWS Lambda for transformation and Amazon S3.

C.AWS Glue streaming jobs consuming from Amazon Kinesis Data Streams and writing to Amazon S3.

D.Amazon Kinesis Data Firehose with data transformation via AWS Lambda, delivering to Amazon S3.

AnswerD

Firehose supports Parquet conversion and partitioning; Lambda handles transformation.

Why this answer

Amazon Kinesis Data Firehose is the correct choice because it can directly ingest streaming data from AWS IoT Core, use a built-in AWS Lambda function to transform JSON to Parquet, and deliver the data to Amazon S3 with automatic partitioning. It also supports buffering and retry logic to handle late-arriving data (up to 1 hour) and provides exactly-once delivery to S3 when configured with the appropriate error handling and idempotent transformations.

Exam trap

The trap here is that candidates often choose Kinesis Data Streams with Lambda (Option A) because they think it offers more control, but they overlook that Firehose provides a managed, exactly-once, partitioned Parquet delivery pipeline with built-in late-arriving data handling, which is the exact requirement in the question.

How to eliminate wrong answers

Option A is wrong because Amazon Kinesis Data Streams with AWS Lambda requires custom code to manage checkpointing, partitioning, and exactly-once semantics, and does not natively support Parquet conversion or S3 delivery without additional complexity. Option B is wrong because Amazon SQS does not guarantee exactly-once processing (standard queues offer at-least-once, FIFO queues offer exactly-once but lack native streaming integration with IoT Core and Parquet transformation). Option C is wrong because AWS Glue streaming jobs consume from Kinesis Data Streams, not directly from IoT Core, and they do not provide built-in exactly-once delivery to S3; they rely on checkpointing that can lead to duplicates or data loss in failure scenarios.

Full explanation →

1518

Multi-Selecthard

A company is using AWS Glue to run ETL jobs that process data from Amazon S3 and load it into Amazon Redshift. The data engineer notices that the Glue job is failing with the error 'S3ServiceException: Access Denied' when writing to the staging S3 bucket. Which THREE actions should the engineer take to resolve this issue?

Select 3 answers

A.Ensure that the Glue job script is correctly referencing the S3 bucket path.

B.Verify that the IAM role used by the Glue job has the s3:PutObject permission for the staging bucket.

C.Ensure that the S3 bucket has a bucket policy that allows the AWS Glue service principal to write objects.

D.Verify that the IAM role has s3:GetObject permission for the source bucket.

E.Check the S3 bucket policy for the staging bucket and ensure it allows the Glue job's IAM role to perform s3:PutObject.

AnswersB, C, E

The role needs write permission to the staging bucket to store temporary files.

Why this answer

Options A, C, and E are correct. The IAM role must have s3:PutObject, and also the S3 bucket policy must allow the role's access (option A). Option C is needed for writing temporary files.

Option E is needed for the Glue service principal to write to the bucket. Option B is wrong because the error is about write access, not read. Option D is wrong because the error is about the staging bucket, not the Glue job script.

Full explanation →

1519

MCQhard

A data engineer is investigating a failed AWS Glue job. The engineer runs the CLI command shown in the exhibit to retrieve the latest log stream. The output shows storedBytes: 0. What does this indicate?

A.The log stream is from a different Glue job.

B.The log stream is empty because the job is still running.

C.The Glue job failed before writing any log events to CloudWatch.

D.The log stream has been expired and deleted.

AnswerC

No logs were written, indicating early failure or logging misconfiguration.

Why this answer

Option C is correct because storedBytes: 0 means no log events were stored, likely because the job failed before writing any logs or logging was not enabled. Option A is wrong because logs are not expired; they were never written. Option B is wrong because the stream exists.

Option D is wrong because the command retrieved the latest stream.

Full explanation →

1520

MCQeasy

A company wants to ingest streaming data from IoT devices into Amazon S3 using Amazon Kinesis Data Firehose. The data must be transformed from JSON to Parquet format before landing in S3. What is the SIMPLEST way to achieve this?

A.Configure Kinesis Data Firehose with a built-in Parquet converter.

B.Use an AWS Lambda function as a data transformation in Kinesis Data Firehose to convert JSON to Parquet.

C.Use Kinesis Data Firehose to deliver data directly to S3 in JSON format and run a nightly Glue job to convert to Parquet.

D.Use Kinesis Data Analytics to convert the data to Parquet before sending to Firehose.

AnswerB

Lambda can transform the data and convert to Parquet before delivery.

Why this answer

Option D is correct because Kinesis Data Firehose can use an AWS Lambda function for data transformation, and converting to Parquet can be done via a Lambda function. Option A is wrong because Firehose can deliver to S3 directly. Option B is wrong because Firehose can transform without a separate Glue job.

Option C is wrong because Firehose does not natively support Parquet conversion without a transformation.

Full explanation →

1521

MCQmedium

A data engineer is configuring an S3 bucket policy to allow cross-account access for a partner account to read objects. The bucket is encrypted with SSE-KMS using a customer-managed key. What additional configuration is needed to allow the partner account to decrypt the objects?

A.Add a bucket policy that grants the partner account s3:GetObject

B.Create a VPC endpoint for S3 and add it to the bucket policy

C.Update the KMS key policy to grant the partner account kms:Decrypt permission

D.Add a bucket policy that grants s3:GetObject and s3:GetEncryptionConfiguration

AnswerC

The KMS key policy must allow the partner account to use the key for decryption.

Why this answer

For cross-account access with SSE-KMS, the KMS key policy must grant the partner account access to use the key. The bucket policy alone is insufficient. The partner account does not need VPC endpoints, and the bucket policy for decryption is not needed.

The partner account does not need access to the S3 bucket's encryption configuration.

Full explanation →

1522

MCQhard

Refer to the exhibit. A data engineer runs the commands shown. What can be determined about the key with ID 1234abcd-12ab-34cd-56ef-1234567890ab?

A.It is pending deletion.

B.It was created in us-west-2.

C.It is a customer managed key.

D.It is an AWS managed key.

AnswerD

KeyManager: AWS means it's AWS managed.

Why this answer

Option A is correct because the KeyManager is "AWS", indicating it is an AWS managed key. Option B is wrong because the key is enabled. Option C is wrong because the key was created in us-east-1 (implied by ARN).

Option D is wrong because the key state is Enabled, not PendingDeletion.

Full explanation →

1523

MCQeasy

A data engineer is troubleshooting a nightly AWS Glue ETL job that reads from an Amazon RDS for MySQL table and writes to an Amazon S3 bucket in Parquet format. The job runs successfully most days, but occasionally fails with the error 'ERROR: An error occurred while calling o67.pyWriteDynamicFrame. The transaction log for the database is full due to 'LOG_BACKUP'.' What is the MOST likely cause of this error?

A.The MySQL database has reached its maximum number of concurrent connections.

B.The AWS Glue job does not have sufficient permissions to write to the S3 bucket.

C.The AWS Glue job is configured with an incorrect 'writeDynamicFrame' method.

D.The MySQL database transaction log needs to be backed up to free space.

AnswerD

The error 'LOG_BACKUP' indicates that the transaction log is full and requires a backup to truncate it.

Why this answer

The error message 'The transaction log for the database is full due to 'LOG_BACKUP'' indicates that the MySQL database's transaction log has reached its maximum size because it has not been backed up and truncated. In MySQL, the transaction log (often the InnoDB redo log or binary log) must be backed up periodically to free space; otherwise, write operations fail. This is a database-side issue, not a Glue or permissions problem, so the correct action is to back up the transaction log to release space.

Exam trap

The trap here is that candidates may confuse a database-side resource exhaustion error (transaction log full) with a permissions or configuration issue in AWS Glue, leading them to incorrectly select options related to Glue permissions or method syntax.

How to eliminate wrong answers

Option A is wrong because the error specifically mentions the transaction log being full, not a limit on concurrent connections; a connection limit would produce a 'Too many connections' error. Option B is wrong because insufficient S3 write permissions would result in an access denied or authorization error, not a database transaction log error. Option C is wrong because the 'writeDynamicFrame' method is correctly used in AWS Glue for writing DynamicFrames; an incorrect method would cause a syntax or API error, not a database transaction log issue.

Full explanation →

1524

MCQhard

A data engineer is troubleshooting a slow AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 10 GB of CSV data. The engineer notices that the job runs with a single DPU and takes longer than expected. Which change would MOST likely improve performance?

A.Replace Redshift with Amazon Redshift Spectrum.

B.Change the input format to Parquet and enable predicate pushdown.

C.Use a JDBC connection to read data directly from S3.

D.Increase the number of DPUs and configure the job to use the S3 list implementation for parallel reads.

AnswerD

More DPUs allow parallel processing, and S3 list implementation improves file discovery.

Why this answer

Option A is correct because increasing DPUs and using the S3 list implementation can parallelize reading. Option B is wrong because Parquet is more efficient, but the bottleneck may be parallelism. Option C is wrong because JDBC connections are for databases, not S3.

Option D is wrong because Redshift Spectrum queries data in place, but the job is an ETL, not querying.

Full explanation →

1525

MCQeasy

A startup is building a data pipeline to ingest user activity logs from a mobile app. The logs are sent in real-time via HTTP POST requests. The data volume is low (a few hundred requests per second) but can spike to a few thousand during promotions. The team wants to store the logs in Amazon S3 for analysis. They also need to be able to query the data using Amazon Athena with minimal latency. The data must be transformed from JSON to Parquet and partitioned by date. The team is considering using Amazon API Gateway with AWS Lambda to receive the logs and write to S3. However, they are concerned about Lambda cold starts and the complexity of handling spikes. Which alternative solution should they choose?

A.Use Amazon API Gateway with AWS Lambda that sends logs to Amazon SQS, then a separate Lambda reads from SQS and writes to S3

B.Use Amazon Kinesis Data Firehose with a HTTP endpoint as source, enable Parquet conversion, and deliver to S3 with dynamic partitioning

C.Use Amazon Kinesis Data Streams with AWS Lambda to process and write to S3

D.Use Amazon EMR with Spark Streaming to ingest logs from a custom endpoint

AnswerB

Firehose handles ingestion, transformation, and partitioning with automatic scaling.

Why this answer

Option A is correct because Kinesis Data Firehose can be used as a HTTP endpoint (via API Gateway or directly with Firehose API), automatically buffers data, converts to Parquet, and writes to S3 with partitioning by date. This handles spikes without custom code. Option B (Lambda + SQS) adds complexity and still faces cold starts.

Option C (EMR) is overkill. Option D (Kinesis Data Streams + Lambda) still requires Lambda and has cold start issues.

Full explanation →

1526

MCQmedium

A company uses Amazon S3 to store sensitive data. The data engineer needs to ensure that all data in transit between the S3 bucket and clients is encrypted. Which configuration should the engineer implement?

A.Use Amazon CloudFront to serve the content and enable SSL.

B.Enable default encryption on the S3 bucket using SSE-S3.

C.Create an S3 bucket policy that denies requests where SecureTransport is false.

D.Use SSE-C to encrypt the data with a customer-provided key.

AnswerC

This ensures all requests use HTTPS, encrypting data in transit.

Why this answer

Option D is correct because a bucket policy that denies requests without the aws:SecureTransport condition enforces HTTPS for all access. Option A is wrong because S3 default encryption only encrypts data at rest, not in transit. Option B is wrong because CloudFront does not enforce HTTPS by default; it can be configured but is not the direct solution.

Option C is wrong because SSE-C is server-side encryption at rest, not in transit.

Full explanation →

1527

MCQhard

A company runs an Amazon Redshift cluster with 10 RA3 nodes. The data warehouse stores 50 TB of data. The company notices that queries are slow and the cluster's storage utilization is high. The data engineer needs to improve query performance and reduce storage costs without changing the cluster's node count. Which action should the engineer take?

A.Use Redshift Spectrum to offload historical data to Amazon S3 and query it in place.

B.Change the distribution style of large tables to DISTSTYLE ALL.

C.Migrate the cluster to Dense Compute node types.

D.Enable concurrency scaling to handle more concurrent queries.

AnswerA

Spectrum queries data in S3, reducing cluster storage and allowing faster queries on hot data.

Why this answer

Option D is correct. Redshift Spectrum allows querying data directly in S3 without loading it into the cluster, offloading storage and compute. Option A is wrong because increasing concurrency scaling adds compute but not storage relief.

Option B is wrong because distribution style affects performance but not storage costs. Option C is wrong because Dense Compute nodes are not compatible with RA3; RA3 uses managed storage.

Full explanation →

1528

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3 with sensitive data. The engineer needs to ensure that data at rest is encrypted and that access is logged for compliance. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers

A.Enable S3 Select to filter data at read time.

B.Enable S3 Transfer Acceleration.

C.Enable CloudTrail data events for S3 object-level operations.

D.Enable default encryption on the S3 bucket using SSE-KMS.

E.Enable S3 Block Public Access on the account.

AnswersC, D

CloudTrail data events log read/write operations to objects.

Why this answer

Option C is correct because enabling CloudTrail data events for S3 object-level operations captures detailed logs of all read, write, and delete actions on objects, which is essential for compliance auditing. Option D is correct because enabling default encryption on the S3 bucket using SSE-KMS ensures that all objects stored in the bucket are encrypted at rest with AWS Key Management Service (KMS) keys, providing centralized control and auditability of encryption keys.

Exam trap

The trap here is that candidates often confuse S3 Block Public Access (a security control) with encryption or logging, or they mistakenly think S3 Select or Transfer Acceleration contribute to compliance requirements, when they are unrelated to data-at-rest encryption and access logging.

Full explanation →

1529

MCQmedium

Refer to the exhibit. A data engineer sees this output from the AWS CLI for a failed Glue job. The job uses 10 workers of Standard type. What is the MOST appropriate action to resolve the OutOfMemoryError?

A.Increase NumberOfWorkers to 20

B.Reduce NumberOfWorkers to 5

C.Change WorkerType to G.1X

D.Increase MaxCapacity to 20

AnswerD

Increasing MaxCapacity allocates more DPUs per worker, increasing memory per worker.

Why this answer

Option B is correct because increasing the number of workers adds more parallelism but does not increase memory per worker; the error is heap space per worker. Option A is wrong because G.1X has less memory. Option C is correct: increasing MaxCapacity (DPUs) per worker increases memory per worker.

Option D is wrong because reducing workers exacerbates the issue.

Full explanation →

1530

MCQhard

A company uses AWS Lake Formation to manage permissions on a data lake in S3. A data analyst reports that queries using Amazon Athena return zero rows for a table that the analyst has been granted SELECT permission on. The table is registered in Lake Formation and uses a partition projection. What is the most likely cause?

A.The table is not registered as a resource in Lake Formation

B.The analyst does not have DESCRIBE permission on the table

C.The analyst lacks GetObject and ListBucket permissions on the underlying S3 location

D.The table uses server-side encryption with KMS and the analyst lacks kms:Decrypt permission

AnswerC

Lake Formation grants metadata permissions, but S3 permissions are still needed for partition projection.

Why this answer

Option B is correct because partition projection requires explicit S3 permissions to list the partition location, which Lake Formation may not automatically grant. Option A is wrong because SELECT permission is granted. Option C is wrong because encryption settings don't cause zero rows.

Option D is wrong because the table is registered.

Full explanation →

1531

MCQmedium

A company needs to ingest data from multiple SaaS applications (e.g., Salesforce, Marketo) into Amazon S3 for analytics. The data sources have different schemas and update frequencies. Which AWS service should be used to build this ingestion pipeline with minimal code?

A.AWS Data Pipeline

B.AWS Glue

C.Amazon Kinesis Data Firehose

D.Amazon AppFlow

AnswerD

AppFlow is designed to ingest data from SaaS applications to S3 with minimal code.

Why this answer

Option C is correct because AWS Glue has built-in connectors for many SaaS applications and can schedule crawlers to discover schema and extract data. Option A is wrong because Kinesis Data Firehose is for streaming data, not for batch extraction from SaaS APIs. Option B is wrong because AppFlow is specifically designed for SaaS data ingestion with minimal code.

Actually, AppFlow is the correct answer; AWS Glue also has connectors but requires more setup. Correcting: Amazon AppFlow is purpose-built for ingesting data from SaaS applications to S3. Option C is wrong because Glue has connectors but AppFlow is simpler.

The correct answer is B.

Full explanation →

1532

MCQeasy

A company uses an Amazon RDS for MySQL DB instance with Multi-AZ deployment. The primary DB instance fails unexpectedly. What happens to the database endpoint?

A.A new endpoint is created for the standby and the application must use the new endpoint.

B.The existing endpoint continues to work and automatically points to the standby DB instance.

C.The database becomes unavailable until the primary is restored from a snapshot.

D.The existing endpoint is deleted and a new endpoint is provided after manual DNS update.

AnswerB

RDS automatically updates the DNS CNAME record to point to the standby instance.

Why this answer

With Multi-AZ, RDS automatically fails over to the standby in another AZ. The CNAME record is updated to point to the standby, so the endpoint remains the same. Option A is correct.

Option B is wrong because failover is automatic and typically completes within minutes, not requiring manual DNS update. Option C is wrong because RDS does not create a new endpoint. Option D is wrong because the standby is promoted to primary, not a read replica.

Full explanation →

1533

MCQeasy

A company wants to ingest data from multiple SaaS applications into Amazon S3 using a fully managed service that supports schema discovery and transformation. Which AWS service should they use?

A.Amazon Kinesis Data Firehose

B.Amazon AppFlow

C.AWS Glue

D.AWS Data Pipeline

AnswerB

Fully managed service for SaaS data ingestion with schema discovery.

Why this answer

Option B is correct because Amazon AppFlow is a fully managed integration service that supports SaaS sources, schema discovery, and data transformation. Option A (AWS Glue) is for ETL but not for SaaS ingestion directly. Option C (Amazon Kinesis) is for streaming data.

Option D (AWS Data Pipeline) is not fully managed for SaaS.

Full explanation →

1534

MCQhard

A company uses Amazon Kinesis Data Firehose to ingest JSON logs from multiple sources into an S3 data lake. The data is then consumed by Amazon Athena for analysis. Recently, some queries have been failing with the error 'HIVE_BAD_DATA: Field xyz's type is an unsupported type'. The firehose delivery stream transforms the data using a Lambda function that converts timestamps to Unix epoch. What is the MOST likely cause of the query failure?

A.Some records contain timestamps that were not converted to epoch, so Athena infers the column as a string.

B.The data is in JSON format instead of Parquet.

C.The S3 partitions are not registered in the Glue Data Catalog.

D.The IAM role for Firehose does not have permission to write to S3.

AnswerA

Inconsistent data types in a column cause Athena to default to string, leading to type mismatch when queried.

Why this answer

The error indicates that the data type detected by Athena's schema inference does not match the actual data. Since the Lambda function converts timestamps to Unix epoch (a number), Athena may infer the column as a string due to some records not being converted properly. Option A is wrong because partitions do not cause this error.

Option B is wrong because S3 permissions would cause a different error. Option D is wrong because data format is likely compatible.

Full explanation →

1535

MCQmedium

A data engineer runs an AWS Glue ETL job that reads from an S3 bucket containing JSON files. The job fails with an error indicating that some records are malformed. The engineer wants to skip the malformed records and continue processing. Which approach should the engineer take?

A.Pre-process the JSON files to correct the malformed records before Glue reads them.

B.Convert the JSON files to Parquet format and use Glue to read Parquet.

C.Use AWS Glue Schema Registry to reject invalid records.

D.Configure the Glue DynamicFrame to use the `withErrorThreshold` option to skip corrupt records.

AnswerD

Glue can skip malformed records using error thresholds.

Why this answer

Option C is correct because Glue's DynamicFrame has a `recurse` option in `withErrorThreshold` or can use `from_options` with `multiLine` and `allowQuotedRecordDelimiters`. However, the best approach is to use the `from_catalog` or `from_options` with `withErrorThreshold` to skip bad records. Option A is wrong because modifying the data source is not always possible.

Option B is wrong because using a different file format does not fix malformed JSON. Option D is wrong because AWS Glue Schema Registry validates schemas, not malformed records.

Full explanation →

1536

MCQmedium

Refer to the exhibit. A data engineer configured the lifecycle policy shown. The 'logs/' prefix contains important audit logs. After 365 days, what happens to the objects?

A.Objects are permanently deleted.

B.Objects are transitioned to Glacier Deep Archive.

C.Objects are transitioned to Glacier.

D.Objects are transitioned to Standard-IA.

AnswerA

The expiration action with Days: 365 deletes the objects after 365 days.

Why this answer

Option D is correct because the expiration action deletes objects after 365 days. Option A is wrong because the policy transitions to Glacier, not Deep Archive. Option B is wrong because the transition to Standard IA occurs at 30 days, not 365.

Option C is wrong because the transition to Glacier occurs at 90 days, not 365.

Full explanation →

1537

Multi-Selecteasy

Which TWO statements about Amazon Redshift data distribution are correct? (Choose two.)

Select 2 answers

A.DISTSTYLE is a distribution style option

B.AUTO distribution always chooses EVEN

C.KEY distribution places rows with the same distribution key on the same slice

D.EVEN distribution distributes rows across slices evenly

E.ALL distribution distributes data across all slices

AnswersC, D

KEY distribution colocates data by key.

Why this answer

Option C is correct because in Amazon Redshift, KEY distribution places all rows with the same distribution key value on the same slice (compute node segment). This ensures that join operations on the distribution key are collocated, reducing data movement across the network and improving query performance.

Exam trap

The trap here is confusing distribution styles with distribution options (e.g., DISTSTYLE is a parameter, not a style) and misunderstanding that ALL distribution replicates the entire table to every node, not slices, while AUTO dynamically selects the best style rather than defaulting to EVEN.

Full explanation →

1538

Multi-Selectmedium

A company's Amazon Redshift cluster is running slowly. The data engineer suspects that table design is the cause. Which TWO design practices can improve query performance? (Choose TWO.)

Select 2 answers

A.Define appropriate sort keys on frequently filtered columns.

B.Use GROUP BY instead of DISTINCT in queries.

C.Define appropriate distribution keys to collocate joins.

D.Increase the number of slices per node by resizing the cluster.

E.Use VARCHAR instead of CHAR for fixed-length strings.

AnswersA, C

Sort keys minimize the number of blocks scanned.

Why this answer

Options A and B are correct. Sort keys help the query optimizer prune blocks, and distribution keys reduce data movement. Option C is wrong because VARCHAR is variable-length and may not improve performance.

Option D is wrong because GROUP BY is not a design practice. Option E is wrong because adding more slices (via cluster resize) is not a table design change.

Full explanation →

1539

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for real-time user activity logs. The logs are generated by a web application and must be ingested into Amazon S3 with minimal latency (under 1 minute). The logs also need to be queried in Amazon Athena. The engineer considers using Amazon Kinesis Data Firehose. Which TWO configurations are required to achieve near-real-time delivery to S3? (Choose TWO.)

Select 2 answers

A.Set the BufferIntervalInSeconds to 60 seconds.

B.Enable S3 compression (e.g., GZIP) on the delivery stream.

C.Enable Amazon CloudWatch error logging for the delivery stream.

D.Enable data format conversion to Parquet using AWS Glue.

E.Set the BufferSizeInMBs to 1 MB.

AnswersA, E

Controls how often data is delivered.

Why this answer

Options A and D are correct. Set 'BufferIntervalInSeconds' to 60 seconds to flush every minute, and 'BufferSizeInMBs' to 1 MB to ensure small files are delivered quickly. Option B is wrong because enabling S3 compression does not affect delivery frequency.

Option C is wrong because converting to Parquet is a transformation, not a delivery trigger. Option E is wrong because enabling error logging does not reduce latency.

Full explanation →

1540

MCQmedium

A company runs a daily batch process that reads data from Amazon S3, transforms it with AWS Glue, and loads it into Amazon Redshift. The process takes 6 hours, but the business requires completion within 4 hours. Which design change would MOST reduce runtime?

A.Increase the number of Glue workers

B.Load data directly from S3 to Redshift using COPY command, then transform in Redshift

C.Use S3 Select to filter data before Glue

D.Switch to columnar storage in Redshift

AnswerB

COPY is highly efficient for bulk loading, and in-database transformation can be faster than Glue.

Why this answer

Option D is correct because using Redshift's COPY command with S3 is optimized for bulk loads and avoids transformation delays. Option A is wrong because increasing Glue workers may help but doesn't address Redshift load speed. Option B is wrong because columnar storage is already used.

Option C is wrong because S3 Select reduces data scanned but does not accelerate the Glue job.

Full explanation →

1541

MCQhard

A data engineer notices that an Amazon Redshift cluster's storage utilization has grown unexpectedly. The cluster uses automatic compression and has a mix of fact and dimension tables. The engineer runs VACUUM and ANALYZE, but storage does not decrease. Which action is most likely to reduce storage consumption?

A.Perform a DEEP COPY on the largest tables.

B.Run VACUUM with the BOOST option.

C.Run ANALYZE with the FULL keyword on all tables.

D.Modify the sort key on the largest tables to a more selective column.

AnswerA

DEEP COPY recreates the table with optimal compression, reclaiming storage from deleted rows and reorganizing data.

Why this answer

Option A is correct because DEEP COPY recreates the table with a fresh, optimally sorted and compressed storage layout, reclaiming space that VACUUM alone cannot recover. In Redshift, VACUUM reorganizes and reclaims space from deleted rows but does not re-apply compression or rebuild the underlying storage blocks; DEEP COPY (e.g., using CREATE TABLE AS or the DEEP COPY command) physically rewrites the data, eliminating fragmentation and applying the current compression encoding, which can significantly reduce storage consumption when automatic compression has left suboptimal encodings or when historical updates have bloated the table.

Exam trap

The trap here is that candidates confuse VACUUM's space reclamation (which only removes deleted rows) with the need to physically rebuild the table to reapply compression, assuming VACUUM or ANALYZE can fix storage bloat caused by suboptimal encodings.

How to eliminate wrong answers

Option B is wrong because VACUUM BOOST is not a valid Redshift command; VACUUM has a BOOST option only in certain other database systems, and Redshift's VACUUM with the BOOST parameter does not exist — the correct options are VACUUM FULL, VACUUM DELETE ONLY, or VACUUM SORT ONLY, none of which reapply compression or reclaim space from suboptimal encodings. Option C is wrong because ANALYZE with the FULL keyword updates table statistics for the query planner but does not modify physical storage or reclaim space; it only refreshes metadata about data distribution and does not affect the actual data blocks. Option D is wrong because modifying the sort key on the largest tables changes the physical order of rows on disk, which can improve query performance but does not directly reduce storage consumption; sort keys affect how data is organized, not the compression ratio or the amount of space used by existing data.

Full explanation →

1542

MCQhard

A company uses Amazon S3 to store large datasets for analytics. Each dataset is stored in a separate prefix and consists of thousands of small objects (1-10 KB each). The company notices that listing objects in a prefix takes several seconds, slowing down data processing. Which solution would MOST improve listing performance?

A.Add a lifecycle policy to transition objects to S3 Glacier.

B.Use S3 Select to filter objects during listing.

C.Use S3 Inventory to generate a daily listing of objects.

D.Increase the number of parallel requests by using more prefixes.

AnswerC

S3 Inventory provides a pre-generated list that can be queried quickly.

Why this answer

S3 Inventory provides a scheduled CSV/Parquet file listing all objects in a bucket or prefix, including metadata like size and last modified date. By querying this inventory file instead of issuing real-time ListObject API calls, you avoid the latency of enumerating thousands of small objects, dramatically improving listing performance for analytics workflows.

Exam trap

The trap here is that candidates confuse S3 Select (which filters object content) with filtering object keys during listing, or assume that parallel requests to a single prefix are allowed, when in fact S3 throttles ListObject calls per prefix and parallelism only helps across different prefixes.

How to eliminate wrong answers

Option A is wrong because transitioning objects to S3 Glacier does not improve listing performance; it only changes storage class and adds retrieval latency, while the ListObject API still must enumerate all objects. Option B is wrong because S3 Select is used to filter the content of objects (e.g., SQL queries on CSV/JSON data), not to filter object keys during listing; it cannot accelerate the ListObject operation. Option D is wrong because increasing parallel requests with more prefixes would require redesigning the data layout and does not reduce the time to list a single prefix; the bottleneck is the number of objects in that prefix, not parallelism.

Full explanation →

1543

MCQeasy

A company uses AWS Glue ETL jobs to transform data stored in Amazon S3. The job reads data in Parquet format, applies transformations, and writes the output back to S3 in Parquet format. The team wants to improve the job's performance and reduce costs. Which action is MOST effective?

A.Change the input format from Parquet to CSV to simplify parsing.

B.Coalesce the input data into a single large file before processing.

C.Use column pruning and predicate pushdown to read only necessary columns and filter data early.

D.Increase the number of workers to maximum allowed.

AnswerC

Reduces the amount of data processed, improving performance and reducing costs.

Why this answer

The correct answer is to use column pruning and predicate pushdown. Reading only necessary columns and filtering early reduces data scanned and processing time. Option B (increasing worker count) increases cost and may not be needed.

Option C (switching to CSV) would increase data size and slow performance. Option D (using a single large file) reduces parallelism and may harm performance.

Full explanation →

1544

Multi-Selecteasy

A company must comply with a regulation that requires logging all access to sensitive data stored in Amazon S3. Which AWS services can be used to capture and store access logs? (Choose TWO.)

Select 2 answers

A.AWS Config

B.Amazon CloudWatch Logs

C.AWS CloudTrail

D.Amazon S3 server access logs

E.VPC Flow Logs

AnswersC, D

CloudTrail logs S3 API calls.

Why this answer

Options B and D are correct. AWS CloudTrail logs API calls to S3. Amazon S3 server access logs provide detailed records of requests.

Option A is wrong because Amazon CloudWatch Logs can receive logs but doesn't generate S3 access logs directly. Option C is wrong because AWS Config tracks configuration changes, not access. Option E is wrong because VPC Flow Logs capture network traffic, not S3 access.

Full explanation →

1545

MCQhard

A data engineering team needs to store log files for 90 days with immediate access, then archive them for 7 years with infrequent access. Which S3 storage class configuration meets these requirements cost-effectively?

A.Use S3 One Zone-IA for 90 days, then lifecycle to S3 Glacier Deep Archive

B.Use S3 Intelligent-Tiering with lifecycle transition to S3 Glacier Deep Archive after 90 days

C.Use S3 Glacier Instant Retrieval for 90 days, then lifecycle to S3 Glacier Flexible Retrieval

D.Use S3 Standard for 90 days, then lifecycle policy to S3 Glacier Deep Archive

AnswerB

Intelligent-Tiering optimizes costs for unknown patterns, and lifecycle to Deep Archive meets long-term retention.

Why this answer

Option B is correct because S3 Intelligent-Tiering automatically moves objects between frequent and infrequent access tiers, and lifecycle policies can transition to S3 Glacier Deep Archive for long-term retention. Option A is wrong because S3 Standard is expensive for 7 years. Option C is wrong because S3 Glacier Instant Retrieval is not cost-effective for the first 90 days.

Option D is wrong because S3 One Zone-IA is not recommended for durability.

Full explanation →

1546

MCQhard

Refer to the exhibit. A data engineer runs this AWS Glue job but it fails with an error that the table 'orders' does not exist in the 'sales_db' database. The engineer has verified that the table exists in the AWS Glue Data Catalog. What is the most likely cause of the error?

A.The IAM role used by the Glue job does not have permission to read the Data Catalog

B.The Glue job has job bookmark enabled and is skipping the table

C.The script uses 'create_dynamic_frame.from_catalog' incorrectly

D.The S3 path 's3://data-lake/raw/' does not exist

AnswerA

The job needs glue:GetTable permission to access the table metadata.

Why this answer

Option B is correct because the Glue job needs permissions to access the Data Catalog. The IAM role associated with the job must have 'glue:GetTable' permission. Option A is wrong because the error is about the table not existing, not about the S3 path.

Option C is wrong because the script is correct syntactically. Option D is wrong because the job bookmark setting does not affect table discovery.

Full explanation →

1547

MCQmedium

A data engineer is monitoring an Amazon EMR cluster running a Spark job. The job is processing a large dataset and the engineer notices that the cluster is using a high percentage of disk space on the core nodes. The job fails with 'No space left on device' error. What is the most effective way to resolve this issue without modifying the job logic?

A.Attach additional EBS volumes to the core nodes.

B.Increase the EBS volume size attached to the core nodes.

C.Change the core node instance type to one with more memory.

D.Increase the number of core nodes in the cluster.

AnswerD

More nodes distribute the intermediate data, reducing disk usage per node.

Why this answer

Option D is correct because increasing the number of core nodes distributes the intermediate shuffle data and temporary files across more nodes, reducing the per-node disk usage. This directly addresses the 'No space left on device' error without altering the Spark job logic, as core nodes in EMR store both HDFS data and local shuffle spills.

Exam trap

The trap here is that candidates confuse storage issues with memory or compute issues, and incorrectly choose to increase EBS volume size (Option B) instead of scaling horizontally, which is the most effective way to distribute disk load in a distributed system like EMR.

How to eliminate wrong answers

Option A is wrong because attaching additional EBS volumes does not increase the total available disk space on the core nodes unless they are mounted and configured; EMR automatically uses the root volume for local data, and adding extra volumes requires manual intervention or instance store configuration, which is not a direct fix. Option B is wrong because increasing the EBS volume size on existing core nodes only provides more space on the root device, but the error may stem from ephemeral storage or HDFS usage; moreover, this requires stopping the cluster or modifying the launch configuration, which is less effective than scaling horizontally. Option C is wrong because changing the instance type to one with more memory does not increase disk space; it addresses memory constraints, not the 'No space left on device' error, which is a storage issue.

Full explanation →

1548

MCQhard

A company uses Amazon DynamoDB for a gaming leaderboard. The table has a partition key of 'GameId' and a sort key of 'Score'. The application needs to query the top 10 scores for a given game. Which DynamoDB feature should be used for optimal performance?

A.Use a Query operation on the base table with ScanIndexForward set to false.

B.Enable DynamoDB Streams and use a Lambda function to compute the leaderboard.

C.Use DynamoDB Accelerator (DAX) to cache the results of a Scan operation.

D.Create a Global Secondary Index with the same partition key and sort key, then query with ScanIndexForward false.

AnswerD

A GSI optimizes the query pattern for fetching top scores per game.

Why this answer

Option C is correct because a Global Secondary Index with 'GameId' as partition key and 'Score' as sort key allows efficient querying with ScanIndexForward=false to get top scores. Option A is wrong because the base table's sort key is 'Score', but querying by 'GameId' and sorting descending would work but is less efficient if the table has other attributes. However, GSI is specifically for this access pattern.

Option B is wrong because DynamoDB Streams is for change data capture, not querying. Option D is wrong because DAX is a caching layer, not a querying feature.

Full explanation →

1549

MCQhard

A data pipeline uses AWS Glue to run ETL jobs that read from and write to an Amazon Redshift cluster. The pipeline recently started failing with the error 'ERROR: cannot execute INSERT in a read-only transaction'. The Glue job's IAM role has the necessary permissions. What could be the cause of this error?

A.The Glue job is using a transaction that was opened in read-only mode.

B.The Redshift cluster is in read-only mode due to maintenance.

C.The Glue connection is configured with 'read-only' set to true.

D.The Glue job's IAM role does not have sufficient Redshift permissions.

AnswerA

If auto_commit=False and the first operation is a SELECT, the session becomes read-only; subsequent INSERT fails.

Why this answer

Option D is correct because Redshift uses a read-only transaction when connected via a read-only workload or if the connection string specifies auto_commit=False and the job tries to write without committing. Option A is wrong because insufficient permissions would cause a different error. Option B is wrong because Redshift is not in read-only mode.

Option C is wrong because Glue connections do not have a read-only setting.

Full explanation →

1550

MCQhard

Refer to the exhibit. A data engineer runs an AWS Glue ETL job that writes output to an S3 bucket. The job fails with the error shown. What is the most likely cause?

A.The IAM role used by the Glue job lacks the s3:PutObject permission for the output bucket

B.The Glue job attempted to write data in an unsupported format

C.The S3 bucket does not exist

D.The output file name contains invalid characters

AnswerA

The error explicitly states the role is not authorized to perform s3:PutObject.

Why this answer

Option A is correct because the error message indicates the GlueServiceRole does not have s3:PutObject permission on the bucket. Option B is wrong because the error is about permissions, not the file name. Option C is wrong because Parquet format is not the issue.

Option D is wrong because the bucket exists (error shows it in the ARN).

Full explanation →

1551

MCQeasy

A company uses AWS Glue to run ETL jobs that process data from an Amazon RDS for MySQL database and load it into an Amazon S3 data lake. The Glue job runs daily and processes incremental data. Recently, the job has been taking longer than expected. The engineer checks the CloudWatch logs and sees that the job is spending most of its time on the 'Reading from JDBC' phase. The MySQL table has 10 million rows and is indexed on the primary key. The Glue job uses a 'job bookmark' to track processed data. The engineer wants to improve the performance of the read phase. Which action is most likely to help?

A.Increase the JDBC 'fetchSize' parameter to 10000.

B.Disable job bookmark and perform a full refresh each time.

C.Increase the number of DPUs for the Glue job.

D.Modify the job to use a 'query' parameter that selects only the new or modified rows based on a timestamp column.

AnswerD

By filtering at the source, less data is read and transferred, speeding up the read phase.

Why this answer

Option B is correct because using a 'query' parameter with a WHERE clause that filters on the bookmark key (e.g., a timestamp column) allows Glue to read only the incremental data, reducing the amount of data transferred. Option A is wrong because increasing the number of DPUs adds parallelism but the bottleneck may be the database's ability to serve data. Option C is wrong because increasing the fetch size may cause memory issues.

Option D is wrong because job bookmark already tracks processed data; disabling it would cause reprocessing.

Full explanation →

1552

MCQeasy

A company uses Amazon QuickSight for data visualization. The data engineer needs to ensure that users can only see data relevant to their department. The data is stored in Amazon S3 and is accessed via SPICE. The engineer has created datasets in QuickSight and wants to implement row-level security (RLS). The dataset contains a column 'Department' that indicates which department a row belongs to. The engineer has configured RLS rules using a separate permissions dataset. However, users report that they can see all rows, not just their department's rows. What is the most likely reason?

A.The RLS permissions dataset is not correctly configured to map users to department values.

B.The 'Department' column is not included in the dataset.

C.The users have been granted admin access to the QuickSight dashboard.

D.The SPICE dataset does not support row-level security.

AnswerA

RLS requires a mapping between users and allowed values.

Why this answer

Option B is correct because QuickSight RLS requires a permissions dataset that maps users to the values in the restricted column. If the RLS rules are not properly set, all rows are visible. Option A is wrong because the column exists.

Option C is wrong because SPICE supports RLS. Option D is wrong because RLS is dataset-level, not dashboard-level.

Full explanation →

1553

MCQhard

Refer to the exhibit. An AWS Glue job is failing with 'AccessDenied' when trying to write to the 'data-lake-bucket' which is encrypted with an AWS KMS key. The IAM role used by the Glue job has the attached policy shown. What is the MOST likely cause of the failure?

A.The policy does not include s3:ListBucket permission.

B.The policy does not include s3:GetObject permission.

C.The KMS key ARN in the policy is incorrect.

D.The policy does not include kms:GenerateDataKey or kms:Encrypt permission.

AnswerD

Writing to SSE-KMS encrypted S3 requires GenerateDataKey and Encrypt.

Why this answer

Option C is correct because the policy allows s3:PutObject but does not allow kms:GenerateDataKey or kms:Encrypt, which are needed to write to an SSE-KMS encrypted bucket. Option A is wrong because ListBucket is allowed. Option B is wrong because GetObject is allowed.

Option D is wrong because the key ARN is correct.

Full explanation →

1554

MCQhard

A company uses AWS Lake Formation to manage data lake permissions. A data engineer notices that a user can query tables in Athena even though the user does not have SELECT permission on the table in Lake Formation. What could be the cause?

A.The user is using Redshift Spectrum

B.The user has S3 permissions to read the underlying data

C.The user has an IAM policy that allows Athena access

D.The IAMAllowedPrincipals group has been granted Super permission on the database

AnswerD

The IAMAllowedPrincipals group bypasses Lake Formation permissions and allows IAM users to access tables directly.

Why this answer

By default, Lake Formation uses IAMAllowedPrincipals group which grants full access to IAM users and roles. If this group is present, Lake Formation permissions are bypassed. Option A is wrong because IAM policy granting Athena access is not sufficient without Lake Formation permissions.

Option C is wrong because S3 permissions alone do not allow Athena queries. Option D is wrong because Lake Formation is a different service.

Full explanation →

1555

MCQhard

A data engineer is ingesting data from a third-party API into Amazon S3 using AWS Lambda. The API returns a JSON payload of up to 10 MB per request. The Lambda function runs every minute. Occasionally, the function times out after 15 seconds. What is the most likely cause?

A.The Lambda function is in a VPC without a NAT gateway, causing network timeouts.

B.The Lambda function's timeout setting is too short.

C.The Lambda function's memory is too low, causing slow processing.

D.The API response size exceeds the Lambda invocation payload limit.

AnswerB

Default timeout is 3 seconds; increase to handle larger payloads.

Why this answer

Option B is correct because Lambda has a default timeout of 3 seconds, but can be increased to 15 minutes. If the function times out at 15 seconds, the timeout is set to 15 seconds. Increasing it resolves the issue.

Option A is wrong because 10 MB is within Lambda limits. Option C is wrong because memory may need increase but timeout is the immediate cause. Option D is wrong because VPC configuration can cause delays but not specifically timeout.

Full explanation →

1556

MCQmedium

A company uses Amazon RDS for MySQL with Multi-AZ deployment. The primary instance fails, and automatic failover occurs. After failover, the application experiences higher latency. What is the most likely cause?

A.The read replica is now the primary and cannot handle write traffic.

B.The failover process disabled automatic backups.

C.The DNS endpoint did not update to point to the new primary.

D.The new primary instance is in a different Availability Zone, increasing network latency.

AnswerD

Cross-AZ latency can be higher than same-AZ.

Why this answer

Option A is correct because after failover, the standby becomes the new primary, and a new standby is created in a different AZ, which may have higher latency. Option B is incorrect because the DNS record is updated. Option C is incorrect because read replicas are not affected.

Option D is incorrect because failover is automatic.

Full explanation →

1557

MCQhard

A company uses Redshift for analytics. The security team requires that all queries be logged and that any access to sensitive columns be blocked for non-admin users. Which combination of features should the data engineer implement?

A.Enable Redshift audit logging and create views that expose only non-sensitive columns, granting access to views.

B.Use Redshift row-level security and enable CloudTrail logging.

C.Enable CloudWatch Logs for Redshift and use IAM conditions to block sensitive columns.

D.Enable Redshift audit logging and use IAM policies to restrict column access.

AnswerA

Views can restrict column access, and audit logging captures queries.

Why this answer

Option C is correct because Redshift audit logging captures queries, and column-level access control can be enforced using views with restricted columns. Option A is wrong because IAM policies do not control column-level access in Redshift. Option B is wrong because row-level security is not yet supported.

Option D is wrong because CloudWatch Logs do not control access.

Full explanation →

1558

MCQhard

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data engineer observes that some read requests are returning stale data, even though the application uses strongly consistent reads. The table has auto-scaling enabled with a maximum read capacity of 10,000 RCUs. The observed read traffic averages 8,000 RCUs but occasionally spikes to 12,000 RCUs. What is the most likely cause of the stale reads?

A.Read capacity auto-scaling cannot keep up with sudden traffic spikes, causing throttling and fallback to eventually consistent reads.

B.The application uses write sharding, causing read-after-write inconsistencies.

C.The application is using DynamoDB Accelerator (DAX) which caches data and may return stale values.

D.The table is part of a DynamoDB global table, and the application reads from a replica in a different region.

AnswerA

Throttling can cause fallback to eventual consistency.

Why this answer

Option C is correct because strongly consistent reads can return stale data if the application is throttled due to insufficient read capacity. During spikes above the maximum auto-scaling limit (10,000 RCUs), requests may be throttled, and the SDK may retry with eventually consistent reads, returning stale data. Option A is incorrect because global tables with eventually consistent reads would not affect a single table.

Option B is incorrect because DynamoDB Accelerator (DAX) provides eventual consistency by default, but strongly consistent reads would bypass DAX. Option D is incorrect because write sharding does not cause stale reads on the same table.

Full explanation →

1559

MCQeasy

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. Which AWS service is best suited for this purpose?

A.Amazon S3

B.AWS Lambda

C.Amazon RDS

D.Amazon Kinesis Data Streams

AnswerD

It is designed for real-time streaming data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion from many sources. Option A is wrong because S3 is for object storage, not real-time streaming. Option B is wrong because Lambda processes data but is not a primary ingestion service.

Option D is wrong because RDS is a relational database, not for streaming ingestion.

Full explanation →

1560

MCQhard

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis stream with 10 shards and writes to an S3 bucket. The application is experiencing high latency. Analysis shows that the application is not keeping up with the incoming data rate. Which action would MOST effectively reduce latency?

A.Increase the number of shards in the Kinesis stream

B.Increase the Parallelism of the Flink application

C.Enable exactly-once delivery to S3

D.Use a larger Kinesis Data Analytics application (increase KPU)

AnswerB

Higher parallelism allows more concurrent processing.

Why this answer

Increasing the parallelism of the Flink application allows it to process more data concurrently. This directly addresses the processing bottleneck.

Full explanation →

1561

Drag & Dropmedium

Order the steps to set up a Kinesis Data Analytics application for real-time stream processing.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First, set up the source stream. Then create the analytics application, configure it with the source and logic, start it, and finally monitor performance.

Full explanation →

1562

MCQeasy

A data engineer needs to ingest data from a relational database (MySQL) into Amazon S3 for analytics. The database is 500 GB and the job must run daily with incremental updates. Which AWS service is BEST suited for this task?

A.Amazon EMR with Apache Sqoop.

B.Amazon Kinesis Data Firehose with a database source.

C.AWS Database Migration Service (DMS) with a replication task.

D.AWS Glue ETL job with a JDBC connection.

AnswerC

DMS supports continuous replication and can write to S3.

Why this answer

Option A is correct because AWS Database Migration Service (DMS) supports ongoing replication from MySQL to S3 with change data capture (CDC), enabling incremental updates. Option B is wrong because AWS Glue can do batch jobs but does not natively support CDC as seamlessly as DMS. Option C is wrong because Amazon Kinesis is for streaming data, not database snapshots.

Option D is wrong because Amazon EMR is overkill for simple database ingestion.

Full explanation →

1563

MCQeasy

A company needs to store relational data that requires complex joins and transactional consistency. The workload is predictable and the data size is less than 500 GB. Which AWS service is MOST cost-effective for this use case?

A.Amazon Redshift

B.Amazon S3

C.Amazon RDS for PostgreSQL

D.Amazon DynamoDB

AnswerC

RDS offers managed relational databases with full SQL support, ideal for transactional workloads up to 500 GB.

Why this answer

Option B is correct because Amazon RDS provides managed relational databases with support for complex joins and transactions. Option A is wrong because DynamoDB is NoSQL and does not support complex joins. Option C is wrong because Redshift is for analytics and is overkill for 500 GB.

Option D is wrong because S3 is not a relational database.

Full explanation →

1564

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline for real-time clickstream data. The data must be available for both real-time analytics and batch processing. The engineer wants to use Amazon Kinesis Data Streams. Which THREE components should be included in the architecture?

Select 3 answers

A.Amazon Kinesis Data Analytics

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Client Library (KCL) application

D.Amazon Kinesis Data Firehose to deliver data to Amazon S3

E.Amazon SQS as a buffer

AnswersB, C, D

Primary ingestion service.

Why this answer

Kinesis Data Streams ingests data, Kinesis Client Library (KCL) consumes for real-time analytics, and Kinesis Data Firehose delivers to S3 for batch. Kinesis Data Analytics is for real-time SQL, not batch; SQS is not used; Glue is batch but not directly from stream.

Full explanation →

1565

MCQmedium

The exhibit shows the lifecycle configuration for an S3 bucket. Objects in the bucket are 200 days old on average. What will happen to the objects?

A.Objects will be transitioned to GLACIER after 90 days and deleted after 365 days.

B.Objects are in GLACIER now and will be deleted after 365 days from creation.

C.Objects will be transitioned to GLACIER after 200 days and deleted after 365 days.

D.Objects will be deleted after 90 days.

AnswerB

At 200 days, objects have been transitioned; expiration is at 365 days.

Why this answer

The lifecycle configuration shows a current version action to transition to GLACIOR (a misspelling of GLACIER) immediately (0 days after creation) and an expiration action to permanently delete the object 365 days after creation. Since the objects are already 200 days old on average, they have already been transitioned to GLACIER storage class. The expiration rule will delete them 365 days from their creation date, not from the current time.

Exam trap

The trap here is that candidates misinterpret the '0 days' transition as 'no transition' or assume the average age of 200 days means the transition hasn't happened yet, when in fact the lifecycle rules are based on creation date, not current age.

How to eliminate wrong answers

Option A is wrong because the lifecycle rule transitions objects to GLACIER immediately (0 days), not after 90 days. Option C is wrong because the transition occurs at 0 days, not after 200 days. Option D is wrong because the expiration deletes objects after 365 days, not after 90 days.

Full explanation →

1566

Multi-Selectmedium

A company is using AWS Glue Data Catalog to store metadata about datasets in S3. The data engineer wants to implement a data governance solution that tracks lineage and versioning of datasets. Which TWO AWS services can be used together to achieve this?

Select 2 answers

A.AWS Data Pipeline

B.AWS Lake Formation

C.AWS Glue Data Catalog

D.AWS CloudTrail

E.Amazon S3

AnswersB, C

Provides data lineage and versioning capabilities.

Why this answer

Option A and D are correct. AWS Lake Formation provides data lineage and versioning. AWS Glue Data Catalog stores metadata and can be integrated with Lake Formation.

Option B is wrong because S3 does not provide lineage. Option C is wrong because CloudTrail logs API calls but not lineage. Option E is wrong because Data Pipeline is for data movement.

Full explanation →

1567

MCQmedium

A company stores sensitive user data in an Amazon RDS for PostgreSQL DB instance. A security audit requires that all data be encrypted at rest. The database is currently unencrypted. What is the MOST operationally efficient way to enable encryption at rest?

A.Create a read replica with encryption enabled and promote it.

B.Take a snapshot of the DB instance, copy it with encryption enabled, and restore the snapshot to a new DB instance.

C.Modify the DB instance and enable encryption in the console.

D.Modify the DB parameter group to include encryption parameters and reboot the instance.

AnswerB

This is the standard method to enable encryption for an existing unencrypted RDS instance.

Why this answer

Option B is correct because RDS for PostgreSQL does not support enabling encryption at rest on an existing unencrypted DB instance directly. The only way to achieve this is by taking a snapshot of the unencrypted instance, creating an encrypted copy of that snapshot, and then restoring it to a new encrypted DB instance. This method is operationally efficient as it uses native RDS snapshot copy and restore capabilities without requiring additional infrastructure or manual data migration.

Exam trap

The trap here is that candidates assume encryption can be toggled on via a simple 'Modify' operation in the console or CLI, but AWS RDS explicitly requires a snapshot copy and restore for existing unencrypted instances, a detail often overlooked in favor of more familiar modification workflows.

How to eliminate wrong answers

Option A is wrong because creating a read replica of an unencrypted source DB instance does not allow enabling encryption on the replica; RDS read replicas inherit the encryption setting of the source, and you cannot enable encryption on a replica if the source is unencrypted. Option C is wrong because the RDS console does not provide a 'Modify' option to enable encryption at rest on an existing unencrypted DB instance; encryption can only be specified at creation time or via snapshot restore. Option D is wrong because modifying the DB parameter group does not affect storage encryption; encryption at rest is a storage-layer feature controlled by the RDS instance configuration, not by PostgreSQL parameters.

Full explanation →

1568

MCQeasy

A company runs a daily batch processing job on Amazon EMR that reads data from Amazon S3 and writes results back to S3. The job takes longer than expected. The engineer wants to monitor the job's resource utilization. Which AWS service should be used to collect and visualize metrics such as CPU and memory usage of the EMR cluster's nodes?

A.AWS Config to record configuration changes in the EMR cluster.

B.Amazon Athena to query EMR job logs stored in S3.

C.Amazon CloudWatch with the CloudWatch Agent installed on the EMR nodes.

D.AWS CloudTrail to log API calls made by the EMR job.

AnswerC

CloudWatch can collect CPU, memory, and disk metrics from EC2 instances (EMR nodes) via the CloudWatch Agent.

Why this answer

Option A is correct because CloudWatch can collect custom metrics from EMR via the CloudWatch agent or EMR metrics integration. Option B is incorrect because CloudTrail records API calls, not resource utilization. Option C is incorrect because AWS Config tracks configuration changes.

Option D is incorrect because Athena is a query service, not a monitoring service.

Full explanation →

1569

Multi-Selecteasy

A data engineer needs to transform CSV files in S3 to Parquet format using a serverless solution. The files are large (up to 5 GB each) and arrive irregularly. Which TWO services can accomplish this with minimal operational overhead? (Choose TWO.)

Select 2 answers

A.AWS Glue ETL job

B.AWS Step Functions with Athena CTAS queries

C.Amazon EC2 with a script

D.Amazon EMR cluster

E.Amazon Redshift Spectrum

AnswersA, B

Glue is serverless and can convert large CSV to Parquet efficiently.

Why this answer

Option A (Glue ETL) is serverless and can handle large files. Option C (Step Functions with Athena) can orchestrate Parquet conversion via CTAS. Option B (EMR) is not serverless.

Option D (Redshift Spectrum) is for querying. Option E (EC2) requires management.

Full explanation →

1570

MCQmedium

A data engineer is designing a data store for a time-series application that requires sub-millisecond read latency for the latest data and high ingestion rates. Which AWS service is most suitable?

A.Amazon DynamoDB

B.Amazon ElastiCache for Redis

C.Amazon RDS for PostgreSQL

D.Amazon Timestream

AnswerD

Timestream is purpose-built for time-series data with fast queries.

Why this answer

Option A is correct because Amazon Timestream is a fast, scalable, serverless time-series database. Option B is wrong because Amazon RDS is not optimized for time-series workloads. Option C is wrong because Amazon DynamoDB is a key-value store, not purpose-built for time-series.

Option D is wrong because Amazon ElastiCache is a caching layer, not a primary datastore for time-series.

Full explanation →

1571

MCQmedium

A media company ingests large video files from partners via AWS Transfer Family (SFTP) into an S3 bucket. Each file is typically 2-5 GB. Once uploaded, an AWS Lambda function is triggered to transcode the video using Amazon Elastic Transcoder. The Lambda function reads the file from S3, submits a transcoding job to Elastic Transcoder, and writes the output back to a different S3 bucket. Recently, the Lambda function has been failing intermittently with timeouts, and the company reports that some files are not being transcoded. The CloudWatch logs show that the Lambda function is timing out after 15 minutes. The average transcoding job takes about 10 minutes to complete. The data engineer needs to fix the issue without changing the architecture drastically. What should the data engineer do?

A.Increase the Lambda function's reserved concurrency to allow multiple invocations in parallel.

B.Increase the Lambda function timeout to 20 minutes to accommodate longer transcoding jobs.

C.Modify the Lambda function to submit the transcoding job asynchronously and exit, using an SNS topic to trigger a second Lambda function when the job completes.

D.Replace AWS Transfer Family with AWS Database Migration Service to handle file transfers more efficiently.

AnswerC

Decoupling submission from completion avoids timeout.

Why this answer

Option B is correct because the Lambda function should not wait for the transcoding job to complete; it should submit the job and exit. The current design synchronously waits, causing timeouts. Option A is wrong because increasing the timeout to 20 minutes may still fail if jobs take longer.

Option C is wrong because the issue is not about concurrency. Option D is wrong because DMS is for database migration, not file processing.

Full explanation →

1572

MCQhard

A company uses Amazon Redshift for its data warehouse. During a routine audit, the data engineer discovers that some queries are returning stale data even though the underlying source data has been updated. The engineer confirms that the COPY command completes successfully and that no errors are reported. Which action should the engineer take to ensure queries reflect the latest data?

A.Run the VACUUM command on the source tables.

B.Clear the Redshift result cache by running RESET ALL.

C.Run the ANALYZE command on the source tables.

D.Refresh the materialized views that the queries are using.

AnswerD

Materialized views must be refreshed to reflect changes in base tables.

Why this answer

Option B is correct because Redshift does not automatically maintain materialized views; they need to be refreshed to reflect changes in the base tables. Option A is wrong because Redshift does not have a cache that needs clearing in this context. Option C is wrong because VACUUM reclaims space and sorts tables, but does not refresh materialized views.

Option D is wrong because ANALYZE updates statistics, not the data content.

Full explanation →

1573

MCQeasy

A company is migrating an on-premises MongoDB database to Amazon DocumentDB. The data engineer needs to ensure minimal downtime during migration. Which AWS service should be used to facilitate the migration?

A.AWS Glue

B.AWS Snowball

C.AWS Database Migration Service (DMS)

D.Amazon S3 Transfer Acceleration

AnswerC

DMS supports live migration from MongoDB to DocumentDB with minimal downtime.

Why this answer

Option B is correct because AWS DMS supports live migration with minimal downtime. Option A is wrong because S3 is for object storage, not database migration. Option C is wrong because Snowball is for large data transfers, not continuous replication.

Option D is wrong because AWS Glue is for ETL, not live database migration.

Full explanation →

1574

MCQmedium

A company is ingesting streaming data from IoT devices into Amazon Kinesis Data Streams. The data must be transformed in real-time using custom Python code before being stored in Amazon S3. Which AWS service should be used to perform this transformation?

A.Amazon EMR with Spark Streaming

B.AWS Lambda function triggered by Kinesis Data Streams

C.Kinesis Data Analytics for Apache Flink

D.Kinesis Data Firehose with custom data transformation

AnswerC

Supports custom Flink applications for complex real-time transformations.

Why this answer

Option C is correct because Kinesis Data Analytics for Apache Flink allows running custom Apache Flink applications for real-time data processing. Option A (Kinesis Data Firehose) does not support custom Python code; it uses built-in transformations. Option B (AWS Lambda) can be used but requires custom code and is typically used for simple transformations.

Option D (Amazon EMR) is for large-scale batch processing, not real-time.

Full explanation →

1575

Multi-Selecthard

A company needs to ingest data from multiple SaaS applications (Salesforce, Marketo) into Amazon S3 for analytics. The data volume is moderate (~100 GB per day). The pipeline must handle schema changes, deduplicate records, and provide low latency (under 1 hour). Which THREE services should be used? (Choose THREE.)

Select 3 answers

A.Amazon AppFlow

B.Amazon EventBridge

C.Amazon Kinesis Data Streams

D.AWS Glue DataBrew

E.AWS Database Migration Service (DMS)

AnswersA, B, D

AppFlow can ingest data from SaaS applications like Salesforce and Marketo.

Why this answer

Option A (AppFlow) connects to SaaS sources. Option C (Glue DataBrew) can clean and deduplicate data. Option E (EventBridge) can trigger Glue jobs on schedule or events.

Option B (Kinesis) is for streaming, not SaaS. Option D (DMS) is for databases.

Full explanation →

Page 21 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →