Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1276–1350

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 18 of 24

1276

MCQhard

A company uses AWS Glue to transform data in Amazon S3. The transformation logic is written in Python and references several libraries that are not included in the default Glue environment. Which approach should the data engineer use to make these libraries available?

A.Include a requirements.txt file in the Glue job script and run pip install during job initialization.

B.Package the libraries in an AWS Lambda layer and attach it to the Glue job.

C.Upload the libraries as a .zip file to an S3 bucket and reference them in the Glue job's Python library path.

D.Use the --additional-python-modules parameter in the Glue job.

AnswerC

Glue Python shell jobs allow adding custom Python modules from S3.

Why this answer

Option C is correct because AWS Glue supports Python shell jobs with extra Python modules via S3 paths. Option A is wrong because --additional-python-modules is not valid. Option B is wrong because Lambda has a deployment package limit.

Option D is wrong because Glue does not support pip install directly.

Full explanation →

1277

MCQmedium

A company wants to ingest streaming data from thousands of IoT devices into AWS for real-time analytics. The data volume is variable and can spike unpredictably. The solution must be serverless and minimize operational overhead. Which AWS service should be used for ingestion?

A.Use Amazon Kinesis Data Firehose to load streaming data directly into Amazon S3.

B.Use Amazon SQS to queue messages and process them in batches.

C.Use Amazon Kinesis Data Streams to ingest and process data in real time.

D.Use AWS IoT Core to ingest data and route it to Amazon DynamoDB.

AnswerC

Amazon Kinesis Data Streams is a serverless streaming data service that can handle variable and high-throughput data from many sources, making it ideal for IoT data ingestion.

Why this answer

Option B is correct because Amazon Kinesis Data Streams is a serverless streaming data service that can handle variable and high-throughput data from many sources, making it ideal for IoT data ingestion. Option A (Amazon SQS) is for message queuing, not real-time streaming analytics. Option C (Amazon Kinesis Data Firehose) is for loading streaming data into data stores, but it does not provide the same real-time processing capabilities as Kinesis Data Streams.

Option D (AWS IoT Core) is a managed cloud service that lets connected devices easily and securely interact with cloud applications and other devices, but it is not primarily a streaming ingestion service for analytics.

Full explanation →

1278

Multi-Selecthard

A company has an S3 bucket with versioning enabled that stores critical data. The security team requires that once an object is deleted, it cannot be recovered by anyone, including the root user. Additionally, the company wants to ensure that objects cannot be overwritten for a specified period. Which THREE actions should the data engineer take to meet these requirements? (Choose THREE.)

Select 3 answers

A.Enable S3 Object Lock in compliance mode.

B.Set a retention period on the bucket using Object Lock.

C.Enable S3 Versioning on the bucket.

D.Enable MFA Delete on the bucket.

E.Configure a lifecycle policy to expire noncurrent versions after 1 day.

AnswersA, B, C

Compliance mode prevents deletion by any user, including root.

Why this answer

Options A, B, and D are correct. S3 Object Lock in compliance mode prevents any user from deleting or overwriting objects; enabling versioning is required for Object Lock; a retention period enforces the protection for a specified time. Option C is wrong because MFA Delete can be bypassed by root; Option E is wrong because lifecycle policies can delete objects, which is not allowed.

Full explanation →

1279

MCQhard

Refer to the exhibit. A data engineer attached this S3 bucket policy to the bucket 'example-bucket'. What is the effect of this policy?

A.It allows all PutObject requests that do not use encryption

B.It denies PutObject requests that do not use SSE-S3

C.It denies all PutObject requests unless they use SSE-KMS

D.It denies all PutObject requests from anonymous users

AnswerB

The condition denies if encryption is not AES256.

Why this answer

Option C is correct. The policy denies PutObject if the request does not use SSE-S3 (AES256). Option A is wrong because it does not enforce SSE-KMS.

Option B is wrong because it allows requests with SSE-S3. Option D is wrong because it does not deny all requests.

Full explanation →

1280

Multi-Selecteasy

A data engineer needs to ingest data from a SaaS application (Salesforce) into Amazon S3 on a daily basis. Which TWO AWS services can be used for this purpose? (Choose TWO.)

Select 2 answers

A.AWS DataSync

B.Amazon Kinesis Data Streams

C.AWS Transfer Family

D.AWS Glue

E.Amazon AppFlow

AnswersD, E

Glue can connect to Salesforce via JDBC and write to S3.

Why this answer

Amazon AppFlow natively integrates with Salesforce and can write to S3. AWS Glue can also connect via JDBC. Option A is correct.

Option B is wrong because Kinesis Data Streams is for real-time streaming, not batch from SaaS. Option C is correct. Option D is wrong because Transfer Family is for FTP.

Option E is wrong because DataSync is for file/object storage, not SaaS.

Full explanation →

1281

MCQhard

A company is using AWS Lake Formation to manage permissions on data in Amazon S3. They need to ingest data from an external source into a new database 'sales_db' and a table 'transactions' using AWS Glue. The IAM role used by Glue must have the minimal permissions to create the database and table in the Data Catalog and write data to the S3 location. Which combination of permissions should be granted?

A.IAM policy with `glue:CreateDatabase`, `glue:CreateTable`, and `s3:PutObject`

B.IAM policy with `lakeformation:GrantPermissions` on the database and table

C.IAM policy with `s3:GetObject` and `s3:PutObject` on the target location

D.Lake Formation permissions: `CREATE_DATABASE` on the catalog, `CREATE_TABLE` on `sales_db`, and S3 location permission

AnswerD

Lake Formation controls Data Catalog operations; S3 write is also needed.

Why this answer

Option C is correct because Lake Formation permissions are required to create databases and tables in the Data Catalog. IAM permissions alone are insufficient; the Glue role must have Lake Formation `CreateDatabase` and `CreateTable` permissions. Option A lacks Lake Formation.

Option B is too broad. Option D lacks S3 write.

Full explanation →

1282

MCQhard

A healthcare company stores patient records in an S3 bucket encrypted with SSE-S3. The data engineering team uses AWS Glue ETL jobs to process this data and load it into an Amazon Redshift cluster for analytics. Recently, the security team mandated that all sensitive data must be encrypted at rest using customer-managed keys (CMK) in AWS KMS, and that the keys must be rotated automatically every year. The team updated the S3 bucket to use SSE-KMS with a CMK and enabled automatic key rotation. However, after the change, the Glue ETL jobs that read from the S3 bucket started failing with 'Access Denied' errors. The Glue job uses an IAM role named 'GlueETLRole' that has the following permissions: s3:GetObject on the bucket, kms:Decrypt and kms:GenerateDataKey on the CMK, and all necessary Glue permissions. The Redshift cluster is also encrypted with a different CMK, and the Glue role has kms:Decrypt on that key as well. What is the most likely cause of the failure?

A.The KMS key policy for the CMK used for S3 encryption does not grant 'GlueETLRole' permission to use the key.

B.The IAM role 'GlueETLRole' does not have kms:Decrypt permission on the CMK used for S3 encryption.

C.The Glue job requires kms:Encrypt permission to read encrypted data from S3.

D.The S3 VPC endpoint policy does not allow the Glue job to access the KMS key.

AnswerA

The key policy must allow the IAM role to use the key.

Why this answer

Option B is correct. When using SSE-KMS, the Glue job needs to call kms:Decrypt to decrypt the data, but the S3 GET request also requires kms:Decrypt permission. The role has kms:Decrypt, but the key policy of the CMK must also grant the role permission.

Option A is wrong because the role has the required KMS permissions. Option C is wrong because Glue does not need kms:Encrypt for reading. Option D is wrong because VPC endpoint policy may block but is less likely.

Full explanation →

1283

MCQmedium

A company needs to automate the detection of sensitive data in Amazon S3 and generate reports. Which AWS service should be used?

A.Amazon Macie

B.Amazon Inspector

C.Amazon GuardDuty

D.AWS Config

AnswerA

Macie discovers sensitive data in S3.

Why this answer

Option B is correct. Amazon Macie uses machine learning to discover and classify sensitive data. Option A is wrong because GuardDuty is for threat detection.

Option C is wrong because Inspector is for vulnerability management. Option D is wrong because Config is for resource compliance.

Full explanation →

1284

MCQmedium

A company uses Amazon S3 to store sensitive customer data. The security team requires that all objects uploaded to a specific bucket be encrypted at rest using AWS KMS with a customer managed key. Which bucket policy statement should be applied to enforce this requirement?

A.Deny s3:PutObject unless s3:x-amz-server-side-encryption is present

B.Allow s3:PutObject only if s3:x-amz-server-side-encryption is present

C.Deny s3:PutObject unless s3:x-amz-server-side-encryption-aws-kms-key-id equals the specific KMS key ARN

D.Deny s3:PutObject unless s3:x-amz-server-side-encryption equals AES256

AnswerC

This condition ensures only the specified KMS key is used for encryption.

Why this answer

Option C is correct because the security team requires encryption at rest using AWS KMS with a customer managed key. The bucket policy must deny any s3:PutObject request that does not include the s3:x-amz-server-side-encryption-aws-kms-key-id condition key set to the specific KMS key ARN. This ensures that only objects encrypted with the designated customer managed key are allowed, enforcing the encryption requirement at the bucket policy level.

Exam trap

The trap here is that candidates often confuse the condition key s3:x-amz-server-side-encryption (which only checks for SSE-S3 or SSE-KMS) with s3:x-amz-server-side-encryption-aws-kms-key-id (which checks for a specific KMS key), leading them to pick Option A or D instead of C.

How to eliminate wrong answers

Option A is wrong because it only checks for the presence of any server-side encryption header (s3:x-amz-server-side-encryption), which could be AES256 (SSE-S3) or aws:kms (SSE-KMS), but does not enforce the use of a customer managed KMS key. Option B is wrong because using an Allow effect with a condition does not override a default implicit deny; to enforce a restriction, you must use an explicit Deny statement. Option D is wrong because it requires the encryption header to equal AES256, which enforces SSE-S3, not SSE-KMS with a customer managed key.

Full explanation →

1285

MCQhard

Refer to the exhibit. A data engineer has configured an S3 event notification to send an event to an SQS queue when objects are created in the 'incoming/' prefix. The engineer wants to trigger an AWS Lambda function to process the object. However, the Lambda function is not being invoked. What is the most likely cause?

A.The Lambda function lacks permission to read from the S3 bucket.

B.Lambda is not configured as an event source for the SQS queue.

C.The SQS queue does not exist or is in a different account.

D.The S3 event notification filter prefix is incorrect.

AnswerB

Lambda must poll the SQS queue to be triggered.

Why this answer

Option B is correct. The event notification sends to SQS, not directly to Lambda. To invoke Lambda, the queue must be configured as an event source for Lambda.

Option A is wrong because the prefix is correct. Option C is wrong because permissions are needed but not the primary cause; the Lambda function is not triggered by SQS events unless Lambda polls the queue. Option D is wrong because the queue exists and is configured.

Full explanation →

1286

MCQhard

A company uses AWS DMS to migrate an on-premises Oracle database to Amazon RDS for PostgreSQL. The migration completes, but the target table has more rows than the source. Which is the MOST likely cause?

A.DMS used binary replication which included extra metadata rows.

B.Oracle and PostgreSQL handle case sensitivity differently.

C.DMS performed a full load instead of ongoing replication.

D.Target tables lack unique constraints, causing DMS to insert duplicate rows.

AnswerD

Without unique constraints, DMS may re-apply changes and create duplicates.

Why this answer

Option C is correct. DMS may re-apply changes from the transaction log if the target tables have no unique constraints, leading to duplicate rows. Option A is wrong because DMS uses CDC, not a full load.

Option B is wrong because PostgreSQL is case-sensitive, but that would cause missing rows, not extras. Option D is wrong because binary replication is not relevant.

Full explanation →

1287

MCQeasy

A data engineer needs to store a large number of small files (each a few KB) from IoT sensors. The data is written once and never modified. The primary requirement is high write throughput and low latency for writes. Which storage solution is most suitable?

A.Amazon DynamoDB with on-demand capacity

B.Amazon RDS for MySQL with InnoDB

C.Amazon S3 with standard storage class

D.Amazon Elastic Block Store (EBS) volumes

AnswerA

DynamoDB provides single-digit millisecond latency and high throughput for writes.

Why this answer

Amazon DynamoDB with on-demand capacity is the most suitable because it is a NoSQL key-value and document database designed for single-digit millisecond latency at any scale. It supports high write throughput by automatically distributing data across multiple partitions, and on-demand capacity eliminates the need for provisioning, allowing it to absorb unpredictable write spikes from many IoT sensors without throttling.

Exam trap

The trap here is that candidates often choose Amazon S3 for storing small files because of its durability and cost, but they overlook the fact that S3's PUT request latency and eventual consistency model make it unsuitable for high-frequency, low-latency write workloads, whereas DynamoDB is purpose-built for such patterns.

How to eliminate wrong answers

Option B (Amazon RDS for MySQL with InnoDB) is wrong because relational databases are optimized for complex queries and ACID transactions, not for high-throughput ingestion of many small, immutable writes; they incur overhead from indexing, locking, and transaction logs that limit write throughput. Option C (Amazon S3 with standard storage class) is wrong because S3 is an object store optimized for durability and high read throughput, but it has a minimum object size of 0 bytes and a write latency of tens to hundreds of milliseconds per PUT request, making it unsuitable for low-latency, high-frequency writes of many small files. Option D (Amazon Elastic Block Store volumes) is wrong because EBS provides block-level storage volumes attached to a single EC2 instance, which creates a bottleneck for distributed write workloads and does not natively support the high concurrency needed for thousands of simultaneous sensor writes.

Full explanation →

1288

MCQeasy

A data engineer needs to back up an Amazon DynamoDB table daily. The backup must be restorable to a specific point in time within the last 24 hours. Which solution meets these requirements with the LEAST operational overhead?

A.Create an on-demand backup of the table every 24 hours.

B.Use DynamoDB Streams to replicate data to another table.

C.Enable point-in-time recovery (PITR) on the table.

D.Export the table data to Amazon S3 every 6 hours using a Lambda function.

AnswerC

PITR provides continuous backups with point-in-time restore capability.

Why this answer

Option C is correct because DynamoDB's point-in-time recovery (PITR) provides continuous backups that allow restoration to any point within the last 35 days (default 24h to 35d) with no manual scheduling. Option A is incorrect because on-demand backups are manual and not continuous. Option B is incorrect because exporting to S3 is a separate process, not a backup feature.

Option D is incorrect because DynamoDB Streams is for change data capture, not backups.

Full explanation →

1289

MCQmedium

Refer to the exhibit. A data engineer runs this AWS CLI command to execute an Athena query. What is the purpose of the EncryptionConfiguration parameter?

A.It encrypts the query string in transit

B.It encrypts the data in the source table

C.It enables client-side encryption for the query output

D.It encrypts the query results stored in Amazon S3 at rest

AnswerD

The parameter defines encryption for the result set in S3.

Why this answer

The EncryptionConfiguration parameter in Athena specifies how the query results stored in S3 are encrypted at rest. SSE_S3 means server-side encryption with S3-managed keys. It does not encrypt the query itself, data in transit, or the source data.

Full explanation →

1290

MCQmedium

A media company stores video files in an S3 bucket. The files are processed by a fleet of EC2 instances that read the files, add watermarks, and write the output back to the same bucket. Recently, the processing jobs have been failing with '500 Internal Server Error' and '503 Slow Down' errors. The data engineer checks the S3 bucket metrics and sees that the PUT/GET request rate is consistently above 5,500 requests per second for a single prefix. The engineer needs to resolve the errors with minimal changes to the application code. Which course of action should the engineer take?

A.Use S3 Batch Operations to process the files.

B.Increase the number of EC2 instances to process files in parallel.

C.Enable S3 Transfer Acceleration on the bucket to improve throughput.

D.Modify the application to add a random hash prefix to the object keys to distribute load across multiple prefixes.

AnswerD

Spreading requests across many prefixes increases the aggregate request rate limit.

Why this answer

Option A is correct. Distributing objects across multiple prefixes (e.g., by adding a hash prefix) increases the request rate limit because S3 supports up to 5,500 requests per second per prefix. Option B is wrong because S3 Transfer Acceleration improves speed over distance but does not increase request rate limits.

Option C is wrong because increasing EC2 instances would increase request rate, worsening the issue. Option D is wrong because S3 Batch Operations is for large-scale batch operations, not for real-time processing.

Full explanation →

1291

MCQeasy

A data engineer is troubleshooting an AWS Glue job that reads from an Apache Kafka topic using a Glue connector. The job fails with 'TimeoutException'. The Kafka cluster is in a VPC. Which step should the engineer take FIRST?

A.Check the security group and network ACLs associated with the Glue job's VPC.

B.Increase the Kafka consumer session timeout.

C.Update the Glue connector to the latest version.

D.Change the Glue job type from Spark to Python Shell.

AnswerA

Network configuration is the most common cause of timeouts.

Why this answer

Option C is correct because timeout errors often indicate network connectivity issues; verifying the security group and route tables for the Glue job's VPC is the first step. Option A is wrong because the error is not about the connector library. Option B is wrong because the job type does not cause timeouts.

Option D is wrong because the error is not about record format.

Full explanation →

1292

MCQmedium

A data engineer notices that an AWS Glue ETL job that processes streaming data from Amazon Kinesis Data Streams is failing intermittently with a 'ResourceNotFoundException' error for the Kinesis stream. The job has been running successfully for weeks. Which action should the engineer take to resolve the issue?

A.Increase the number of shards in the Kinesis data stream to handle higher throughput.

B.Rename the Kinesis data stream to match the stream name used in the Glue job exactly, including case.

C.Add the 'kinesis:DescribeStream' permission to the IAM role used by the Glue job.

D.Increase the timeout for the Glue job in the job configuration.

AnswerC

Missing DescribeStream permission causes intermittent resource not found errors.

Why this answer

Option C is correct because the most common cause of intermittent 'ResourceNotFoundException' for a Kinesis stream is that the IAM role used by the Glue job does not have the kinesis:DescribeStream permission, which is required for the job to check stream details. Option A is incorrect because increasing the Kinesis shard count would not resolve a permissions issue. Option B is incorrect because the Kinesis stream name must match exactly; case sensitivity would cause a consistent error, not intermittent.

Option D is incorrect because the timeout setting on the Glue job would not cause a resource not found error.

Full explanation →

1293

MCQhard

A company uses Amazon EMR to process data stored in S3 with server-side encryption using AWS KMS. The EMR cluster fails with a "403 Access Denied" error when reading data from S3. The IAM role for the EMR cluster has s3:GetObject and kms:Decrypt permissions. What is the most likely issue?

A.The EC2 instance profile does not have kms:Decrypt permission

B.The S3 bucket policy denies access to the EMR cluster's IAM role

C.The EMRFS consistent view is not enabled

D.The EMR cluster is using an incorrect KMS key ID

AnswerA

The instance profile must have KMS decrypt permission.

Why this answer

Option D is correct. EMR EC2 instance profile must have the necessary permissions. Often the instance profile is missing kms:Decrypt.

Option A is wrong because SSE-KMS requires kms:Decrypt. Option B is wrong because S3 bucket policy might allow. Option C is wrong because EMRFS consistent view is not related.

Full explanation →

1294

Multi-Selecteasy

A company is using AWS Glue to process data stored in Amazon S3. The Glue job runs successfully but takes longer than expected. Which TWO actions can reduce the job runtime?

Select 2 answers

A.Disable job bookmarking

B.Increase the number of DPUs allocated to the job

C.Reduce the number of workers

D.Change the job type from Spark to Python shell

E.Partition the input data in S3

AnswersB, E

More DPUs enable parallel processing, reducing runtime.

Why this answer

Option B is correct because increasing worker capacity speeds up processing. Option D is correct because partitioning data reduces the amount scanned. Option A is wrong because the job type (Spark vs Python) is not specified as the bottleneck.

Option C is wrong because reducing worker count would increase runtime. Option E is wrong because disabling job bookmarking might affect incremental processing but not runtime significantly.

Full explanation →

1295

MCQhard

A company is using Amazon Kinesis Data Streams with a Lambda consumer to process real-time events. The Lambda function is triggered by a DynamoDB stream to update a counter. Recently, the counter has been inaccurate due to duplicate processing. What is the most likely cause?

A.The DynamoDB stream is configured with 'TRIM_HORIZON' iterator type

B.The Lambda function's reserved concurrency is too low

C.The Lambda function is not idempotent and is being retried on failures

D.The Kinesis stream has undergone a shard rebalance

AnswerC

Retries cause duplicate updates if the function is not idempotent.

Why this answer

Option D is correct because Lambda functions may be invoked multiple times for the same record if they fail or time out, leading to duplicates if the operation is not idempotent. Option A is wrong because DynamoDB streams guarantee at-least-once delivery, not exactly-once. Option B is wrong because Kinesis shard rebalancing does not cause duplicates.

Option C is wrong because Lambda concurrency limits would cause throttling, not duplicates.

Full explanation →

1296

MCQhard

A data engineer runs a weekly AWS Glue ETL job that processes data from Amazon DynamoDB to Amazon S3. The job reads the entire table every time, which is slow and expensive. The job needs to process only items that changed since the last run. Which solution should the engineer implement?

A.Use DynamoDB Scan with a LastEvaluatedKey to paginate and store the last scanned key to resume next time

B.Enable DynamoDB Streams and process change events with AWS Lambda to write to S3

C.Add a Global Secondary Index (GSI) on a timestamp attribute and query only new records

D.Use AWS Database Migration Service (DMS) with ongoing replication from DynamoDB to S3

AnswerB

Streams capture item-level changes, enabling incremental loads.

Why this answer

Option D is correct because DynamoDB Streams captures changes (inserts, updates, deletes) and can be read by AWS Lambda to write to S3, enabling incremental processing. Option A (Scan with LastEvaluatedKey) still scans entire table. Option B (GSI) does not capture changes automatically.

Option C (DMS) is overkill and not as seamless with DynamoDB Streams.

Full explanation →

1297

MCQhard

A company has an S3 bucket policy that allows access to a specific IAM role. However, an administrator notices that requests from that role are being denied. The bucket is encrypted with AES-256. What is the MOST likely reason for the denial?

A.The IAM role does not have s3:Decrypt permission on the bucket.

B.The bucket policy has an explicit deny statement that overrides the allow.

C.The bucket policy allows access, but the VPC endpoint policy denies the action.

D.The S3 Block Public Access settings are blocking the request.

AnswerC

A VPC endpoint policy can restrict actions even if the bucket policy allows them.

Why this answer

Option B is correct because even if the bucket policy allows access, if the bucket is not publicly accessible and the VPC endpoint policy does not allow the action, access will be denied. Option A is wrong because AES-256 encryption does not require additional permissions. Option C is wrong because an explicit deny in the bucket policy would override an allow.

Option D is wrong because S3 Block Public Access settings affect only public access, not IAM role access.

Full explanation →

1298

MCQeasy

The exhibit shows a build log from AWS CodeBuild. The build fails with a permission error when trying to open the downloaded file. What is the most likely cause?

A.The S3 bucket policy denies access to the object.

B.The python script is not in the PATH.

C.The downloaded file has restrictive permissions that the python process cannot read.

D.The file is encrypted and cannot be decrypted.

AnswerC

Permission denied suggests file ownership/permissions issue.

Why this answer

The file is downloaded with root ownership and non-root user cannot read it. CodeBuild by default runs as root, but the process.py might be running as a non-root user. However, common cause is that the file was downloaded with restrictive permissions.

But more plausible: the file download succeeded but the python script may be running as a different user, or the file permissions are wrong. In CodeBuild, the default user is root, but if the buildspec runs commands as different user, permissions may be an issue. However, a typical cause is that the file permissions are 600 owned by root, and the python process runs as a non-root user.

Alternatively, the file might be corrupted. The most likely cause is that the file permissions do not allow read by the user running python. Given the context, option B is correct.

Full explanation →

1299

MCQeasy

A data engineer needs to encrypt data in transit between an Amazon RDS for MySQL instance and an application. Which solution should be used?

A.Enable encryption at rest using AWS KMS

B.Use SSL/TLS to connect to the RDS instance

C.Store the data in Amazon S3 with server-side encryption

D.Use AWS CloudHSM to generate and store encryption keys

AnswerB

SSL/TLS encrypts data in transit between client and database.

Why this answer

Option B is correct because SSL/TLS is used to encrypt data in transit between clients and RDS. Option A is wrong because KMS encrypts data at rest, not in transit. Option C is wrong because S3 is not involved in this scenario.

Option D is wrong because CloudHSM provides hardware security modules for key storage, not encryption in transit.

Full explanation →

1300

MCQmedium

A data engineer needs to transform JSON data into Parquet format using AWS Glue. The input data has nested fields. Which Glue feature should be used to flatten the nested structure?

A.Relationalize transform

B.DropNullFields transform

C.FindMatches transform

D.Map transform

AnswerA

Relationalize transforms nested JSON into flat tables.

Why this answer

Option B is correct because AWS Glue's 'Relationalize' transform flattens nested JSON into a relational format. Option A is wrong because 'FindMatches' is for deduplication. Option C is wrong because 'DropNullFields' removes null fields.

Option D is wrong because 'Map' is for simple field mappings.

Full explanation →

1301

MCQmedium

A data engineering team uses Amazon S3 to store raw data files. They have an AWS Glue ETL job that reads from an S3 bucket, transforms the data, and writes to a Redshift cluster. The job runs daily and has been failing intermittently with the error: 'An error occurred while calling o143.pyWriteDynamicFrame. S3 Access Denied'. The team has confirmed that the IAM role used by the Glue job has s3:GetObject and s3:PutObject permissions on the bucket and all objects. The Redshift cluster is in the same VPC and the Glue connection is configured correctly. What is the most likely cause of the failure?

A.The Redshift cluster is not publicly accessible and the Glue job does not have a VPC endpoint to Redshift.

B.The Glue job has exceeded the maximum execution time and is being killed by AWS.

C.The Glue job is using the wrong JDBC driver version for Redshift.

D.The Glue job's IAM role lacks permission to write to the Glue temporary file bucket (aws-glue-*).

AnswerD

Glue uses a temporary S3 bucket for staging; the role must have s3:PutObject on that bucket.

Why this answer

Option D is correct because Glue jobs use a special S3 bucket for bookkeeping and temporary data. The job's IAM role must have s3:PutObject permission on the bucket used for temporary files, which is often 'aws-glue-*' for the same region. If this permission is missing, the job fails with access denied.

Option A is wrong because the error is an S3 access issue, not a network timeout. Option B is wrong because the error is not related to schema mismatch. Option C is wrong because the error is an S3 access issue, not a Glue job timeout.

Full explanation →

1302

Multi-Selecthard

A company is designing a multi-Region disaster recovery solution for Amazon DynamoDB. They need to ensure that data is replicated across Regions with minimal latency and that applications can read from any Region. Which THREE steps should be taken? (Choose THREE.)

Select 3 answers

A.Configure application to read from any Region using the DynamoDB endpoint.

B.Configure Time to Live (TTL) to automatically expire old data.

C.Enable DynamoDB Global Tables.

D.Enable DynamoDB Streams on the table.

E.Deploy DynamoDB Accelerator (DAX) in each Region.

AnswersA, C, D

Global Tables allow reads from any Region.

Why this answer

Options A, C, and E are correct. Global Tables replicate data across Regions. DynamoDB Streams capture changes for replication.

Applications can read from any Region. Option B is wrong because DAX is not required for global replication. Option D is wrong because TTL is for expiration, not replication.

Full explanation →

1303

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour and process new data that arrives in an S3 bucket. The job should only process files that have been added since the last run. Which approach should the engineer use to track which files have been processed?

A.Configure S3 Event Notifications to trigger the Glue job on each new object creation.

B.Enable job bookmarks in the Glue ETL job.

C.Store the last processed timestamp in a DynamoDB table and query it at the start of the job.

D.Use S3 Inventory to list all objects and filter by last modified date in the job.

AnswerB

Glue job bookmarks automatically track the state of data processed and only process new data.

Why this answer

Option B is correct because Glue job bookmarks automatically track the last processed files in S3 and only process new data in subsequent runs. Option A is wrong because S3 Event Notifications can trigger the job but do not track processed files; the job would need to handle deduplication. Option C is wrong because a DynamoDB table would require custom code to track state.

Option D is wrong because checking last modified timestamp manually is error-prone and not built-in.

Full explanation →

1304

MCQeasy

A company wants to schedule a nightly batch job to copy data from an on-premises PostgreSQL database to Amazon S3. The solution must minimize operational overhead. Which AWS service should be used?

A.AWS Glue

B.AWS Data Pipeline

C.Amazon EMR

D.AWS Database Migration Service (AWS DMS) with ongoing replication

AnswerA

AWS Glue can run scheduled ETL jobs to read from PostgreSQL and write to S3 with minimal overhead.

Why this answer

AWS Glue can connect to JDBC sources like PostgreSQL using a crawler or ETL job, and write to S3. Data Pipeline and DMS are more complex or intended for replication, not simple scheduled batch copies.

Full explanation →

1305

Multi-Selectmedium

A company runs a data pipeline that ingests clickstream data from a web application into Amazon Kinesis Data Streams. A Lambda function processes records from the stream and writes them to an Amazon S3 bucket in JSON format. The pipeline has been running smoothly, but for the past hour, the Lambda function has been failing with 'Rate exceeded' errors, and the Kinesis stream shows elevated 'IteratorAgeMilliseconds' metrics. The Lambda function has a reserved concurrency of 100, and the Kinesis stream has 10 shards. The average record size is 5 KB, and the data rate is approximately 15 MB per second. Which combination of actions should a data engineer take to resolve the issue and prevent recurrence? (Choose TWO.)

Select 2 answers

A.Increase the Lambda function's reserved concurrency to 200.

B.Increase the number of Kinesis shards to 20.

C.Decrease the Lambda function's batch size from 100 to 50.

D.Enable S3 multipart upload for the Lambda function.

E.Replace the Lambda function with Amazon Kinesis Data Firehose to write directly to S3.

AnswersA, B

More concurrency allows more parallel invocations to process records faster.

Why this answer

The 'Rate exceeded' errors indicate that the Lambda function's concurrency is insufficient to keep up with the incoming data rate from Kinesis. With 10 shards and a 15 MB/s data rate, each shard processes ~1.5 MB/s, and with 5 KB records, that's ~300 records per second per shard. Increasing reserved concurrency to 200 allows more parallel invocations to handle the load, reducing the iterator age.

Exam trap

The trap here is that candidates often focus only on Lambda concurrency (Option A) and overlook the Kinesis shard count (Option B), not realizing that both the consumer (Lambda) and the stream capacity must be scaled together to resolve throughput bottlenecks.

Full explanation →

1306

MCQhard

A data engineering team uses AWS Glue ETL jobs to process daily data from an Amazon RDS for PostgreSQL instance into Amazon S3. Recently, the jobs have been failing randomly with the error 'psycopg2.OperationalError: could not connect to server: Connection timed out'. The RDS instance is in a private subnet with a security group that allows inbound traffic from the Glue job's security group on port 5432. The Glue job is configured to use the same VPC, subnet, and security group. The RDS instance has sufficient connections and is not at CPU or memory limits. The failures occur at different times each day, and the job works when retried immediately. Which action should the team take to resolve the issue?

A.Set the Glue job timeout to 60 minutes to ensure the job does not fail prematurely.

B.Change the subnet of the Glue job to one with a larger CIDR range (e.g., /20 instead of /24).

C.Add an inbound rule to the RDS security group allowing traffic on ports 1024-65535 from the Glue security group.

D.Increase the number of retries in the Glue job configuration to 5.

AnswerB

A larger subnet provides more available IP addresses, reducing the chance of exhausting the pool for Glue ENIs.

Why this answer

Option B is correct because the timeout suggests network connectivity issues, likely due to Glue's dynamic IP allocation exhausting available IP addresses in the subnet's CIDR range, which causes connection failures when no IP is available. Resizing the subnet to a larger CIDR provides more IP addresses for Glue ENIs. Option A is wrong because increasing the number of retries does not fix the root cause.

Option C is wrong because Glue does not have a connection timeout setting that would help; the issue is network connectivity, not job timeout. Option D is wrong because the existing security group already allows inbound traffic on port 5432; adding an inbound rule for ephemeral ports is unnecessary for PostgreSQL connections.

Full explanation →

1307

MCQeasy

A company uses AWS Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. Recently, the delivery stream has been failing with the error 'S3 bucket does not exist'. The S3 bucket exists and the Firehose IAM role has s3:PutObject permissions. What is the most likely cause?

A.The S3 bucket name is misspelled in the Firehose configuration.

B.The S3 bucket has default encryption enabled.

C.The S3 bucket is in a different AWS Region than the Firehose stream.

D.The IAM role does not have s3:ListBucket permission.

AnswerC

Firehose can only deliver to S3 buckets in the same region.

Why this answer

Option D is correct because Firehose requires the bucket to be in the same region as the delivery stream. If the bucket is in a different region, Firehose cannot write to it. Option A is wrong because the bucket exists.

Option B is wrong because the role has permissions. Option C is wrong because encryption is not related to bucket existence errors.

Full explanation →

1308

Multi-Selectmedium

A company uses Amazon DynamoDB for a gaming leaderboard. The table has a primary key of GameId (partition key) and Score (sort key). The application needs to retrieve the top 10 scores for a given game. Which strategies can improve query performance? (Choose TWO.)

Select 2 answers

A.Change the primary key to a single attribute

B.Create a global secondary index with Score as sort key

C.Use a Scan operation with a limit

D.Increase the write capacity units

E.Use DynamoDB Accelerator (DAX) for caching

AnswersB, E

Allows efficient sorted queries.

Why this answer

Option B is correct because creating a global secondary index (GSI) with Score as the sort key allows efficient range queries on scores for a given GameId. DynamoDB can then use the GSI to retrieve the top 10 scores in sorted order without scanning the entire table, leveraging the index's sort key to fetch only the highest values.

Exam trap

The trap here is that candidates may think a Scan with a limit is efficient for top-N queries, but DynamoDB Scans always read the entire dataset up to the limit, making them unsuitable for sorted retrieval without additional processing.

Full explanation →

1309

MCQmedium

A company uses Amazon Redshift for data warehousing. The data engineering team notices that queries are slow due to high disk I/O. The team wants to improve query performance without changing the cluster configuration. Which action should the team take?

A.Increase the number of nodes in the cluster.

B.Redesign tables with appropriate sort keys and distribution styles.

C.Run the ANALYZE command to update table statistics.

D.Run the VACUUM command to reclaim disk space.

AnswerB

Proper sort keys and distribution can minimize data scanning and reduce I/O.

Why this answer

Redesigning tables with appropriate sort keys and distribution styles directly addresses high disk I/O by minimizing data scanning and reducing data movement across nodes. Sort keys enable Redshift to skip irrelevant blocks via zone maps, while distribution styles (KEY, ALL, EVEN) optimize data locality for joins and aggregations, reducing I/O without changing cluster configuration.

Exam trap

The trap here is that candidates confuse maintenance commands (ANALYZE, VACUUM) with design changes, or think scaling out (adding nodes) is allowed when the question explicitly forbids changing cluster configuration.

How to eliminate wrong answers

Option A is wrong because increasing the number of nodes changes the cluster configuration, which the question explicitly prohibits. Option C is wrong because ANALYZE updates table statistics for the query planner but does not reduce disk I/O caused by poor data layout or data movement. Option D is wrong because VACUUM reclaims disk space from deleted rows and sorts data, but it does not fundamentally redesign tables to reduce I/O; it only maintains existing design.

Full explanation →

1310

MCQeasy

A company wants to store historical financial data for 7 years with immediate access for the first year and then infrequent access. After 7 years, the data must be automatically deleted. Which S3 lifecycle policy should be configured?

A.Transition to S3 Standard-IA after 30 days, expire after 2555 days

B.Transition to S3 Glacier Flexible Retrieval after 365 days, expire after 2555 days

C.Transition to S3 One Zone-IA after 90 days, expire after 365 days

D.Transition to S3 Glacier Deep Archive after 365 days, expire after 2555 days

AnswerD

This is cost-effective: immediate access for 1 year, then low-cost storage, delete after 7 years.

Why this answer

Option D is correct because it meets all requirements: immediate access for the first year (no transition before 365 days), then transition to S3 Glacier Deep Archive for infrequent access and cost savings, with automatic deletion after 7 years (2555 days). S3 Glacier Deep Archive is the most cost-effective storage class for long-term archival data that is rarely accessed, and the 2555-day expiration ensures compliance with the 7-year retention policy.

Exam trap

The trap here is that candidates often confuse 'immediate access for the first year' with needing a transition to a cheaper tier early, or they forget to include an expiration action, leading them to choose Option B which has no deletion mechanism.

How to eliminate wrong answers

Option A is wrong because transitioning to S3 Standard-IA after 30 days would move data too early, incurring unnecessary costs for the first year when immediate access is needed, and the 2555-day expiration is correct but the storage class is not suitable for infrequent access after year one. Option B is wrong because transitioning to S3 Glacier Flexible Retrieval after 365 days is acceptable, but this option lacks an expiration action, so data would not be automatically deleted after 7 years, violating the requirement. Option C is wrong because transitioning to S3 One Zone-IA after 90 days is too early and does not provide the durability needed for financial data (single AZ risk), and the 365-day expiration is far too short for a 7-year retention requirement.

Full explanation →

1311

Multi-Selecthard

A financial services company needs to share sensitive customer data with a third-party analytics firm. The data resides in an S3 bucket encrypted with an AWS KMS customer managed key. The third party has their own AWS account. Which combination of steps is required to securely share the data? (Choose TWO.)

Select 2 answers

A.Share the KMS key material with the third party

B.Update the KMS key policy to include the third-party account as a principal with kms:Decrypt permission

C.Create an IAM role in the third-party account that can be assumed by the data owner

D.Grant the third-party account access to the KMS key management

E.Configure an S3 bucket policy that grants the third-party account access to the objects

AnswersB, E

The key policy must allow the third party to decrypt.

Why this answer

Options A and D are correct. Option A: The S3 bucket policy must grant cross-account access. Option D: The KMS key policy must grant decrypt permission to the third-party account.

Option B is wrong because the third party does not need access to the KMS key management. Option C is wrong because the third party should not have cross-account access to the S3 bucket without appropriate permissions. Option E is wrong because sharing the KMS key directly is not secure.

Full explanation →

1312

MCQeasy

A data engineer needs to ensure that all data in an S3 bucket is encrypted at rest. The bucket contains objects uploaded by various applications. What is the simplest method to enforce encryption for all new objects?

A.Enable S3 Block Public Access to block public access to the bucket.

B.Enable default encryption on the S3 bucket using S3-Managed Keys (SSE-S3).

C.Enable S3 Object Lock on the bucket.

D.Configure an S3 bucket policy that denies PutObject requests without the x-amz-server-side-encryption header.

AnswerD

A bucket policy can conditionally deny uploads that lack the required encryption header, enforcing encryption for all new objects.

Why this answer

Option B is correct because S3 Bucket Policies can enforce encryption by denying PutObject requests that do not include the x-amz-server-side-encryption header. Option A is wrong because default encryption applies only if the upload request does not specify encryption headers; it does not enforce encryption for requests that specify 'None'. Option C is wrong because enabling S3 Block Public Access does not enforce encryption.

Option D is wrong because S3 Object Lock prevents deletion but does not enforce encryption.

Full explanation →

1313

MCQhard

A company has an AWS Glue ETL job that reads from an RDS MySQL instance and writes to S3. The security team requires that the connection to RDS be encrypted and that credentials be rotated automatically. Which configuration should be used?

A.Store the database password in an encrypted parameter in Systems Manager Parameter Store and enable SSL for the connection.

B.Use IAM database authentication for RDS and store credentials in Glue connection properties.

C.Store the password in a text file in an encrypted S3 bucket and use SSL.

D.Store the password in AWS Secrets Manager with automatic rotation enabled and configure Glue to use SSL for the connection.

AnswerD

Secrets Manager supports rotation and Glue can use SSL.

Why this answer

Option C is correct because Secrets Manager provides automatic rotation of RDS credentials, and Glue can use the secret. Option A does not rotate credentials. Option B does not provide encryption for the connection.

Option D is not a service for storing credentials.

Full explanation →

1314

Multi-Selecthard

A data engineer is troubleshooting an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The job runs successfully but 5% of records are missing after the load. The engineer suspects data consistency issues. Which THREE actions could help diagnose and resolve the problem? (Choose THREE.)

Select 3 answers

A.Use the Redshift COPY command with a manifest file to load data.

B.Increase the number of DPUs for the Glue job.

C.Enable Glue job bookmarks to track processed files.

D.Use a staging table in Redshift with a transaction to commit.

E.Review the job's CloudWatch Logs for any error messages.

AnswersA, C, E

Manifest file ensures all files are loaded.

Why this answer

Option A is correct because using the Redshift COPY command with a manifest file ensures that only the exact files listed in the manifest are loaded, eliminating the risk of partial or duplicate reads from S3. This is a common pattern to guarantee data consistency when the Glue job may not reliably track which files have been processed, especially in scenarios with concurrent writes or retries.

Exam trap

The trap here is that candidates often assume performance tuning (increasing DPUs) or database-level transactions (staging tables) can fix data ingestion gaps, when the actual problem is incomplete or inconsistent file discovery from the source (S3).

Full explanation →

1315

MCQhard

A company uses Amazon S3 to store large datasets. The data engineering team needs to provide access to specific objects in the bucket to external partners using presigned URLs. Each URL should expire after 12 hours. The team wants to ensure that the presigned URLs cannot be used to access other objects in the bucket. Which approach should be taken?

A.Create an IAM role for each partner and attach a policy that grants access to specific objects.

B.Generate presigned URLs using the AWS SDK, specifying the exact object key and expiration time.

C.Use a bucket policy that allows access only from the partner's IP address range.

D.Use CloudFront signed URLs with a custom policy that restricts access to specific objects.

AnswerB

Presigned URLs grant access only to the specified object and expire after the set time.

Why this answer

Option A is correct because a presigned URL generated for a specific object key and expiration time limits access to that object only. Option B (IAM role) is for internal use. Option C (bucket policy with IP restriction) is not per-object.

Option D (CloudFront signed URLs) also works but is more complex and may not be necessary.

Full explanation →

1316

MCQhard

Refer to the exhibit. A CloudFormation template is used to create a DynamoDB table. After creation, a data engineer wants to restore the table to a point in time from 3 hours ago. Which action is required?

A.Create a manual backup of the table first.

B.Enable AWS Backup to schedule automatic backups.

C.Ensure the table has at least one on-demand backup.

D.Use the AWS CLI or Console to initiate a point-in-time restore specifying the desired timestamp.

AnswerD

PITR is enabled, so restore is straightforward.

Why this answer

Option C is correct because point-in-time recovery (PITR) is enabled in the template, allowing restores to any time within the recovery window. Option A is wrong because backup is already enabled via PITR, so no need for additional backup. Option B is wrong because PITR does not require a backup to exist; it uses continuous backups.

Option D is wrong because AWS Backup is not required; DynamoDB PITR is sufficient.

Full explanation →

1317

MCQeasy

A data engineer needs to store JSON documents that are accessed by a serverless application using AWS Lambda. The documents are frequently updated and need low latency (single-digit milliseconds) for read and write operations. Which AWS service should the engineer use?

A.Amazon DynamoDB

B.Amazon ElastiCache for Redis

C.Amazon S3 (with S3 Select)

D.Amazon RDS for MySQL

AnswerA

DynamoDB offers single-digit millisecond latency for reads and writes and supports JSON documents natively.

Why this answer

Amazon DynamoDB is a fully managed NoSQL key-value and document database that provides single-digit millisecond latency for read and write operations at any scale. It natively supports JSON documents, integrates directly with AWS Lambda via the AWS SDK, and handles frequent updates efficiently through its auto-scaling and on-demand capacity modes, making it ideal for serverless applications requiring low-latency data access.

Exam trap

The trap here is that candidates often confuse ElastiCache for Redis as a primary data store due to its low latency, overlooking that it is an in-memory cache with no built-in persistence guarantees, whereas DynamoDB provides both low latency and durable, persistent storage for JSON documents.

How to eliminate wrong answers

Option B is wrong because Amazon ElastiCache for Redis is an in-memory cache, not a durable data store; while it offers sub-millisecond latency, it is typically used for caching or session management and requires a separate persistent database to avoid data loss on node failure, making it unsuitable as the primary store for frequently updated JSON documents that must persist. Option C is wrong because Amazon S3 is an object storage service with eventual consistency for overwrite PUTS and higher latency (typically tens to hundreds of milliseconds) for read operations, and S3 Select is a server-side filtering feature that does not reduce latency for individual document reads or writes; it is not designed for frequent, low-latency updates. Option D is wrong because Amazon RDS for MySQL is a relational database that requires schema definition, does not natively store JSON as a first-class document model (though it supports JSON data type, it lacks the flexible schema and single-digit millisecond read/write performance of DynamoDB for key-value access patterns), and incurs higher operational overhead for scaling and connection management in a serverless architecture.

Full explanation →

1318

Multi-Selecteasy

A data engineer is designing a data ingestion pipeline for streaming social media data. The data must be ingested with low latency (seconds) and stored in Amazon S3 for long-term analytics. The engineer also needs to perform real-time aggregations. Which TWO services should the engineer use? (Choose two.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.AWS Glue ETL

C.Amazon S3

D.Amazon Kinesis Data Analytics

E.Amazon Kinesis Data Streams

AnswersD, E

Performs real-time analytics on streaming data.

Why this answer

Option A and D are correct. Kinesis Data Streams provides low-latency ingestion, and Kinesis Data Analytics performs real-time aggregations. Option B is wrong because Kinesis Data Firehose has higher latency (minutes).

Option C is wrong because S3 is storage, not stream processing. Option E is wrong because Glue ETL is batch-oriented.

Full explanation →

1319

MCQmedium

A data engineer is using AWS Glue ETL to transform a large dataset in S3. The job processes 2 TB of data daily and currently runs for 6 hours. The engineer wants to reduce runtime without changing the transformation logic. What is the best approach?

A.Reduce the number of DPUs to minimize overhead.

B.Use the Spark UI to analyze bottlenecks and rewrite code.

C.Increase the number of Glue DPUs or enable auto-scaling.

D.Switch from Spark to Python shell.

AnswerC

More DPUs provide parallel processing and reduce runtime.

Why this answer

Increasing DPUs or enabling auto-scaling can improve performance. Option A is correct. Option B is wrong because reducing DPUs would slow the job.

Option C is wrong because using a different engine may not be compatible. Option D is wrong because Spark UI is for monitoring, not performance improvement.

Full explanation →

1320

MCQhard

A company uses AWS KMS to encrypt data in Amazon S3 and RDS. They need to ensure that encryption keys are automatically rotated every year. Which KMS key type supports automatic annual rotation?

A.AWS owned keys

B.AWS managed keys (aws/xxx)

C.Customer managed keys

D.Custom key stores

AnswerB

AWS managed keys rotate automatically every year.

Why this answer

AWS managed keys (AWS-managed KMS keys) have automatic rotation enabled by default every year. Customer managed keys can also have automatic rotation enabled optionally. AWS owned keys are not visible to the customer and cannot be managed.

Custom key stores do not support automatic rotation. Option B is wrong because customer managed keys require explicit enabling of rotation. Option C is wrong because AWS owned keys are not customer-accessible.

Option D is wrong because custom key stores do not support automatic rotation.

Full explanation →

1321

MCQeasy

A data engineer needs to ingest data from an on-premises Oracle database into Amazon S3. The data volume is about 500 GB initially, with daily incremental updates of 10 GB. The pipeline must minimize operational overhead. Which AWS service should be used for the initial and incremental loads?

A.AWS Database Migration Service (DMS) with change data capture (CDC) to Amazon S3.

B.AWS Glue with a JDBC connection and incremental crawl.

C.Amazon Kinesis Data Firehose with a custom producer.

D.AWS Data Pipeline with a SQL activity and HiveCopyActivity.

AnswerA

DMS supports full load and CDC with low overhead.

Why this answer

AWS DMS with CDC is the correct choice because it supports continuous replication from Oracle to Amazon S3 with minimal overhead. It handles both the initial 500 GB full load and ongoing 10 GB daily increments via change data capture, without requiring custom code or complex pipeline management.

Exam trap

The trap here is that candidates often choose AWS Glue for its serverless nature, but Glue's incremental crawl only updates the Data Catalog, not the data itself, and it cannot capture row-level changes from a database without full reloads.

How to eliminate wrong answers

Option B is wrong because AWS Glue with an incremental crawl is designed for cataloging schema changes, not for capturing row-level changes from a database; it would require full table scans for each incremental load, which is inefficient for 10 GB daily updates. Option C is wrong because Amazon Kinesis Data Firehose requires a custom producer to stream data from Oracle, which adds operational overhead and does not natively support CDC or initial bulk loads from a database. Option D is wrong because AWS Data Pipeline with a SQL activity and HiveCopyActivity is a legacy service that lacks native CDC support for Oracle, requiring custom scripting for incremental loads and increasing operational complexity.

Full explanation →

1322

MCQhard

A data engineer is monitoring an Amazon Kinesis Data Analytics application that processes real-time clickstream data. The application uses a Flink application with multiple operators. The engineer notices that the 'millisBehindLatest' metric is increasing steadily. Which action is MOST likely to reduce the lag?

A.Decrease the batch size in the Flink application.

B.Switch the source stream to use GZIP compression.

C.Increase the parallelism of the Flink application.

D.Increase the retention period of the Kinesis stream.

AnswerC

More parallelism increases processing capacity.

Why this answer

Option D is correct because increasing parallelism can improve throughput and reduce lag. Option A is wrong because increasing the retention period does not affect processing speed. Option B is wrong because decreasing the batch size would reduce throughput.

Option C is wrong because using a different compression may reduce storage but not lag.

Full explanation →

1323

Multi-Selectmedium

A data engineer is using Amazon EMR to process large datasets. The cluster uses a mix of Spot Instances and On-Demand Instances. The engineer wants to reduce costs while ensuring the job can complete even if Spot Instances are reclaimed. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Enable Instance Fleets to use multiple instance types for Spot Instances.

B.Use only On-Demand Instances for all nodes.

C.Use Spot Instances for core nodes to reduce cost.

D.Enable termination protection for the cluster.

E.Use a task instance group with Spot Instances for non-critical processing tasks.

AnswersA, E

Instance Fleets reduce the impact of Spot interruptions by diversifying instance types.

Why this answer

Option A is correct because enabling Instance Fleets allows EMR to diversify instance types for Spot, reducing the chance of interruption. Option D is correct because using a task instance group with Spot instances for non-critical tasks ensures that only those tasks are interrupted. Option B is incorrect because using only On-Demand increases cost.

Option C is incorrect because reducing core nodes may cause data loss if Spot nodes storing HDFS are reclaimed. Option E is incorrect because enabling termination protection is for accidental termination, not Spot interruptions.

Full explanation →

1324

MCQhard

A data pipeline uses AWS DMS to replicate data from an on-premises Oracle database to Amazon S3 in Parquet format. The pipeline has been running successfully for months, but recently the DMS task status shows 'failed' with the error: 'The source database is running out of archive log space.' Which action should the engineer take to prevent this error?

A.Configure multiple target S3 buckets to distribute the load.

B.Increase the amount of archive log space or reduce the log retention period on the source Oracle database.

C.Enable automatic log archiving on the DMS replication instance.

D.Increase the memory allocation for the DMS replication instance.

AnswerB

More space or shorter retention prevents log space exhaustion.

Why this answer

Option D is correct because increasing the archive log retention on the source ensures DMS can read changes before logs are purged. Option A is incorrect because DMS uses CDC logs; adding more memory does not fix log space. Option B is incorrect because DMS tasks can only use one S3 bucket.

Option C is incorrect because AWS does not manage on-premises database logs.

Full explanation →

1325

MCQmedium

A data engineer is troubleshooting a nightly ETL job that reads data from an RDS MySQL instance and writes to an S3 bucket in Parquet format. The job runs on an EMR cluster and uses PySpark. Recently, the job started failing with 'OutOfMemoryError' in the executor logs. The data volume has grown 30% in the last month. Which is the MOST efficient solution to resolve this issue without changing the code?

A.Change the RDS instance to a larger size to reduce load.

B.Switch the ETL job to use AWS Glue with a larger WorkerType.

C.Increase the executor memory and memoryOverhead in the Spark configuration.

D.Increase the number of core nodes in the EMR cluster.

AnswerC

Increasing executor memory and memoryOverhead directly addresses the OutOfMemoryError by providing more heap and off-heap memory to executors.

Why this answer

Option A is correct because increasing executor memory and adjusting the spark.executor.memoryOverhead setting addresses memory limitations for large data processing. Option B is wrong because switching to Glue may not directly resolve memory issues and requires code changes. Option C is wrong because increasing cluster size adds cost and may not fix memory per executor.

Option D is wrong because using a larger instance type is less flexible than tuning Spark configurations.

Full explanation →

1326

MCQeasy

A company needs to transform JSON data from Amazon Kinesis Data Streams into Parquet format and store it in Amazon S3. The transformation includes simple field mappings and type conversions. Which approach is most cost-effective and serverless?

A.Use Amazon EC2 instances running Apache Spark Streaming

B.Use Amazon Kinesis Data Firehose with an AWS Lambda function for transformation and output to Parquet

C.Use Amazon SageMaker for transformation

D.Use an AWS Glue ETL job triggered by a Kinesis stream

AnswerB

Firehose can invoke Lambda per record and convert to Parquet.

Why this answer

Kinesis Data Firehose with built-in Lambda transformation can convert JSON to Parquet efficiently. Glue is heavier and more expensive for simple transforms; EC2 is not serverless; SageMaker is ML-focused.

Full explanation →

1327

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for real-time data ingestion? (Choose three.)

Select 3 answers

A.The need for a fully managed delivery destination

B.Whether the application requires custom data processing logic

C.The ability to compress data before storage

D.The latency requirements for data delivery

E.The maximum throughput supported per shard

AnswersA, B, D

Firehose can directly deliver to S3, Redshift, etc., while Streams requires a consumer.

Why this answer

Option A, Option C, and Option D are correct. Kinesis Data Streams requires custom consumers; Firehose is fully managed. Streams provide sub-second latency; Firehose has buffer intervals.

Streams allow custom processing; Firehose has limited transformation options. Option B is wrong because both scale. Option E is wrong because both support compression for destinations.

Full explanation →

1328

MCQhard

A data engineer is building a real-time data pipeline using Amazon Kinesis Data Streams with a Lambda consumer. The data volume is 2 MB/s with average record size of 5 KB. The Lambda function processes records and writes to DynamoDB. Occasionally, the Lambda function fails with 'ProvisionedThroughputExceededException' on DynamoDB. What is the best way to handle this?

A.Replace Lambda with the Kinesis Client Library (KCL) running on EC2.

B.Increase the Lambda function's reserved concurrency to process more records in parallel.

C.Use DynamoDB Streams to capture the records and process them asynchronously.

D.Configure a Lambda destination on failure to send records to an SQS dead-letter queue, and implement retry logic in Lambda.

AnswerD

This handles transient failures without data loss.

Why this answer

Implementing a dead-letter queue and retry logic with exponential backoff is best practice. Option A is correct. Option B is wrong because increasing Lambda concurrency may worsen throttling.

Option C is wrong because KCL adds complexity. Option D is wrong because DynamoDB Streams is for change data capture, not for throttling errors.

Full explanation →

1329

MCQeasy

A company wants to ingest real-time streaming data from thousands of IoT devices into AWS for immediate processing. Which service is designed for ingesting large volumes of streaming data with low latency?

A.Amazon Simple Storage Service (S3)

B.AWS Database Migration Service (DMS)

C.Amazon Simple Queue Service (SQS)

D.Amazon Kinesis Data Streams

AnswerD

Kinesis Data Streams is designed for real-time streaming data ingestion with durable storage and replay capability.

Why this answer

Amazon Kinesis Data Streams is purpose-built for real-time streaming data ingestion at scale. Option A (SQS) is pull-based and not optimized for streaming; Option C (S3) is object storage; Option D (DMS) is for database migration.

Full explanation →

1330

MCQhard

A company runs a critical PostgreSQL database on Amazon RDS. The database experiences high read latency during peak hours. The data engineer needs to reduce read latency with minimal changes to the application. Which solution is MOST effective?

A.Delete unused indexes to improve query performance.

B.Enable Multi-AZ deployment for automatic failover.

C.Increase the DB instance class to a larger size with more memory.

D.Create a read replica of the RDS instance and redirect read queries to it.

AnswerD

Read replicas distribute read load, reducing latency.

Why this answer

Option A is correct because creating a read replica offloads read queries from the primary instance, reducing read latency. Option B is incorrect because increasing instance size may help but is more disruptive and costly. Option C is incorrect because enabling Multi-AZ is for high availability, not read performance.

Option D is incorrect because deleting unused indexes can improve write performance but may not significantly reduce read latency.

Full explanation →

1331

MCQhard

A company uses Amazon DynamoDB to store metadata for a document management system. The table has a partition key of document_id and a sort key of version. The application frequently queries for the latest version of a document by document_id. The data engineer notices that these queries are consuming a lot of read capacity. How can the engineer optimize the read performance and reduce read capacity consumption?

A.Change the sort key to store version in descending order.

B.Enable DynamoDB Streams and use a read replica.

C.Decrease the ReadCapacityUnits of the table to force caching.

D.Create a global secondary index (GSI) and use DynamoDB Accelerator (DAX).

AnswerD

A GSI can support efficient queries, and DAX caches results, reducing read capacity.

Why this answer

Option B is correct because creating a global secondary index (GSI) with document_id as partition key and version as sort key, but with a different projection, can allow queries to use the index with fewer items returned. More importantly, using DynamoDB Accelerator (DAX) would cache the results of frequent queries, reducing read capacity consumption. Option A is incorrect because changing the sort key order does not reduce read capacity for the same query.

Option C is incorrect because read replicas are not a feature of DynamoDB. Option D is incorrect because decreasing ReadCapacityUnits would cause throttling.

Full explanation →

1332

MCQhard

A data engineer is troubleshooting an ETL job that reads from an S3 bucket encrypted with SSE-KMS. The job is failing with an error indicating that the IAM role does not have permission to decrypt the data. What is the most likely missing permission?

A.kms:GenerateDataKey

B.s3:ListBucket

C.kms:Decrypt

D.s3:GetObject

AnswerC

To read SSE-KMS encrypted objects, the role must have kms:Decrypt permission on the KMS key.

Why this answer

Option D is correct because the role needs kms:Decrypt to read data encrypted with KMS keys. Option A is wrong because s3:ListBucket allows listing, not reading. Option B is wrong because s3:GetObject allows reading but without decrypt permission, the encrypted object cannot be read.

Option C is wrong because kms:GenerateDataKey is for encryption, not decryption.

Full explanation →

1333

MCQeasy

A company wants to audit all changes to IAM policies in their AWS account. Which AWS service should be used to record these changes for compliance purposes?

A.Amazon CloudWatch Logs

B.AWS Config

C.AWS CloudTrail

D.Amazon S3

AnswerC

CloudTrail records API calls made in the account, including IAM policy changes.

Why this answer

AWS CloudTrail records API calls, including IAM policy changes. AWS Config records resource configurations but not all API calls. CloudWatch Logs can store logs but does not record API calls itself.

S3 is the destination for logs, not the recording service.

Full explanation →

1334

MCQeasy

A data engineer needs to audit all changes to IAM policies in an AWS account. Which AWS service should be used?

A.AWS CloudTrail

B.AWS Config

C.AWS Organizations

D.Amazon CloudWatch Logs

AnswerA

CloudTrail records all API activity for auditing.

Why this answer

Option D is correct because AWS CloudTrail records API calls, including IAM policy changes. Option A is wrong because AWS Config tracks resource configuration changes, not API calls. Option B is wrong because CloudWatch Logs is for log storage.

Option C is wrong because AWS Organizations manages multiple accounts.

Full explanation →

1335

MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon S3 and load into Amazon Redshift. The jobs have recently started failing with 'Out of Memory' errors. The data volume has increased 3x in the past month. Which is the MOST effective solution to resolve this issue without redesigning the job?

A.Use Amazon Athena instead of Glue for the transformation.

B.Increase the number of Glue workers (DPUs) for the job.

C.Rewrite the job to use Spark SQL instead of PySpark.

D.Increase the number of partitions in the input S3 data.

AnswerB

More workers provide more memory and CPU to handle increased data volume.

Why this answer

Option A is correct because increasing the number of workers (DPUs) provides more memory and processing capacity. Option B (increasing S3 partitions) may help with parallelism but not directly with memory. Option C (using Spark SQL) is not a direct fix.

Option D (switching to Athena) changes the architecture.

Full explanation →

1336

MCQhard

A data engineer runs the above DDL statement in Amazon Athena. The query returns an error. What is the most likely cause?

A.The SerDe is not compatible with Parquet files.

B.The INPUTFORMAT is incorrect for Parquet files.

C.The S3 bucket location does not exist.

D.The table name contains underscores.

AnswerB

TextInputFormat is for text files, not Parquet. Should use Parquet input format.

Why this answer

Option D is correct because the table is defined with Parquet SerDe but the INPUTFORMAT is TextInputFormat, which is incompatible. For Parquet files, the INPUTFORMAT should be 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'. Option A is wrong because the location is valid.

Option B is wrong because the table name is valid. Option C is wrong because the SerDe is correct for Parquet, but the mismatch with INPUTFORMAT causes the error.

Full explanation →

1337

MCQmedium

A data engineer needs to load data from an on-premises Oracle database to Amazon S3 daily. The table is 500 GB and grows by 50 MB per day. The load must capture only new and changed rows since the last run. Which solution is MOST cost-effective and requires the least maintenance?

A.Write a custom Python script on EC2 to query the Oracle redo logs and upload to S3

B.Export the entire table to CSV daily using a script and upload to S3

C.Use AWS Glue ETL job with a JDBC connection and a timestamp filter

D.Use AWS Database Migration Service (DMS) with ongoing replication (CDC)

AnswerD

DMS supports CDC and can capture only changes, minimizing cost and effort.

Why this answer

Option D is correct because AWS DMS with change data capture (CDC) can continuously replicate changes with minimal setup. Option A (full export) is wasteful. Option B (Glue with JDBC) does not natively support CDC.

Option C (custom script) increases maintenance.

Full explanation →

1338

MCQhard

Refer to the exhibit. A CloudFormation stack outputs the Glue job name and S3 bucket names. The Glue job transforms CSV files from the raw bucket to Parquet in the processed bucket. However, the Glue job is failing with an error that it cannot write to the processed bucket. What is the most likely cause?

A.The Glue job does not have permission to write to the processed bucket

B.The raw data bucket is in a different region

C.The Glue job is not using the correct worker type

D.The Glue job is using an incorrect file format

AnswerA

Missing s3:PutObject on processed-bucket.

Why this answer

The Glue job's IAM role likely does not have s3:PutObject permission on the processed bucket. The CloudFormation outputs show the bucket names, but the role may only have read access to raw. The error is write-related, not read.

Full explanation →

1339

Multi-Selecteasy

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data must be transformed in real-time using a custom Lambda function. Which TWO steps are required to enable this? (Choose TWO)

Select 2 answers

A.Configure Kinesis Data Firehose to use a Lambda function for data transformation

B.Ensure the Lambda function returns the transformed records in the correct format

C.Create a Kinesis Data Analytics application to transform the data

D.Write the transformation logic directly in the Firehose delivery stream configuration

E.Use Kinesis Data Streams as the source for Firehose

AnswersA, B

Firehose can invoke Lambda for transformation.

Why this answer

Options A and D are correct. Kinesis Data Firehose can invoke a Lambda function for transformation. The Lambda function must return the transformed records to Firehose.

Option B is wrong because Kinesis Data Analytics is for analytics, not transformation. Option C is wrong because the transformation is done by Lambda, not within Firehose. Option E is wrong because Kinesis Data Streams is not required for Firehose.

Full explanation →

1340

Multi-Selecteasy

Which TWO features of Amazon DynamoDB help ensure high availability and durability? (Choose two.)

Select 2 answers

A.Auto-scaling adjusts provisioned capacity based on traffic.

B.Data is automatically replicated across multiple Availability Zones within an AWS Region.

C.Global tables enable active-active replication across multiple AWS Regions.

D.On-demand backup and restore provides point-in-time recovery.

E.Time to Live (TTL) automatically deletes expired items.

AnswersB, D

Provides high availability and durability.

Why this answer

Option B is correct because DynamoDB automatically replicates data synchronously across three Availability Zones (AZs) within an AWS Region. This built-in replication ensures that even if an entire AZ fails, the data remains available and durable, providing a 99.999999999% (11 nines) durability SLA.

Exam trap

The trap here is that candidates often confuse auto-scaling (Option A) with high availability, but auto-scaling only adjusts capacity to meet demand, not data replication or fault tolerance.

Full explanation →

1341

MCQeasy

A data engineer needs to ensure that data in an S3 bucket is encrypted at rest. The bucket policy includes a condition that denies PutObject requests if the object is not encrypted. Which S3 encryption feature does this enforce?

A.S3 Object Lock

B.S3 MFA Delete

C.S3 Default Encryption

D.S3 Bucket Policy

AnswerD

Bucket policy can deny uploads if encryption is not set.

Why this answer

Option D is correct because S3 bucket policies can require server-side encryption by denying PutObject without encryption headers. Option A (default encryption) is a bucket-level setting that automatically encrypts objects, but it does not enforce encryption via policy. Option B (object lock) prevents deletion.

Option C (MFA delete) requires multi-factor authentication.

Full explanation →

1342

Multi-Selecthard

A company is ingesting data from multiple sources into Amazon S3 using AWS Glue. The data is then transformed using Apache Spark on Amazon EMR. The data engineer wants to reduce the cost of storing and processing data by compressing the ingested files. Which THREE file formats support compression and are commonly used with Spark? (Choose THREE.)

Select 3 answers

A.ORC

B.Parquet

C.JSON

D.CSV

E.Avro

AnswersA, B, E

ORC is a columnar format that supports compression and is optimized for Hive/Spark.

Why this answer

Correct options: A, C, and D. Parquet, ORC, and Avro all support compression and are commonly used with Spark. Option B is wrong because JSON can be compressed but is not a columnar format and is less efficient; however, it does support compression, but the question asks for formats commonly used with Spark for analytics; Parquet, ORC, and Avro are typical.

Option E is wrong because CSV can be compressed but is not efficient for Spark processing. The answer is A, C, D.

Full explanation →

1343

MCQeasy

A company uses Amazon S3 to store sensitive customer data. The security policy requires that all objects in the bucket be encrypted at rest using server-side encryption with a customer-managed KMS key. The data engineer has enabled default encryption on the bucket using SSE-KMS with the required KMS key. However, a security scan reveals that some objects in the bucket are not encrypted with the KMS key. The objects were uploaded before the default encryption was enabled. The data engineer needs to ensure that all objects are encrypted with the KMS key without disrupting ongoing data access. What should the data engineer do?

A.Use the AWS CLI to copy the objects to themselves with the --sse-kms-key-id parameter.

B.Modify the bucket policy to deny access to objects not encrypted with the KMS key.

C.Delete the unencrypted objects and re-upload them with encryption.

D.Use S3 Batch Operations with a Lambda function to apply SSE-KMS encryption to all existing objects using the KMS key.

AnswerD

S3 Batch Operations can encrypt existing objects in place.

Why this answer

Option A is correct. S3 Batch Operations can apply SSE-KMS encryption to existing objects using the KMS key. Option B is wrong because copying objects manually is inefficient and error-prone.

Option C is wrong because the bucket policy only prevents new unencrypted uploads. Option D is wrong because deleting and re-uploading is disruptive.

Full explanation →

1344

Multi-Selectmedium

A company uses Amazon Redshift for analytics. The data engineering team wants to improve query performance for frequently used aggregate queries. Which TWO actions would help achieve this?

Select 2 answers

A.Increase the number of WLM query queues

B.Use distribution keys to collocate data on the same node slices

C.Run the VACUUM command to reclaim space from deleted rows

D.Define appropriate sort keys on the tables

E.Increase the number of nodes in the cluster

AnswersB, D

Distribution keys reduce data movement during joins and aggregations.

Why this answer

Distribution keys determine how data is distributed across node slices in Amazon Redshift. By choosing distribution keys that align with the join and aggregation columns, the database can collocate related data on the same slice, minimizing data movement during query execution. This directly improves performance for aggregate queries by reducing network traffic and enabling local computation.

Exam trap

The trap here is that candidates often confuse VACUUM (which reclaims space) with performance optimization for queries, or assume adding nodes always improves query speed without considering the overhead of data redistribution.

Full explanation →

1345

MCQmedium

A company uses Amazon S3 to store large CSV files and runs Amazon Athena queries on them. The queries are becoming slower as data grows. A data engineer suggests converting the files to Apache Parquet format and partitioning the data. What is the primary benefit of converting to Parquet?

A.Parquet allows schema evolution without rewriting files.

B.Parquet supports nested data structures that CSV cannot.

C.Parquet stores data in a columnar format, reducing the amount of data scanned per query.

D.Parquet is compressed by default, reducing storage costs.

AnswerC

Columnar storage minimizes I/O by reading only relevant columns.

Why this answer

Parquet is a columnar storage format that stores data by columns rather than rows. When Athena queries only a subset of columns, it can read just those columns from disk, drastically reducing the amount of data scanned per query. This directly addresses the performance slowdown because Athena charges by data scanned, and less scanning means faster queries and lower costs.

Exam trap

The trap here is that candidates confuse the general benefits of Parquet (compression, schema evolution, nested data) with the primary performance benefit for Athena, which is columnar pruning reducing scanned data.

How to eliminate wrong answers

Option A is wrong because Parquet does support schema evolution (e.g., adding columns) but this is not its primary benefit for query performance; schema evolution is a feature of many formats and not unique to Parquet's columnar nature. Option B is wrong because while Parquet does support nested data structures (like structs and arrays), CSV does not, but this is a data modeling advantage, not the primary performance benefit for large-scale analytics queries. Option D is wrong because Parquet is not compressed by default; compression is configurable (e.g., Snappy, Gzip, Zstd) and while it reduces storage costs, the primary benefit for query speed is columnar pruning, not compression.

Full explanation →

1346

MCQhard

The Glue job attempts to read data from 's3://my-data-bucket/input/' and write to 's3://my-data-bucket/output/'. It also tries to update a table in the Glue Data Catalog. The job fails with an access denied error. What is the MOST likely cause?

A.The IAM role is missing the 's3:ListBucket' permission on the bucket.

B.The 'glue:UpdateTable' action is not allowed on the specific table.

C.The policy is missing a condition key for the S3 bucket.

D.The resource ARN does not include the bucket itself; it only covers objects.

AnswerA

Glue needs ListBucket to read the list of objects in the prefix.

Why this answer

The policy allows GetObject and PutObject on the bucket, but the Glue job also needs 's3:ListBucket' permission to list objects in the bucket. Without ListBucket, the job cannot enumerate files. Option A is wrong because the resource is correct ('/*' includes all objects).

Option C is wrong because the action 'glue:UpdateTable' is allowed. Option D is wrong because there is no condition key issue.

Full explanation →

1347

Multi-Selectmedium

A company is building a data lake on AWS and must encrypt data at rest. Which services can provide server-side encryption for data stored in Amazon S3? (Choose TWO.)

Select 2 answers

A.SSE-S3

B.SSL/TLS

C.AWS SDK client-side encryption

D.AWS CloudHSM

E.SSE-KMS

AnswersA, E

Server-side encryption with S3 managed keys.

Why this answer

Options B and C are correct because SSE-S3 and SSE-KMS are two methods for server-side encryption in S3. Option A is wrong because client-side encryption is not server-side. Option D is wrong because CloudHSM is not a server-side encryption option for S3.

Option E is wrong because SSL/TLS is encryption in transit, not at rest.

Full explanation →

1348

MCQhard

Refer to the exhibit. A data engineer runs a CLI command to decrypt a file and receives an access denied error. The IAM user 'DataEngineer' has the following policy attached: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "kms:Decrypt", "Resource": "*" } ] } What is the most likely cause of the error?

A.The CLI command is missing the --encryption-context parameter.

B.The key policy does not grant the IAM user permission to decrypt.

C.The key is an AWS managed key and cannot be used for decryption.

D.The IAM policy does not allow kms:Decrypt on the specific key.

AnswerB

Key policy is separate from IAM policy; it must explicitly allow the user.

Why this answer

Even with an Allow on kms:Decrypt for all resources, the key policy must also grant the user access. The error indicates that the key policy does not allow the user. Option C is correct.

Option A is wrong because the IAM policy allows all keys. Option B is wrong because AWS managed keys are not used. Option D is wrong because the CLI command is correct.

Full explanation →

1349

MCQeasy

A data engineer needs to export data from an Amazon DynamoDB table to Amazon S3 for archival purposes. The export should be a one-time operation and must not impact the read capacity of the table. Which approach meets these requirements?

A.Use a Scan operation in a script to read all items and write to S3

B.Use AWS Glue ETL with a DynamoDB connector

C.Set up a DynamoDB Stream to Lambda that writes to S3

D.Use DynamoDB on-demand backup feature to export to S3

AnswerD

Backup exports to S3 without consuming read capacity.

Why this answer

Option B is correct because DynamoDB's on-demand backup exports to S3 without consuming read capacity. Option A impacts read capacity. Option C is for Spark jobs.

Option D is for real-time streaming.

Full explanation →

1350

MCQmedium

A retail company uses AWS Glue ETL jobs to process sales data from an S3 data lake. The source data is partitioned by year/month/day in CSV format. The Glue job reads the latest day's data, performs transformations (e.g., cleaning, aggregating), and writes the results to a separate S3 bucket. The job runs on a schedule every day at 2 AM. Recently, the job has been failing intermittently with the error 'AnalysisException: Path does not exist: s3://source-bucket/year=2024/month=02/day=30/'. The engineer verifies that the folder 'day=30' does not exist because February has only 28 days in 2024. The job is reading data from a hardcoded path. The company expects the job to handle variable days per month automatically. What should the engineer do to fix the issue?

A.Modify the script to use Spark SQL with manual partition pruning based on current date

B.Add a try-catch block in the script to skip missing partitions

C.Increase the job's retry count and set a timeout

D.Use a Glue crawler to populate the Data Catalog and use dynamic frame from_catalog with partition predicates

AnswerD

The crawler discovers existing partitions, and dynamic frame reads only available partitions.

Why this answer

Option B is correct because using a Glue crawler to update the partition metadata and then using dynamic frame with from_catalog allows Glue to automatically discover all existing partitions. This eliminates the need for hardcoded paths. Option A (try-catch) is a workaround but not a proper solution.

Option C (increase retries) does not fix the root cause. Option D (use Spark SQL with manual partition pruning) still requires knowing which partitions exist.

Full explanation →

Page 18 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →