Knowledge + Practice

AWS Certified Data Engineer Associate DEA-C01 (DEA-C01) — Questions 1126–1200

1786 questions total · 24pages · All types, answers revealed

Take a mock exam Exam hub

Page 16 of 24

1126

Multi-Selectmedium

A data engineering team is building a pipeline to transform CSV files uploaded to Amazon S3 into Parquet format using AWS Glue. The transformation must be serverless and handle files that arrive at irregular intervals. Which TWO actions should the team take? (Choose two.)

Select 2 answers

A.Configure an Amazon EMR cluster with Apache Spark for on-demand transformation.

B.Use an AWS Glue ETL job to convert CSV to Parquet.

C.Use AWS Data Pipeline to schedule a periodic transformation.

D.Use Amazon Redshift Spectrum to convert files during query execution.

E.Set up an S3 event notification to invoke an AWS Lambda function that triggers the Glue job.

AnswersB, E

Glue ETL jobs are serverless and can transform data formats.

Why this answer

Option A: AWS Glue is serverless and can transform data. Option C: S3 event notifications can trigger a Lambda function to start the Glue job. Option B (Amazon EMR) is server-based and requires cluster management.

Option D (AWS Data Pipeline) is not serverless. Option E (Amazon Redshift Spectrum) queries data in place but does not transform file formats.

Full explanation →

1127

MCQeasy

A company needs to ingest data from multiple SaaS sources (e.g., Salesforce, Marketo) into Amazon S3 for analytics. Which AWS service is designed for this purpose?

A.AWS Transfer Family

B.AWS Glue

C.Amazon AppFlow

D.AWS DataSync

AnswerC

AppFlow is purpose-built for SaaS data ingestion.

Why this answer

Option B is correct because AppFlow is a fully managed service for securely transferring data from SaaS applications to AWS. Option A (Glue) can connect to JDBC sources but not directly to many SaaS APIs. Option C (DataSync) is for file-based transfers.

Option D (Transfer Family) is for SFTP.

Full explanation →

1128

MCQeasy

A data engineer needs to ensure that an Amazon Redshift cluster only accepts encrypted connections. Which parameter should be modified?

A.enable_user_activity_logging

B.max_concurrency_scaling_clusters

C.require_SSL

D.wlm_json_configuration

AnswerC

This parameter enforces SSL connections.

Why this answer

Option A is correct. The require_SSL parameter enforces SSL connections. Option B is wrong because enable_user_activity_logging is for auditing.

Option C is wrong because max_concurrency_scaling_clusters is for concurrency. Option D is wrong because wlm_json_configuration is for workload management.

Full explanation →

1129

Multi-Selecthard

A data engineer is designing a data lake on Amazon S3 with AWS Lake Formation. The data lake contains personally identifiable information (PII). The company has a policy that only users who have completed data privacy training can access the PII data. The training status is stored in an external identity provider (IdP) as an attribute. The data engineer needs to enforce this policy using Lake Formation. Which THREE steps should the data engineer take? (Choose THREE.)

Select 3 answers

A.Create an LF-tag called 'trainingCompleted' with values 'true' and 'false'. Grant 'SELECT' permission on the LF-tag 'trainingCompleted=true' to the federated users.

B.Configure SAML-based federation between the IdP and AWS to pass the training status attribute in the SAML assertion.

C.Create a column-level filter on the PII columns that limits access based on the user's training attribute.

D.Create an IAM role for each user and attach a policy that allows 'lakeformation:GetDataAccess' only if the user has the training attribute.

E.Associate the LF-tag 'trainingCompleted=true' with the PII columns in the tables.

AnswersA, B, E

This allows users with the tag to access data associated with that tag.

Why this answer

Option A is correct because LF-tags allow Lake Formation to manage access based on metadata attributes. By creating an LF-tag 'trainingCompleted' with values 'true' and 'false', and granting SELECT permission on the tag value 'true' to federated users, the data engineer can enforce that only users with the training attribute can access the tagged resources. This approach decouples access control from IAM roles and leverages tag-based authorization, which is the recommended method for attribute-based access control (ABAC) in Lake Formation.

Exam trap

The trap here is that candidates often confuse column-level filters (Option C) with tag-based access control, not realizing that column-level filters cannot dynamically evaluate external IdP attributes, whereas LF-tags with SAML assertions can enforce attribute-based policies.

Full explanation →

1130

MCQeasy

A data engineer needs to set up a disaster recovery solution for an Amazon RDS for MySQL database. The database must be available in another AWS Region with minimal data loss. What is the simplest approach?

A.Enable Multi-AZ deployment in the same Region.

B.Set up AWS Database Migration Service (DMS) for continuous replication.

C.Take a manual snapshot and copy it to the other Region daily.

D.Create a cross-Region read replica of the database.

AnswerD

A read replica can be promoted to a standalone DB in a disaster, with minimal data loss.

Why this answer

Option B is correct because a cross-Region read replica provides asynchronous replication and can be promoted in a disaster. Option A is wrong because Multi-AZ is within a region. Option C is wrong because a manual export/import is not real-time.

Option D is wrong because DMS requires ongoing replication setup and is more complex.

Full explanation →

1131

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineer notices that the most frequently accessed table is sorted by date, but queries often filter by customer_id. The table has 500 million rows and uses AUTO distribution style. What change would MOST improve query performance?

A.Change distribution style to KEY on customer_id.

B.Change distribution style to EVEN.

C.Change the sort key to include customer_id as a compound sort key.

D.Change the sort key to an interleaved sort key on date and customer_id.

AnswerC

A compound sort key with customer_id first will optimize queries filtering by customer_id.

Why this answer

Changing the sort key to include customer_id or using a compound sort key with customer_id first will improve performance for queries filtering by customer_id. Option A is wrong because AUTO distribution style is fine and may already distribute data well. Option B is wrong because changing to KEY distribution on customer_id might help but the question asks about sort key.

Option D is wrong because interleaved sort key is useful for multiple predicates but compound sort key with customer_id as first key is simpler and likely more efficient given the access pattern.

Full explanation →

1132

MCQhard

A data engineer uses the AWS CLI to list KMS keys and describe one. The output shows two keys. The described key has KeyState 'Enabled' and Origin 'AWS_KMS'. Which statement is true about this key?

A.The key is a KMS managed key that is enabled and ready for use

B.The key material was imported from an external source

C.The key is scheduled for deletion

D.The key is disabled and cannot be used

AnswerA

Origin 'AWS_KMS' and KeyState 'Enabled' indicate it is a managed, enabled key.

Why this answer

Option C is correct. The key has Origin 'AWS_KMS', which means it is a KMS managed key (not imported). The KeyState 'Enabled' means it can be used for encryption and decryption.

Option A is incorrect because it is not imported (Origin 'AWS_KMS'). Option B is incorrect because the key is not disabled. Option D is incorrect because the key is not pending deletion.

Full explanation →

1133

MCQhard

A financial services company ingests stock trade data from multiple exchanges into an Amazon S3 bucket (trade-bucket). Each exchange sends a CSV file every 5 minutes. The data must be transformed into Parquet format and partitioned by exchange and date (trade_date) for efficient querying using Amazon Athena. The pipeline must handle late-arriving data (files up to 2 hours late) and ensure exactly-once processing to avoid duplicates. Currently, a scheduled AWS Glue ETL job runs every hour, reads new CSV files, converts them to Parquet, and writes to an output bucket. However, the team is experiencing data duplication: if the job fails midway, upon retry it reprocesses all files in the input folder, causing duplicates in the output. Additionally, the job takes too long because it scans all files each run. The engineer must redesign the pipeline to eliminate duplicates and improve efficiency. What should the engineer do?

A.Use AWS Glue Workflows to orchestrate the job and add a condition to check for duplicates before writing.

B.Set up an S3 event notification to invoke an AWS Lambda function that starts a Glue job with a parameter containing the S3 object key of the new file; modify the Glue job to process only that file and use the file key to avoid duplicates.

C.Modify the Glue job to move processed CSV files to an archive folder after successful transformation, and process only unprocessed files.

D.Replace Glue with Amazon EMR and use Spark Structured Streaming with checkpointing to process files incrementally.

AnswerB

This ensures each file is processed exactly once, and the job runs only on new files, improving efficiency.

Why this answer

Option D is the best approach: S3 event notifications trigger Lambda to start a Glue job passing the file key; the job processes only that file, using the file key to avoid reprocessing. Option A still processes all files and may cause duplicates if a file arrives after the job starts. Option B (Glue Workflows) does not solve the file-level tracking.

Option C (EMR) is overkill and still scans the entire input folder.

Full explanation →

1134

Multi-Selecteasy

A company is building a data lake on Amazon S3 and needs to ingest data from various on-premises sources. Which TWO AWS services can be used to transfer data securely over the internet?

Select 2 answers

A.AWS Snowcone

B.AWS DataSync

C.Amazon Kinesis Data Firehose

D.AWS Direct Connect

E.AWS CLI

AnswersB, E

DataSync can transfer data over the internet.

Why this answer

Options A and C are correct because AWS DataSync can transfer data over the internet, and the AWS CLI can be used for uploads. Option B is incorrect because Direct Connect is a physical connection. Option D is incorrect because Snowcone is a physical device.

Option E is incorrect because Kinesis Data Firehose is for streaming, not file transfer.

Full explanation →

1135

MCQmedium

A company uses AWS Lake Formation to manage permissions on a data lake stored in S3. A data analyst reports that they can see a table in the AWS Glue Data Catalog but cannot query it using Amazon Athena. The analyst has been granted 'SELECT' permission on the table in Lake Formation. The table's underlying S3 location is encrypted with AWS KMS. The IAM role used by Athena has the necessary S3 and KMS permissions. What is the most likely reason for the failure?

A.The analyst does not have 'DESCRIBE' permission on the table.

B.Athena is not integrated with Lake Formation.

C.The KMS key policy does not allow the analyst's IAM role to decrypt.

D.The analyst does not have 'DESCRIBE' permission on the database.

AnswerA

Athena needs DESCRIBE on the table to retrieve metadata; without it, queries fail.

Why this answer

Option B is correct because Lake Formation requires explicit grant of 'DESCRIBE' permission on the table for Athena to read metadata; SELECT alone is insufficient. Option A is wrong because the analyst can see the table, meaning DESCRIBE is granted at the catalog level. Option C is wrong because KMS permissions are already in place.

Option D is wrong because Lake Formation is designed to work with Athena.

Full explanation →

1136

MCQhard

A company runs an Amazon EMR cluster with Spark jobs. One job fails with 'Container killed by YARN for exceeding memory limits'. The data engineer has already increased the executor memory. What is the NEXT best step to resolve the issue?

A.Set spark.executor.memoryOverhead to a higher value.

B.Increase the YARN container memory allocation (yarn.nodemanager.resource.memory-mb).

C.Decrease the number of Spark partitions.

D.Increase the driver memory.

AnswerB

This allows larger containers, preventing YARN from killing them.

Why this answer

Increasing the yarn.nodemanager.resource.memory-mb allows the node manager to allocate more memory per container, which can prevent YARN from killing containers. Option A is wrong because increasing driver memory may not help if executors are the issue. Option C is wrong because reducing parallelism may mitigate but does not address the root cause.

Option D is wrong because it is about memory overhead, but the main issue is YARN limits.

Full explanation →

1137

Multi-Selectmedium

A company is building a data lake on Amazon S3. They need to ingest data from multiple sources, including relational databases, streaming data, and log files. Which THREE AWS services can be used to ingest data into the data lake?

Select 3 answers

A.Amazon Kinesis Data Firehose

B.AWS Database Migration Service (DMS)

C.Amazon Athena

D.Amazon Redshift Spectrum

E.AWS Glue

AnswersA, B, E

Ingests streaming data into S3.

Why this answer

Options A, B, and D are correct. Kinesis Firehose for streaming, Glue for batch ETL, and DMS for database migration. Option C is wrong because Athena is a query service, not ingestion.

Option E is wrong because Redshift Spectrum is for querying S3 from Redshift.

Full explanation →

1138

Multi-Selecteasy

A data engineer needs to ingest streaming data from an e-commerce application into Amazon S3 for near-real-time analytics. The solution must handle variable throughput and allow reprocessing of failed records. Which TWO AWS services should the engineer use? (Choose two.)

Select 2 answers

A.Amazon Kinesis Data Streams

B.Amazon SQS

C.Amazon DynamoDB Streams

D.Amazon Kinesis Data Firehose

E.Amazon RDS

AnswersA, D

Kinesis Data Streams ingests and stores streaming data in shards for real-time processing.

Why this answer

Amazon Kinesis Data Streams ingests streaming data reliably, and Amazon Kinesis Data Firehose delivers it to S3 with retry logic. Option A (Amazon SQS) is for message queues, not streaming ingestion. Option B (Amazon RDS) is a database.

Option D (AWS Lambda) can process records but is not the primary ingestion service for high-throughput streaming. Option E (Amazon DynamoDB Streams) is for CDC from DynamoDB, not general streaming.

Full explanation →

1139

MCQhard

A data engineer is monitoring an Amazon Redshift cluster using Amazon CloudWatch. The engineer notices that the 'WriteThroughput' metric is consistently below the provisioned IOPS for the cluster's EBS volumes. The query performance is slower than expected. Which action is MOST likely to improve write performance?

A.Reduce the number of concurrent queries to the database.

B.Upgrade to a larger node type with more CPU and memory.

C.Add sort keys to the tables to improve data distribution.

D.Increase the provisioned IOPS on the EBS volumes.

AnswerB

Larger nodes provide more processing power, improving write throughput.

Why this answer

Option D is correct because if write throughput is low but IOPS are not saturated, the bottleneck may be CPU or memory; increasing the node size (DC2 or RA3) provides more CPU and memory. Option A is wrong because increasing IOPS when not saturated won't help. Option B is wrong because concurrency may cause blocking but not necessarily low throughput.

Option C is wrong because sort keys improve read performance, not write.

Full explanation →

1140

MCQeasy

A data engineer is setting up an Amazon RDS for MySQL database. The compliance team requires that all data at rest be encrypted. What must the engineer do to enable encryption for this database?

A.Specify an AWS KMS key when launching the DB instance

B.Enable encryption after the DB instance is created by modifying the DB instance

C.Use AWS Secrets Manager to store the encryption key and attach it to the DB instance

D.Encrypt the underlying EBS volumes after the instance is created

AnswerA

Encryption must be enabled at launch by choosing a KMS key.

Why this answer

Encryption at rest for Amazon RDS can only be enabled at launch time. After creation, you cannot enable encryption; you must create a new encrypted instance and migrate data.

Full explanation →

1141

MCQhard

A data engineer needs to set up a new Amazon RDS for PostgreSQL database for a production workload. The database must be highly available and resilient to a single Availability Zone failure. Which configuration should the engineer choose?

A.Single-AZ with automated backups

B.Multi-AZ deployment with one standby in a different AZ

C.Multi-AZ with two readable standbys

D.Single-AZ with a read replica

AnswerB

Provides automatic failover and high availability.

Why this answer

Multi-AZ deployment with a standby in a different AZ provides high availability and automatic failover. Read replicas are for read scaling, not failover. Single-AZ lacks resilience.

Full explanation →

1142

Matchingmedium

Match each AWS networking concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Virtual private cloud isolated network

Segment of VPC IP address range

Stateful firewall for instances

Stateless firewall for subnets

Enables VPC to internet communication

Why these pairings

Networking fundamentals for AWS.

Full explanation →

1143

MCQmedium

Refer to the exhibit. A data engineer is attaching this IAM policy to an IAM role used by an AWS Glue job. The job reads from a Kinesis Data Streams stream and writes transformed data to an S3 bucket. When the job runs, it fails with an AccessDenied error for the Kinesis stream. What is the MOST likely cause?

A.The stream ARN in the policy is incorrect.

B.The IAM policy is missing the 'kinesis:DescribeStream' action.

C.The Glue job does not have permissions to call 'kinesis:PutRecord'.

D.The S3 bucket policy blocks the PutObject action from the Glue role.

AnswerB

Required to read stream metadata.

Why this answer

Option B is correct because missing 'kinesis:DescribeStream' is required for Glue to read from the stream. Option A is wrong because the S3 policy is fine. Option C is wrong because resource ARN is correctly formatted.

Option D is wrong because the policy does not allow 'kinesis:GetRecords' or 'kinesis:DescribeStream'.

Full explanation →

1144

MCQhard

A data engineering team is building a real-time analytics pipeline using Amazon Kinesis Data Streams, AWS Lambda, and Amazon DynamoDB. The Lambda function consumes records from the stream and writes aggregated data to a DynamoDB table. The application requires that each record be processed exactly once to avoid duplicates. The Lambda function is idempotent, but occasionally duplicate records are written due to retries from Kinesis. The team needs to ensure exactly-once semantics for DynamoDB writes. Which solution should they implement?

A.Enable DynamoDB Streams and use a second Lambda to deduplicate.

B.Use DynamoDB TransactWriteItems with a condition check on a unique transaction ID.

C.Use the Kinesis Client Library (KCL) to checkpoint after processing and ignore duplicates.

D.Ensure the Lambda function is idempotent by using upsert operations.

AnswerB

Condition check ensures only one write succeeds per unique ID.

Why this answer

Option B is correct because DynamoDB TransactWriteItems with a condition check on a unique transaction ID ensures that the write only succeeds if the transaction ID does not already exist in the table. This provides exactly-once semantics by preventing duplicate writes even when Kinesis retries deliver the same record multiple times. The condition check acts as a distributed lock at the item level, guaranteeing idempotency without relying on downstream deduplication.

Exam trap

The trap here is that candidates often assume idempotent Lambda functions alone guarantee exactly-once processing, but they overlook that Kinesis retries can still cause duplicate writes unless a conditional write with a unique identifier is used at the database level.

How to eliminate wrong answers

Option A is wrong because enabling DynamoDB Streams and using a second Lambda to deduplicate introduces eventual consistency and additional latency, and does not prevent duplicate writes at the point of ingestion; it only attempts to clean up duplicates after they have already been written. Option C is wrong because the Kinesis Client Library (KCL) checkpointing tracks processing progress but does not prevent duplicate records from being delivered to the Lambda function during retries, so duplicates can still be written to DynamoDB. Option D is wrong because simply ensuring the Lambda function is idempotent using upsert operations (e.g., UpdateItem) does not guarantee exactly-once semantics; upsert can still overwrite data or create duplicate items if the record lacks a unique identifier or condition check.

Full explanation →

1145

MCQeasy

A data engineer needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the data transfer must be completed within 5 days. What is the most cost-effective solution?

A.Use AWS Snowball Edge to physically ship the data.

B.Use S3 Transfer Acceleration to speed up the transfer over the internet.

C.Use AWS DataSync over the internet to transfer the data.

D.Set up an AWS Direct Connect connection to increase bandwidth.

AnswerA

Snowball bypasses network limitations and is cost-effective for large data volumes.

Why this answer

With 10 TB of data and a 100 Mbps link, the theoretical transfer time over the internet is approximately 10 days (10 TB * 8 / 100 Mbps = 800,000 seconds ≈ 9.26 days), which exceeds the 5-day requirement. AWS Snowball Edge is the most cost-effective solution because it bypasses the network bottleneck entirely by physically shipping the data, and it is designed for large-scale data transfers where network constraints make online transfer impractical.

Exam trap

The trap here is that candidates assume S3 Transfer Acceleration or DataSync can magically overcome bandwidth limitations, but they only optimize the path, not increase the pipe size, so the math of bandwidth vs. data volume always dictates the minimum transfer time.

How to eliminate wrong answers

Option B is wrong because S3 Transfer Acceleration only optimizes the network path using AWS edge locations and does not increase the available bandwidth; it cannot overcome the fundamental 100 Mbps bottleneck, so the transfer would still take over 9 days. Option C is wrong because AWS DataSync over the internet is still limited by the 100 Mbps bandwidth, and even with optimization, it cannot complete 10 TB within 5 days. Option D is wrong because setting up AWS Direct Connect requires significant upfront cost and provisioning time (often weeks), making it neither cost-effective nor timely for a one-time transfer within 5 days.

Full explanation →

1146

MCQmedium

A media company stores large video files in Amazon S3 and uses Amazon CloudFront for content delivery. Users in different regions report slow download speeds for popular content. The data engineer needs to improve performance while minimizing cost. Which solution should the engineer implement?

A.Change the S3 storage class to S3 Standard-IA

B.Enable S3 Transfer Acceleration on the bucket

C.Create multiple S3 buckets in different regions and configure CloudFront with multiple origins

D.Enable CloudFront Origin Shield

AnswerD

Origin Shield provides an additional cache layer that improves cache hit ratio and reduces load on the origin, thereby improving download performance for users.

Why this answer

CloudFront Origin Shield acts as a centralized caching layer in front of the S3 origin, reducing the number of requests that reach the origin and improving cache hit ratios. This minimizes latency for users in different regions by serving popular content from the edge or shield cache, while also reducing origin load and data transfer costs.

Exam trap

The trap here is that candidates may confuse S3 Transfer Acceleration (which optimizes uploads) with download acceleration, or assume that multiple regional origins are needed when CloudFront's global edge network already handles geographic distribution.

How to eliminate wrong answers

Option A is wrong because changing the storage class to S3 Standard-IA reduces storage costs for infrequently accessed data but does not improve download speeds or reduce latency for users. Option B is wrong because S3 Transfer Acceleration speeds up uploads to S3 over long distances using AWS edge locations, but it does not accelerate downloads or improve CloudFront delivery performance. Option C is wrong because creating multiple S3 buckets in different regions and configuring CloudFront with multiple origins increases complexity and cost without addressing the core issue of cache efficiency; CloudFront already uses a global edge network, and adding more origins does not inherently improve cache hit ratios or reduce origin load.

Full explanation →

1147

MCQeasy

An organization needs to ingest data from on-premises databases into AWS S3 for archival purposes. The data volume is several TB per day, and the network has moderate bandwidth. Which AWS service is BEST suited for this bulk data transfer?

A.AWS Direct Connect

B.Amazon S3 Transfer Acceleration

C.AWS Snowball

D.AWS DataSync

AnswerD

DataSync automates and accelerates moving data from on-premises to AWS.

Why this answer

Option C is correct because AWS DataSync is designed for large-scale data transfers from on-premises to AWS. Option A is wrong because Snowball is for offline transfer, not real-time. Option B is wrong because Direct Connect is a dedicated network connection, not a data transfer service.

Option D is wrong because S3 Transfer Acceleration speeds up uploads over the internet but is not specifically designed for on-premises bulk transfer.

Full explanation →

1148

MCQhard

A company is using Amazon S3 to store sensitive data. The security team requires that all data be encrypted at rest using a customer-managed AWS KMS key. The data engineer must ensure that only a specific IAM role can decrypt the data. Which policy should the data engineer attach to the KMS key?

A.A KMS key policy that allows the IAM role to perform kms:Decrypt

B.An IAM user policy that allows kms:Decrypt for the specific key

C.An IAM policy attached to the role that allows kms:Decrypt

D.An S3 bucket policy that denies access unless encryption is used

AnswerA

KMS key policies grant permissions to use the key.

Why this answer

Option C is correct because the KMS key policy must grant the IAM role permission to decrypt. Option A is wrong because the IAM policy alone is not sufficient; KMS key policy must allow the role to decrypt. Option B is wrong because the S3 bucket policy controls access to S3, not KMS.

Option D is wrong because the IAM user policy is less secure and doesn't address role-based access.

Full explanation →

1149

Multi-Selectmedium

A data engineer is designing a data pipeline that processes PII data in AWS Glue. They need to ensure data is encrypted at rest and in transit. Which TWO actions should they take? (Choose TWO.)

Select 2 answers

A.Disable SSL for Glue connections

B.Configure S3 bucket server-side encryption for job output

C.Use a KMS key for Glue job bookmarks

D.Use CloudWatch Logs for encryption

E.Enable encryption at rest for the AWS Glue Data Catalog

AnswersB, E

Encrypts data stored in S3.

Why this answer

Options A and D are correct. Enabling encryption at rest for Glue data catalog and using S3 server-side encryption for job output ensures encryption at rest. Option B is wrong because Glue connections can use SSL.

Option C is wrong because CloudWatch logs are not for encryption. Option E is wrong because KMS keys for Glue jobs encrypt job bookmarks, not all data.

Full explanation →

1150

MCQhard

A data engineer is troubleshooting a slow Amazon Redshift query that joins a large fact table with several dimension tables. The EXPLAIN plan shows a hash join on the distribution key, but the query still runs slowly. The fact table is distributed by KEY(column_x) and the dimension tables are distributed ALL. The engineer notices that the fact table has a high number of rows with the same value in column_x. What is the most likely cause of the slow performance?

A.The fact table's distribution key column has data skew, causing uneven data distribution across nodes.

B.The dimension tables should be distributed by KEY instead of ALL.

C.The Redshift cluster does not have enough disk space.

D.The fact table does not have a sort key.

AnswerA

Skew leads to some nodes doing more work, slowing the query.

Why this answer

Option A is correct because data skew in the distribution key column_x causes some slices to hold a disproportionate number of rows, leading to uneven workload distribution during the hash join. The EXPLAIN plan shows a hash join on the distribution key, which should be efficient if data is evenly distributed, but skew forces the node with the most rows to become a bottleneck, slowing the entire query.

Exam trap

The trap here is that candidates often assume a hash join on the distribution key is always optimal, overlooking that data skew in the distribution key itself can negate the benefit and cause severe performance degradation.

How to eliminate wrong answers

Option B is wrong because distributing dimension tables by KEY would likely worsen performance by requiring redistribution or broadcasting during joins, whereas ALL distribution is optimal for small dimension tables to avoid data movement. Option C is wrong because insufficient disk space would manifest as disk-full errors or failed writes, not as slow query performance with a hash join plan. Option D is wrong because while a sort key can improve query performance for range-restricted scans, the EXPLAIN plan indicates the bottleneck is the hash join on the distribution key, not a missing sort key.

Full explanation →

1151

MCQmedium

A company uses AWS DMS to migrate data from Oracle to Aurora MySQL. During the ongoing replication, the target table shows duplicate primary key errors. What is the most likely cause?

A.DMS is using 'Limited LOB mode' and truncating LOB data, causing row mismatches.

B.The source table has a trigger that inserts additional rows.

C.The target table has an auto-increment column, and DMS is inserting explicit values that conflict.

D.The DMS task is configured with 'Parallel apply' threads that cause race conditions.

AnswerC

DMS inserts values for the PK, but auto-increment may also generate values, causing duplicates.

Why this answer

Duplicate PK errors often occur when the target table has an auto-increment column that conflicts with DMS inserts. Option A is correct. Option B is wrong because LOB mode doesn't cause duplicates.

Option C is wrong because full LOB mode is for large objects. Option D is wrong because parallel threads can cause duplicates if not properly configured.

Full explanation →

1152

MCQhard

A company uses AWS Glue to run ETL jobs that process data from Amazon RDS to Amazon S3. The jobs run nightly and take 3 hours to complete. The data volume is growing by 20% each month. The engineer needs to reduce job runtime and cost. The source RDS is a db.r5.large instance. Which approach would be MOST effective?

A.Reduce the number of DPUs to lower cost and accept longer runtime.

B.Increase the number of Glue workers and choose a G.1X or G.2X worker type.

C.Create a read replica of the RDS instance and point the Glue job to the replica.

D.Enable S3 Transfer Acceleration on the destination bucket.

AnswerB

More workers increase parallelism.

Why this answer

Option A is correct because increasing the number of Glue workers (DPUs) directly increases parallelism and reduces runtime, and you can choose a worker type with more memory if needed. Option B is wrong because RDS read replica does not help Glue processing speed. Option C is wrong because S3 Transfer Acceleration is for uploads, not Glue processing.

Option D is wrong because reducing DPUs would increase runtime.

Full explanation →

1153

MCQeasy

A company needs to enforce that all objects uploaded to an S3 bucket are encrypted at rest. Which bucket setting should be used?

A.Default encryption

B.S3 Object Lock

C.Bucket policy requiring s3:x-amz-server-side-encryption header

D.S3 Block Public Access

AnswerA

Default encryption automatically encrypts objects at rest.

Why this answer

Option C is correct because enabling default encryption on the S3 bucket ensures all objects are encrypted at rest. Option A is wrong because S3 Block Public Access controls public access, not encryption. Option B is wrong because bucket policies can enforce encryption but default encryption is simpler.

Option D is wrong because S3 Object Lock is for retention, not encryption.

Full explanation →

1154

MCQmedium

Refer to the exhibit. An IAM policy is attached to a user who needs to read objects from the 'example-bucket' S3 bucket. The user reports being unable to read any object under the 'confidential/' prefix. What is the reason for this access issue?

A.The allow statement is evaluated before the deny statement

B.The deny statement is missing an explicit allow for the confidential prefix

C.The explicit deny statement overrides the allow statement

D.The resource ARN in the deny statement is incorrect

AnswerC

Explicit deny overrides all allows.

Why this answer

Option B is correct because an explicit deny overrides any allow, even if the allow is more general. The policy allows all GetObject but denies GetObject for the confidential prefix. Option A is wrong because the order of statements does not matter; explicit deny always wins.

Option C is wrong because there is no explicit allow for confidential; the deny applies. Option D is wrong because the resource ARN is correct.

Full explanation →

1155

MCQmedium

A data engineer needs to ensure that an Amazon S3 bucket used for sensitive data is encrypted at rest using a customer-managed AWS KMS key. The bucket policy must enforce encryption for all PUT requests. Which policy statement should be added to the bucket policy?

A.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucket/*","Condition":{"Null":{"s3:x-amz-server-side-encryption":"true"}}}

B.{"Effect":"Allow","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucket/*","Condition":{"StringEquals":{"s3:x-amz-server-side-encryption-aws-kms-key-id":"arn:aws:kms:us-east-1:123456789012:key/abc123"}}}

C.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucket/*","Condition":{"StringNotEquals":{"s3:x-amz-server-side-encryption":"aws:kms"},"Null":{"s3:x-amz-server-side-encryption-aws-kms-key-id":"true"}}}

D.{"Effect":"Deny","Principal":"*","Action":"s3:PutObject","Resource":"arn:aws:s3:::bucket/*","Condition":{"StringNotEquals":{"s3:x-amz-server-side-encryption-aws-kms-key-id":"arn:aws:kms:us-east-1:123456789012:key/abc123"}}}

AnswerC

This denies if encryption is not aws:kms or if the key ID is not provided, enforcing the required encryption.

Why this answer

Option C is correct because it uses a Deny effect with a condition that blocks PUT requests unless the encryption header specifies 'aws:kms' (SSE-KMS) AND the KMS key ID matches the required customer-managed key. The combination of StringNotEquals on the encryption type and Null on the key ID ensures that any request not using the specified KMS key is denied, enforcing both encryption at rest and the use of the customer-managed key.

Exam trap

The trap here is that candidates often choose a simple Deny on a missing encryption header (Option A) without realizing that it does not enforce the use of a specific KMS key, or they mistakenly use an Allow effect (Option B) which cannot block non-compliant requests due to the default Allow behavior of S3 bucket policies.

How to eliminate wrong answers

Option A is wrong because it denies requests only when the 's3:x-amz-server-side-encryption' header is null, which would allow requests with any encryption header (including AES256 or a different KMS key) to succeed, failing to enforce the specific customer-managed KMS key. Option B is wrong because it uses an Allow effect, which cannot override an explicit Deny and does not enforce encryption; it merely allows requests that match the condition but does not block non-compliant requests. Option D is wrong because it denies requests only when the KMS key ID does not match, but it does not require the encryption header to be present at all, allowing unencrypted PUT requests to bypass the policy.

Full explanation →

1156

MCQmedium

A financial company needs to ingest real-time stock trade data from multiple sources and store it in Amazon S3 for compliance. The data must be delivered within 1 minute of the trade occurring. The data volume is approximately 10,000 records per second, with occasional spikes to 50,000 records per second. The engineer has set up Amazon Kinesis Data Streams with 10 shards and a Kinesis Data Firehose delivery stream that reads from the Kinesis stream and writes to S3. However, during spikes, the Firehose delivery stream falls behind, causing data to be delayed beyond the 1-minute SLA. What should the engineer do to meet the SLA without over-provisioning?

A.Increase the buffer size in Kinesis Data Firehose from 1 MB to 5 MB to batch more data per delivery.

B.Use Amazon SQS as a buffer between Kinesis and Firehose to absorb spikes.

C.Replace Firehose with an AWS Lambda function that writes directly to S3 for lower latency.

D.Increase the number of shards in the Kinesis data stream to handle peak throughput and enable auto-scaling.

AnswerD

More shards increase read capacity; auto-scaling adjusts during spikes.

Why this answer

Option A is correct: Increase shard count to handle peak throughput, and enable automatic scaling for Kinesis Data Streams. Option B (increase buffer size) would increase latency. Option C (use Lambda) may not handle the throughput.

Option D (use SQS) would not reduce latency.

Full explanation →

1157

MCQhard

Refer to the exhibit. A data engineer runs the command above. The DataAdminRole is used by an application to decrypt data. The security team wants to ensure that a SecurityAdminRole can revoke the grant. What must be done to allow the SecurityAdminRole to retire the grant?

A.Set the RetiringPrincipal to the root user

B.Add a grant with Revoke operation for the SecurityAdminRole

C.No action needed; the SecurityAdminRole can retire the grant

D.Create a new grant with SecurityAdminRole as GranteePrincipal

AnswerC

The RetiringPrincipal is already set.

Why this answer

Option A is correct because the RetiringPrincipal field specifies which principal can retire the grant. It is already set to SecurityAdminRole. Option B is wrong because revoke is different from retire.

Option C is wrong because the grant already exists. Option D is wrong because the retiring principal is already set.

Full explanation →

1158

Multi-Selecthard

Which THREE factors should a data engineer consider when choosing between Amazon Redshift and Amazon Athena for querying large datasets in Amazon S3? (Choose three.)

Select 3 answers

A.Both support standard SQL queries.

B.Redshift requires provisioning and managing clusters, while Athena is serverless.

C.Athena charges per query based on data scanned, while Redshift charges for cluster compute capacity.

D.Athena can only query data stored in Amazon S3, while Redshift can also query data in S3.

E.Redshift is optimized for highly structured, frequently queried data, while Athena is better for ad-hoc queries on raw data.

AnswersB, C, E

Redshift needs cluster management; Athena is serverless.

Why this answer

Option B is correct because Amazon Redshift requires manual provisioning, configuration, and ongoing management of clusters, including node sizing, scaling, and maintenance windows. In contrast, Amazon Athena is a serverless service that automatically handles infrastructure, requiring no cluster management and allowing users to query data directly from Amazon S3 without any setup overhead.

Exam trap

The trap here is that candidates may assume Athena is limited to S3-only queries or that both services have identical SQL support, overlooking the fundamental architectural differences in provisioning, cost models, and workload optimization that are the real decision factors.

Full explanation →

1159

Multi-Selectmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The data must be transformed and stored in Amazon S3 for batch analytics. The engineer wants to use AWS Lambda for transformation. Which TWO configurations are required? (Choose two.)

Select 2 answers

A.Configure the Lambda function to write to S3 via Kinesis Data Firehose.

B.Configure the Kinesis stream to send records directly to S3.

C.Set up an SQS queue as a destination for Lambda errors.

D.Create an event source mapping from the Kinesis stream to the Lambda function.

E.Assign an IAM role to Lambda with permissions to read from Kinesis and write to S3.

AnswersD, E

Event source mapping enables Lambda to poll from Kinesis.

Why this answer

Options A and D are correct. Option A: Lambda needs an event source mapping to poll from Kinesis. Option D: Lambda needs an IAM role with permissions to read from Kinesis and write to S3.

Option B is wrong because Lambda does not write directly to S3 via Kinesis; it writes via the Lambda function. Option C is wrong because Lambda does not use SQS to trigger from Kinesis. Option E is wrong because Lambda does not write to S3 via Firehose; Firehose can directly write to S3.

Full explanation →

1160

Multi-Selecteasy

Which TWO data stores are considered fully managed, serverless, and suitable for storing JSON documents?

Select 2 answers

A.Amazon Redshift

B.Amazon ElastiCache for Redis

C.Amazon DocumentDB (with MongoDB compatibility)

D.Amazon DynamoDB

E.Amazon RDS for MySQL

AnswersC, D

DocumentDB is a managed document database, supports JSON.

Why this answer

Amazon DocumentDB (with MongoDB compatibility) is a fully managed, serverless document database that natively stores JSON documents. It supports MongoDB workloads, allowing you to store, query, and index JSON data without managing infrastructure, making it ideal for content management and catalog applications.

Exam trap

AWS often tests the distinction between fully managed serverless services (DocumentDB, DynamoDB) and those requiring provisioning or cluster management (Redshift, ElastiCache, RDS), leading candidates to mistakenly select ElastiCache for Redis due to its JSON module support, ignoring its non-serverless nature and primary use as a cache.

Full explanation →

1161

MCQmedium

A company needs to ingest real-time sensor data from thousands of IoT devices into Amazon S3, with a latency of less than 1 minute. The data must be transformed (e.g., convert to Parquet) before landing in S3. Which combination of services is MOST cost-effective?

A.Amazon Kinesis Data Streams to AWS Lambda to S3.

B.Amazon Kinesis Data Streams to Amazon Kinesis Data Analytics to S3.

C.Amazon Kinesis Data Streams to AWS Glue streaming ETL to S3.

D.Amazon Kinesis Data Streams to Amazon Kinesis Data Firehose to S3.

AnswerD

Firehose can transform and deliver with low latency, cost-effective for high throughput.

Why this answer

Option A is correct because Kinesis Data Streams ingests data in real-time, and Kinesis Data Firehose can buffer, transform (e.g., convert to Parquet), and deliver to S3 with low latency. Option B is wrong because Kinesis Data Analytics is for running SQL on streams, not for simple transformation. Option C is wrong because Glue is batch-oriented, not real-time streaming.

Option D is wrong because Lambda can transform but may not scale cost-effectively for thousands of devices.

Full explanation →

1162

MCQhard

Refer to the exhibit. A data engineer runs the above command and sees that the DataLakeAdmin role has the AmazonS3FullAccess and AWSLakeFormationDataAdmin policies attached. The engineer wants to ensure that the role can only access S3 data through Lake Formation. What should the engineer do?

A.Create a new IAM policy that explicitly denies s3:GetObject and attach it to the role

B.Detach the AWSLakeFormationDataAdmin policy from the role

C.Detach the AmazonS3FullAccess policy from the role

D.Modify the S3 bucket policy to deny all access except from Lake Formation

AnswerC

This removes direct S3 access, forcing the role to use Lake Formation for data access.

Why this answer

To enforce that access is only through Lake Formation, the engineer should detach the AmazonS3FullAccess policy because it allows direct S3 access, bypassing Lake Formation. The LakeFormationAdmin policy is needed for Lake Formation administration. Changing the S3 bucket policy to deny all access except from Lake Formation is not straightforward because Lake Formation uses the principal's IAM role.

Creating a new policy that denies S3 access would be redundant if the full access policy is removed.

Full explanation →

1163

MCQhard

A data engineer notices that an Amazon Athena query on a partitioned table in S3 scans more data than expected. The table is partitioned by year, month, day. The query includes a WHERE clause on a non-partition column but also filters on day='2023-01-01'. What is the most likely cause of the excessive data scan?

A.The data is stored in JSON format instead of Parquet

B.The table is not partitioned by the column used in the WHERE clause

C.The partition column data type in the table definition does not match the actual partition folder names

D.The data is not sorted within partitions

AnswerC

If the partition column is defined as string but folders are dates, pruning fails and full scan occurs.

Why this answer

Option D is correct because mismatched data types cause partition pruning to fail. Option A is wrong because partition pruning works with proper types. Option B is wrong because it would not affect scan size.

Option C is wrong because sorting is irrelevant.

Full explanation →

1164

MCQeasy

A data engineer has set up an Amazon S3 lifecycle policy to transition objects to Glacier Instant Retrieval after 30 days. After 60 days, objects should transition to Deep Archive. However, objects are not transitioning to Deep Archive. What is the most likely cause?

A.The bucket has versioning enabled.

B.Deep Archive is not supported in the bucket's region.

C.Objects are smaller than 128 KB.

D.The transition to Deep Archive requires a minimum of 30 days after the previous transition.

AnswerD

S3 requires at least 30 days between transitions to different storage classes.

Why this answer

Amazon S3 lifecycle policies require a minimum of 30 days between successive transitions when moving objects from one storage class to another. Since the policy transitions objects to Glacier Instant Retrieval after 30 days, the subsequent transition to Deep Archive cannot occur until at least 60 days after creation (30 days for the first transition plus 30 days minimum gap). The current policy attempts the Deep Archive transition at 60 days, which is exactly 30 days after the first transition, meeting the minimum requirement.

However, the most likely cause of the failure is that the policy is incorrectly configured or the objects are too new; given the options, the correct answer is that the transition to Deep Archive requires a minimum of 30 days after the previous transition, and the policy as described should work if the objects are older than 60 days, but the question implies a timing issue.

Exam trap

The trap here is that candidates often overlook the 30-day minimum transition interval requirement and assume any transition timing is allowed, or they mistakenly attribute the failure to versioning or region limitations.

How to eliminate wrong answers

Option A is wrong because S3 versioning does not prevent lifecycle transitions; lifecycle policies can be applied to both current and noncurrent versions independently. Option B is wrong because Deep Archive is supported in all AWS regions where S3 is available, including the standard commercial regions. Option C is wrong because the 128 KB minimum object size restriction applies only to S3 Intelligent-Tiering and S3 Glacier Instant Retrieval for automatic tiering, not to lifecycle transitions to Deep Archive; lifecycle policies can transition objects of any size.

Full explanation →

1165

MCQmedium

Refer to the exhibit. A data engineer runs the above AWS CLI command to view the table metadata in the AWS Glue Data Catalog. The data is stored as CSV in S3 with partitions by year and month. When querying the table using Amazon Athena, no data is returned. What is the most likely cause?

A.The partitions have not been added to the Glue Data Catalog.

B.The SerDe is not compatible with CSV files.

C.The S3 location points to a file instead of a folder.

D.The column data types are incorrect for the CSV data.

AnswerA

Partitions must be explicitly registered for Athena to query them.

Why this answer

Option A is correct because the AWS CLI command shown only retrieves table metadata, not partition metadata. In AWS Glue, partitions must be explicitly added to the Data Catalog via `MSCK REPAIR TABLE`, `ALTER TABLE ADD PARTITION`, or a Glue crawler. Without partition metadata, Athena cannot locate the data files under the partitioned S3 paths (e.g., `s3://bucket/year=2024/month=01/`), resulting in zero rows returned even though the table schema is defined.

Exam trap

The trap here is that candidates assume the `PARTITIONED BY` clause in the table definition automatically registers the partitions in the Glue Data Catalog, but it only defines the schema; partition metadata must be added separately.

How to eliminate wrong answers

Option B is wrong because the default SerDe for CSV in Athena (`LazySimpleSerDe`) is fully compatible with standard CSV files; no SerDe mismatch would cause zero rows. Option C is wrong because the `LOCATION` in the Glue table points to a folder (the base path), not a file; Athena expects a folder and would fail with an error if a file were specified, not silently return no data. Option D is wrong because incorrect column data types would cause query failures or data conversion errors, not an empty result set; Athena would still attempt to read the data and return rows with nulls or errors.

Full explanation →

1166

MCQmedium

A data engineer is designing a data ingestion pipeline to load data from an on-premises Oracle database into Amazon Redshift. The pipeline must capture changes (inserts, updates, deletes) with low latency and minimal impact on the source database. Which combination of AWS services should the engineer use?

A.AWS Database Migration Service (DMS) to Amazon S3, then COPY into Redshift

B.Amazon Kinesis Data Streams with a custom producer on Oracle

C.AWS Glue with JDBC connection to Oracle, writing to Redshift

D.AWS Lambda reading from Oracle logs and writing to Redshift

AnswerA

DMS supports ongoing replication from Oracle to S3 with CDC, and Redshift can COPY from S3.

Why this answer

Option A is correct because AWS DMS with ongoing replication (change data capture) can capture changes from Oracle with minimal impact and replicate to S3, then COPY into Redshift. Option B is wrong because AWS Glue is batch-oriented and does not support real-time CDC natively. Option C is wrong because Lambda can process events but is not designed for continuous CDC from a database.

Option D is wrong because Kinesis Data Streams requires custom producers and does not directly integrate with Oracle for CDC.

Full explanation →

1167

MCQmedium

Refer to the exhibit. A data engineer notices that the Redshift cluster 'mycluster' does not have automated backups beyond 7 days. However, the compliance team requires a minimum of 35 days of backup retention. What should the engineer do?

A.Change the node type to ra3.xlplus to enable automatic backups for 35 days.

B.Enable audit logging to capture changes for recovery.

C.Take manual snapshots every day and retain them for 35 days.

D.Modify the cluster's automated snapshot retention period to 35 days.

AnswerD

The retention period can be increased up to 35 days via modification.

Why this answer

Option D is correct because Amazon Redshift allows you to modify the automated snapshot retention period for a cluster up to 35 days. The engineer can use the AWS Management Console, CLI, or API to change the `automated_snapshot_retention_period` parameter from the current 7 days to 35 days, meeting the compliance requirement without additional manual intervention.

Exam trap

The trap here is that candidates may confuse backup retention with node type capabilities or audit logging, assuming that hardware or logging features inherently extend backup duration, when in fact the retention period is a simple configuration parameter.

How to eliminate wrong answers

Option A is wrong because changing the node type to ra3.xlplus does not affect the automated backup retention period; retention is configured independently of node type. Option B is wrong because audit logging captures user activity and SQL queries for security and compliance, not for point-in-time recovery of data; it does not replace backup retention. Option C is wrong because while manual snapshots can be retained for 35 days, this approach requires daily manual effort and does not leverage the automated backup feature that is already available; modifying the automated retention period is simpler and more reliable.

Full explanation →

1168

MCQeasy

A data engineer needs to store streaming data from IoT devices for real-time analytics. The data has a fixed schema and requires low-latency queries. Which AWS service should be used?

A.Amazon DynamoDB

B.Amazon Redshift

C.Amazon S3

D.Amazon Timestream

AnswerD

Timestream is designed for time-series data with low-latency queries.

Why this answer

Amazon Timestream is a time-series database purpose-built for IoT and operational applications that generate large volumes of time-stamped data. It automatically manages data retention and storage tiers (memory and magnetic) to provide fast query performance for recent data and cost-effective storage for historical data, making it ideal for real-time analytics on streaming IoT data with a fixed schema.

Exam trap

AWS often tests the misconception that any database can handle time-series data equally well, but the trap here is that candidates choose DynamoDB for its low-latency reads, overlooking that Timestream is the only AWS service purpose-built for time-series workloads with native support for time-based partitioning, retention policies, and analytical functions.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is a NoSQL key-value and document database optimized for high-throughput, low-latency read/write operations on individual items, but it lacks native time-series optimizations such as automatic downsampling, interpolation, and time-based partitioning, making it less efficient for time-series queries like aggregations over time windows. Option B is wrong because Amazon Redshift is a petabyte-scale data warehouse designed for complex analytical queries on structured and semi-structured data using SQL, but it is not optimized for real-time streaming ingestion or low-latency queries on high-frequency time-series data; its batch-oriented architecture introduces higher latency for streaming use cases. Option C is wrong because Amazon S3 is an object storage service that provides durable, scalable storage for any type of data, but it does not support real-time querying directly; querying S3 requires services like Athena or S3 Select, which add latency and are not designed for sub-second, low-latency queries on streaming data.

Full explanation →

1169

MCQmedium

A company uses AWS Glue to catalog data in Amazon S3. The security team requires that all sensitive data be identified and encrypted at rest using customer-managed KMS keys. Which combination of steps should a data engineer take to meet these requirements?

A.Enable S3 Access Logs and use Athena to query the logs for sensitive data patterns.

B.Use Amazon Macie to scan the S3 bucket and automatically apply S3 default encryption.

C.Enable S3 default encryption for the bucket and use IAM policies to restrict access.

D.Configure AWS Glue to use Detect Sensitive Data and write encrypted output to S3 with SSE-KMS.

AnswerD

Glue's Detect Sensitive Data identifies sensitive columns, and the ETL job can encrypt output using customer-managed KMS keys.

Why this answer

Option B is correct because using AWS Glue FindMatches ML transforms or custom classifiers with Detect Sensitive Data can identify sensitive data, and then the Glue ETL job can write to S3 with SSE-KMS. Option A is wrong because S3 default encryption does not guarantee customer-managed KMS keys. Option C is wrong because Macie is not integrated directly with Glue cataloging.

Option D is wrong because S3 Access Logs do not help with identification or encryption.

Full explanation →

1170

MCQmedium

A company uses AWS Glue DataBrew to clean and transform data. A data engineer notices that a DataBrew recipe step that should remove duplicates is not working as expected. The dataset has millions of rows. What is the MOST likely reason?

A.The data source is an S3 bucket with a large number of files

B.The dataset contains null values in the key columns

C.The dataset is not sorted by the columns used for deduplication

D.The DataBrew project is using a sampling of the data

AnswerC

DataBrew's dedup is based on consecutive duplicates; sorting is required.

Why this answer

Option A is correct because DataBrew's dedup step requires the dataset to be sorted by the key columns to identify consecutive duplicates. Option B would cause a different error. Option C is for joins.

Option D is unrelated.

Full explanation →

1171

MCQhard

A data engineer is designing a data lake on Amazon S3. The data is frequently accessed by multiple analytics services, and the company needs to enforce fine-grained access control based on data tags. Which combination of AWS services should be used?

A.S3 Block Public Access settings

B.AWS Lake Formation with tag-based access control

C.S3 Access Points with bucket policies

D.S3 Object Lambda with IAM policies

AnswerB

Lake Formation provides fine-grained access control using tags.

Why this answer

AWS Lake Formation with tag-based access control (TBAC) is the correct choice because it provides fine-grained, attribute-based access control (ABAC) at the column, row, and cell level across a data lake on S3. By assigning LF-tags to Data Catalog resources and defining permissions based on those tags, you can enforce granular access policies that scale without managing individual user-to-resource mappings. This directly meets the requirement for tag-driven, fine-grained access for multiple analytics services.

Exam trap

The trap here is that candidates often confuse S3 Access Points (which provide network-level or prefix-level restrictions) with the fine-grained, tag-driven access control that Lake Formation TBAC uniquely offers, leading them to pick Option C despite its inability to enforce column- or row-level security based on tags.

How to eliminate wrong answers

Option A is wrong because S3 Block Public Access settings only prevent public exposure of S3 objects and do not provide any fine-grained, tag-based access control for internal users or services. Option C is wrong because S3 Access Points with bucket policies can restrict access based on VPC or IP, but they do not natively support tag-based access control at the column or row level; they operate at the bucket or prefix level only. Option D is wrong because S3 Object Lambda transforms data on read but does not enforce access control based on data tags; IAM policies attached to it cannot dynamically filter data by tags without custom code, and it lacks the centralized governance Lake Formation provides.

Full explanation →

1172

Multi-Selecthard

A company uses a Kinesis Data Firehose delivery stream to load data into an S3 bucket. The data is in JSON format and must be converted to Parquet before landing in S3. Which steps are required to achieve this? (Choose THREE.)

Select 3 answers

A.Configure the Firehose delivery stream to enable data format conversion to Parquet.

B.Create a table in the AWS Glue Data Catalog with the schema.

C.Store the schema in Amazon DynamoDB.

D.Set the Firehose's schema mapping to reference the Glue table.

E.Use Kinesis Data Analytics to convert the data.

AnswersA, B, D

Firehose has built-in conversion capability.

Why this answer

Option A, Option C, and Option E are correct because Firehose can convert to Parquet using a schema from a Glue table, and the table must be registered in the Data Catalog. Option B is wrong because Firehose does not use DynamoDB. Option D is wrong because Kinesis Data Analytics is for stream processing, not conversion.

Full explanation →

1173

MCQhard

A company runs a Redshift cluster for analytics. The data engineering team notices that COPY commands from S3 are failing for large files (>1 GB) with the error 'S3ServiceException: SlowDown'. What is the most effective solution?

A.Use Redshift Spectrum to query the data directly in S3.

B.Enable automatic compression on the target tables.

C.Increase the number of Redshift nodes to distribute the load.

D.Split the large files into smaller parts (e.g., 100 MB each) and use parallel COPY.

AnswerD

Smaller files reduce per-object throttling and allow higher parallelism.

Why this answer

Option D is correct because the SlowDown error indicates throttling from S3. Splitting large files into smaller parts increases parallelism and reduces the chance of throttling per object. Option A is wrong because increasing node count does not directly address S3 throttling.

Option B is wrong because enabling automatic compression is for compression, not throttling. Option C is wrong because using spectrum queries is for querying external tables, not direct COPY.

Full explanation →

1174

MCQeasy

A data engineer needs to store JSON documents that are frequently accessed by a low-latency web application. The data does not require complex queries, and the access pattern is primarily by a key. Which AWS service is most appropriate?

A.Amazon ElastiCache for Redis

B.Amazon S3

C.Amazon RDS for MySQL

D.Amazon DynamoDB

AnswerD

DynamoDB provides low-latency key-value access for JSON documents.

Why this answer

Option A is correct because Amazon DynamoDB is a key-value and document database that offers low-latency access. Option B (RDS) is relational and not optimized for JSON. Option C (S3) has higher latency for frequent reads.

Option D (ElastiCache) is a cache, not a primary store.

Full explanation →

1175

Multi-Selecthard

A company uses Amazon DynamoDB to store session data for a web application. The application experiences throttling during peak hours. The data engineer needs to reduce throttling. Which THREE actions should the engineer take?

Select 3 answers

A.Use DynamoDB Accelerator (DAX) to cache read requests.

B.Increase the provisioned read capacity units.

C.Implement exponential backoff in the application.

D.Design the partition key to include a random suffix to distribute writes.

E.Enable auto scaling on the table.

AnswersA, C, D

Reduces read capacity consumption.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency and offloads read requests from the DynamoDB table, directly decreasing the number of read capacity units consumed. By caching frequently accessed session data, DAX mitigates throttling during peak hours without requiring changes to the table's provisioned capacity.

Exam trap

The trap here is that candidates confuse reactive scaling (auto scaling) or capacity increases with proactive throttling reduction techniques, while the correct answers focus on caching, request distribution, and retry logic that directly reduce the load on the table.

Full explanation →

1176

MCQhard

A company uses Amazon DynamoDB with on-demand capacity for a gaming application that experiences unpredictable traffic spikes. The application reads the same set of 'hot' items frequently. Users report high latency during peak hours. Which action would MOST effectively reduce read latency for the hot items?

A.Enable DynamoDB Accelerator (DAX) for the table.

B.Switch to provisioned capacity with auto-scaling.

C.Increase the read capacity units for the table.

D.Enable DynamoDB Global Tables for multi-region replication.

AnswerA

DAX caches hot items, reducing read latency.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency for frequently accessed items. Option C is correct. Option A: increasing read capacity units is not applicable for on-demand mode (it auto-scales).

Option B: switching to provisioned capacity with auto-scaling does not address hot item latency. Option D: using Global Tables improves write availability across regions but does not reduce read latency for hot items.

Full explanation →

1177

Multi-Selecteasy

A data engineer needs to monitor the performance of an Amazon Redshift cluster. Which TWO Amazon CloudWatch metrics should the engineer monitor to detect disk space issues?

Select 1 answer

A.ReadIOPS

B.WriteIOPS

C.PercentageDiskSpace

D.NetworkThroughput

E.CPUUtilization

AnswersC

Directly measures disk usage percentage.

Why this answer

Options A and B are correct. PercentageDiskSpace is direct. ReadIOPS and WriteIOPS indicate I/O but not disk space.

NetworkThroughput does not relate to disk. CPUUtilization is compute.

Full explanation →

1178

MCQmedium

A data engineer is troubleshooting an AWS Glue ETL job that fails intermittently. The job is triggered by an AWS Lambda function that uses the IAM policy shown. The Lambda function invokes the Glue job, but sometimes the job does not start. Which action should the engineer take to ensure the job starts reliably?

A.Replace the resource "*" in the Glue action with the specific Glue job ARN.

B.Add s3:GetObject and s3:PutObject permissions for the Glue job's output bucket.

C.Modify the Lambda function to batch multiple job start requests.

D.Add the iam:PassRole permission for the IAM role used by the Glue job.

AnswerD

The Lambda function needs iam:PassRole to pass the Glue job role; missing this causes intermittent failures.

Why this answer

Option C is correct because the policy only allows glue:StartJobRun on any resource (*), but does not allow glue:GetJobRun or glue:GetJob to check job status, which may be needed by the Lambda function to confirm job start. However, the immediate issue is that the policy might be missing glue:StartJobRun on the specific job ARN, but since it's on *, it's allowed. The failure may be due to missing permissions to describe the job or pass role.

Option C addresses the need to pass the IAM role to Glue. Option A is wrong because S3 permissions are sufficient. Option B is wrong because batching is not the issue.

Option D is wrong because the policy already allows StartJobRun on *.

Full explanation →

1179

MCQmedium

The exhibit shows an S3 bucket policy. What is the effect of this policy?

A.Allows all S3 actions over HTTPS only.

B.Allows all S3 actions to the bucket over any protocol.

C.Denies all S3 actions to the bucket.

D.Allows only GetObject and PutObject over HTTPS.

AnswerD

Explicit allow for those actions over HTTPS; deny for HTTP.

Why this answer

The policy allows GetObject and PutObject only over HTTPS (SecureTransport true) and denies all S3 actions over HTTP (SecureTransport false). Result: Only HTTPS requests are allowed.

Full explanation →

1180

Multi-Selecthard

A data engineer is configuring a VPC for an Amazon Redshift cluster. The cluster must be accessible only from a specific on-premises network via a Direct Connect connection. Which TWO actions should the engineer take to meet this requirement? (Choose TWO.)

Select 2 answers

A.Enable Redshift Enhanced VPC Routing.

B.Configure a security group to allow inbound traffic from the on-premises CIDR block.

C.Configure a network ACL to allow inbound traffic from the on-premises CIDR block.

D.Create a VPC endpoint for Redshift.

E.Make the Redshift cluster publicly accessible.

AnswersB, C

Security groups act as a firewall to control inbound traffic.

Why this answer

To restrict access to Redshift, you use a security group to allow inbound traffic from the on-premises CIDR, and optionally use a network ACL for subnet-level security. Option A is incorrect because Redshift does not support VPC endpoints. Option D is incorrect because Redshift Enhanced VPC Routing is for routing traffic, not for access control.

Option E is incorrect because public endpoints are not secure. Correct: B and C.

Full explanation →

1181

MCQhard

A data engineer runs an AWS Glue ETL job that writes to a table in the AWS Glue Data Catalog. The job fails occasionally with the error 'Resource Not Found' for the table. The table exists. What is a likely cause?

A.The job is using an outdated version of the table schema.

B.Multiple Glue jobs are writing to the same table concurrently.

C.The table location in S3 is incorrect.

D.The Glue job name contains special characters.

AnswerA

Schema version mismatch can cause 'Resource Not Found'.

Why this answer

Option B is correct because Glue jobs use a version of the table schema that may be stale if the schema was updated after the job started; the job may be referencing an old version that no longer exists. Option A is wrong because table location is irrelevant if the table exists. Option C is wrong because concurrent jobs do not cause 'Resource Not Found' errors.

Option D is wrong because the job name does not affect table access.

Full explanation →

1182

MCQmedium

A data engineer is running a Glue ETL job that reads from a JDBC source and writes to S3 in Parquet format. The job is slow and the engineer notices that the number of DPUs used is low. What can be done to improve performance?

A.Disable job bookmarks to avoid reading metadata.

B.Use push-down predicates to filter data at the source.

C.Increase the number of workers (MaxCapacity) in the job configuration.

D.Change the output format to CSV to reduce CPU overhead.

AnswerC

More workers allow parallel processing.

Why this answer

Option B is correct because increasing the number of workers increases parallelism. Option A is wrong because Parquet is already efficient; converting to CSV would worsen performance. Option C is wrong because pushing down filters reduces data read, but if the bottleneck is parallelism, more workers help.

Option D is wrong because disabling bookmarks may cause reprocessing but does not directly improve speed.

Full explanation →

1183

MCQmedium

A company uses Amazon S3 to store raw data and needs to transform it into Parquet format for analytics. The transformation job runs daily on a schedule. Which AWS service is BEST suited for this task?

A.Amazon Redshift

B.Amazon EMR

C.AWS Lambda

D.AWS Glue

AnswerD

Glue is serverless, supports Parquet, and can be scheduled with triggers.

Why this answer

Option C is correct because AWS Glue is a serverless ETL service that can run scheduled jobs to convert data to Parquet. Option A is wrong because Lambda has a 15-minute timeout. Option B is wrong because EMR requires managing clusters.

Option D is wrong because Redshift is a warehouse, not for file conversion.

Full explanation →

1184

MCQmedium

A data engineer runs an AWS Glue job that reads from a JDBC connection to a PostgreSQL database. The job fails with a 'Connection timed out' error. The Glue job runs in a VPC with the appropriate security group. What is the most likely cause?

A.The network ACL associated with the Glue job's subnet is blocking outbound traffic.

B.The Glue job does not have permission to access the database.

C.The security group does not allow inbound traffic from the Glue job.

D.The database credentials are incorrect.

AnswerA

Network ACLs can block traffic.

Why this answer

Option C is correct because a network ACL can block outbound traffic from the VPC to the database. Option A is wrong because the error is network, not authentication. Option B is wrong because the security group allows inbound, but need outbound.

Option D is wrong because it's a network issue, not schema.

Full explanation →

1185

MCQmedium

A data engineer needs to ensure that an S3 bucket containing sensitive customer data is encrypted at rest. The company requires that all encryption keys be managed by AWS and rotated annually. Which encryption option meets these requirements?

A.Use server-side encryption with AWS KMS (SSE-KMS)

B.Use client-side encryption with AWS KMS

C.Use server-side encryption with customer-provided keys (SSE-C)

D.Use server-side encryption with S3-managed keys (SSE-S3)

AnswerD

SSE-S3 uses AWS-managed keys that are rotated automatically.

Why this answer

SSE-S3 uses AWS-managed keys and handles key rotation automatically. SSE-KMS also uses AWS-managed keys but gives more control; however, the requirement does not specify customer-managed keys. SSE-C requires the customer to manage keys, which does not meet the requirement of AWS-managed keys.

Option D is not a valid encryption type.

Full explanation →

1186

Multi-Selecthard

Which THREE factors should be considered when choosing between AWS Glue and Amazon EMR for data transformation? (Choose three.)

Select 3 answers

A.Glue automatically stores data in S3 after transformation.

B.EMR allows fine-grained control over cluster configuration and software.

C.EMR supports real-time stream processing with Spark Streaming.

D.Glue is serverless, reducing operational overhead.

E.Glue integrates natively with the Glue Data Catalog for schema management.

AnswersB, D, E

EMR provides flexibility to install custom software and tune clusters.

Why this answer

Option B is correct because Amazon EMR provides full control over cluster configuration, including the ability to customize software, install libraries, and tune Spark, Hadoop, or Hive parameters. This fine-grained control is essential for complex or specialized data transformation pipelines that require specific versions or custom configurations.

Exam trap

The trap here is that candidates may confuse Glue's automatic schema discovery with automatic data storage, or assume EMR is the only option for streaming, when in fact both services support streaming but with different levels of control and operational overhead.

Full explanation →

1187

MCQmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by an AWS Lambda function that writes to Amazon DynamoDB. The Lambda function is seeing high error rates due to DynamoDB write throttling. Which action should be taken to reduce throttling?

A.Use Amazon Kinesis Data Firehose instead of Kinesis Data Streams

B.Add an Amazon SQS queue between Lambda and DynamoDB

C.Increase the Lambda function memory

D.Enable auto scaling on the DynamoDB table

AnswerD

Auto scaling adjusts write capacity to handle spikes and reduce throttling.

Why this answer

Enabling DynamoDB auto scaling increases write capacity automatically when needed. Using Kinesis Data Firehose would change the architecture but does not address throttling directly. Increasing Lambda memory does not help with DynamoDB throttling.

Using SQS would add a queue but does not increase DynamoDB capacity.

Full explanation →

1188

MCQmedium

A company needs to tag all resources created in a specific AWS account to enforce data governance policies. Which AWS service can automatically enforce tagging rules?

A.AWS Organizations SCPs

B.AWS Service Catalog

C.AWS Resource Access Manager

D.AWS Systems Manager

AnswerA

SCPs can enforce tagging policies.

Why this answer

Option B is correct. AWS Organizations service control policies (SCPs) can enforce tagging across accounts. Option A is wrong because Resource Access Manager is for sharing resources.

Option C is wrong because Systems Manager is for management. Option D is wrong because Service Catalog is for IT services.

Full explanation →

1189

Multi-Selecthard

A data engineer is troubleshooting a slow-running AWS Glue ETL job that reads from Amazon S3 and writes to Amazon Redshift. The job processes 500 GB of CSV data daily. The engineer wants to improve performance. Which THREE actions should the engineer take? (Choose three.)

Select 3 answers

A.Use a JDBC connection with a higher batch size for writing to Redshift.

B.Partition the input data in S3 by date or category.

C.Switch to a single-node Redshift cluster to reduce latency.

D.Increase the number of DPUs allocated to the Glue job.

E.Reduce the number of input files by combining them into larger files.

AnswersA, B, D

Larger batch sizes reduce round trips and improve write throughput.

Why this answer

Options A, C, and D are correct. Increasing the number of DPUs provides more parallelism. Partitioning the input data improves read performance.

Using a JDBC connection with appropriate batch size improves write performance. Option B is wrong because reducing the number of files may reduce parallelism. Option E is wrong because switching to a single node cluster reduces parallelism.

Full explanation →

1190

MCQmedium

A company uses AWS Lake Formation to manage permissions on a data lake stored in S3. A data engineer notices that a new IAM user can query data via Athena but cannot see the tables in the Lake Formation console. What is the most likely cause?

A.The user has not been granted DESCRIBE or SELECT permissions on the tables in Lake Formation

B.The Athena workgroup is not encrypted

C.The Glue Data Catalog is not enabled for the account

D.The IAM user lacks s3:GetObject permissions

AnswerA

Lake Formation controls metadata access.

Why this answer

Lake Formation permissions are separate from IAM permissions. Even if IAM allows Athena, Lake Formation must grant the user permissions on the tables. Option A is wrong because IAM policies are not sufficient for Lake Formation.

Option B is irrelevant. Option D is wrong because the Glue Data Catalog may be accessible but Lake Formation can restrict visibility. Option C is correct.

Full explanation →

1191

MCQeasy

A company is using AWS Glue to run ETL jobs that transform data from Amazon DynamoDB to Amazon S3. The DynamoDB table has a large number of items (over 10 million) and is heavily used by production applications. The Glue job reads the entire DynamoDB table each time it runs, causing increased read capacity consumption and affecting production performance. The team wants to reduce the impact on the source DynamoDB table while still keeping the S3 data up-to-date. What should the team do?

A.Use DynamoDB Streams and AWS Lambda to capture changes and write them to S3, then run incremental Glue jobs.

B.Increase the DynamoDB read capacity units to handle the Glue job's read load.

C.Use the DynamoDB console to export the table to S3 in Parquet format.

D.Reduce the parallelism of the Glue job to lower the read throughput.

AnswerA

Captures only changes, reducing read impact.

Why this answer

Option B is correct because using DynamoDB Streams with AWS Lambda allows incremental change data capture (CDC), reducing the need to read the entire table. Option A is wrong because increasing read capacity might help but still reads the whole table. Option C is wrong because reducing parallelism increases job duration but still reads everything.

Option D is wrong because exporting to S3 via the console is a one-time operation, not incremental.

Full explanation →

1192

Multi-Selectmedium

A data engineer is designing a near-real-time streaming pipeline to ingest clickstream data from a web application. The data must be enriched with user metadata from a DynamoDB table before being stored in S3. Which combination of AWS services should the engineer use? (Choose TWO.)

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Analytics for Apache Flink

C.Amazon Kinesis Data Streams

D.AWS Lambda with DynamoDB Accelerator (DAX)

E.AWS Glue Streaming ETL

AnswersB, C

Performs stream enrichment with DynamoDB lookups.

Why this answer

Option A and Option C are correct because Kinesis Data Streams ingests the clickstream, and Kinesis Data Analytics for Apache Flink can enrich streams with DynamoDB lookups. Option B is wrong because Lambda can enrich but is limited by concurrency. Option D is wrong because Firehose can transform via Lambda but not DynamoDB enrichment directly.

Option E is wrong because Glue is batch, not streaming.

Full explanation →

1193

MCQeasy

Refer to the exhibit. A security analyst is reviewing CloudTrail logs and notices a PutObject event to the 'company-data-lake' bucket. The bucket policy requires all objects to be encrypted with SSE-KMS. What should the analyst conclude?

A.The object was uploaded by the bucket owner, bypassing the policy.

B.The request succeeded because SSE-S3 is acceptable.

C.The object was encrypted with SSE-KMS as required.

D.The object was encrypted with SSE-S3, which violates the bucket policy.

AnswerD

AES256 indicates SSE-S3, not KMS.

Why this answer

The bucket policy requires all objects to be encrypted with SSE-KMS. The CloudTrail log shows a PutObject event, but the encryption context (not shown in the exhibit) would indicate the encryption method used. Since the correct answer states the object was encrypted with SSE-S3, this violates the bucket policy, which mandates SSE-KMS.

Therefore, the analyst should conclude that the request succeeded but violated the policy, as SSE-S3 does not meet the requirement.

Exam trap

AWS often tests the nuance that a bucket policy requiring SSE-KMS does not automatically deny requests using SSE-S3; the policy must include an explicit Deny effect for non-compliant encryption to block the upload, so candidates may mistakenly think a requirement alone prevents the action.

How to eliminate wrong answers

Option A is wrong because the bucket owner does not bypass bucket policies; all principals, including the owner, are subject to the policy unless explicitly exempted (which is not indicated). Option B is wrong because the bucket policy explicitly requires SSE-KMS, so SSE-S3 is not acceptable; the request would succeed only if the policy allowed it, but it does not. Option C is wrong because if the object were encrypted with SSE-KMS as required, there would be no violation, but the correct answer indicates a violation occurred, meaning SSE-KMS was not used.

Full explanation →

1194

MCQeasy

A data engineer is troubleshooting a Kinesis Data Firehose delivery stream that ingests JSON log data from web servers. The stream is configured to transform records with an AWS Lambda function and deliver to an Amazon S3 bucket. Recently, the stream has been failing with 'InvalidData' errors. Which action should the engineer take to resolve the issue?

A.Verify the S3 bucket policy allows Firehose to write.

B.Increase the buffer size and interval in the Firehose delivery stream.

C.Change the data format to CSV in the Firehose configuration.

D.Check the CloudWatch Logs for the Lambda function to identify transformation errors.

AnswerD

Lambda errors are logged in CloudWatch and can reveal why transformation fails.

Why this answer

The 'InvalidData' error in Kinesis Data Firehose typically indicates that the Lambda function used for data transformation is failing or returning malformed records. By checking the CloudWatch Logs for the Lambda function, the engineer can identify specific transformation errors, such as incorrect JSON parsing, missing fields, or exceptions, which cause Firehose to reject the records. This is the most direct troubleshooting step because Firehose relies on the Lambda function to return valid transformed data in the expected format.

Exam trap

The trap here is that candidates often confuse 'InvalidData' errors with S3 permission issues or buffer configuration problems, but the error specifically points to a failure in the data transformation step, not the delivery destination or batching settings.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket policy were the issue, the error would be a permission or access denied error, not 'InvalidData'. Option B is wrong because increasing buffer size or interval would not resolve data transformation errors; it only affects how data is batched before delivery. Option C is wrong because changing the data format to CSV would not fix transformation errors; Firehose expects the Lambda function to return data in the same format as the input (JSON) unless explicitly configured otherwise, and the 'InvalidData' error is unrelated to the output format.

Full explanation →

1195

MCQeasy

A data engineer needs to troubleshoot a failed AWS Glue job that reads from an Amazon RDS for MySQL database. The error log shows 'Communications link failure'. Which step should the engineer take FIRST?

A.Increase the job timeout and retry count.

B.Check that the security group associated with the Glue job allows outbound traffic to the RDS database.

C.Verify that the database username and password are correct in the Glue connection.

D.Confirm that the table schema in MySQL matches the Glue Data Catalog.

AnswerB

Network connectivity is the most common cause of this error.

Why this answer

Option B is correct because a 'Communications link failure' often indicates network connectivity issues; verifying that the Glue job's security group allows outbound traffic to the RDS database is the first troubleshooting step. Option A is wrong because the error is not about authentication. Option C is wrong because the issue is not about table structure.

Option D is wrong because the error is not about permissions.

Full explanation →

1196

MCQeasy

An Amazon CloudWatch alarm is configured to monitor the CPUUtilization of an EC2 instance. The alarm state is 'INSUFFICIENT_DATA'. What is the most likely cause?

A.The EC2 instance is not sending metric data to CloudWatch.

B.The evaluation periods are set too low.

C.The alarm threshold is set too high.

D.The alarm does not have any actions configured.

AnswerA

If the instance does not have the CloudWatch agent installed or is stopped, no CPU metrics are published.

Why this answer

Option A is correct because INSUFFICIENT_DATA means no data points are available, often due to no metrics being published. Option B (no alarm actions) would not cause this state. Option C (threshold) is irrelevant.

Option D (evaluation periods) could cause missing data if the metric didn't exist long enough, but the most common reason is no data published.

Full explanation →

1197

MCQeasy

A company is using Amazon S3 to store log files. The security team wants to ensure that any object uploaded to the S3 bucket is automatically scanned for malware before being processed by downstream applications. The data engineer needs to implement a solution that integrates with AWS services and minimizes latency. The bucket receives thousands of objects per day. Which solution should the data engineer use?

A.Deploy an EC2 instance with antivirus software that triggers on S3 events.

B.Enable Amazon GuardDuty with malware protection for S3.

C.Use AWS WAF to inspect objects as they are uploaded.

D.Enable Amazon Macie on the S3 bucket to detect malware.

AnswerB

GuardDuty malware protection scans S3 objects.

Why this answer

Option B is correct because Amazon GuardDuty has a malware protection feature that can scan S3 objects for malware upon upload. It integrates with S3 events and minimizes latency. Option A is incorrect because Amazon Macie is for sensitive data discovery, not malware.

Option C is incorrect because AWS WAF is for web application firewall, not S3 scanning. Option D is incorrect because running a custom scanning solution on EC2 adds operational overhead and latency.

Full explanation →

1198

MCQmedium

A data engineer receives the error shown in the exhibit when trying to upload a file to my-bucket. The engineer uses the AWS CLI with the following command: aws s3 cp file.txt s3://my-bucket/. What is the most likely cause of the error?

A.The KMS key specified in the command is incorrect

B.The VPC endpoint policy is blocking the request

C.The object is not encrypted with SSE-KMS

D.The bucket does not have a default encryption configuration

AnswerC

The policy requires SSE-KMS encryption header.

Why this answer

Option B is correct because the bucket policy denies PutObject if the encryption header is not set to aws:kms (SSE-KMS). The CLI command does not specify encryption, so the condition StringNotEquals matches and denies the request. Option A is wrong because the bucket policy explicitly denies non-KMS encrypted uploads.

Option C is wrong because the error is not a network issue. Option D is wrong because the policy does not require KMS key ID, just the header value.

Full explanation →

1199

Multi-Selectmedium

A company is designing a data ingestion pipeline for real-time sensor data from thousands of devices. The data must be processed with low latency and stored in Amazon S3. Which TWO services would be appropriate for this use case? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.AWS DataSync

C.Amazon Athena

D.Amazon Kinesis Data Firehose

E.Amazon Kinesis Data Streams

AnswersD, E

Firehose can deliver streaming data to S3 with buffering.

Why this answer

Option A is correct for ingestion, Option C is correct for delivery to S3. Option B is for batch ETL, not real-time. Option D is for ad-hoc querying.

Option E is for batch data transfer.

Full explanation →

1200

MCQmedium

A company uses Amazon S3 to store log files. The security team notices that some objects are being accessed from an unexpected AWS account. The data engineer needs to identify which specific IAM user or role is accessing the objects. Which AWS service should be used to get this information?

A.AWS Trusted Advisor

B.Amazon S3 server access logs

C.AWS CloudTrail

D.AWS Config

AnswerC

CloudTrail logs API calls and can be used to trace S3 access to specific IAM users or roles.

Why this answer

AWS CloudTrail records API calls including S3 object-level operations. It logs who made the call, from which account, and other details. S3 server access logs provide similar info but are log files themselves, not a queryable service.

Config is for resource configuration tracking. Trusted Advisor gives best practice checks.

Full explanation →

Page 16 of 24

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Practice DEA-C01 by domain

Target a specific domain to shore up weak areas.

Data Ingestion and Transformation Data Operations and Support Data Security and Governance Data Store Management

See all domains with question counts →