CCNA Data Store Management Questions

75 of 456 questions · Page 1/7 · Data Store Management topic · Answers revealed

1
MCQeasy

A company is running a production database on Amazon RDS for PostgreSQL. The database experiences high read traffic from multiple application servers. Which data store management strategy would reduce the load on the primary database instance?

A.Enable DynamoDB Accelerator (DAX) for the database.
B.Enable Multi-AZ deployment for automatic failover.
C.Create an RDS Read Replica in the same region.
D.Use Amazon ElastiCache to cache query results.
AnswerC

Read Replicas allow offloading read queries, reducing load on the primary.

Why this answer

Option B is correct because using RDS Read Replicas offloads read queries from the primary instance. Option A is incorrect because Multi-AZ is for high availability, not read scaling. Option C is incorrect because ElastiCache is a caching layer, not a replica.

Option D is incorrect because DynamoDB Accelerator is for DynamoDB, not RDS.

2
MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an Amazon S3 bucket. The data is JSON and each record is about 2 KB. The delivery stream is configured to buffer incoming data to 5 MB or 60 seconds, whichever comes first. The data engineering team notices that the S3 bucket contains many small files (average 2 MB), which makes subsequent processing inefficient. They need to reduce the number of small files without increasing the latency beyond 5 minutes. Which solution should they implement?

A.Enable compression (GZIP) on the delivery stream.
B.Increase the buffer size to 50 MB and the buffer interval to 300 seconds.
C.Use a Lambda function to merge small files after delivery.
D.Decrease the buffer size to 1 MB and the buffer interval to 60 seconds.
AnswerB

Larger buffer creates larger files within latency limit.

Why this answer

Option B is correct because increasing the buffer size to 50 MB and the buffer interval to 300 seconds directly addresses the root cause: the current 5 MB buffer size triggers a flush too frequently, producing many 2 MB files. By raising the buffer size to 50 MB, each flush will contain more data, resulting in larger S3 objects (up to ~50 MB uncompressed), while the 300-second interval ensures latency stays within the 5-minute requirement. This reduces the number of small files without requiring additional services or post-processing.

Exam trap

The trap here is that candidates often assume compression (Option A) reduces file count, but compression only reduces file size, not the number of files; the real issue is the flush frequency controlled by buffer size and interval.

How to eliminate wrong answers

Option A is wrong because enabling GZIP compression reduces the storage size of each file but does not change the buffer size or flush behavior; the delivery stream will still flush at 5 MB or 60 seconds, producing the same number of small files (just compressed). Option C is wrong because using a Lambda function to merge small files after delivery adds complexity, cost, and latency (Lambda invocation delays, S3 event processing), and does not address the root cause of premature flushes; it also violates the goal of not increasing latency beyond 5 minutes due to the merging overhead. Option D is wrong because decreasing the buffer size to 1 MB and keeping the buffer interval at 60 seconds would make the problem worse, producing even smaller files (average ~1 MB) and more frequent flushes, increasing the number of small files.

3
Multi-Selecthard

A company uses Amazon DynamoDB for a gaming application that requires single-digit millisecond read and write latencies. The application experiences throttling on the 'GameScores' table during peak hours. The table has a partition key of 'game_id' and a sort key of 'player_id'. The data engineer needs to improve performance without changing the table's provisioned capacity. Which THREE actions should the engineer take? (Choose THREE.)

Select 3 answers
A.Enable DynamoDB adaptive capacity to allow more throughput per partition.
B.Enable DynamoDB auto scaling to adjust capacity based on traffic patterns.
C.Add a Global Secondary Index (GSI) on a different partition key to offload reads.
D.Implement DynamoDB Accelerator (DAX) for caching frequent reads.
E.Increase the read capacity units (RCUs) to twice the peak observed value.
AnswersB, C, D

Auto scaling prevents throttling by adjusting capacity automatically.

Why this answer

Option B is correct because DynamoDB auto scaling automatically adjusts the provisioned read and write capacity based on actual traffic patterns, preventing throttling during peak hours without manual intervention. This allows the table to handle bursts while maintaining single-digit millisecond latencies, as long as the traffic stays within the auto scaling limits.

Exam trap

The trap here is that candidates may confuse adaptive capacity with auto scaling, or think that increasing provisioned capacity is the only solution, but the question explicitly prohibits changing provisioned capacity, making options that alter RCUs or WCUs incorrect.

4
Multi-Selecteasy

Which TWO features of Amazon S3 help protect data from accidental deletion or modification? (Choose two.)

Select 2 answers
A.Lifecycle policies
B.Default encryption
C.S3 MFA Delete
D.S3 Object Versioning
E.Cross-Region Replication
AnswersC, D

MFA Delete requires additional authentication for deletions.

Why this answer

Options A and B are correct because versioning allows recovery of previous versions and MFA delete adds an extra layer of protection. Option C (cross-region replication) is for disaster recovery, not accidental deletion. Option D (default encryption) protects data at rest, not deletion.

Option E (lifecycle policies) automate transitions, not protection.

5
Multi-Selectmedium

A company is designing a disaster recovery strategy for an Amazon RDS for SQL Server database. The database must be recoverable in another AWS region within 15 minutes of a regional outage. Which TWO actions should the data engineer take?

Select 2 answers
A.Enable Multi-AZ deployment on the primary instance.
B.Configure automated backups to copy to the recovery region.
C.Configure automated cross-region snapshots to be copied to the recovery region.
D.Create a cross-region read replica in the desired recovery region.
AnswersC, D

Cross-region snapshot copy allows restoring in another region.

Why this answer

Options A and D are correct. A cross-region read replica can be promoted to a standalone instance in another region, providing recovery within minutes. Automated cross-region snapshots can also be used to restore to another region.

Option B is incorrect because Multi-AZ is within a region, not cross-region. Option C is incorrect because cross-region automated backups are not supported; you need snapshots.

6
Multi-Selectmedium

Which TWO of the following are benefits of using Amazon DynamoDB Accelerator (DAX)? (Choose TWO.)

Select 2 answers
A.Improves write throughput
B.Provides microsecond read latency
C.Offloads read traffic from the DynamoDB table
D.Provides data durability across Availability Zones
E.Reduces storage costs
AnswersB, C

DAX caches reads in memory for low latency.

Why this answer

Amazon DynamoDB Accelerator (DAX) is an in-memory cache that sits between your application and DynamoDB, providing microsecond response times for read-heavy workloads. It achieves this by caching frequently accessed data in memory, reducing the latency from single-digit milliseconds to microseconds for eventually consistent reads.

Exam trap

The trap here is that candidates often confuse DAX's read acceleration with write performance improvements, or assume that a cache provides durability guarantees similar to the underlying database.

7
Multi-Selectmedium

A data engineer is designing a data storage solution for IoT sensor data that is ingested at high velocity. The data is time-series and needs to be queried by time range. Which TWO AWS services are suitable for this use case? (Choose TWO)

Select 2 answers
A.Amazon RDS
B.Amazon Redshift
C.Amazon Timestream
D.Amazon DynamoDB
E.Amazon S3
AnswersC, D

Timestream is a time-series database built for IoT and operational applications.

Why this answer

Amazon Timestream is a purpose-built time-series database that automatically scales to handle high-velocity IoT sensor data, with built-in time-based partitioning and query optimization for time-range queries. It supports SQL-like queries with time-series functions (e.g., `BETWEEN`, `DATE_BIN`) and separates storage into a memory store for recent data and a magnetic store for historical data, enabling efficient querying by time range.

Exam trap

The trap here is that candidates often choose Amazon RDS or Redshift because they are familiar with SQL-based querying, overlooking that Timestream and DynamoDB are purpose-built for high-velocity time-series ingestion and time-range queries, while RDS and Redshift incur performance and cost penalties for such workloads.

8
MCQmedium

A data engineer is managing an Amazon Redshift cluster used for analytics. The cluster has a single node of type dc2.large. The engineer notices that queries are slowing down as data volume grows. The cluster's disk space is at 70% usage. The engineer needs to improve query performance and accommodate future growth. The budget allows for moderate cost increase. Which action should the engineer take?

A.Add another dc2.large node to the cluster.
B.Resize the cluster to a single ds2.xlarge node.
C.Migrate the cluster to a single ra3.xlplus node with managed storage.
D.Enable concurrency scaling and maintain the current cluster.
AnswerA

Adding nodes increases both compute and storage capacity for better performance.

Why this answer

Redshift's dc2 nodes are dense compute nodes limited by storage. Adding a node increases both compute and storage capacity, improving performance and scalability. Option B is correct.

Option A: resizing to a larger node type (ds2) increases storage but not necessarily compute proportionally; ds2 is dense storage. Option C: switching to ra3 nodes with managed storage allows compute scaling independent of storage, but ra3 nodes have a higher cost. Option D: enabling concurrency scaling adds cost for additional clusters, not addressing the single node bottleneck.

9
MCQmedium

A data engineer notices that an Amazon Redshift cluster is experiencing slow query performance. The engineer suspects that tables are not properly sorted. Which diagnostic query should the engineer run to identify unsorted rows?

A.SELECT * FROM SVV_TABLE_INFO ORDER BY unsorted DESC;
B.SELECT * FROM PG_CATALOG;
C.SELECT * FROM STV_TBL_PERM;
D.SELECT * FROM STL_LOAD_ERRORS;
AnswerA

SVV_TABLE_INFO shows unsorted rows for each table.

Why this answer

The `SVV_TABLE_INFO` system view in Amazon Redshift provides metadata about each table, including the `unsorted` column which shows the percentage of unsorted rows. By ordering by `unsorted DESC`, the engineer can quickly identify tables with the highest proportion of unsorted data, which directly impacts query performance due to inefficient zone maps and scan pruning.

Exam trap

The trap here is that candidates may confuse `SVV_TABLE_INFO` with `STV_TBL_PERM` (which shows block counts) or `STL_LOAD_ERRORS` (which is for load debugging), missing that only `SVV_TABLE_INFO` exposes the `unsorted` column specifically designed for sort health analysis.

How to eliminate wrong answers

Option B is wrong because `PG_CATALOG` is a system schema containing PostgreSQL catalog tables (e.g., `pg_class`, `pg_attribute`), not a diagnostic view for unsorted rows; it lacks the `unsorted` metric. Option C is wrong because `STV_TBL_PERM` provides block-level storage information (e.g., number of blocks per slice) but does not include a column for unsorted row percentage. Option D is wrong because `STL_LOAD_ERRORS` logs errors from COPY and INSERT operations, such as data type mismatches or malformed CSV rows, and has no relevance to sort key efficiency.

10
MCQeasy

A company uses Amazon DynamoDB for a gaming application. They need to store player session data that expires after 24 hours. Which DynamoDB feature should they use to automatically delete expired items?

A.Time to Live (TTL)
B.DynamoDB auto scaling
C.DynamoDB Streams
D.Point-in-time recovery
AnswerA

TTL automatically deletes expired items based on a timestamp attribute.

Why this answer

DynamoDB Time to Live (TTL) is the correct feature because it allows you to define a per-item timestamp attribute (e.g., `expireAt`) that DynamoDB automatically deletes once that timestamp is reached. This is ideal for expiring session data after 24 hours without requiring custom code or scheduled jobs to scan and delete items, reducing cost and operational overhead.

Exam trap

The trap here is that candidates may confuse TTL with DynamoDB Streams, thinking streams can automatically delete items, but streams only notify of changes and require separate logic to perform deletions.

How to eliminate wrong answers

Option B (DynamoDB auto scaling) is wrong because it manages throughput capacity (read/write units) based on traffic, not item expiration or deletion. Option C (DynamoDB Streams) is wrong because it captures item-level changes (inserts, updates, deletes) in near real-time for downstream processing, but does not automatically delete items. Option D (Point-in-time recovery) is wrong because it provides continuous backups to restore a table to any point within the last 35 days, but does not handle automatic deletion of expired data.

11
MCQhard

A company is using Amazon Redshift for data warehousing. The data engineer notices that the STL_ALERT_EVENT_LOG table shows many 'missing statistics' alerts. What is the best course of action to address this issue?

A.Increase the WLM concurrency slots.
B.Run VACUUM on the tables.
C.Enable compression on the tables.
D.Run ANALYZE on the tables.
AnswerD

ANALYZE updates table statistics, resolving missing statistics alerts.

Why this answer

The STL_ALERT_EVENT_LOG table records alerts about query performance issues, including 'missing statistics' alerts. This indicates that the query optimizer lacks up-to-date table statistics, leading to suboptimal query plans. Running the ANALYZE command updates table statistics, enabling the optimizer to generate efficient execution plans.

Therefore, option D is the correct course of action.

Exam trap

The trap here is that candidates often confuse VACUUM (which reorganizes data) with ANALYZE (which updates statistics), assuming both are needed for query performance, but only ANALYZE directly resolves 'missing statistics' alerts.

How to eliminate wrong answers

Option A is wrong because increasing WLM concurrency slots does not address missing statistics; it only allows more queries to run simultaneously, which could worsen performance if statistics are outdated. Option B is wrong because VACUUM reclaims disk space and sorts rows but does not update table statistics; it is used for managing data storage, not query optimization. Option C is wrong because enabling compression reduces storage and I/O but does not provide the optimizer with the statistical metadata needed for efficient query planning.

12
MCQmedium

A company is using Amazon RDS for MySQL and needs to reduce read latency for a global user base. Which AWS feature should be implemented?

A.Multi-AZ deployment
B.Aurora Auto Scaling
C.Read Replicas
D.Cross-Region Replication
AnswerC

Read Replicas allow offloading read queries to reduce latency.

Why this answer

Option D is correct because Read Replicas can be promoted to stand-alone instances and reduce read latency. Option A is wrong because Multi-AZ is for high availability, not read scaling. Option B is wrong because Auto Scaling adjusts capacity, not read latency.

Option C is wrong because cross-region replication latency is high.

13
MCQmedium

A company stores sensitive customer data in an Amazon S3 bucket. The security team requires that all data be encrypted at rest using a key that is automatically rotated every year. Which encryption solution should the data engineer use?

A.SSE-S3
B.SSE-KMS with a customer managed key
C.SSE-C
D.Client-side encryption
AnswerB

KMS can automatically rotate customer managed keys annually.

Why this answer

SSE-KMS with a customer managed CMK allows automatic annual key rotation (when enabled). Option C is correct. Option A: SSE-S3 uses S3-managed keys, rotation is not customer-controlled.

Option B: SSE-C requires customer-provided keys, rotation is manual. Option D: client-side encryption does not use AWS KMS for key rotation.

14
MCQmedium

A company runs a multi-AZ Amazon RDS for PostgreSQL instance. They need to run a one-time analytical query that will take several hours and consume significant I/O. The query should not impact the primary workload. What should the data engineer do?

A.Create a read replica of the RDS instance and run the query on the replica.
B.Run the query directly on the primary instance during off-peak hours.
C.Increase the instance size to handle the load.
D.Enable Multi-AZ and run the query on the standby instance.
AnswerA

Read replica offloads read traffic from the primary.

Why this answer

Option A is correct because creating a read replica of the RDS for PostgreSQL instance allows the analytical query to run on a separate database engine without affecting the primary workload. Read replicas in Amazon RDS use asynchronous replication from the source instance, so the replica can handle heavy I/O and long-running queries independently. This ensures the primary instance remains available for the production workload without performance degradation.

Exam trap

The trap here is that candidates often confuse the Multi-AZ standby instance with a read replica, assuming the standby can be used for queries, but in Amazon RDS, the standby is only for high availability and is not accessible for read operations.

How to eliminate wrong answers

Option B is wrong because running the query directly on the primary instance, even during off-peak hours, still consumes significant I/O and CPU resources on that instance, which can impact the primary workload and potentially cause performance issues or increased latency. Option C is wrong because increasing the instance size only adds more resources to the same single instance; the analytical query would still compete with the primary workload for I/O and memory, and scaling up does not isolate the workload. Option D is wrong because the standby instance in a Multi-AZ deployment is not directly accessible for read or write operations; it is a synchronous replica used only for automatic failover, and Amazon RDS does not allow connecting to the standby for queries.

15
MCQhard

A data engineer is designing a multi-region disaster recovery plan for an Amazon DynamoDB table. The table stores critical user profile data and must have a Recovery Point Objective (RPO) of less than 1 minute and a Recovery Time Objective (RTO) of less than 5 minutes. Which solution meets these requirements?

A.Configure DynamoDB Streams and a Lambda function to replicate data to another region.
B.Use DynamoDB on-demand backup and restore to another region.
C.Use DynamoDB global tables to replicate data to another region.
D.Enable point-in-time recovery (PITR) and restore to another region.
AnswerC

Global tables provide near-real-time replication and fast failover.

Why this answer

DynamoDB global tables provide multi-region active-active replication with sub-second replication latency, meeting RPO < 1 minute and RTO < 5 minutes (by failing over to another region). Option A is wrong because on-demand backup and restore can take hours to restore. Option B is wrong because point-in-time recovery (PITR) restores to a new table, which can take minutes to hours.

Option D is wrong because cross-region replication using Lambda is custom and may have higher latency.

16
MCQmedium

A data engineer runs the describe-table command shown in the exhibit. The application frequently queries by CustomerID alone. Currently, these queries result in full table scans. Which action should the engineer take to improve query performance?

A.Change the sort key to OrderID
B.Create a local secondary index on CustomerID
C.Increase the read capacity units to 500
D.Create a global secondary index on CustomerID
AnswerD

A GSI allows efficient queries on CustomerID.

Why this answer

The `describe-table` output shows CustomerID is the partition key, but the application frequently queries by CustomerID alone, which currently causes full table scans because there is no index supporting that query pattern. Creating a global secondary index (GSI) on CustomerID allows efficient querying by CustomerID without scanning the entire table, as the GSI provides a separate data structure with its own read/write capacity that can be queried directly.

Exam trap

AWS often tests the misconception that increasing capacity (RCUs/WCUs) can fix query performance issues, but the trap here is that throughput and indexing are separate concerns — full table scans are a design problem, not a capacity problem.

How to eliminate wrong answers

Option A is wrong because changing the sort key to OrderID would not help queries by CustomerID alone, as the sort key is used for sorting within a partition, not for filtering by a different attribute. Option B is wrong because DynamoDB does not support local secondary indexes (LSIs) on the partition key; LSIs can only be created on a different sort key within the same partition key, and CustomerID is already the partition key, so an LSI on CustomerID is invalid. Option C is wrong because increasing read capacity units only increases throughput, not query efficiency; full table scans still occur regardless of capacity, and the issue is the lack of an index, not insufficient capacity.

17
Multi-Selecthard

A company runs a production Amazon RDS for PostgreSQL database. The database is experiencing performance degradation due to a high number of concurrent read queries. The data engineer needs to improve read performance without significantly increasing costs. Which TWO actions should the engineer take? (Choose TWO.)

Select 2 answers
A.Create one or more read replicas in the same region.
B.Enable Multi-AZ deployment for automatic failover.
C.Increase the allocated storage size to improve IOPS.
D.Enable Performance Insights to identify slow queries.
E.Delete unnecessary indexes to reduce write overhead.
AnswersA, D

Read replicas handle read traffic, reducing load on the primary.

Why this answer

Option A is correct because read replicas offload read traffic from the primary instance. Option D is correct because performance insights help identify bottlenecks. Option B is incorrect because Multi-AZ does not improve read performance.

Option C is incorrect because increasing storage size may not help with read performance. Option E is incorrect because deleting indexes would degrade performance.

18
MCQhard

A data engineer is setting up an Amazon S3 bucket for storing sensitive financial data. The compliance team requires that all data be encrypted at rest using a customer-managed AWS KMS key. Additionally, the bucket must block public access. Which combination of settings should the engineer configure?

A.Enable default encryption with AWS-KMS and a customer managed key. Enable block public access settings.
B.Use S3 Object Ownership to enforce bucket owner enforced. Enable block public access.
C.Use S3 Bucket Keys to reduce KMS costs. Enable block public access.
D.Create a bucket policy that denies PutObject without encryption. Enable block public access.
AnswerA

This ensures all objects are encrypted with the specified KMS key and public access is blocked.

Why this answer

Option A is correct because enabling default encryption with AWS-KMS using a customer-managed key ensures that all objects uploaded to the S3 bucket are automatically encrypted at rest with the required key type. Additionally, enabling block public access settings prevents any public access to the bucket, satisfying the compliance team's requirements.

Exam trap

The trap here is that candidates may think a bucket policy denying unencrypted uploads is sufficient, but it does not enforce the use of a customer-managed KMS key, nor does it automatically encrypt objects that lack encryption headers.

How to eliminate wrong answers

Option B is wrong because S3 Object Ownership with bucket owner enforced controls object ownership and ACLs, but does not enforce encryption at rest with a customer-managed KMS key. Option C is wrong because S3 Bucket Keys reduce KMS request costs by using a bucket-level key, but they do not enforce encryption with a customer-managed key or block public access. Option D is wrong because a bucket policy that denies PutObject without encryption can enforce encryption, but it does not guarantee that the encryption uses a customer-managed KMS key; it could allow SSE-S3 or SSE-KMS with an AWS-managed key, and it does not configure default encryption.

19
MCQmedium

A company has an Amazon RDS for PostgreSQL DB instance with a large table that is frequently updated. The data engineer needs to reduce storage costs by archiving old records that are no longer accessed. The archived records must be retained for 7 years due to compliance requirements. Which solution is MOST cost-effective?

A.Use RDS native backup and restore to keep a separate backup.
B.Export old records using pg_dump and store in S3 Glacier Deep Archive.
C.Enable storage autoscaling on the RDS instance.
D.Move old records to a separate table in the same RDS instance.
AnswerB

This offloads old data to low-cost archival storage.

Why this answer

Using pg_dump to export old records and store them in S3 Glacier Deep Archive is cost-effective because Glacier Deep Archive is the lowest-cost storage for long-term archival. Option A is wrong because enabling storage autoscaling doesn't archive data. Option B is wrong because archiving within RDS still incurs storage costs.

Option D is wrong because S3 Standard is more expensive than Glacier Deep Archive for long-term retention.

20
MCQhard

A company runs a streaming application on Amazon EC2 instances that writes data to an Amazon DynamoDB table (us-east-1). The data is later consumed by a reporting job that runs every hour. Recently, the reporting job has been failing with ProvisionedThroughputExceededException errors during peak hours. The DynamoDB table uses provisioned capacity with 1000 read capacity units (RCU) and 500 write capacity units (WCU). The reporting job performs scans and reads using eventually consistent reads. The application's write traffic is steady, but the reporting job's reads spike at the top of the hour. The data engineer needs to resolve the throughput exceptions without affecting the application's writes. Which solution should the data engineer implement?

A.Create a global secondary index (GSI) with enough RCU for the reporting job and have the job query the index instead of scanning the table.
B.Increase the table's RCU to 2000.
C.Create a read replica of the DynamoDB table in a different region.
D.Switch the table to on-demand capacity mode.
AnswerA

Using a GSI with dedicated capacity can isolate the reporting workload and avoid throttling.

Why this answer

By default, DynamoDB allocates read capacity equally between the table and its global secondary indexes (GSIs). If the table has a GSI, the reporting job's read requests may be throttled due to the GSI's capacity. Adding a GSI with provisioned read capacity dedicated for the reporting job would offload the reads from the main table.

Option D is correct. Option A: increasing RCU on the table may not help if the GSI is throttled. Option B: switching to on-demand eliminates throttling but may increase costs and does not guarantee performance.

Option C: adding a read replica is not a feature for DynamoDB (only for RDS).

21
MCQhard

A company uses Amazon DynamoDB as the primary data store for a gaming application. The application experiences sudden spikes in traffic. The data engineer notices that write requests are throttled during peak times. The partition keys are well-distributed. What should the data engineer do to reduce throttling?

A.Use DynamoDB global tables to distribute writes across regions.
B.Configure DynamoDB auto scaling to adjust write capacity automatically.
C.Increase the number of partition keys to improve write distribution.
D.Enable DynamoDB Accelerator (DAX) to cache write operations.
AnswerB

Auto scaling increases write capacity during spikes, reducing throttling.

Why this answer

Option C is correct because DynamoDB auto scaling adjusts capacity based on load, preventing throttling during spikes. Option A is wrong because DAX is a cache for reads, not writes. Option B is wrong because global tables improve latency and disaster recovery, not write capacity.

Option D is wrong because increasing partition count is automatic if throughput increases, but without auto scaling, throttling still happens.

22
MCQeasy

A data engineer ran the above CLI command to describe an Amazon DynamoDB table named 'Orders'. The table has a key schema with 'OrderID' as the partition key and 'CustomerID' as the sort key. The table currently has no items. The engineer wants to add a new attribute 'OrderDate' and then query all orders for a specific customer within a date range. Which of the following actions is the MOST efficient approach to support this query pattern?

A.Modify the table's primary key to include 'OrderDate' as an additional sort key.
B.Use a Scan operation with a filter expression on 'CustomerID' and 'OrderDate' to retrieve the data.
C.Create a Local Secondary Index (LSI) with 'CustomerID' as partition key and 'OrderDate' as sort key.
D.Create a Global Secondary Index (GSI) with 'CustomerID' as partition key and 'OrderDate' as sort key.
AnswerD

GSI can be added at any time and supports efficient queries on CustomerID and OrderDate.

Why this answer

Option D is correct because a Global Secondary Index (GSI) allows querying on a different partition key ('CustomerID') and sort key ('OrderDate') without altering the base table's key schema. This supports efficient range queries on 'OrderDate' for a specific customer, as GSIs provide a separate index with its own provisioned throughput and can be created on existing tables with items. The base table's primary key remains unchanged, and the GSI enables the desired query pattern with low latency.

Exam trap

AWS often tests the distinction between LSIs and GSIs, specifically that LSIs require the same partition key as the base table, while GSIs allow a different partition key, which is a common point of confusion for candidates.

How to eliminate wrong answers

Option A is wrong because DynamoDB does not support modifying an existing table's primary key schema; you cannot add a sort key after table creation without recreating the table. Option B is wrong because a Scan operation reads every item in the table and then applies a filter, which is inefficient and costly for large tables, and does not leverage DynamoDB's indexing capabilities for range queries. Option C is wrong because a Local Secondary Index (LSI) must have the same partition key as the base table (here 'OrderID'), so it cannot use 'CustomerID' as the partition key; LSIs only allow querying with the base table's partition key and an alternate sort key.

23
MCQhard

A company is using AWS Glue to run ETL jobs that write data to an Amazon S3 data lake. The jobs are failing with '503 Slow Down' errors. The data engineering team has already implemented retries. What is the BEST long-term solution?

A.Enable S3 Transfer Acceleration.
B.Use S3 multipart upload for all objects.
C.Increase the number of retries in the Glue job.
D.Implement a backoff strategy to reduce the request rate.
AnswerD

Reducing request rate helps avoid S3 503 errors.

Why this answer

The '503 Slow Down' error from Amazon S3 indicates that the request rate is too high and S3 is throttling the requests. The best long-term solution is to implement a backoff strategy (exponential backoff) to reduce the request rate, which allows the Glue job to automatically slow down and retry with increasing delays, aligning with S3's request rate limits and avoiding sustained throttling.

Exam trap

The trap here is that candidates often confuse '503 Slow Down' with a network or throughput issue and choose S3 Transfer Acceleration or multipart upload, when in fact the error is a throttling response from S3 that requires reducing the request rate via backoff, not increasing speed or parallelism.

How to eliminate wrong answers

Option A is wrong because S3 Transfer Acceleration is designed to speed up uploads over long distances using edge locations, but it does not reduce the request rate or resolve throttling caused by high request volumes. Option B is wrong because S3 multipart upload is a mechanism for uploading large objects in parts, which can improve throughput but does not address the root cause of excessive request rate leading to '503 Slow Down' errors. Option C is wrong because increasing the number of retries without reducing the request rate will likely continue to trigger throttling, as the same high request rate will persist after each retry, leading to repeated failures.

24
MCQmedium

A company is migrating an on-premises MongoDB database to Amazon DocumentDB. The migration must have minimal downtime. Which service should be used to perform the migration?

A.AWS Glue
B.AWS DataSync
C.AWS Database Migration Service (DMS)
D.Amazon S3 Transfer Acceleration
AnswerC

DMS supports MongoDB to DocumentDB migration with minimal downtime using change data capture.

Why this answer

AWS Database Migration Service (DMS) is the correct choice because it supports continuous replication from MongoDB to Amazon DocumentDB using change data capture (CDC), enabling near-zero downtime migrations. DMS can perform a full load of existing data and then apply ongoing changes from the source MongoDB oplog, keeping the target DocumentDB synchronized until the cutover.

Exam trap

The trap here is that candidates may confuse AWS DMS with AWS DataSync or AWS Glue, assuming any 'migration' or 'data transfer' service can handle live database replication, but only DMS provides the necessary CDC engine for heterogeneous database migrations with minimal downtime.

How to eliminate wrong answers

Option A is wrong because AWS Glue is a serverless data integration service for ETL (extract, transform, load) jobs, not designed for live database migration with minimal downtime; it lacks native CDC support for MongoDB to DocumentDB replication. Option B is wrong because AWS DataSync is optimized for moving large volumes of file data (e.g., NFS, SMB) to AWS storage services like S3 or EFS, not for heterogeneous database migrations or ongoing replication. Option D is wrong because Amazon S3 Transfer Acceleration is a feature that speeds up uploads to S3 buckets over long distances using edge locations; it has no capability to migrate or replicate a MongoDB database to DocumentDB.

25
Multi-Selecthard

A data engineer is designing a data lake on Amazon S3. The data must be immutable and support high-throughput streaming ingestion. Which THREE features should the engineer consider? (Select THREE.)

Select 3 answers
A.S3 Transfer Acceleration
B.S3 Lifecycle policies to transition data to Amazon S3 Glacier
C.S3 Multipart Upload API
D.S3 Object Lock in governance mode
E.S3 Cross-Region Replication (CRR)
AnswersB, C, D

Lifecycle policies automate data movement, cost-effectively managing the data lifecycle.

Why this answer

S3 Object Lock in governance mode (Option D) is correct because it enforces immutability by preventing objects from being deleted or overwritten for a specified retention period, which is essential for a data lake requiring immutable data. S3 Multipart Upload API (Option C) is correct because it enables high-throughput streaming ingestion by allowing large objects to be uploaded in parallel parts, improving throughput and resilience. S3 Lifecycle policies to transition data to Amazon S3 Glacier (Option B) is correct because it supports cost-effective storage management for immutable data that is rarely accessed, aligning with the data lake's lifecycle needs.

Exam trap

The trap here is that candidates often confuse S3 Transfer Acceleration (a speed optimization) with a feature that provides immutability or streaming support, leading them to select it incorrectly, while overlooking that S3 Object Lock and Multipart Upload directly address the core requirements of immutability and high-throughput ingestion.

26
MCQeasy

A company is storing large amounts of log data in Amazon S3. The data is accessed frequently for the first 30 days, then rarely after that. The company wants to automatically transition the data to a lower-cost storage class after 30 days. Which S3 feature should the data engineer use?

A.S3 Intelligent-Tiering
B.S3 Lifecycle policies
C.S3 Cross-Region Replication
D.S3 Batch Operations
AnswerB

Lifecycle policies can transition objects after a specified number of days.

Why this answer

Option B is correct because S3 Lifecycle policies automatically transition objects between storage classes. Option A is wrong because S3 Intelligent-Tiering monitors access patterns but may have monitoring costs. Option C is wrong because S3 replication is for copying data, not transitioning storage classes.

Option D is wrong because S3 Batch Operations is for bulk actions, not automatic transitions.

27
Multi-Selecthard

A company uses Amazon S3 to store log files that are generated every hour. Each log file is about 1 GB. The logs must be stored for 5 years for compliance. The data engineer wants to minimize storage costs while ensuring that logs can be retrieved within 24 hours for the first year, and within 48 hours thereafter. Which THREE lifecycle actions should the engineer configure? (Choose THREE.)

Select 3 answers
A.Transition objects to S3 Standard after 30 days.
B.Transition objects to S3 Glacier Deep Archive after 1 year.
C.Set a retrieval window of 48 hours for Glacier Deep Archive.
D.Delete objects after 2 years to reduce storage costs.
E.Transition objects to S3 Standard-IA after 30 days.
AnswersB, C, E

Deep Archive is the cheapest storage class for archival data.

Why this answer

Option B is correct because Standard-IA is suitable for infrequently accessed data with immediate retrieval. Option C is correct because Glacier Deep Archive provides low-cost storage with retrieval within 12 hours. Option D is correct because Deep Archive retrieval within 48 hours meets the requirement.

Option A is incorrect because Standard is not cost-effective for long-term. Option E is incorrect because deletion before 5 years violates compliance.

28
MCQmedium

A data engineer is setting up an Amazon S3 lifecycle policy to transition objects to S3 Glacier after 90 days and delete after 365 days. The objects are stored in the S3 Standard storage class. Which lifecycle rule configuration meets the requirements?

A.Transition to Glacier after 90 days and expire after 90 days
B.Transition to Glacier after 90 days and expire after 90 days
C.Transition to Glacier after 365 days and expire after 365 days
D.Transition to Glacier after 90 days and expire after 365 days
AnswerD

Correct timing for transition and deletion.

Why this answer

Option B is correct because the transition to Glacier should happen after 90 days, and expiration after 365 days. Option A is wrong because expiration after 90 days deletes too early. Option C is wrong because transition to Glacier after 365 days is too late.

Option D is wrong because transition to Glacier after 90 days is correct.

29
MCQhard

A healthcare company uses Amazon RDS for PostgreSQL to store patient records. The database has a size of 1 TB and is running on a db.r5.large instance. The company requires that the database be highly available and have automated backups with point-in-time recovery (PITR) for the last 35 days. The operations team has configured Multi-AZ deployment and automated backups with a 35-day retention period. During a recent disaster simulation, the team attempted to restore the database to a point in time from 30 days ago. The restore operation failed because the backup was not available. On investigation, the team found that the automated backups were being deleted before the retention period ended. The team also noticed that the database has a large number of transaction logs generating a high volume of write activity. What is the most likely cause of the backups being deleted prematurely?

A.The RDS instance was deleted, which automatically deletes all automated backups.
B.The automated backup window was set to a time that conflicted with the database maintenance window.
C.The Multi-AZ deployment was not enabled during the backup process, causing backups to fail.
D.The database had manual snapshots that were deleted manually by the operations team.
AnswerA

When an RDS instance is deleted, automated backups are also deleted unless a final snapshot is taken.

Why this answer

Option D is correct because automated backups are retained based on the backup retention period. However, if the database instance is deleted, all automated backups are also deleted. Option A is wrong because manual snapshots are separate from automated backups.

Option B is wrong because the backup window does not affect retention. Option C is wrong because Multi-AZ automatically performs backups from the standby, but the retention is still enforced.

30
MCQeasy

A data engineer is configuring an Amazon S3 lifecycle policy to transition objects to S3 Glacier Deep Archive after 90 days. The bucket receives new objects daily. The engineer wants to ensure that objects are not deleted before 90 days. Which lifecycle action should be used?

A.Expiration
B.Transition
C.NoncurrentVersionTransition
D.AbortIncompleteMultipartUpload
AnswerB

Transition moves objects to a different storage class.

Why this answer

Option B (Transition) is correct because the S3 Lifecycle Transition action moves objects between storage classes over time. To ensure objects are moved to S3 Glacier Deep Archive after 90 days without deletion, a Transition rule is configured to specify the target storage class and the number of days from object creation.

Exam trap

The trap here is confusing Expiration (which deletes objects) with Transition (which moves objects to another storage class), leading candidates to select Expiration when the goal is to retain objects for a minimum period before moving them to archival storage.

How to eliminate wrong answers

Option A (Expiration) is wrong because it permanently deletes objects after a specified number of days, which would remove them before they could be transitioned to Glacier Deep Archive. Option C (NoncurrentVersionTransition) is wrong because it applies only to noncurrent versions of versioned objects, not to current objects in a non-versioned or versioned bucket. Option D (AbortIncompleteMultipartUpload) is wrong because it only aborts incomplete multipart uploads after a specified number of days, not transitioning or deleting complete objects.

31
MCQeasy

A company needs to store application log files for 90 days for compliance. The logs are generated continuously and are rarely accessed after 30 days. The data engineer must minimize storage costs. Which storage solution should the engineer choose?

A.Amazon CloudWatch Logs with a retention policy of 90 days
B.Amazon S3 Glacier Deep Archive
C.Amazon EBS gp3 volumes attached to an EC2 instance
D.Amazon S3 Standard with a lifecycle policy to transition to S3 Standard-IA after 30 days and expire after 90 days
AnswerD

This minimizes cost by using cheaper storage for infrequently accessed data and deleting after compliance period.

Why this answer

Amazon S3 Standard with a lifecycle policy to transition to S3 Standard-IA after 30 days and expire after 90 days is correct because it aligns with the access pattern: logs are frequently accessed only in the first 30 days, then rarely accessed for the remaining 60 days. S3 Standard-IA offers lower storage costs for infrequently accessed data while still providing millisecond retrieval, and the lifecycle policy automates the transition and eventual deletion, minimizing costs without sacrificing availability.

Exam trap

The trap here is that candidates often choose CloudWatch Logs (Option A) because it is a familiar logging service, but they overlook that its cost model (per GB ingested, per GB stored, and per GB archived) can be significantly higher than S3 for long-term retention of large log volumes, and it lacks the automated tiering to lower-cost storage classes.

How to eliminate wrong answers

Option A is wrong because Amazon CloudWatch Logs is designed for real-time monitoring and log ingestion, not for long-term, cost-optimized archival storage; its retention policy only controls deletion, not tiered storage transitions, and costs can be higher than S3 for large volumes of rarely accessed logs. Option B is wrong because S3 Glacier Deep Archive is intended for data that is accessed at most once or twice a year and has retrieval times of 12 hours or more, making it unsuitable for logs that may need occasional access within 90 days; it also incurs minimum storage charges that make it cost-ineffective for short retention periods. Option C is wrong because EBS gp3 volumes attached to an EC2 instance incur compute costs even when idle, and managing log storage on block storage requires manual lifecycle management, leading to higher operational overhead and cost compared to a fully managed object storage solution.

32
Multi-Selectmedium

Which TWO actions are recommended for securing data at rest in Amazon S3? (Choose two.)

Select 2 answers
A.Enable default encryption on the S3 bucket using SSE-S3 or SSE-KMS.
B.Use S3 Bucket Key to reduce KMS request costs.
C.Enable S3 Versioning to protect against accidental deletions.
D.Apply a bucket policy that denies PutObject requests without the x-amz-server-side-encryption header.
E.Configure cross-region replication to replicate data to another bucket.
AnswersA, D

Ensures all new objects are encrypted automatically.

Why this answer

Option A is correct because enabling default encryption on an S3 bucket using SSE-S3 or SSE-KMS ensures that all objects stored in the bucket are encrypted at rest automatically, even if the upload request does not include encryption headers. This satisfies the requirement for securing data at rest by applying server-side encryption to every object written to the bucket.

Exam trap

The trap here is that candidates often confuse data protection features like Versioning or replication with encryption controls, but the question specifically asks for securing data at rest, which requires encryption mechanisms such as default encryption or policy-enforced encryption headers.

33
MCQhard

A data engineer is designing a data lake on Amazon S3 for a healthcare organization that must comply with HIPAA regulations. The data includes protected health information (PHI) and must be encrypted at rest. The organization requires that all encryption keys be managed by AWS and rotated automatically every year. Additionally, the data must be replicated to another AWS Region for disaster recovery. Which combination of S3 features should the engineer use to meet these requirements?

A.Use SSE-S3 with S3 Same-Region Replication (SRR).
B.Use SSE-S3 with S3 Cross-Region Replication (CRR).
C.Use SSE-KMS with S3 Cross-Region Replication (CRR).
D.Use SSE-C with S3 Cross-Region Replication (CRR).
AnswerB

SSE-S3 provides AWS-managed keys with automatic rotation; CRR replicates to another region.

Why this answer

Option C is correct because SSE-S3 uses AWS-managed keys that are automatically rotated, and S3 Cross-Region Replication (CRR) replicates objects to another region. Option A is incorrect because SSE-KMS uses customer-managed keys, not AWS-managed keys. Option B is incorrect because S3 Same-Region Replication does not replicate to another region.

Option D is incorrect because SSE-C uses customer-provided keys, not AWS-managed.

34
MCQmedium

A company runs a data warehouse on Amazon Redshift. Queries are slow, and the team suspects data distribution is skewed. Which approach would best help identify distribution skew?

A.Check the STL_LOAD_ERRORS table for load failures
B.Query the SVV_TABLE_INFO table to see table size
C.Query the SVV_DISKUSAGE table to examine data distribution across slices
D.Review the WLM configuration in the parameter group
AnswerC

SVV_DISKUSAGE provides per-slice disk usage, helping identify skew.

Why this answer

Option C is correct because the SVV_DISKUSAGE table provides per-slice data distribution information, allowing you to identify skew by comparing the number of blocks allocated to each slice for a given table. In Amazon Redshift, data is distributed across slices based on the distribution key, and significant variation in block counts across slices indicates distribution skew, which can cause query performance degradation due to uneven workload distribution.

Exam trap

The trap here is that candidates confuse table-level metadata (SVV_TABLE_INFO) with slice-level distribution data (SVV_DISKUSAGE), assuming overall table size alone can reveal skew, when in fact only per-slice block counts expose uneven data distribution.

How to eliminate wrong answers

Option A is wrong because STL_LOAD_ERRORS records errors during COPY or INSERT operations, such as data type mismatches or malformed data, and has no relation to data distribution skew. Option B is wrong because SVV_TABLE_INFO shows overall table size, row count, and compression ratios, but it does not provide per-slice data distribution details needed to identify skew. Option D is wrong because WLM configuration in the parameter group manages query concurrency and memory allocation, not data distribution or skew detection.

35
MCQhard

A data engineer is troubleshooting an Amazon DynamoDB table that has frequent throttling exceptions for write requests. The table has auto scaling enabled. What is the most likely cause?

A.The partition key is causing a hot partition
B.The table's read capacity is set too low
C.The table's auto scaling is disabled
D.The table is using global tables without conflict resolution
AnswerA

Hot partitions throttle even if overall capacity is sufficient.

Why this answer

Hot partition is a common cause where a single partition key receives a disproportionate amount of writes, exhausting that partition's capacity. Auto scaling adjusts total capacity, not partition-level distribution.

36
MCQmedium

A company is running a data warehouse on Amazon Redshift. The data engineering team notices that query performance has degraded over time. They suspect that data distribution is causing excessive data movement between nodes. The table is joined frequently on the customer_id column. Which column should be chosen as the distribution key to optimize join performance?

A.AUTO distribution
B.customer_id
C.order_date
D.EVEN distribution
AnswerB

Distributing on the join column reduces data movement.

Why this answer

The correct answer is B (customer_id) because Redshift distributes data across nodes based on the distribution key. When two tables are joined on customer_id, using it as the distribution key ensures that matching rows from both tables are co-located on the same node, eliminating the need for data redistribution (broadcast or shuffle) during the join. This minimizes network traffic and reduces query latency, directly addressing the performance degradation caused by excessive data movement.

Exam trap

The trap here is that candidates may choose EVEN distribution (D) thinking it balances data evenly, but they overlook that it causes maximum data movement for joins, while AUTO distribution (A) seems safe but does not guarantee co-location for the specific join column.

How to eliminate wrong answers

Option A (AUTO distribution) is wrong because AUTO lets Redshift choose the distribution style based on table size and usage patterns, but it may not guarantee co-location for frequent joins on customer_id, potentially still causing data movement. Option C (order_date) is wrong because it is not the join column; using it as the distribution key would scatter customer_id values across nodes, forcing redistribution for every join on customer_id. Option D (EVEN distribution) is wrong because it distributes rows round-robin across nodes without considering join keys, which maximizes data movement during joins on customer_id and degrades performance.

37
MCQhard

A data engineer created the IAM policy shown in the exhibit. The engineer then attempts to upload an object to 'my-bucket' using the AWS CLI with the command: aws s3 cp file.txt s3://my-bucket/ --sse aws:kms. The upload fails with an 'AccessDenied' error. What is the most likely cause?

A.The policy resource is incorrect
B.The policy requires SSE-S3 (AES256), but the command uses SSE-KMS
C.The policy does not allow the s3:PutObject action
D.The command is missing the --sse-customer-algorithm parameter
AnswerB

The condition mandates AES256, but the command uses aws:kms.

Why this answer

The IAM policy in the exhibit requires the `s3:x-amz-server-side-encryption` header to be set to `AES256`, which corresponds to SSE-S3. The AWS CLI command uses `--sse aws:kms`, which sets the header to `aws:kms` for SSE-KMS. This mismatch causes the request to fail the `s3:PutObject` condition check in the policy, resulting in an 'AccessDenied' error.

Exam trap

The trap here is that candidates may overlook the condition key in the policy and assume the error is due to a missing action or incorrect resource, rather than recognizing that the encryption header value must exactly match the policy's requirement.

How to eliminate wrong answers

Option A is wrong because the policy resource `arn:aws:s3:::my-bucket/*` correctly specifies the bucket and its objects, so the resource is not the issue. Option B is wrong because the policy explicitly requires SSE-S3 (AES256), but the command uses SSE-KMS, which is the direct cause of the failure. Option C is wrong because the policy does allow `s3:PutObject` via the `Effect: Allow` statement; the failure is due to the condition key mismatch, not a missing action.

Option D is wrong because `--sse-customer-algorithm` is used for SSE-C, not SSE-KMS or SSE-S3, and the command already specifies `--sse aws:kms` correctly for SSE-KMS.

38
MCQeasy

A data engineer is designing a data lake on AWS using Amazon S3. The data consists of CSV files generated by IoT devices. The data is accessed by multiple analytics jobs, and the engineer needs to ensure that new files are immediately visible to all consumers after writing. What S3 consistency model applies?

A.Consistent reads require S3 Object Lock.
B.Strong consistency for all operations.
C.Eventual consistency for all operations.
D.Read-after-write consistency for new object PUTS.
AnswerD

S3 provides read-after-write consistency for new objects, so they are immediately visible.

Why this answer

Amazon S3 provides read-after-write consistency for PUTS of new objects (since 2020), meaning new objects are immediately readable. Option A is wrong because eventual consistency applies to overwrites and deletes. Option C is wrong because there is no strong consistency for all operations; S3 now provides strong consistency for all operations.

Option D is wrong because no locking mechanism is needed for new objects.

39
MCQhard

A gaming company uses Amazon DynamoDB to store player profiles and game state. The table has a partition key of 'player_id' and no sort key. The table is provisioned with 5,000 RCUs and 5,000 WCUs. The application performs frequent reads and writes to update player scores. Recently, the company introduced a new feature that allows players to form guilds. The guild data is stored in a separate DynamoDB table with a partition key of 'guild_id'. The application often needs to retrieve all members of a guild. The data engineer is encountering high latency when querying the guild table because the guilds can have up to 100 members. The engineer wants to reduce latency without changing the application architecture. What should the data engineer do?

A.Increase the provisioned read and write capacity for the guild table to 10,000 RCUs and 10,000 WCUs.
B.Create a global secondary index (GSI) on the guild table with partition key guild_id and sort key member_id.
C.Enable DynamoDB Streams on the guild table and process the stream to populate a separate read table.
D.Use DynamoDB Accelerator (DAX) to cache the results of the guild queries.
AnswerB

The GSI allows efficient retrieval of all members of a guild by querying on guild_id.

Why this answer

Option B is correct because adding a global secondary index (GSI) on the guild table with guild_id as the partition key and member_id as the sort key allows efficient queries for all members of a guild. Option A is wrong because increasing capacity may not solve the access pattern issue. Option C is wrong because DynamoDB Streams are for change data capture, not for query optimization.

Option D is wrong because DAX caches read results; if the query pattern is inefficient, DAX won't help much.

40
Multi-Selectmedium

Which TWO statements are true about Amazon S3 bucket policies and ACLs?

Select 2 answers
A.When both exist, bucket policies are evaluated before ACLs.
B.ACLs are a legacy access control mechanism that is still supported.
C.ACLs can grant permissions to all authenticated AWS users.
D.ACLs support conditions such as IP address restrictions.
E.Bucket policies can grant access to users in other AWS accounts.
AnswersB, E

ACLs are older but still functional.

Why this answer

Option B is correct because ACLs (Access Control Lists) are indeed a legacy access control mechanism that Amazon S3 continues to support for backward compatibility. While bucket policies and IAM policies are the modern, recommended approach, ACLs can still be used to grant basic read/write permissions to AWS accounts or predefined groups like AllUsers or AuthenticatedUsers.

Exam trap

The trap here is that candidates confuse ACLs with bucket policies, assuming ACLs support advanced conditions like IP restrictions or that bucket policies and ACLs are evaluated in a strict order, when in fact ACLs are simplistic and both are evaluated as an OR.

41
MCQmedium

A data engineer reviewed the S3 lifecycle policy shown in the exhibit. The engineer notices that objects under the 'logs/' prefix are being deleted after 365 days. The business requirement is to retain logs for at least 5 years. What should the engineer change in the lifecycle policy?

A.Change the prefix to 'logs/archive/'
B.Set the expiration days to 1825
C.Change the transition to GLACIER on day 365
D.Remove the expiration action
AnswerB

1825 days equals 5 years.

Why this answer

The business requirement is to retain logs for at least 5 years, which is 1,825 days (5 × 365). The current lifecycle policy sets expiration to 365 days, causing premature deletion. By setting the expiration days to 1,825, the S3 lifecycle policy will delete objects under the 'logs/' prefix only after 5 years, meeting the retention requirement.

Exam trap

The trap here is that candidates may confuse transition actions (which change storage class) with expiration actions (which delete objects), or incorrectly assume that changing the prefix or removing expiration will meet the retention requirement without adjusting the day count.

How to eliminate wrong answers

Option A is wrong because changing the prefix to 'logs/archive/' would only apply the lifecycle rules to a different subset of objects, not fix the retention period for the original 'logs/' prefix. Option C is wrong because transitioning to GLACIER on day 365 only changes the storage class for cost optimization; it does not extend the deletion timeline, so objects would still be deleted after 365 days. Option D is wrong because removing the expiration action entirely would mean objects are never automatically deleted, which may lead to indefinite storage and increased costs, not a 5-year retention.

42
Multi-Selecteasy

A company is building a data pipeline that ingests streaming data from IoT devices. The data must be stored in a durable, scalable, and cost-effective manner for batch processing. Which TWO AWS services should be used together?

Select 2 answers
A.Amazon ElastiCache
B.Amazon Kinesis Data Streams
C.Amazon Redshift
D.Amazon DynamoDB
E.Amazon S3
AnswersB, E

Ingests streaming data in real-time.

Why this answer

Options A and C are correct. Kinesis Data Streams ingests streaming data, and S3 stores the data durably and cost-effectively. Option B (Redshift) is for analytics, not raw storage.

Option D (DynamoDB) is for low-latency queries, not cost-effective bulk storage. Option E (ElastiCache) is a cache, not durable storage.

43
Multi-Selecteasy

A data engineer needs to migrate an on-premises MongoDB database to AWS. The migration must have minimal downtime and support automatic scaling. Which TWO AWS services should the engineer use for the target data store? (Choose TWO.)

Select 2 answers
A.Amazon Redshift
B.Amazon RDS for MySQL
C.Amazon DocumentDB (with MongoDB compatibility)
D.Amazon DynamoDB
E.Amazon S3
AnswersC, D

DocumentDB is compatible with MongoDB workloads.

Why this answer

Options A and D are correct. Amazon DocumentDB is MongoDB-compatible; Amazon DynamoDB is a NoSQL database that can handle document-like data and supports auto scaling. Option B (RDS) is relational; Option C (Redshift) is for analytics; Option E (S3) is object storage, not a database.

44
MCQmedium

A company is using Amazon S3 to store large amounts of archival data. The data is accessed infrequently but must be immediately retrievable when needed. Which storage class is the most cost-effective choice?

A.S3 Standard
B.S3 Standard-IA
C.S3 Glacier Deep Archive
D.S3 Intelligent-Tiering
AnswerB

Designed for infrequently accessed data with immediate retrieval.

Why this answer

S3 Standard-IA is designed for infrequently accessed data that requires rapid access. S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive have retrieval delays. S3 Standard is more expensive for infrequent access.

45
Multi-Selectmedium

A company is designing a data lake on Amazon S3. Which TWO strategies improve query performance for Amazon Athena?

Select 2 answers
A.Enable S3 Versioning on the bucket.
B.Use server-side encryption with AWS KMS (SSE-KMS).
C.Partition the data by frequently queried columns such as date or region.
D.Use columnar file formats like Parquet or ORC.
E.Store data in CSV format with header rows.
AnswersC, D

Partitioning prunes the data scanned.

Why this answer

Partitioning data by frequently queried columns (e.g., date or region) allows Athena to prune the data scanned by only reading the relevant partitions, reducing the amount of data scanned and improving query performance. This is a core optimization for Athena, which charges based on data scanned and performs better with less I/O.

Exam trap

The trap here is that candidates often confuse data management features (like versioning or encryption) with performance optimizations, or assume that simpler formats like CSV are sufficient for analytics, ignoring the significant performance benefits of partitioning and columnar storage.

46
Multi-Selecthard

A company is using Amazon Redshift for its data warehouse. The data engineering team needs to improve query performance for a large fact table that is frequently joined with multiple dimension tables. Which THREE strategies should be considered?

Select 3 answers
A.Define sort keys on columns used in WHERE clauses.
B.Use DISTSTYLE EVEN to distribute data evenly.
C.Increase the number of nodes in the cluster.
D.Choose an appropriate distribution key based on join columns.
E.Apply columnar compression to reduce storage and I/O.
AnswersA, D, E

Improves filter efficiency.

Why this answer

Option A is correct because defining sort keys on columns used in WHERE clauses allows Amazon Redshift to use zone maps to skip large blocks of data that do not satisfy the filter condition, dramatically reducing the amount of data scanned. This is especially effective for large fact tables where selective filters can prune entire disk blocks, improving query performance without additional hardware.

Exam trap

The trap here is that candidates often assume DISTSTYLE EVEN is always the best choice for performance, but for frequently joined fact tables, a distribution key aligned with the join columns is critical to avoid network-heavy data shuffling.

47
Multi-Selecthard

A company stores sensitive data in Amazon S3. The security team requires encryption at rest and that the encryption keys are managed by the company using AWS KMS. The data is frequently accessed by multiple AWS services. Which THREE steps should be taken to meet these requirements?

Select 3 answers
A.Use client-side encryption with the KMS key before uploading to S3
B.Configure the KMS key policy to allow the necessary AWS services to use the key for decryption
C.Enable default encryption on the S3 bucket using SSE-S3
D.Create a bucket policy that denies s3:PutObject if the object is not encrypted with SSE-KMS
E.Enable default encryption on the S3 bucket using SSE-KMS
AnswersB, D, E

Services must have decrypt permissions to access the encrypted objects.

Why this answer

Option B is correct because the security team requires that encryption keys be managed by the company using AWS KMS, and that multiple AWS services can access the data. To allow those services to decrypt objects encrypted with a customer-managed KMS key, the KMS key policy must explicitly grant the necessary AWS services (e.g., AWS Lambda, Amazon Athena) permission to use the key for decryption (kms:Decrypt). Without this policy, even if the bucket is configured for SSE-KMS, the services will fail to read the encrypted objects.

Exam trap

AWS often tests the distinction between enforcing encryption (bucket policy) and enabling access to encrypted data (KMS key policy), leading candidates to overlook the KMS key policy step when multiple services need to decrypt objects.

48
MCQmedium

A data engineer runs the above AWS CLI command and receives the output. The object is part of an S3 Lifecycle policy that transitions objects to Glacier Instant Retrieval after 30 days. The object was created on January 1, 2023. Why is the object still in STANDARD_IA storage class?

A.The Lifecycle policy has a filter that excludes this object's prefix
B.Versioning is enabled and the current version is not the oldest
C.The object has not reached the transition age of 30 days yet
D.The metadata timestamp is used for lifecycle transitions instead of LastModified
AnswerC

The LastModified is Jan 2, so as of Jan 3, it is only 1 day old.

Why this answer

Option C is correct because the S3 Lifecycle rule transitions objects to Glacier Instant Retrieval after 30 days, but the object was created on January 1, 2023, and the current date (implied by the command output) is before January 31, 2023. The transition age is calculated from the object's LastModified date, not from any other timestamp, and the object must be at least 30 days old before S3 applies the transition. Since the object is only 20 days old (as of January 21, 2023, based on the output showing STANDARD_IA), it has not yet met the 30-day threshold.

Exam trap

The trap here is that candidates assume the object's current storage class (STANDARD_IA) means the 30-day transition to Glacier Instant Retrieval has already failed or been misconfigured, when in fact the object simply hasn't aged enough yet.

How to eliminate wrong answers

Option A is wrong because the AWS CLI command output shows the object's storage class as STANDARD_IA, which indicates the Lifecycle policy has already transitioned it from STANDARD to STANDARD_IA, proving the filter does not exclude this prefix. Option B is wrong because versioning being enabled does not prevent lifecycle transitions; S3 Lifecycle policies apply to all versions unless explicitly filtered, and the current version's age is based on its own LastModified date, not the oldest version. Option D is wrong because S3 Lifecycle transitions are based on the object's LastModified date, not any metadata timestamp; the LastModified field is the authoritative timestamp for age calculations.

49
MCQeasy

A company uses Amazon RDS for PostgreSQL. The data engineer needs to ensure that the database is automatically backed up and that backups are retained for 35 days. What is the simplest way to achieve this?

A.Use AWS Backup to schedule daily backups with a 35-day retention.
B.Enable automated backups with a retention period of 35 days in the RDS instance configuration.
C.Create a manual snapshot every day and delete them after 35 days using a script.
D.Enable automatic export of transaction logs to Amazon S3 and use S3 lifecycle policies.
AnswerB

RDS automated backups run daily and retain backups for the specified period, up to 35 days.

Why this answer

Amazon RDS for PostgreSQL allows you to enable automated backups directly in the instance configuration. By setting the backup retention period to 35 days, RDS automatically performs daily snapshots and retains transaction logs for point-in-time recovery within that window. This is the simplest method because it requires no external services or custom scripting.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing AWS Backup (Option A) or manual scripting (Option C), not realizing that RDS native automated backups already provide the simplest, fully managed way to achieve the required retention period.

How to eliminate wrong answers

Option A is wrong because AWS Backup is an additional service that adds complexity and cost; RDS native automated backups already support retention up to 35 days without needing AWS Backup. Option C is wrong because manual snapshots require custom scripting to create and delete daily, which is not the simplest approach and does not provide automated point-in-time recovery. Option D is wrong because automatic export of transaction logs to S3 is not a native RDS feature for PostgreSQL; RDS handles transaction logs internally for point-in-time recovery, and using S3 lifecycle policies would not replace the need for automated backups.

50
MCQmedium

A company is using Amazon RDS for PostgreSQL and wants to minimize downtime during a major version upgrade. Which approach best meets this requirement?

A.Perform an in-place upgrade using the AWS Management Console.
B.Modify the DB instance class to a larger size to handle the upgrade.
C.Create a read replica of the current instance, upgrade the replica, and then promote it to primary.
D.Use pg_dump and pg_restore to migrate data to a new upgraded instance.
AnswerC

This minimizes downtime by switching over after upgrade.

Why this answer

Option C is correct because creating a read replica, upgrading it to the new major version, and then promoting it to primary minimizes downtime by allowing the replica to be upgraded while the original instance remains operational. The promotion process is fast, typically taking only a few seconds to redirect traffic, and avoids the longer downtime associated with in-place upgrades or full data migrations.

Exam trap

The trap here is that candidates often assume an in-place upgrade (Option A) is the simplest and fastest method, but they overlook the fact that major version upgrades in RDS PostgreSQL require a longer downtime window due to the need for a database restart and potential compatibility checks, making the read replica promotion strategy the superior choice for minimizing downtime.

How to eliminate wrong answers

Option A is wrong because an in-place major version upgrade for RDS PostgreSQL requires a database restart and can take significant time (often 10-30 minutes or more) depending on instance size and data volume, leading to unacceptable downtime. Option B is wrong because modifying the DB instance class to a larger size does not perform a version upgrade; it only changes compute and memory resources, leaving the PostgreSQL version unchanged. Option D is wrong because using pg_dump and pg_restore involves exporting the entire database to a file and then importing it into a new instance, which can take hours for large datasets and requires the source database to be unavailable or read-only during the process, resulting in extended downtime.

51
Multi-Selecthard

A company is migrating an on-premises Apache Hadoop cluster to Amazon EMR. The data is stored in HDFS and must be moved to Amazon S3. Which THREE considerations are important when designing the migration? (Choose THREE.)

Select 3 answers
A.S3 supports POSIX file system semantics
B.HDFS can be directly mounted as an S3 bucket
C.EMR can read data directly from S3 using EMRFS
D.S3 provides eventual consistency for overwrite PUTS and DELETES
E.Using S3 as the data store allows independent scaling of compute and storage
AnswersC, D, E

EMRFS allows EMR to access S3 as a filesystem.

Why this answer

Option C is correct because Amazon EMR uses the EMR File System (EMRFS) to directly read and write data stored in Amazon S3, treating S3 as a scalable, durable data lake without needing to first copy data into HDFS. This allows EMR clusters to process data directly from S3, enabling decoupled compute and storage.

Exam trap

The trap here is that candidates often assume S3 supports POSIX semantics or that HDFS can be directly mounted as an S3 bucket, confusing the object storage model with a traditional filesystem, leading them to select options A or B.

52
Multi-Selecteasy

A data engineer is setting up an Amazon RDS for MySQL database. The database must be highly available and automatically failover in case of an AZ outage. Which TWO configurations should the engineer enable? (Choose TWO.)

Select 2 answers
A.Multi-AZ deployment
B.A DB subnet group with subnets in at least two Availability Zones
C.Enhanced Monitoring
D.Automated backups with a retention period of 30 days
E.Read replicas in a different Region
AnswersA, B

Multi-AZ creates a standby instance in a different Availability Zone for automatic failover.

Why this answer

Multi-AZ deployment (Option A) automatically provisions and maintains a synchronous standby replica in a different Availability Zone (AZ). In the event of an AZ outage, Amazon RDS automatically fails over to the standby, ensuring high availability with minimal downtime. This is the core mechanism for automatic failover in RDS for MySQL.

Exam trap

The trap here is that candidates often confuse read replicas or automated backups with high availability failover, but only Multi-AZ deployment provides automatic, synchronous failover within the same region.

53
MCQmedium

A company uses Amazon Redshift for analytics. The data engineer notices that queries are slow due to many small inserts. Which technique would improve write performance?

A.Use the COPY command to load data from Amazon S3.
B.Define DISTKEY and SORTKEY on the table.
C.Increase the number of nodes in the cluster.
D.Configure workload management (WLM) queues.
AnswerA

Bulk loading is more efficient than small inserts.

Why this answer

The COPY command is the recommended way to load data into Amazon Redshift because it performs bulk inserts in parallel across all nodes, leveraging the cluster's distributed architecture. Small individual INSERT statements cause high overhead due to transaction logging and commit processing, leading to slow write performance. By loading data from Amazon S3 using COPY, you bypass these per-row overheads and achieve optimal throughput.

Exam trap

The trap here is that candidates often confuse performance tuning for reads (DISTKEY/SORTKEY) or general scaling (adding nodes) with the specific write performance bottleneck caused by many small inserts, overlooking the COPY command as the primary solution for bulk data loading.

How to eliminate wrong answers

Option B is wrong because defining DISTKEY and SORTKEY improves query read performance by optimizing data distribution and sort order, but does not directly address the write performance issue caused by many small inserts. Option C is wrong because increasing the number of nodes adds compute and storage capacity, but does not solve the fundamental problem of per-insert overhead; small inserts will still be slow on a larger cluster. Option D is wrong because configuring workload management (WLM) queues manages concurrency and prioritizes queries, but does not reduce the overhead of individual small INSERT statements.

54
Multi-Selecteasy

A company is migrating a MySQL database to Amazon RDS for MySQL. The database is 2 TB in size and the company can only afford minimal downtime. The migration must be secure and use AWS DMS. Which TWO configuration steps are required? (Choose TWO.)

Select 2 answers
A.Create a source endpoint pointing to the on-premises MySQL database.
B.Create a target endpoint pointing to the Amazon RDS instance.
C.Configure SSL/TLS encryption for the DMS endpoints.
D.Set up VPC peering between the on-premises network and the Amazon VPC.
E.Create a DMS replication instance.
AnswersA, B

A source endpoint is needed for DMS to connect to the source.

Why this answer

AWS DMS requires a source endpoint (source database connection) and a target endpoint (RDS instance). Option C is the source endpoint, option D is the target endpoint. The other options are not required for DMS: SSL/TLS is optional but recommended, replication instance is created by DMS, and VPC peering is not typically needed if using the same VPC.

55
Multi-Selectmedium

A company is designing a data lake on Amazon S3. The data engineering team needs to implement a lifecycle policy to manage costs. Which TWO actions should be taken to reduce storage costs?

Select 2 answers
A.Transition objects to S3 Glacier Deep Archive after 90 days.
B.Transition objects to S3 One Zone-IA after 30 days.
C.Enable S3 Intelligent-Tiering.
D.Transition objects to S3 Standard-IA after 30 days.
E.Delete incomplete multipart uploads after 7 days.
AnswersA, E

Deep Archive is lowest cost for rarely accessed data.

Why this answer

Option A is correct because transitioning objects to S3 Glacier Deep Archive after 90 days significantly reduces storage costs for data that is rarely accessed and can tolerate a retrieval time of 12 hours. This lifecycle policy is a standard cost-optimization strategy for data lakes where historical or cold data does not require immediate access.

Exam trap

The trap here is that candidates often choose S3 Intelligent-Tiering or S3 Standard-IA as cost-saving measures without considering that the question specifically asks for lifecycle policy actions to reduce costs, and that Glacier Deep Archive and deleting incomplete multipart uploads are the most direct and effective actions for a data lake scenario.

56
MCQeasy

A company uses Amazon DynamoDB as its primary data store for a web application. The application experiences high latency during peak hours. The data engineer notices that the table has a large number of items with the same partition key. Which DynamoDB feature should the engineer use to improve performance?

A.Redesign the partition key to use a composite key that includes a timestamp or random suffix.
B.Enable DynamoDB Accelerator (DAX) to cache read requests.
C.Create a global table to replicate data across multiple Regions.
D.Enable auto scaling on the table to increase write capacity.
AnswerA

A well-designed partition key prevents hot spots by distributing writes evenly.

Why this answer

The high latency is caused by a hot partition, where many items share the same partition key, overwhelming a single DynamoDB partition. Redesigning the partition key to include a timestamp or random suffix distributes the workload evenly across partitions, improving throughput and reducing latency. This directly addresses the root cause of the performance issue.

Exam trap

The trap here is that candidates often confuse caching solutions (DAX) or scaling mechanisms (auto scaling) with the need to fix the data model itself, which is the only way to resolve a hot partition caused by a skewed partition key.

How to eliminate wrong answers

Option B is wrong because DynamoDB Accelerator (DAX) caches read requests, which can reduce read latency but does not solve the underlying hot partition issue caused by skewed write or read traffic on a single partition key. Option C is wrong because creating a global table replicates data across multiple Regions for disaster recovery or low-latency global access, but it does not distribute load within a single table's partitions. Option D is wrong because enabling auto scaling increases the table's provisioned capacity, but if the workload is concentrated on one partition, the partition's throughput limit (3000 RCU or 1000 WCU) will still be exceeded, causing throttling and high latency.

57
MCQeasy

A company needs to store JSON documents that are accessed by a key-value pattern. The data is 500 GB and requires single-digit millisecond latency. Which AWS database is most suitable?

A.Amazon Redshift
B.Amazon DynamoDB
C.Amazon Neptune
D.Amazon RDS for MySQL
AnswerB

DynamoDB is a NoSQL key-value and document database with low latency.

Why this answer

Option C is correct because DynamoDB supports document store and key-value access with low latency. Option A is wrong because RDS is relational. Option B is wrong because Neptune is graph.

Option D is wrong because Redshift is analytical.

58
Multi-Selecteasy

Which TWO methods can be used to enforce least-privilege access to an Amazon S3 bucket? (Choose two.)

Select 2 answers
A.Use IAM policies to grant specific permissions to users and roles.
B.Set bucket ACLs to allow full control to the bucket owner only.
C.Use an S3 bucket policy that explicitly denies actions not required.
D.Configure a VPC endpoint to restrict access to the bucket.
E.Generate pre-signed URLs for all access.
AnswersA, C

IAM policies allow granular permissions.

Why this answer

Options A and C are correct. IAM policies define user permissions, and bucket policies control bucket-level access. Option B is for network security.

Option D (pre-signed URLs) grants temporary access but not least-privilege enforcement. Option E (ACLs) are legacy and not recommended.

59
Multi-Selectmedium

Which TWO actions can help improve query performance in Amazon Redshift? (Choose two.)

Select 2 answers
A.Use appropriate sort keys for tables.
B.Disable SSL encryption for connections.
C.Use VARCHAR instead of CHAR for fixed-length strings.
D.Apply compression encodings to columns.
E.Increase the number of nodes in the cluster.
AnswersA, D

Sort keys help the query optimizer scan less data.

Why this answer

Option A is correct because defining appropriate sort keys in Amazon Redshift enables the query optimizer to use zone maps to skip irrelevant data blocks during table scans, significantly reducing the amount of data read from disk. Sort keys also improve the effectiveness of merge joins and the performance of range-restricted queries by physically co-locating rows with similar sort key values on disk.

Exam trap

The trap here is that candidates often assume scaling out (adding nodes) always speeds up individual queries, but in Redshift, query performance is more dependent on data layout (sort keys, distribution, compression) than on cluster size, and adding nodes primarily benefits concurrent workloads rather than single-query latency.

60
MCQhard

A data engineering team is designing a data lake on Amazon S3. They need to store raw data in its original format and transformed data in Parquet. The data is accessed by multiple analytics services, including Amazon Athena and Amazon Redshift Spectrum. Compliance requirements mandate that all data be encrypted at rest with AWS KMS and that the encryption keys be rotated every 90 days. Which S3 bucket configuration meets these requirements?

A.Use SSE-KMS with a customer-managed KMS key that has automatic key rotation enabled.
B.Use SSE-C with client-managed keys and rotate them manually.
C.Use a bucket policy to enforce encryption and rely on default S3 encryption.
D.Use SSE-S3 with default encryption enabled.
AnswerA

SSE-KMS with automatic rotation meets compliance requirements.

Why this answer

Option A is correct because SSE-KMS with a customer-managed KMS key allows you to enable automatic key rotation, which meets the 90-day rotation requirement. AWS KMS automatically rotates customer-managed keys annually, but you can also configure a custom rotation period (e.g., 90 days) using a Lambda function or AWS Config rule. This setup ensures raw and Parquet data are encrypted at rest, and the key rotation satisfies compliance mandates.

Exam trap

The trap here is that candidates assume SSE-S3 or default encryption meets the rotation requirement because AWS rotates keys automatically, but they overlook the need for a configurable 90-day rotation period, which only SSE-KMS with a customer-managed key can support.

How to eliminate wrong answers

Option B is wrong because SSE-C requires you to manage and rotate encryption keys client-side, which adds operational overhead and does not integrate with AWS KMS for automated rotation; manual rotation every 90 days is possible but not automated, and it violates the requirement to use AWS KMS. Option C is wrong because relying on default S3 encryption (SSE-S3) uses S3-managed keys that cannot be rotated on a 90-day schedule; AWS rotates SSE-S3 keys annually, but you have no control over the rotation frequency. Option D is wrong because SSE-S3 does not support customer-controlled key rotation; it uses S3-managed keys with automatic rotation by AWS, but the rotation period is not configurable and does not meet the 90-day requirement.

61
MCQhard

A company runs a real-time analytics platform on Amazon ECS that ingests streaming data from Amazon Kinesis Data Streams, processes it, and stores results in Amazon DynamoDB. The data volume spikes unpredictably, causing DynamoDB to throttle write requests. The application uses on-demand capacity mode. The data engineer notices that the throttling occurs on a specific partition due to a hot key. The hot key is a customer ID that receives a disproportionate number of writes. The application cannot change the partition key design immediately. The engineer needs to reduce throttling while maintaining low latency. Which solution is most effective?

A.Switch to provisioned capacity with auto scaling and increase the write capacity units.
B.Implement a write buffer using Amazon SQS, and have consumers write to DynamoDB at a controlled rate.
C.Enable DynamoDB Accelerator (DAX) to cache the hot key writes.
D.Use DynamoDB Streams to trigger a Lambda function that retries throttled writes.
AnswerB

SQS decouples the producers from the writes, allowing batch processing and reducing throttling.

Why this answer

Option B is correct because buffering writes through Amazon SQS decouples the ingestion rate from DynamoDB's capacity, allowing consumers to write at a controlled pace. This directly mitigates throttling on the hot key without requiring a partition key redesign, and SQS provides low-latency, durable buffering suitable for real-time analytics.

Exam trap

The trap here is that candidates often assume on-demand capacity eliminates all throttling, but it does not protect against hot key skew; they may also confuse DAX's read caching with write buffering, or think retrying throttled writes is a viable solution rather than a reactive fix that increases latency.

How to eliminate wrong answers

Option A is wrong because switching to provisioned capacity with auto scaling does not solve the hot key issue; throttling occurs on a specific partition regardless of total capacity, and increasing write capacity units would not prevent a single partition from exceeding its 1,000 WCU limit. Option C is wrong because DAX is a caching layer for reads, not writes; it cannot buffer or absorb write throttling on a hot key. Option D is wrong because using DynamoDB Streams to retry throttled writes introduces latency and does not prevent throttling; it only retries failed writes, which can lead to backlog and increased latency, not a controlled rate.

62
MCQhard

A data engineer is designing a data lake on Amazon S3. The data is partitioned by year, month, day, and hour. The engineer needs to ensure that queries using Amazon Athena are cost-effective and performant. The data is written in Parquet format, and the total volume is 50 TB. Which approach minimizes query costs?

A.Use AWS Glue Data Catalog to catalog the data
B.Convert data to CSV format
C.Partition the data by year, month, day, and hour
D.Use S3 Intelligent-Tiering storage class
AnswerC

Partitioning allows Athena to scan only relevant partitions, reducing cost.

Why this answer

Option C is correct because partitioning by year, month, day, and hour allows Athena to use partition pruning, reading only the relevant S3 prefixes instead of scanning the entire 50 TB dataset. This drastically reduces the amount of data scanned per query, which directly lowers query costs (Athena charges per TB scanned). The existing Parquet format further optimizes performance through columnar storage and compression.

Exam trap

AWS often tests the misconception that simply cataloging data (Option A) or using a storage tier (Option D) directly improves query performance, when in fact only partitioning and efficient file formats reduce the data scanned by Athena.

How to eliminate wrong answers

Option A is wrong because using AWS Glue Data Catalog to catalog the data is a prerequisite for Athena to query the data, but it does not by itself reduce query costs or improve performance; it only provides schema and partition metadata. Option B is wrong because converting data to CSV format would increase the amount of data scanned (CSV is not columnar and lacks compression compared to Parquet), leading to higher query costs and slower performance. Option D is wrong because S3 Intelligent-Tiering is a storage class that optimizes storage costs based on access patterns, but it has no impact on Athena query costs or performance, which depend on data format and partitioning, not storage tier.

63
MCQmedium

A financial services company uses Amazon Redshift for its data warehouse. The cluster has two nodes and is used for complex analytical queries. The company recently migrated from a single-node cluster to a two-node cluster to improve performance. After the migration, the data engineer notices that query performance has not improved as expected. Some queries are even slower than before. The engineer checks the workload management (WLM) queue configuration and sees that there is only one queue with a concurrency level of 5. The queries are mostly large scans and aggregations. The cluster's CPU utilization is low, but disk I/O is high. What should the data engineer do to improve query performance?

A.Apply compression to the tables to reduce the amount of data scanned.
B.Increase the concurrency level in the WLM queue to allow more queries to run simultaneously.
C.Add more nodes or upgrade to a larger node type to increase memory and reduce disk spills.
D.Change the distribution style of large tables to DISTSTYLE ALL to avoid data redistribution.
AnswerC

More memory reduces disk I/O by allowing intermediate results to stay in memory.

Why this answer

Option C is correct because high disk I/O and low CPU utilization indicate that the cluster is spilling to disk due to insufficient memory. Increasing the number of nodes or upgrading to a larger node type (e.g., dc2.large to dc2.8xlarge) increases memory. Option A is wrong because increasing concurrency would increase contention.

Option B is wrong because distribution style is unlikely the main issue. Option D is wrong because manual compression is not needed; Redshift automatically applies compression.

64
MCQhard

An IAM policy is attached to a role assumed by authenticated users via Amazon Cognito. What does this policy allow?

A.Users can read and write items in the Orders table where the partition key matches their Cognito identity ID.
B.Users can read any item in the Orders table using GetItem and Query.
C.Users can scan the entire Orders table but only if they use a filter expression.
D.Users can read items in the Orders table only if the partition key matches their Cognito identity ID.
AnswerD

The LeadingKeys condition restricts based on the partition key equal to the Cognito sub.

Why this answer

Option A is correct because the policy uses 'dynamodb:LeadingKeys' condition to restrict access to items where the partition key equals the Cognito identity ID (sub). This provides fine-grained access control. Option B is wrong because it allows only GetItem and Query, not Scan.

Option C is wrong because it restricts to a specific table. Option D is wrong because it does not allow writes.

65
MCQeasy

A company uses Amazon DynamoDB as the primary data store for a web application. The application experiences occasional throttling on write requests. The data engineer needs to implement a solution that handles throttling gracefully without losing data. Which approach should the engineer use?

A.Increase the provisioned write capacity to a higher value
B.Use an Amazon SQS queue to buffer write requests before sending to DynamoDB
C.Implement exponential backoff in the application's write retry logic
D.Enable DynamoDB Accelerator (DAX) to cache writes
AnswerC

Exponential backoff is a best practice to handle throttling effectively.

Why this answer

Option C is correct because implementing exponential backoff in the application's write retry logic is the standard AWS-recommended approach for handling DynamoDB throttling (ProvisionedThroughputExceededException). Exponential backoff gradually increases the wait time between retries, reducing the retry rate and allowing the throttling condition to subside, while ensuring no write data is lost as long as the retries eventually succeed. This approach is lightweight, requires no additional AWS services, and aligns with best practices for building resilient applications against DynamoDB throttling.

Exam trap

The trap here is that candidates often confuse DAX as a write cache or assume SQS is the only way to buffer writes, but the question specifically asks for handling throttling gracefully without losing data, and exponential backoff is the direct, built-in mechanism for retrying throttled requests in DynamoDB.

How to eliminate wrong answers

Option A is wrong because simply increasing provisioned write capacity may reduce throttling but does not handle throttling gracefully when it occurs; it also incurs higher costs and does not address the root cause of occasional spikes. Option B is wrong because using an SQS queue to buffer write requests introduces eventual consistency and potential data loss if the queue messages expire or are not processed before the DynamoDB write; it also adds complexity and latency, and is not the standard pattern for handling DynamoDB throttling directly. Option D is wrong because DynamoDB Accelerator (DAX) is an in-memory cache for reads only, not writes; it cannot cache write requests or mitigate write throttling.

66
MCQmedium

A company is using Amazon DynamoDB for a high-traffic web application. They notice increased read latency during peak hours. Which design change would best reduce read latency without increasing cost?

A.Increase read capacity units
B.Use DynamoDB global tables
C.Switch to strongly consistent reads
D.Enable DynamoDB Accelerator (DAX)
AnswerD

DAX is a caching layer that reduces read latency.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency from single-digit milliseconds to microseconds for eventually consistent reads, without requiring any changes to provisioned capacity. Since the question specifies reducing latency without increasing cost, DAX is ideal because it offloads read traffic from the underlying table, allowing you to potentially lower read capacity units (RCUs) while maintaining performance.

Exam trap

The trap here is that candidates often confuse increasing provisioned capacity (Option A) with reducing latency, but DynamoDB's internal latency is dominated by storage I/O and network round trips, not capacity units—DAX addresses the actual bottleneck by caching hot data in memory.

How to eliminate wrong answers

Option A is wrong because increasing read capacity units (RCUs) would directly increase cost, and while it can reduce throttling, it does not inherently reduce per-request latency caused by internal DynamoDB overhead or hot partitions. Option B is wrong because global tables are designed for multi-region replication and disaster recovery, not for reducing read latency within a single region; they would increase cost due to replication writes and cross-region traffic. Option C is wrong because switching to strongly consistent reads actually increases latency (as they require a quorum read from multiple storage nodes) and consumes twice the RCUs, thus increasing cost without improving performance.

67
MCQmedium

A data engineer is designing a data store for real-time analytics on high-velocity clickstream data. The data must be stored in a schema-on-read format and support SQL queries with sub-second latency. Which service should be used?

A.Amazon Redshift
B.Amazon Kinesis Data Firehose to S3 with Athena
C.Amazon Kinesis Data Analytics
D.Amazon DynamoDB
AnswerB

Firehose streams data to S3, Athena queries with schema-on-read and partitioning for low latency.

Why this answer

Amazon Kinesis Data Firehose can ingest high-velocity clickstream data and deliver it to Amazon S3, where it is stored in a schema-on-read format (e.g., Parquet or ORC). Amazon Athena then allows SQL queries directly on the data in S3 with sub-second latency when using partitions, columnar formats, and optimizations like AWS Glue Catalog. This combination meets the requirements for real-time analytics without predefining a schema.

Exam trap

The trap here is that candidates confuse Amazon Kinesis Data Analytics (which processes streams but does not store data) with a storage solution, or they assume Amazon Redshift is suitable for real-time streaming without recognizing its schema-on-write requirement and higher latency for ad-hoc queries.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift requires a predefined schema (schema-on-write) and is optimized for batch analytics, not sub-second latency on high-velocity streaming data without significant preprocessing. Option C is wrong because Amazon Kinesis Data Analytics processes streaming data in real time using SQL but does not store the data persistently in a schema-on-read format; it is for transient analytics, not a data store. Option D is wrong because Amazon DynamoDB is a NoSQL key-value and document database that does not support SQL queries natively (it uses PartiQL with limitations) and is schema-on-write, not schema-on-read, making it unsuitable for ad-hoc SQL analytics on clickstream data.

68
MCQeasy

A data engineer needs to store semi-structured JSON logs from multiple microservices in a cost-effective manner for later analysis using Amazon Athena. The logs are generated continuously, and the total volume is about 1 TB per day. The data must be queryable within minutes of arrival. Which storage solution is most appropriate?

A.Amazon DynamoDB table with JSON attribute
B.Amazon RDS for PostgreSQL table with JSON column
C.Amazon S3 bucket with partitioned folders
D.Amazon Redshift cluster with JSON ingestion
AnswerC

S3 is cost-effective, and Athena can query the data directly.

Why this answer

Amazon S3 with partitioned folders is the most appropriate solution because it provides a cost-effective, scalable storage layer for semi-structured JSON logs, and integrates natively with Amazon Athena for serverless querying. By partitioning the data by time (e.g., year/month/day/hour), Athena can use partition pruning to minimize scanned data, enabling queries within minutes of arrival. S3's low cost per GB and lifecycle policies further optimize storage for the 1 TB/day volume.

Exam trap

AWS often tests the misconception that a data warehouse (Redshift) or a NoSQL database (DynamoDB) is required for analytical queries on semi-structured data, when in fact S3 with Athena is the most cost-effective and scalable solution for serverless ad-hoc analysis on raw logs.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is optimized for key-value and document access patterns with low-latency reads/writes, not for ad-hoc analytical queries on large volumes of JSON logs; scanning 1 TB/day would be prohibitively expensive and slow, and it lacks native integration with Athena. Option B is wrong because Amazon RDS for PostgreSQL is a relational database designed for transactional workloads, not for storing and analyzing 1 TB/day of semi-structured logs; it would require manual partitioning, incur high storage costs, and cannot scale to petabyte-scale analytics efficiently. Option D is wrong because Amazon Redshift is a petabyte-scale data warehouse optimized for complex analytical queries, but it is overkill and more expensive than S3 for raw log storage; ingesting 1 TB/day of JSON logs into Redshift requires an ETL pipeline (e.g., COPY from S3) and incurs compute costs even when idle, whereas S3 with Athena is serverless and pay-per-query.

69
MCQmedium

A company uses Amazon DynamoDB to store session data for a web application. The application experiences throttling errors during peak traffic. The data engineer observes that the table's read capacity is consistently at 100% and the write capacity is at 20%. The engineer needs to resolve the throttling with minimal cost. Which solution should the engineer implement?

A.Increase the provisioned read capacity units for the table.
B.Enable DynamoDB auto scaling for read capacity.
C.Implement DynamoDB Accelerator (DAX) to cache read-heavy workloads.
D.Decrease the provisioned write capacity units to free up budget for reads.
AnswerC

DAX reduces read load on the table by caching, lowering required read capacity.

Why this answer

Option A is correct. Using DynamoDB Accelerator (DAX) caches frequent reads, reducing read capacity consumption. Option B is wrong because increasing read capacity units is more expensive than DAX and does not address cost.

Option C is wrong because auto scaling may not respond fast enough for sudden spikes. Option D is wrong because decreasing write capacity does not help read throttling.

70
MCQeasy

A data engineer is reviewing the lifecycle configuration of an S3 bucket. The bucket stores log files. The engineer wants to ensure that objects are deleted after 365 days. What is the current behavior?

A.Objects are deleted after 365 days.
B.Objects are transitioned to S3 Standard-IA immediately.
C.Objects are transitioned to S3 Glacier after 365 days.
D.Noncurrent versions of objects are deleted after 365 days.
AnswerA

The expiration rule sets deletion after 365 days.

Why this answer

Option A is correct because the expiration rule deletes objects after 365 days. Option B is wrong because there is no transition to Glacier. Option C is wrong because objects transition to STANDARD_IA after 30 days, not immediately.

Option D is wrong because noncurrent versions are not included.

71
Multi-Selecthard

A company is migrating an on-premises Apache Hadoop cluster to Amazon EMR. The cluster uses HDFS for storage. Which THREE features of Amazon EMR help reduce storage costs compared to on-premises HDFS? (Choose THREE)

Select 3 answers
A.Leverage Amazon S3 storage classes like S3 Standard-IA for older data.
B.Use instance store volumes for intermediate data.
C.Enable automatic data compression in EMRFS.
D.Use EMR File System (EMRFS) to store data in Amazon S3.
E.Attach Amazon EBS volumes to cluster nodes for persistent storage.
AnswersA, C, D

S3 storage classes allow cost optimization for infrequently accessed data.

Why this answer

Option A is correct because Amazon S3 Standard-IA (Infrequent Access) offers lower storage costs than S3 Standard for data that is accessed less frequently, making it ideal for older or archival data in a Hadoop migration. By using S3 as the primary storage layer via EMRFS, you decouple compute from storage and avoid the replication overhead of HDFS (which typically uses 3x replication), significantly reducing storage costs.

Exam trap

The trap here is that candidates often confuse instance store volumes or EBS volumes as cost-saving alternatives, but the exam tests the understanding that S3-based storage with EMRFS is the primary mechanism for reducing storage costs in EMR by eliminating HDFS replication and enabling lifecycle management.

72
Multi-Selectmedium

A data engineer is using Amazon DynamoDB to store session data for a web application. The engineer wants to ensure that all data is encrypted at rest using an AWS managed key. Which THREE steps should the engineer take to achieve this? (Choose THREE.)

Select 2 answers
A.Enable server-side encryption with S3-managed keys (SSE-S3) on the DynamoDB table.
B.Disable encryption at rest to improve performance.
C.Specify an AWS KMS customer managed key for encryption if required.
D.Use client-side encryption before writing data to DynamoDB.
E.Create the DynamoDB table with encryption at rest enabled using an AWS managed key.
AnswersC, E

You can choose a customer managed key for encryption.

Why this answer

Option C is correct because when using DynamoDB encryption at rest, you can choose between an AWS managed key (aws/dynamodb) or a customer managed key (CMK) in AWS KMS. Specifying a customer managed key is a valid step if the requirement is to use an AWS managed key, but the question asks for steps to achieve encryption with an AWS managed key, so selecting a customer managed key would be incorrect unless the engineer intends to use it; however, the option states 'if required', which aligns with the flexibility to choose. Option E is correct because creating the DynamoDB table with encryption at rest enabled using an AWS managed key directly fulfills the requirement—DynamoDB defaults to the AWS managed key when encryption is enabled without specifying a CMK.

Exam trap

The trap here is that candidates may confuse the encryption options across AWS services (e.g., applying S3-specific SSE-S3 to DynamoDB) or think client-side encryption is a valid substitute for server-side encryption at rest, when DynamoDB's native encryption at rest is the correct mechanism.

73
MCQhard

A financial services company stores sensitive transaction data in an Amazon S3 bucket. Compliance requires that all objects be encrypted using SSE-KMS and that the bucket be protected from accidental deletion. Which combination of actions meets these requirements? (Select TWO.)

A.Enable MFA Delete on the bucket
B.Enable S3 Block Public Access
C.Add a bucket policy that denies PutObject if the object is not encrypted with SSE-KMS
D.Enable S3 Versioning on the bucket
E.Set default encryption to SSE-S3
AnswerC, D

This ensures all uploads use SSE-KMS.

Why this answer

Options A and C are correct because enabling versioning protects against accidental deletion and a bucket policy denying unencrypted uploads enforces SSE-KMS. Option B (default encryption) does not enforce encryption on existing objects. Option D (MFA delete) is an additional protection but not required by the scenario.

Option E (block public access) addresses public access, not deletion.

74
MCQeasy

A data engineer needs to store semi-structured JSON files that are accessed infrequently but must be retrievable within minutes. The data is immutable and must be stored cost-effectively. Which AWS service should the engineer use?

A.Amazon DynamoDB with on-demand capacity
B.Amazon EBS with gp3 volume
C.Amazon S3 with S3 Standard-IA storage class
D.Amazon RDS for PostgreSQL with JSONB data type
AnswerC

S3 is designed for object storage, supports JSON, and Standard-IA is cost-effective for infrequent access with millisecond retrieval.

Why this answer

Amazon S3 Standard-IA (Infrequent Access) is designed for data that is accessed less frequently but requires rapid retrieval when needed, with retrieval times in milliseconds. It offers lower storage costs than S3 Standard while maintaining high durability and availability, making it ideal for storing immutable semi-structured JSON files that must be retrievable within minutes. The service is cost-effective for infrequently accessed data because it charges a retrieval fee per GB, but the storage price is significantly lower than standard tiers.

Exam trap

The trap here is that candidates often confuse 'infrequently accessed' with 'archival' and choose Glacier or Deep Archive, but the requirement for retrieval within minutes eliminates those options, while DynamoDB or RDS seem plausible for JSON but are not cost-effective for immutable, infrequently accessed data.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB with on-demand capacity is a NoSQL database optimized for high-frequency, low-latency queries and is not cost-effective for infrequently accessed, immutable JSON files; it charges per read/write request unit and storage, which would be wasteful for archival-like data. Option B is wrong because Amazon EBS with gp3 volume is a block storage service designed for EC2 instances and requires an attached compute instance to access data, adding unnecessary cost and complexity; it is not a standalone object storage solution for infrequently accessed files. Option D is wrong because Amazon RDS for PostgreSQL with JSONB data type is a relational database service that incurs ongoing compute and storage costs, even when idle, and is overkill for storing immutable JSON files that are only occasionally retrieved; it is designed for transactional workloads and complex queries, not cost-effective archival storage.

75
MCQmedium

A company uses Amazon EMR to run Spark jobs on a cluster of 20 nodes. The cluster stores intermediate data on Amazon S3 using EMRFS. The company's data engineering team notices that the Spark jobs are running slower than expected. Upon investigating, they find that the cluster is experiencing high network I/O and that the S3 storage costs have increased significantly. The team suspects that the Spark jobs are writing too much intermediate data to S3. The jobs are performing many shuffle operations. The team wants to optimize the job performance and reduce costs without modifying the Spark application code. What should the data engineer do?

A.Enable S3 server-side encryption on the S3 bucket to reduce storage costs.
B.Increase the size of the EBS root volumes on the cluster nodes to store more intermediate data locally.
C.Configure the EMR cluster to use instance store volumes for intermediate data instead of EMRFS.
D.Add more nodes to the cluster to distribute the shuffle load.
AnswerC

Instance store provides local ephemeral storage, reducing S3 dependency and network I/O.

Why this answer

Option C is correct because enabling automatic encryption on the bucket does not affect performance; the issue is about shuffle data. Option B is wrong because increasing EBS volumes for shuffle storage on instance nodes is not standard; EMR uses instance store or EMRFS. Option A is correct because using instance store for shuffle data reduces S3 I/O and cost.

Option D is wrong because increasing node count may increase cost and network I/O.

Page 1 of 7 · 456 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Store Management questions.