Knowledge + Practice

CCNA Data Store Management Questions

75 of 456 questions · Page 5/7 · Data Store Management topic · Answers revealed

Practice these questions Exam hub All questions

301

MCQmedium

Refer to the exhibit. A data engineer runs the above AWS CLI command to view the table metadata in the AWS Glue Data Catalog. The data is stored as CSV in S3 with partitions by year and month. When querying the table using Amazon Athena, no data is returned. What is the most likely cause?

A.The partitions have not been added to the Glue Data Catalog.

B.The SerDe is not compatible with CSV files.

C.The S3 location points to a file instead of a folder.

D.The column data types are incorrect for the CSV data.

AnswerA

Partitions must be explicitly registered for Athena to query them.

Why this answer

Option A is correct because the AWS CLI command shown only retrieves table metadata, not partition metadata. In AWS Glue, partitions must be explicitly added to the Data Catalog via `MSCK REPAIR TABLE`, `ALTER TABLE ADD PARTITION`, or a Glue crawler. Without partition metadata, Athena cannot locate the data files under the partitioned S3 paths (e.g., `s3://bucket/year=2024/month=01/`), resulting in zero rows returned even though the table schema is defined.

Exam trap

The trap here is that candidates assume the `PARTITIONED BY` clause in the table definition automatically registers the partitions in the Glue Data Catalog, but it only defines the schema; partition metadata must be added separately.

How to eliminate wrong answers

Option B is wrong because the default SerDe for CSV in Athena (`LazySimpleSerDe`) is fully compatible with standard CSV files; no SerDe mismatch would cause zero rows. Option C is wrong because the `LOCATION` in the Glue table points to a folder (the base path), not a file; Athena expects a folder and would fail with an error if a file were specified, not silently return no data. Option D is wrong because incorrect column data types would cause query failures or data conversion errors, not an empty result set; Athena would still attempt to read the data and return rows with nulls or errors.

Practice this question →

302

MCQmedium

Refer to the exhibit. A data engineer notices that the Redshift cluster 'mycluster' does not have automated backups beyond 7 days. However, the compliance team requires a minimum of 35 days of backup retention. What should the engineer do?

A.Change the node type to ra3.xlplus to enable automatic backups for 35 days.

B.Enable audit logging to capture changes for recovery.

C.Take manual snapshots every day and retain them for 35 days.

D.Modify the cluster's automated snapshot retention period to 35 days.

AnswerD

The retention period can be increased up to 35 days via modification.

Why this answer

Option D is correct because Amazon Redshift allows you to modify the automated snapshot retention period for a cluster up to 35 days. The engineer can use the AWS Management Console, CLI, or API to change the `automated_snapshot_retention_period` parameter from the current 7 days to 35 days, meeting the compliance requirement without additional manual intervention.

Exam trap

The trap here is that candidates may confuse backup retention with node type capabilities or audit logging, assuming that hardware or logging features inherently extend backup duration, when in fact the retention period is a simple configuration parameter.

How to eliminate wrong answers

Option A is wrong because changing the node type to ra3.xlplus does not affect the automated backup retention period; retention is configured independently of node type. Option B is wrong because audit logging captures user activity and SQL queries for security and compliance, not for point-in-time recovery of data; it does not replace backup retention. Option C is wrong because while manual snapshots can be retained for 35 days, this approach requires daily manual effort and does not leverage the automated backup feature that is already available; modifying the automated retention period is simpler and more reliable.

Practice this question →

303

MCQeasy

A data engineer needs to store streaming data from IoT devices for real-time analytics. The data has a fixed schema and requires low-latency queries. Which AWS service should be used?

A.Amazon DynamoDB

B.Amazon Redshift

C.Amazon S3

D.Amazon Timestream

AnswerD

Timestream is designed for time-series data with low-latency queries.

Why this answer

Amazon Timestream is a time-series database purpose-built for IoT and operational applications that generate large volumes of time-stamped data. It automatically manages data retention and storage tiers (memory and magnetic) to provide fast query performance for recent data and cost-effective storage for historical data, making it ideal for real-time analytics on streaming IoT data with a fixed schema.

Exam trap

AWS often tests the misconception that any database can handle time-series data equally well, but the trap here is that candidates choose DynamoDB for its low-latency reads, overlooking that Timestream is the only AWS service purpose-built for time-series workloads with native support for time-based partitioning, retention policies, and analytical functions.

How to eliminate wrong answers

Option A is wrong because Amazon DynamoDB is a NoSQL key-value and document database optimized for high-throughput, low-latency read/write operations on individual items, but it lacks native time-series optimizations such as automatic downsampling, interpolation, and time-based partitioning, making it less efficient for time-series queries like aggregations over time windows. Option B is wrong because Amazon Redshift is a petabyte-scale data warehouse designed for complex analytical queries on structured and semi-structured data using SQL, but it is not optimized for real-time streaming ingestion or low-latency queries on high-frequency time-series data; its batch-oriented architecture introduces higher latency for streaming use cases. Option C is wrong because Amazon S3 is an object storage service that provides durable, scalable storage for any type of data, but it does not support real-time querying directly; querying S3 requires services like Athena or S3 Select, which add latency and are not designed for sub-second, low-latency queries on streaming data.

Practice this question →

304

MCQhard

A data engineer is designing a data lake on Amazon S3. The data is frequently accessed by multiple analytics services, and the company needs to enforce fine-grained access control based on data tags. Which combination of AWS services should be used?

A.S3 Block Public Access settings

B.AWS Lake Formation with tag-based access control

C.S3 Access Points with bucket policies

D.S3 Object Lambda with IAM policies

AnswerB

Lake Formation provides fine-grained access control using tags.

Why this answer

AWS Lake Formation with tag-based access control (TBAC) is the correct choice because it provides fine-grained, attribute-based access control (ABAC) at the column, row, and cell level across a data lake on S3. By assigning LF-tags to Data Catalog resources and defining permissions based on those tags, you can enforce granular access policies that scale without managing individual user-to-resource mappings. This directly meets the requirement for tag-driven, fine-grained access for multiple analytics services.

Exam trap

The trap here is that candidates often confuse S3 Access Points (which provide network-level or prefix-level restrictions) with the fine-grained, tag-driven access control that Lake Formation TBAC uniquely offers, leading them to pick Option C despite its inability to enforce column- or row-level security based on tags.

How to eliminate wrong answers

Option A is wrong because S3 Block Public Access settings only prevent public exposure of S3 objects and do not provide any fine-grained, tag-based access control for internal users or services. Option C is wrong because S3 Access Points with bucket policies can restrict access based on VPC or IP, but they do not natively support tag-based access control at the column or row level; they operate at the bucket or prefix level only. Option D is wrong because S3 Object Lambda transforms data on read but does not enforce access control based on data tags; IAM policies attached to it cannot dynamically filter data by tags without custom code, and it lacks the centralized governance Lake Formation provides.

Practice this question →

305

MCQeasy

A data engineer needs to store JSON documents that are frequently accessed by a low-latency web application. The data does not require complex queries, and the access pattern is primarily by a key. Which AWS service is most appropriate?

A.Amazon ElastiCache for Redis

B.Amazon S3

C.Amazon RDS for MySQL

D.Amazon DynamoDB

AnswerD

DynamoDB provides low-latency key-value access for JSON documents.

Why this answer

Option A is correct because Amazon DynamoDB is a key-value and document database that offers low-latency access. Option B (RDS) is relational and not optimized for JSON. Option C (S3) has higher latency for frequent reads.

Option D (ElastiCache) is a cache, not a primary store.

Practice this question →

306

Multi-Selecthard

A company uses Amazon DynamoDB to store session data for a web application. The application experiences throttling during peak hours. The data engineer needs to reduce throttling. Which THREE actions should the engineer take?

Select 3 answers

A.Use DynamoDB Accelerator (DAX) to cache read requests.

B.Increase the provisioned read capacity units.

C.Implement exponential backoff in the application.

D.Design the partition key to include a random suffix to distribute writes.

E.Enable auto scaling on the table.

AnswersA, C, D

Reduces read capacity consumption.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency and offloads read requests from the DynamoDB table, directly decreasing the number of read capacity units consumed. By caching frequently accessed session data, DAX mitigates throttling during peak hours without requiring changes to the table's provisioned capacity.

Exam trap

The trap here is that candidates confuse reactive scaling (auto scaling) or capacity increases with proactive throttling reduction techniques, while the correct answers focus on caching, request distribution, and retry logic that directly reduce the load on the table.

Practice this question →

307

MCQhard

A company uses Amazon DynamoDB with on-demand capacity for a gaming application that experiences unpredictable traffic spikes. The application reads the same set of 'hot' items frequently. Users report high latency during peak hours. Which action would MOST effectively reduce read latency for the hot items?

A.Enable DynamoDB Accelerator (DAX) for the table.

B.Switch to provisioned capacity with auto-scaling.

C.Increase the read capacity units for the table.

D.Enable DynamoDB Global Tables for multi-region replication.

AnswerA

DAX caches hot items, reducing read latency.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency for frequently accessed items. Option C is correct. Option A: increasing read capacity units is not applicable for on-demand mode (it auto-scales).

Option B: switching to provisioned capacity with auto-scaling does not address hot item latency. Option D: using Global Tables improves write availability across regions but does not reduce read latency for hot items.

Practice this question →

308

MCQmedium

The exhibit shows an S3 bucket policy. What is the effect of this policy?

A.Allows all S3 actions over HTTPS only.

B.Allows all S3 actions to the bucket over any protocol.

C.Denies all S3 actions to the bucket.

D.Allows only GetObject and PutObject over HTTPS.

AnswerD

Explicit allow for those actions over HTTPS; deny for HTTP.

Why this answer

The policy allows GetObject and PutObject only over HTTPS (SecureTransport true) and denies all S3 actions over HTTP (SecureTransport false). Result: Only HTTPS requests are allowed.

Practice this question →

309

MCQmedium

A data engineering team is designing a data lake on Amazon S3 with a folder structure that separates raw, transformed, and curated data. The team needs to implement lifecycle policies to minimize storage costs while ensuring that data in the 'raw' zone is retained for 90 days before being moved to Amazon S3 Glacier Deep Archive. Additionally, data in the 'curated' zone should be deleted after 365 days. What is the MOST cost-effective way to achieve these requirements?

A.Configure S3 Intelligent-Tiering on both prefixes with automatic archiving to Glacier Deep Archive after 90 days in raw and deletion after 365 days in curated.

B.Create separate lifecycle policies for each prefix: one to transition raw to S3 Glacier Deep Archive after 90 days, another to delete curated after 365 days.

C.Create a lifecycle policy that transitions raw zone data to S3 Standard-IA after 90 days and deletes curated zone data after 365 days.

D.Create a single lifecycle policy with two rules: one to transition raw zone objects to S3 Glacier Deep Archive after 90 days, and another to delete curated zone objects after 365 days.

AnswerD

This meets cost and retention requirements efficiently.

Why this answer

Option D is correct because a single S3 lifecycle policy can contain multiple rules, each applying to different prefixes. This allows you to transition raw zone objects to S3 Glacier Deep Archive after 90 days and delete curated zone objects after 365 days, minimizing storage costs without needing separate policies. S3 lifecycle policies are evaluated per object based on creation date, and using one policy reduces management overhead.

Exam trap

The trap here is that candidates might think separate lifecycle policies are required for different prefixes, but AWS allows multiple rules within a single policy, making it more cost-effective and easier to manage than creating separate policies.

How to eliminate wrong answers

Option A is wrong because S3 Intelligent-Tiering does not support automatic archiving to Glacier Deep Archive after a fixed number of days; it moves objects between access tiers based on usage patterns, not a scheduled transition, and it cannot enforce deletion after a specific period. Option B is wrong because while separate lifecycle policies can achieve the requirements, they are not the most cost-effective or efficient approach; a single policy with multiple rules is simpler and avoids potential policy conflicts or duplication of management. Option C is wrong because transitioning raw zone data to S3 Standard-IA after 90 days does not meet the requirement to move it to Glacier Deep Archive, which is the lowest-cost storage class for long-term archival; Standard-IA is more expensive than Glacier Deep Archive for data that is rarely accessed.

Practice this question →

310

MCQeasy

A data engineer is designing a data store for a real-time analytics application that requires sub-millisecond read and write latency. The data is key-value in nature and the workload is both read-heavy and write-heavy. Which AWS service is most suitable?

A.Amazon ElastiCache for Redis

B.Amazon DynamoDB

C.Amazon RDS for MySQL

D.Amazon S3

AnswerB

DynamoDB offers consistent single-digit millisecond latency at scale.

Why this answer

Amazon DynamoDB is the most suitable service because it is a fully managed NoSQL key-value and document database designed for single-digit millisecond latency at any scale, and with features like DynamoDB Accelerator (DAX) it can achieve sub-millisecond read latency. It supports both read-heavy and write-heavy workloads through its distributed architecture and auto-scaling capabilities, making it ideal for real-time analytics applications.

Exam trap

The trap here is that candidates often choose ElastiCache for Redis because of its sub-millisecond performance, overlooking that the question specifies a 'data store' for a write-heavy workload, which requires durability and persistence that Redis does not guarantee by default, whereas DynamoDB is a fully managed, durable database designed for such use cases.

How to eliminate wrong answers

Option A (Amazon ElastiCache for Redis) is wrong because while Redis provides sub-millisecond latency, it is an in-memory data store primarily designed for caching and not as a durable primary data store; data persistence is optional and can lead to data loss if not configured correctly, making it unsuitable for a write-heavy workload requiring durability. Option C (Amazon RDS for MySQL) is wrong because it is a relational database that incurs higher latency due to disk I/O and SQL query overhead, and it cannot consistently achieve sub-millisecond read and write latency for key-value workloads. Option D (Amazon S3) is wrong because it is an object storage service with eventual consistency for overwrite PUTS and DELETE operations, and its latency is typically in the tens to hundreds of milliseconds, far exceeding the sub-millisecond requirement.

Practice this question →

311

Multi-Selecthard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be stored in Amazon S3 in near real-time with minimal overhead. Which THREE steps should the data engineer take to achieve this? (Choose THREE.)

Select 3 answers

A.Configure a Lambda function to transform data before delivery to S3.

B.Create a Kinesis Data Firehose delivery stream that delivers data to S3.

C.Use Kinesis Data Analytics to process and store data in S3.

D.Enable S3 cross-region replication for the destination bucket.

E.Enable S3 compression (e.g., GZIP) in Firehose.

AnswersA, B, E

Why this answer

Option A is correct because a Lambda function can be used as a data transformation step within a Kinesis Data Firehose delivery stream. This allows the clickstream data to be transformed (e.g., parsed, enriched, or reformatted) before being delivered to Amazon S3, enabling near real-time storage with minimal operational overhead.

Exam trap

The trap here is that candidates may confuse Kinesis Data Analytics with a storage service or think that cross-region replication is needed for near real-time ingestion, when in fact Firehose with Lambda transformation and compression is the correct, minimal-overhead solution.

Practice this question →

312

MCQeasy

A company stores time-series sensor data in Amazon S3. They need to query the data using SQL with minimal latency and no infrastructure management. Which service should they use?

A.Amazon Kinesis Data Analytics

B.Amazon Athena

C.Amazon Redshift

D.Amazon DynamoDB

AnswerB

Athena is serverless and directly queries S3 using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL without any infrastructure to manage. It is optimized for querying structured, semi-structured, and unstructured data stored in S3, making it ideal for time-series sensor data with minimal latency requirements.

Exam trap

The trap here is that candidates often confuse Amazon Athena with Amazon Redshift Spectrum, but the question explicitly requires 'no infrastructure management,' which eliminates Redshift; also, Kinesis Data Analytics is mistakenly chosen by those who think it can query static S3 data, but it is strictly for real-time streams.

How to eliminate wrong answers

Option A is wrong because Amazon Kinesis Data Analytics is designed for real-time stream processing using SQL or Apache Flink, not for querying static data already stored in S3; it requires a streaming data source and incurs ongoing processing costs. Option C is wrong because Amazon Redshift is a fully managed data warehouse that requires provisioning and managing clusters, which contradicts the 'no infrastructure management' requirement; it is also overkill for simple SQL queries on S3 data and incurs higher costs for idle compute. Option D is wrong because Amazon DynamoDB is a NoSQL key-value and document database, not designed for SQL queries on S3 data; it requires data to be loaded into tables and does not support direct querying of S3 objects.

Practice this question →

313

MCQeasy

A data engineer needs to store semi-structured JSON transaction logs for analytics. The logs are written once and rarely accessed. The storage must be cost-effective. Which AWS service should be used?

A.Amazon S3

B.Amazon DynamoDB

C.Amazon RDS

D.Amazon Redshift

AnswerA

S3 is cost-effective for infrequently accessed semi-structured data.

Why this answer

Amazon S3 is the correct choice because it provides highly durable, cost-effective object storage ideal for semi-structured JSON transaction logs that are written once and rarely accessed. S3's lifecycle policies can automatically transition such infrequently accessed data to S3 Glacier or S3 Glacier Deep Archive for even lower storage costs, making it the most economical option for this use case.

Exam trap

The trap here is that candidates may choose DynamoDB or Redshift because they support JSON natively, but they overlook the core requirement of cost-effective storage for rarely accessed data, which is best met by S3's low-cost object storage and lifecycle management features.

How to eliminate wrong answers

Option B (Amazon DynamoDB) is wrong because it is a NoSQL key-value and document database optimized for low-latency, high-throughput read/write operations, not for cost-effective archival storage of rarely accessed logs; storing large volumes of infrequently accessed JSON logs in DynamoDB would incur significant costs for provisioned throughput and storage. Option C (Amazon RDS) is wrong because it is a relational database service designed for transactional workloads with structured data and frequent queries, not for storing semi-structured JSON logs at low cost; it would require schema management and incur higher per-GB storage costs compared to S3. Option D (Amazon Redshift) is wrong because it is a petabyte-scale data warehouse optimized for complex analytical queries on structured and semi-structured data, not for simple, cost-effective archival storage; using Redshift for rarely accessed logs would be over-provisioned and expensive due to its compute and storage costs.

Practice this question →

314

Multi-Selecteasy

Which TWO are valid Amazon Redshift distribution styles? (Choose 2.)

Select 2 answers

A.HASH

B.ALL

C.AUTO

D.RANDOM

E.KEY

AnswersB, E

ALL distributes a copy of the table to all nodes.

Why this answer

Options A and C are correct because KEY and ALL are valid distribution styles. Option B is wrong because HASH is not a distribution style (it's a synonym for KEY). Option D is wrong because RANDOM is not a distribution style.

Option E is wrong because AUTO is not a distribution style.

Practice this question →

315

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is ingested from multiple sources in Parquet format, and the schema evolves over time. Which approach allows querying the data with Amazon Athena while supporting schema evolution?

A.Use AWS Glue Data Catalog with crawlers to automatically update the table schema.

B.Define Hive-style partitions in Athena and manually update the schema.

C.Use S3 Select to query the data directly without a schema.

D.Use Amazon Redshift Spectrum with external tables and update the schema manually.

AnswerA

Crawlers can detect schema changes and update the Data Catalog, which Athena uses.

Why this answer

AWS Glue Data Catalog with crawlers automatically infers and updates the table schema as new Parquet files with evolving schemas are ingested into S3. This allows Athena to query the data using the latest schema without manual intervention, making it the ideal solution for schema evolution in a data lake.

Exam trap

The trap here is that candidates may think S3 Select or Redshift Spectrum can handle schema evolution automatically, but they lack the schema inference and versioning capabilities that AWS Glue Data Catalog provides for Athena.

How to eliminate wrong answers

Option B is wrong because manually updating the schema in Athena is error-prone and does not scale with frequent schema changes; Hive-style partitions alone do not handle schema evolution. Option C is wrong because S3 Select operates on individual objects and returns data in CSV/JSON format, not Parquet, and it does not support schema evolution or table-level queries across multiple files. Option D is wrong because Redshift Spectrum requires manual schema updates for external tables and is not designed for automatic schema evolution like AWS Glue Data Catalog.

Practice this question →

316

MCQhard

A company runs an Apache Spark job on Amazon EMR that writes output to an S3 bucket. The job fails with the error 'S3AccessDeniedException' when writing the final output, but earlier stages succeed. The EMR cluster uses a service role and an instance profile. The S3 bucket policy allows access from the VPC only. What is the MOST likely cause?

A.The S3 bucket uses SSE-C encryption, and the EMR cluster does not have the encryption key.

B.The EMR service role does not have permissions to write to the S3 bucket.

C.The EMR cluster is not using a VPC endpoint for S3, so requests are denied by the bucket policy's VPC condition.

D.The S3 bucket is configured with 'Bucket owner enforced' setting for ACLs, and the EMR cluster's account is not the bucket owner.

AnswerC

The bucket policy restricts access to VPC, but since the Spark job runs on EMR, its requests originate from inside the VPC only if a VPC endpoint is used; otherwise, they come from public IPs.

Why this answer

The bucket policy restricts access to requests originating from the VPC, typically using a condition like `aws:SourceVpc`. If the EMR cluster does not use a VPC endpoint for S3 (either Gateway or Interface endpoint), traffic from the cluster to S3 traverses the public internet and does not match the VPC condition, causing the `S3AccessDeniedException`. Earlier stages may succeed if they use cached data or different paths, but the final write fails because it hits the bucket policy check.

Exam trap

The trap here is that candidates often assume the EMR service role (EMR_EC2_DefaultRole) is responsible for all S3 access, but in reality the instance profile (EC2 instance role) handles data plane operations, and the bucket policy's VPC condition is the key blocker when earlier stages succeed but final writes fail.

How to eliminate wrong answers

Option A is wrong because SSE-C encryption requires the client to provide the encryption key; if the key were missing, the error would be an encryption-related error (e.g., 'InvalidArgument' or 'AccessDenied' with a different message), not a generic 'S3AccessDeniedException'. Option B is wrong because the EMR service role is used for the cluster's service-level permissions (e.g., launching instances, reading logs), not for data access to S3; the instance profile (IAM role attached to EC2 instances) handles data read/write permissions, and the question states earlier stages succeed, indicating the instance profile has write permissions. Option D is wrong because the 'Bucket owner enforced' setting (S3 Object Ownership) controls ACLs and ownership of objects, not access permissions; it does not cause an 'S3AccessDeniedException' — it would affect who owns new objects, not whether the write is allowed.

Practice this question →

317

MCQhard

A company runs a real-time analytics platform on AWS. Data is ingested from thousands of IoT devices into Amazon Kinesis Data Streams. A Lambda function consumes the stream, processes the data, and writes the results to an Amazon DynamoDB table. The DynamoDB table has a provisioned write capacity of 1000 WCU, and the read capacity is set to 200 RCU. Recently, the company noticed that the Lambda function is failing with ProvisionedThroughputExceededException on DynamoDB writes. The Lambda function is configured with a batch size of 100 and a concurrency limit of 10. The Kinesis shard count is 4. The number of devices has increased, but the data volume per device has remained the same. The company needs to resolve the write throttling without increasing the DynamoDB write capacity. Which action should the data engineer take?

A.Increase the number of Kinesis shards to 8.

B.Increase the Lambda concurrency limit to 20.

C.Increase the batch size to 200.

D.Reduce the batch size of the Lambda function to 10.

AnswerD

Smaller batches reduce write volume per invocation.

Why this answer

Reducing the batch size from 100 to 10 decreases the number of records processed per Lambda invocation, which reduces the burst of write requests to DynamoDB per invocation. This helps stay within the 1000 WCU limit without increasing capacity, as the same total throughput is spread across more invocations with smaller batches.

Exam trap

The trap here is that candidates assume increasing concurrency or shards will distribute the load better, but in reality, those actions increase the total write throughput, exacerbating throttling when DynamoDB capacity is fixed.

How to eliminate wrong answers

Option A is wrong because increasing Kinesis shards to 8 would increase the number of concurrent Lambda consumers, potentially amplifying the write throttling issue by generating more parallel writes to DynamoDB. Option B is wrong because increasing Lambda concurrency to 20 would allow more simultaneous invocations, each writing up to 100 records, which would increase the aggregate write rate and worsen ProvisionedThroughputExceededException. Option C is wrong because increasing the batch size to 200 would cause each Lambda invocation to attempt writing more records at once, creating larger spikes in write demand that exceed the 1000 WCU limit.

Practice this question →

318

MCQhard

A company uses DynamoDB with provisioned capacity and experiences throttling on a table during peak hours. The data engineer notices that the table has a partition key with high cardinality and the workload is read-heavy. Which action would best resolve the throttling?

A.Enable DynamoDB Auto Scaling for the table.

B.Switch the table to on-demand capacity mode.

C.Increase the provisioned write capacity units.

D.Add a global secondary index (GSI) to distribute reads.

AnswerA

Auto Scaling adjusts capacity based on traffic, preventing throttling efficiently.

Why this answer

DynamoDB Auto Scaling adjusts the provisioned read capacity units (RCUs) based on actual traffic patterns, preventing throttling during peak hours without manual intervention. Since the table has high-cardinality partition keys and is read-heavy, throttling is likely due to insufficient RCUs, which Auto Scaling dynamically increases to match demand.

Exam trap

AWS often tests the misconception that adding a GSI or switching to on-demand is the default fix for throttling, but the correct answer requires identifying that the read-heavy workload needs RCU adjustments, not structural changes or mode switches.

How to eliminate wrong answers

Option B is wrong because switching to on-demand capacity mode would eliminate throttling but at a significantly higher cost for a read-heavy workload, and it does not leverage the existing provisioned capacity setup. Option C is wrong because increasing provisioned write capacity units (WCUs) does not address read throttling; the issue is read-heavy, so RCUs need adjustment, not WCUs. Option D is wrong because adding a GSI distributes reads across partitions but does not increase the table's total provisioned read capacity; it could even worsen throttling if the GSI's write capacity is not properly provisioned.

Practice this question →

319

Multi-Selecteasy

A company needs to store log files from multiple applications in a centralized location. The logs are written once and accessed rarely after 30 days. The company must retain logs for 5 years. Which TWO actions meet these requirements cost-effectively?

Select 2 answers

A.Configure a lifecycle policy to transition objects to S3 Glacier Deep Archive after 30 days

B.Configure a lifecycle policy to transition objects to S3 Glacier Flexible Retrieval after 30 days

C.Use S3 Intelligent-Tiering for automatic cost optimization

D.Use S3 One Zone-IA for the first 30 days, then delete

E.Store all logs in S3 Standard

AnswersA, C

Deep Archive is the lowest-cost storage class for long-term retention.

Why this answer

Option A is correct because S3 Glacier Deep Archive is the lowest-cost storage class for data that is accessed rarely, with retrieval times of 12 hours or more, making it ideal for logs that are rarely accessed after 30 days. A lifecycle policy transitions objects from a higher-cost class (e.g., S3 Standard) to S3 Glacier Deep Archive after 30 days, meeting the 5-year retention requirement cost-effectively.

Exam trap

AWS often tests the distinction between S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, where candidates mistakenly choose the former for rarely accessed data due to familiarity, ignoring the cost savings of the latter for deep archival use cases.

Practice this question →

320

MCQmedium

A company has an Amazon RDS for MySQL DB instance with read replicas. The primary DB instance fails. What is the correct procedure to promote a read replica to become the new primary?

A.Modify the read replica to be a Multi-AZ deployment and failover will occur.

B.RDS automatically fails over to the read replica within 5 minutes.

C.Manually promote the read replica to a standalone DB instance.

D.Delete the primary and the read replica will automatically become the primary.

AnswerC

This is the correct procedure to make the read replica the new primary.

Why this answer

When an Amazon RDS for MySQL primary DB instance fails, read replicas do not automatically become the new primary. The correct procedure is to manually promote the read replica using the AWS Management Console, CLI, or API, which converts it into a standalone DB instance. After promotion, you must update your application endpoints to point to the new primary, as RDS does not handle this automatically.

Exam trap

The trap here is that candidates confuse read replicas with Multi-AZ standby instances, assuming automatic failover applies to both, but RDS read replicas require manual promotion and do not provide automatic failover.

How to eliminate wrong answers

Option A is wrong because modifying a read replica to be Multi-AZ does not trigger a failover; Multi-AZ is a separate feature for high availability within a single region, and read replicas are not part of the Multi-AZ failover mechanism. Option B is wrong because RDS does not automatically fail over to a read replica; automatic failover only occurs with Multi-AZ deployments, not with read replicas. Option D is wrong because deleting the primary DB instance does not cause the read replica to automatically become the primary; the read replica remains a read-only copy until manually promoted.

Practice this question →

321

MCQmedium

A data engineering team is using Amazon EMR to process large datasets stored in Amazon S3. The cluster uses Spot Instances for cost savings. During processing, the team notices that tasks are failing due to Spot Instance interruptions. The team needs to make the EMR job resilient to Spot interruptions without increasing costs significantly. Which solution should they implement?

A.Use EMR instance fleets with a mix of Spot and On-Demand, but set the allocation strategy to 'lowest price'.

B.Increase the number of core nodes using On-Demand instances.

C.Use only Spot Instances but enable automatic termination and checkpointing.

D.Use EMR instance fleets with a mix of Spot and On-Demand, setting the allocation strategy to 'diversified' and using On-Demand for core nodes.

AnswerD

Diversified spreads risk; On-Demand core ensures stability.

Why this answer

Option D is correct because enabling auto-termination and using a mix of On-Demand and Spot instances for core and task nodes ensures job completion. Specifically, using On-Demand for core nodes and Spot for task nodes with a diversified allocation strategy reduces interruption impact. Option A is incorrect because increasing core nodes with On-Demand increases cost.

Option B is incorrect because using only Spot instances increases failure risk. Option C is incorrect because instance fleets with only Spot do not help.

Practice this question →

322

MCQmedium

A company is migrating an on-premises Apache Cassandra database to Amazon Keyspaces. The database has a table with a partition key of 'user_id' and a clustering column of 'timestamp'. The application frequently queries the last 10 records for a given user. Which table design in Keyspaces would provide the BEST query performance for this access pattern?

A.Partition key: random column, clustering column: none.

B.Partition key: timestamp, clustering column: user_id.

C.Partition key: user_id, clustering column: none.

D.Partition key: user_id, clustering column: timestamp (descending order).

AnswerD

This design groups all records for a user in one partition and sorts by timestamp descending, enabling efficient retrieval of the last 10 records.

Why this answer

Option D is correct because it preserves the original Cassandra table design with 'user_id' as the partition key and 'timestamp' as the clustering column in descending order. This allows Keyspaces to efficiently retrieve the last 10 records for a given user by performing a range query on the clustering column within a single partition, avoiding full table scans or cross-partition queries.

Exam trap

The trap here is that candidates may think a random partition key (Option A) or timestamp-based partition key (Option B) improves write distribution, but they overlook that the query pattern requires efficient reads within a single partition, which is best achieved by using the query filter column as the partition key and the sort column as the clustering key with the appropriate order.

How to eliminate wrong answers

Option A is wrong because using a random partition key with no clustering column would scatter data across partitions, requiring a full scan to find records for a specific user, which is highly inefficient. Option B is wrong because using 'timestamp' as the partition key would place each timestamp in a separate partition, making it impossible to query all records for a user without scanning multiple partitions, and the clustering column 'user_id' would not help retrieve the last 10 records per user efficiently. Option C is wrong because while 'user_id' as the partition key correctly groups data by user, having no clustering column means you cannot order records by timestamp, so retrieving the last 10 records would require fetching all records for that user and sorting them in application code, which is suboptimal.

Practice this question →

323

MCQmedium

A data engineer sees this AWS Glue table definition in the Data Catalog. The engineer wants to query this table with Amazon Athena, but the query returns zero rows. What is the MOST likely cause?

A.The data files are not in the specified S3 location.

B.The SerDe library is incorrect for CSV files.

C.The table format CSV is not supported by Athena.

D.Athena cannot read tables from the Glue Data Catalog.

AnswerA

If no files exist at s3://data-lake/sales/, query returns zero rows.

Why this answer

The most likely cause is that the data files are not in the specified S3 location. When an AWS Glue table is defined in the Data Catalog, Athena reads the table's metadata (including the S3 location) and then attempts to read the underlying data files from that exact path. If the files are missing, misnamed, or in a different prefix, Athena returns zero rows because there is no data to scan.

This is a common misconfiguration when the S3 path in the table definition does not match the actual data storage.

Exam trap

The trap here is that candidates often assume the issue is with the SerDe or format compatibility, but the most common real-world cause is simply that the data files are not present at the specified S3 location, leading to zero rows returned.

How to eliminate wrong answers

Option B is wrong because the SerDe library is not incorrect for CSV files; Athena uses the LazySimpleSerDe by default for CSV, which is fully supported and does not cause zero rows. Option C is wrong because CSV is a widely supported table format in Athena, and Athena can query CSV files natively. Option D is wrong because Athena is designed to read tables from the Glue Data Catalog; in fact, Athena and Glue Data Catalog are tightly integrated, and this is a standard use case.

Practice this question →

324

MCQhard

A company is using Amazon DynamoDB for a gaming application with high read and write throughput. The data engineer notices that the read latency is high during peak hours. The table has a partition key only (no sort key). The engineer wants to improve read performance by distributing reads across partitions more evenly. Which action should the engineer take?

A.Increase the read capacity units of the table.

B.Add a sort key to the table.

C.Enable DynamoDB Accelerator (DAX).

D.Enable DynamoDB global tables.

AnswerC

DAX provides a write-through cache, reducing read latency.

Why this answer

Option C is correct because DynamoDB Accelerator (DAX) is an in-memory cache that reduces read latency from single-digit milliseconds to microseconds by caching frequently accessed items. Since the issue is high read latency during peak hours and the table has only a partition key, DAX offloads reads from the underlying table, distributing the read load and improving response times without changing the table's key structure.

Exam trap

The trap here is that candidates often confuse increasing capacity (Option A) with improving read distribution, not realizing that latency issues from hot partitions require caching or key redesign, not just more RCUs.

How to eliminate wrong answers

Option A is wrong because increasing read capacity units (RCUs) only raises the provisioned throughput limit, but does not inherently distribute reads more evenly across partitions; high latency during peak hours is often due to hot partitions or throttling, not insufficient capacity. Option B is wrong because adding a sort key to an existing table is not possible without recreating the table, and a sort key does not directly improve read distribution across partitions; it only enables more flexible query patterns within a partition. Option D is wrong because enabling DynamoDB global tables provides multi-region replication for disaster recovery and low-latency reads from multiple regions, but it does not improve read distribution across partitions within a single table or reduce latency for a single-region workload.

Practice this question →

325

MCQmedium

A company uses Amazon S3 to store sensitive data. The security team wants to ensure that all objects uploaded to a specific S3 bucket are automatically encrypted at rest using server-side encryption with AWS KMS managed keys (SSE-KMS). Which bucket policy statement should be added to enforce this requirement?

A.Deny put requests where 's3:x-amz-server-side-encryption' is 'aws:kms'

B.Deny put requests where 's3:x-amz-server-side-encryption' is not 'aws:kms'

C.Deny put requests where 's3:x-amz-server-side-encryption' is not 'AES256'

D.Deny put requests where 's3:x-amz-server-side-encryption' is not set

AnswerB

This enforces SSE-KMS encryption.

Why this answer

Option C is correct because the condition 's3:x-amz-server-side-encryption': 'aws:kms' enforces SSE-KMS. Option A denies requests with SSE-S3 but allows unencrypted uploads. Option B denies without encryption but allows SSE-S3.

Option D denies SSE-KMS, which is the opposite of what is needed.

Practice this question →

326

MCQhard

A company runs a critical transactional database on Amazon RDS for PostgreSQL. They need to achieve high availability with automatic failover to a different AWS Region in case of a regional outage. Which solution meets these requirements?

A.Create a cross-Region Read Replica and promote it during a disaster.

B.Take daily automated snapshots and restore them in another Region.

C.Deploy the RDS instance in a Multi-AZ configuration.

D.Use Amazon Aurora Global Database with a primary cluster in one Region and a secondary in another.

AnswerD

Aurora Global Database supports automatic failover across Regions with RPO of 1 second.

Why this answer

Option D is correct because Amazon Aurora Global Database provides cross-region replication with failover capabilities. Option A is wrong because Multi-AZ only protects against AZ failures, not regional. Option B is wrong because Read Replicas are for read scaling, not automatic failover.

Option C is wrong because manual snapshot restore takes too long for high availability.

Practice this question →

327

MCQeasy

A company is using Amazon DynamoDB for a gaming application. They want to store player session data that expires after 24 hours. Which DynamoDB feature should be used?

A.Time to Live (TTL)

B.DynamoDB Streams

C.Global Tables

D.Point-in-Time Recovery

AnswerA

TTL deletes items automatically after a defined expiration time.

Why this answer

Amazon DynamoDB Time to Live (TTL) allows you to define a per-item timestamp attribute that automatically deletes items after a specified duration. For the gaming session data that must expire after 24 hours, you can set the TTL attribute to the current time plus 24 hours, and DynamoDB will asynchronously delete expired items without any additional cost or write operations.

Exam trap

The trap here is that candidates may confuse DynamoDB Streams (which can react to deletions) with the actual mechanism that performs the deletion, or assume Point-in-Time Recovery can be used to 'roll back' expired data, neither of which addresses automatic expiration.

How to eliminate wrong answers

Option B (DynamoDB Streams) is wrong because it captures a time-ordered sequence of item-level changes (inserts, updates, deletes) in a DynamoDB table, but it does not automatically expire or delete data; it is used for event-driven processing or replication, not for scheduled data removal. Option C (Global Tables) is wrong because it provides multi-region, fully replicated tables for low-latency access and disaster recovery, but it has no built-in mechanism to expire or delete items based on time. Option D (Point-in-Time Recovery) is wrong because it enables continuous backups of DynamoDB table data to restore to any point within the last 35 days, but it does not delete or manage the lifecycle of individual items.

Practice this question →

328

Multi-Selectmedium

Which TWO actions can reduce the cost of an Amazon S3 bucket that stores infrequently accessed data? (Choose 2.)

Select 2 answers

A.Enable cross-region replication

B.Enable versioning to keep multiple versions

C.Use lifecycle policies to expire objects after a certain period

D.Enable MFA Delete for extra security

E.Transition objects to S3 Standard-IA after 30 days

AnswersC, E

Expiration deletes unneeded objects.

Why this answer

Options A and C are correct because transitioning to S3 Standard-IA reduces cost for infrequent access, and lifecycle policies automatically transition objects. Option B is wrong because enabling versioning increases storage costs. Option D is wrong because MFA Delete does not affect storage cost.

Option E is wrong because cross-region replication incurs additional costs.

Practice this question →

329

Multi-Selecteasy

A data engineer is setting up Amazon S3 event notifications to trigger an AWS Lambda function when new objects are uploaded. Which TWO actions are required to enable this?

Select 2 answers

A.Add a resource-based policy to the Lambda function to allow S3 to invoke it.

B.Enable S3 versioning on the bucket.

C.Create an S3 bucket policy that grants S3 permission to invoke Lambda.

D.Configure an event notification on the S3 bucket for s3:ObjectCreated:* events.

E.Set up an Amazon CloudWatch Events rule to detect S3 uploads.

AnswersA, D

Necessary for S3 to trigger Lambda.

Why this answer

Lambda function must have resource policy allowing S3 to invoke it. S3 bucket must have notification configuration. S3 bucket policy is not required if Lambda resource policy allows.

CloudWatch Events are not used for S3 notifications.

Practice this question →

330

Multi-Selecthard

A company has an S3 bucket with versioning enabled that stores critical data. The security team requires that once an object is deleted, it cannot be recovered by anyone, including the root user. Additionally, the company wants to ensure that objects cannot be overwritten for a specified period. Which THREE actions should the data engineer take to meet these requirements? (Choose THREE.)

Select 3 answers

A.Enable S3 Object Lock in compliance mode.

B.Set a retention period on the bucket using Object Lock.

C.Enable S3 Versioning on the bucket.

D.Enable MFA Delete on the bucket.

E.Configure a lifecycle policy to expire noncurrent versions after 1 day.

AnswersA, B, C

Compliance mode prevents deletion by any user, including root.

Why this answer

Options A, B, and D are correct. S3 Object Lock in compliance mode prevents any user from deleting or overwriting objects; enabling versioning is required for Object Lock; a retention period enforces the protection for a specified time. Option C is wrong because MFA Delete can be bypassed by root; Option E is wrong because lifecycle policies can delete objects, which is not allowed.

Practice this question →

331

MCQmedium

A company uses Amazon S3 to store sensitive customer data. The security team requires that all objects uploaded to a specific bucket be encrypted at rest using AWS KMS with a customer managed key. Which bucket policy statement should be applied to enforce this requirement?

A.Deny s3:PutObject unless s3:x-amz-server-side-encryption is present

B.Allow s3:PutObject only if s3:x-amz-server-side-encryption is present

C.Deny s3:PutObject unless s3:x-amz-server-side-encryption-aws-kms-key-id equals the specific KMS key ARN

D.Deny s3:PutObject unless s3:x-amz-server-side-encryption equals AES256

AnswerC

This condition ensures only the specified KMS key is used for encryption.

Why this answer

Option C is correct because the security team requires encryption at rest using AWS KMS with a customer managed key. The bucket policy must deny any s3:PutObject request that does not include the s3:x-amz-server-side-encryption-aws-kms-key-id condition key set to the specific KMS key ARN. This ensures that only objects encrypted with the designated customer managed key are allowed, enforcing the encryption requirement at the bucket policy level.

Exam trap

The trap here is that candidates often confuse the condition key s3:x-amz-server-side-encryption (which only checks for SSE-S3 or SSE-KMS) with s3:x-amz-server-side-encryption-aws-kms-key-id (which checks for a specific KMS key), leading them to pick Option A or D instead of C.

How to eliminate wrong answers

Option A is wrong because it only checks for the presence of any server-side encryption header (s3:x-amz-server-side-encryption), which could be AES256 (SSE-S3) or aws:kms (SSE-KMS), but does not enforce the use of a customer managed KMS key. Option B is wrong because using an Allow effect with a condition does not override a default implicit deny; to enforce a restriction, you must use an explicit Deny statement. Option D is wrong because it requires the encryption header to equal AES256, which enforces SSE-S3, not SSE-KMS with a customer managed key.

Practice this question →

332

MCQeasy

A data engineer needs to store a large number of small files (each a few KB) from IoT sensors. The data is written once and never modified. The primary requirement is high write throughput and low latency for writes. Which storage solution is most suitable?

A.Amazon DynamoDB with on-demand capacity

B.Amazon RDS for MySQL with InnoDB

C.Amazon S3 with standard storage class

D.Amazon Elastic Block Store (EBS) volumes

AnswerA

DynamoDB provides single-digit millisecond latency and high throughput for writes.

Why this answer

Amazon DynamoDB with on-demand capacity is the most suitable because it is a NoSQL key-value and document database designed for single-digit millisecond latency at any scale. It supports high write throughput by automatically distributing data across multiple partitions, and on-demand capacity eliminates the need for provisioning, allowing it to absorb unpredictable write spikes from many IoT sensors without throttling.

Exam trap

The trap here is that candidates often choose Amazon S3 for storing small files because of its durability and cost, but they overlook the fact that S3's PUT request latency and eventual consistency model make it unsuitable for high-frequency, low-latency write workloads, whereas DynamoDB is purpose-built for such patterns.

How to eliminate wrong answers

Option B (Amazon RDS for MySQL with InnoDB) is wrong because relational databases are optimized for complex queries and ACID transactions, not for high-throughput ingestion of many small, immutable writes; they incur overhead from indexing, locking, and transaction logs that limit write throughput. Option C (Amazon S3 with standard storage class) is wrong because S3 is an object store optimized for durability and high read throughput, but it has a minimum object size of 0 bytes and a write latency of tens to hundreds of milliseconds per PUT request, making it unsuitable for low-latency, high-frequency writes of many small files. Option D (Amazon Elastic Block Store volumes) is wrong because EBS provides block-level storage volumes attached to a single EC2 instance, which creates a bottleneck for distributed write workloads and does not natively support the high concurrency needed for thousands of simultaneous sensor writes.

Practice this question →

333

MCQmedium

A media company stores video files in an S3 bucket. The files are processed by a fleet of EC2 instances that read the files, add watermarks, and write the output back to the same bucket. Recently, the processing jobs have been failing with '500 Internal Server Error' and '503 Slow Down' errors. The data engineer checks the S3 bucket metrics and sees that the PUT/GET request rate is consistently above 5,500 requests per second for a single prefix. The engineer needs to resolve the errors with minimal changes to the application code. Which course of action should the engineer take?

A.Use S3 Batch Operations to process the files.

B.Increase the number of EC2 instances to process files in parallel.

C.Enable S3 Transfer Acceleration on the bucket to improve throughput.

D.Modify the application to add a random hash prefix to the object keys to distribute load across multiple prefixes.

AnswerD

Spreading requests across many prefixes increases the aggregate request rate limit.

Why this answer

Option A is correct. Distributing objects across multiple prefixes (e.g., by adding a hash prefix) increases the request rate limit because S3 supports up to 5,500 requests per second per prefix. Option B is wrong because S3 Transfer Acceleration improves speed over distance but does not increase request rate limits.

Option C is wrong because increasing EC2 instances would increase request rate, worsening the issue. Option D is wrong because S3 Batch Operations is for large-scale batch operations, not for real-time processing.

Practice this question →

334

MCQeasy

The exhibit shows a build log from AWS CodeBuild. The build fails with a permission error when trying to open the downloaded file. What is the most likely cause?

A.The S3 bucket policy denies access to the object.

B.The python script is not in the PATH.

C.The downloaded file has restrictive permissions that the python process cannot read.

D.The file is encrypted and cannot be decrypted.

AnswerC

Permission denied suggests file ownership/permissions issue.

Why this answer

The file is downloaded with root ownership and non-root user cannot read it. CodeBuild by default runs as root, but the process.py might be running as a non-root user. However, common cause is that the file was downloaded with restrictive permissions.

But more plausible: the file download succeeded but the python script may be running as a different user, or the file permissions are wrong. In CodeBuild, the default user is root, but if the buildspec runs commands as different user, permissions may be an issue. However, a typical cause is that the file permissions are 600 owned by root, and the python process runs as a non-root user.

Alternatively, the file might be corrupted. The most likely cause is that the file permissions do not allow read by the user running python. Given the context, option B is correct.

Practice this question →

335

Multi-Selecthard

A company is designing a multi-Region disaster recovery solution for Amazon DynamoDB. They need to ensure that data is replicated across Regions with minimal latency and that applications can read from any Region. Which THREE steps should be taken? (Choose THREE.)

Select 3 answers

A.Configure application to read from any Region using the DynamoDB endpoint.

B.Configure Time to Live (TTL) to automatically expire old data.

C.Enable DynamoDB Global Tables.

D.Enable DynamoDB Streams on the table.

E.Deploy DynamoDB Accelerator (DAX) in each Region.

AnswersA, C, D

Global Tables allow reads from any Region.

Why this answer

Options A, C, and E are correct. Global Tables replicate data across Regions. DynamoDB Streams capture changes for replication.

Applications can read from any Region. Option B is wrong because DAX is not required for global replication. Option D is wrong because TTL is for expiration, not replication.

Practice this question →

336

Multi-Selectmedium

A company uses Amazon DynamoDB for a gaming leaderboard. The table has a primary key of GameId (partition key) and Score (sort key). The application needs to retrieve the top 10 scores for a given game. Which strategies can improve query performance? (Choose TWO.)

Select 2 answers

A.Change the primary key to a single attribute

B.Create a global secondary index with Score as sort key

C.Use a Scan operation with a limit

D.Increase the write capacity units

E.Use DynamoDB Accelerator (DAX) for caching

AnswersB, E

Allows efficient sorted queries.

Why this answer

Option B is correct because creating a global secondary index (GSI) with Score as the sort key allows efficient range queries on scores for a given GameId. DynamoDB can then use the GSI to retrieve the top 10 scores in sorted order without scanning the entire table, leveraging the index's sort key to fetch only the highest values.

Exam trap

The trap here is that candidates may think a Scan with a limit is efficient for top-N queries, but DynamoDB Scans always read the entire dataset up to the limit, making them unsuitable for sorted retrieval without additional processing.

Practice this question →

337

MCQmedium

A company uses Amazon Redshift for data warehousing. The data engineering team notices that queries are slow due to high disk I/O. The team wants to improve query performance without changing the cluster configuration. Which action should the team take?

A.Increase the number of nodes in the cluster.

B.Redesign tables with appropriate sort keys and distribution styles.

C.Run the ANALYZE command to update table statistics.

D.Run the VACUUM command to reclaim disk space.

AnswerB

Proper sort keys and distribution can minimize data scanning and reduce I/O.

Why this answer

Redesigning tables with appropriate sort keys and distribution styles directly addresses high disk I/O by minimizing data scanning and reducing data movement across nodes. Sort keys enable Redshift to skip irrelevant blocks via zone maps, while distribution styles (KEY, ALL, EVEN) optimize data locality for joins and aggregations, reducing I/O without changing cluster configuration.

Exam trap

The trap here is that candidates confuse maintenance commands (ANALYZE, VACUUM) with design changes, or think scaling out (adding nodes) is allowed when the question explicitly forbids changing cluster configuration.

How to eliminate wrong answers

Option A is wrong because increasing the number of nodes changes the cluster configuration, which the question explicitly prohibits. Option C is wrong because ANALYZE updates table statistics for the query planner but does not reduce disk I/O caused by poor data layout or data movement. Option D is wrong because VACUUM reclaims disk space from deleted rows and sorts data, but it does not fundamentally redesign tables to reduce I/O; it only maintains existing design.

Practice this question →

338

MCQeasy

A company wants to store historical financial data for 7 years with immediate access for the first year and then infrequent access. After 7 years, the data must be automatically deleted. Which S3 lifecycle policy should be configured?

A.Transition to S3 Standard-IA after 30 days, expire after 2555 days

B.Transition to S3 Glacier Flexible Retrieval after 365 days, expire after 2555 days

C.Transition to S3 One Zone-IA after 90 days, expire after 365 days

D.Transition to S3 Glacier Deep Archive after 365 days, expire after 2555 days

AnswerD

This is cost-effective: immediate access for 1 year, then low-cost storage, delete after 7 years.

Why this answer

Option D is correct because it meets all requirements: immediate access for the first year (no transition before 365 days), then transition to S3 Glacier Deep Archive for infrequent access and cost savings, with automatic deletion after 7 years (2555 days). S3 Glacier Deep Archive is the most cost-effective storage class for long-term archival data that is rarely accessed, and the 2555-day expiration ensures compliance with the 7-year retention policy.

Exam trap

The trap here is that candidates often confuse 'immediate access for the first year' with needing a transition to a cheaper tier early, or they forget to include an expiration action, leading them to choose Option B which has no deletion mechanism.

How to eliminate wrong answers

Option A is wrong because transitioning to S3 Standard-IA after 30 days would move data too early, incurring unnecessary costs for the first year when immediate access is needed, and the 2555-day expiration is correct but the storage class is not suitable for infrequent access after year one. Option B is wrong because transitioning to S3 Glacier Flexible Retrieval after 365 days is acceptable, but this option lacks an expiration action, so data would not be automatically deleted after 7 years, violating the requirement. Option C is wrong because transitioning to S3 One Zone-IA after 90 days is too early and does not provide the durability needed for financial data (single AZ risk), and the 365-day expiration is far too short for a 7-year retention requirement.

Practice this question →

339

MCQhard

A company uses Amazon S3 to store large datasets. The data engineering team needs to provide access to specific objects in the bucket to external partners using presigned URLs. Each URL should expire after 12 hours. The team wants to ensure that the presigned URLs cannot be used to access other objects in the bucket. Which approach should be taken?

A.Create an IAM role for each partner and attach a policy that grants access to specific objects.

B.Generate presigned URLs using the AWS SDK, specifying the exact object key and expiration time.

C.Use a bucket policy that allows access only from the partner's IP address range.

D.Use CloudFront signed URLs with a custom policy that restricts access to specific objects.

AnswerB

Presigned URLs grant access only to the specified object and expire after the set time.

Why this answer

Option A is correct because a presigned URL generated for a specific object key and expiration time limits access to that object only. Option B (IAM role) is for internal use. Option C (bucket policy with IP restriction) is not per-object.

Option D (CloudFront signed URLs) also works but is more complex and may not be necessary.

Practice this question →

340

MCQeasy

A data engineer needs to store JSON documents that are accessed by a serverless application using AWS Lambda. The documents are frequently updated and need low latency (single-digit milliseconds) for read and write operations. Which AWS service should the engineer use?

A.Amazon DynamoDB

B.Amazon ElastiCache for Redis

C.Amazon S3 (with S3 Select)

D.Amazon RDS for MySQL

AnswerA

DynamoDB offers single-digit millisecond latency for reads and writes and supports JSON documents natively.

Why this answer

Amazon DynamoDB is a fully managed NoSQL key-value and document database that provides single-digit millisecond latency for read and write operations at any scale. It natively supports JSON documents, integrates directly with AWS Lambda via the AWS SDK, and handles frequent updates efficiently through its auto-scaling and on-demand capacity modes, making it ideal for serverless applications requiring low-latency data access.

Exam trap

The trap here is that candidates often confuse ElastiCache for Redis as a primary data store due to its low latency, overlooking that it is an in-memory cache with no built-in persistence guarantees, whereas DynamoDB provides both low latency and durable, persistent storage for JSON documents.

How to eliminate wrong answers

Option B is wrong because Amazon ElastiCache for Redis is an in-memory cache, not a durable data store; while it offers sub-millisecond latency, it is typically used for caching or session management and requires a separate persistent database to avoid data loss on node failure, making it unsuitable as the primary store for frequently updated JSON documents that must persist. Option C is wrong because Amazon S3 is an object storage service with eventual consistency for overwrite PUTS and higher latency (typically tens to hundreds of milliseconds) for read operations, and S3 Select is a server-side filtering feature that does not reduce latency for individual document reads or writes; it is not designed for frequent, low-latency updates. Option D is wrong because Amazon RDS for MySQL is a relational database that requires schema definition, does not natively store JSON as a first-class document model (though it supports JSON data type, it lacks the flexible schema and single-digit millisecond read/write performance of DynamoDB for key-value access patterns), and incurs higher operational overhead for scaling and connection management in a serverless architecture.

Practice this question →

341

MCQhard

A company uses Amazon DynamoDB to store metadata for a document management system. The table has a partition key of document_id and a sort key of version. The application frequently queries for the latest version of a document by document_id. The data engineer notices that these queries are consuming a lot of read capacity. How can the engineer optimize the read performance and reduce read capacity consumption?

A.Change the sort key to store version in descending order.

B.Enable DynamoDB Streams and use a read replica.

C.Decrease the ReadCapacityUnits of the table to force caching.

D.Create a global secondary index (GSI) and use DynamoDB Accelerator (DAX).

AnswerD

A GSI can support efficient queries, and DAX caches results, reducing read capacity.

Why this answer

Option B is correct because creating a global secondary index (GSI) with document_id as partition key and version as sort key, but with a different projection, can allow queries to use the index with fewer items returned. More importantly, using DynamoDB Accelerator (DAX) would cache the results of frequent queries, reducing read capacity consumption. Option A is incorrect because changing the sort key order does not reduce read capacity for the same query.

Option C is incorrect because read replicas are not a feature of DynamoDB. Option D is incorrect because decreasing ReadCapacityUnits would cause throttling.

Practice this question →

342

MCQhard

A data engineer runs the above DDL statement in Amazon Athena. The query returns an error. What is the most likely cause?

A.The SerDe is not compatible with Parquet files.

B.The INPUTFORMAT is incorrect for Parquet files.

C.The S3 bucket location does not exist.

D.The table name contains underscores.

AnswerB

TextInputFormat is for text files, not Parquet. Should use Parquet input format.

Why this answer

Option D is correct because the table is defined with Parquet SerDe but the INPUTFORMAT is TextInputFormat, which is incompatible. For Parquet files, the INPUTFORMAT should be 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'. Option A is wrong because the location is valid.

Option B is wrong because the table name is valid. Option C is wrong because the SerDe is correct for Parquet, but the mismatch with INPUTFORMAT causes the error.

Practice this question →

343

Multi-Selecteasy

Which TWO features of Amazon DynamoDB help ensure high availability and durability? (Choose two.)

Select 2 answers

A.Auto-scaling adjusts provisioned capacity based on traffic.

B.Data is automatically replicated across multiple Availability Zones within an AWS Region.

C.Global tables enable active-active replication across multiple AWS Regions.

D.On-demand backup and restore provides point-in-time recovery.

E.Time to Live (TTL) automatically deletes expired items.

AnswersB, D

Provides high availability and durability.

Why this answer

Option B is correct because DynamoDB automatically replicates data synchronously across three Availability Zones (AZs) within an AWS Region. This built-in replication ensures that even if an entire AZ fails, the data remains available and durable, providing a 99.999999999% (11 nines) durability SLA.

Exam trap

The trap here is that candidates often confuse auto-scaling (Option A) with high availability, but auto-scaling only adjusts capacity to meet demand, not data replication or fault tolerance.

Practice this question →

344

Multi-Selectmedium

A company uses Amazon Redshift for analytics. The data engineering team wants to improve query performance for frequently used aggregate queries. Which TWO actions would help achieve this?

Select 2 answers

A.Increase the number of WLM query queues

B.Use distribution keys to collocate data on the same node slices

C.Run the VACUUM command to reclaim space from deleted rows

D.Define appropriate sort keys on the tables

E.Increase the number of nodes in the cluster

AnswersB, D

Distribution keys reduce data movement during joins and aggregations.

Why this answer

Distribution keys determine how data is distributed across node slices in Amazon Redshift. By choosing distribution keys that align with the join and aggregation columns, the database can collocate related data on the same slice, minimizing data movement during query execution. This directly improves performance for aggregate queries by reducing network traffic and enabling local computation.

Exam trap

The trap here is that candidates often confuse VACUUM (which reclaims space) with performance optimization for queries, or assume adding nodes always improves query speed without considering the overhead of data redistribution.

Practice this question →

345

MCQhard

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries against a large fact table are slow. The table is distributed using DISTSTYLE EVEN and has multiple sort keys. After analyzing the query plans, they find that most queries filter on a specific column, 'customer_id'. Which change would most likely improve query performance for these filter operations?

A.Add a secondary sort key on 'customer_id'.

B.Change to DISTSTYLE KEY on the 'customer_id' column.

C.Change to DISTSTYLE EVEN with a different sort key.

D.Change to DISTSTYLE ALL for the fact table.

AnswerB

KEY distribution on the filtered column reduces data movement during queries.

Why this answer

Option C is correct because using DISTSTYLE KEY on 'customer_id' collocates rows with the same customer_id on the same slice, reducing data shuffling for queries filtering on that column. Option A (EVEN distribution) does not help with filtering. Option B (ALL distribution) is for small dimension tables.

Option D (compound sort key with other columns) may not help if the filter is only on 'customer_id'.

Practice this question →

346

MCQhard

A data engineer notices that an Amazon Redshift cluster’s storage usage is increasing rapidly due to many UPDATE and DELETE operations. The engineer needs to reclaim storage space and improve query performance. Which action should be taken?

A.Run VACUUM command

B.UNLOAD the table to S3 and reload

C.Increase cluster node count

D.Run ANALYZE command

AnswerA

VACUUM reclaims disk space and re-sorts rows.

Why this answer

The VACUUM command in Amazon Redshift reclaims disk space occupied by deleted or updated rows and re-sorts the data according to the table's sort keys. This directly addresses the storage increase from UPDATE/DELETE operations and improves query performance by restoring the physical order of rows, which reduces the number of blocks scanned.

Exam trap

The trap here is that candidates confuse ANALYZE with VACUUM, thinking updating statistics will also reclaim storage, when in fact ANALYZE only refreshes metadata for the query optimizer and has no effect on physical storage.

How to eliminate wrong answers

Option B is wrong because unloading the table to S3 and reloading is a heavy, manual process that does not reclaim space in place and can be avoided with a simple VACUUM; it also incurs additional S3 costs and time. Option C is wrong because increasing the cluster node count adds more storage and compute capacity but does not reclaim the existing wasted space from deleted rows, and it may not improve performance if the underlying data is fragmented. Option D is wrong because the ANALYZE command only updates table statistics for the query planner, it does not reclaim storage space or physically reorganize data affected by UPDATE/DELETE operations.

Practice this question →

347

MCQeasy

A data engineer needs to store semi-structured data (JSON logs) from thousands of IoT devices. The data must be schema-less, highly scalable, and support low-latency queries by device ID and timestamp. Which AWS service should the engineer use?

A.Amazon RDS for PostgreSQL

B.Amazon Redshift

C.Amazon DynamoDB

D.Amazon S3

AnswerC

DynamoDB supports flexible schema, high throughput, and low-latency queries on partition key and sort key.

Why this answer

Option C is correct because Amazon DynamoDB is a NoSQL key-value and document database that supports schema-less design, high scalability, and low-latency queries. Option A is wrong because RDS is relational and schema-on-write. Option B is wrong because Redshift is a data warehouse for analytics, not low-latency point queries.

Option D is wrong because S3 is object storage, not a database, and query performance would be higher latency.

Practice this question →

348

Multi-Selecthard

Which THREE factors should be considered when choosing a partition key for an Amazon DynamoDB table?

Select 3 answers

A.The partition key should be chosen to maximize the size of items in each partition.

B.If the table has a write-heavy workload, the partition key should distribute writes evenly.

C.The partition key should align with the most common query access pattern.

D.The partition key should be chosen to minimize read capacity unit consumption.

E.The partition key should have high cardinality to distribute data evenly.

AnswersB, C, E

Even write distribution prevents throttling.

Why this answer

Option B is correct because DynamoDB distributes data and request traffic across partitions based on the partition key. For write-heavy workloads, a partition key that evenly distributes writes prevents hot partitions, which can throttle requests and degrade performance. This ensures that no single partition exceeds its write capacity limit.

Exam trap

The trap here is that candidates may think maximizing item size (Option A) or minimizing RCU consumption (Option D) are primary factors, when in fact even distribution and access pattern alignment are the critical design principles for DynamoDB partition keys.

Practice this question →

349

MCQeasy

A company is using an RDS for PostgreSQL instance and wants to minimize downtime during a major version upgrade. Which approach should be taken?

A.Create a read replica of the DB instance, upgrade the replica, and then promote it to the primary instance.

B.Use AWS Database Migration Service (DMS) to migrate data to a new upgraded instance.

C.Modify the DB instance and apply the upgrade immediately.

D.Take a snapshot of the DB instance and restore it as a new instance with the upgraded version.

AnswerA

Minimizes downtime by failing over to the upgraded replica.

Why this answer

Option A is correct because creating a read replica of the RDS for PostgreSQL instance, upgrading the replica to the new major version, and then promoting it to become the primary instance minimizes downtime by allowing the replica to be upgraded while the original primary remains fully operational. The promotion process is fast (typically seconds), and the only downtime is the brief cutover period when applications switch to the promoted replica. This approach leverages RDS's managed replication and avoids the longer downtime associated with direct in-place upgrades.

Exam trap

The trap here is that candidates often assume a snapshot-and-restore (Option D) is the fastest method because it seems like a simple copy, but they overlook the fact that the snapshot itself requires the instance to be operational and the restore creates a new instance that is not automatically kept in sync, leading to longer overall downtime compared to the replica promotion method.

How to eliminate wrong answers

Option B is wrong because AWS Database Migration Service (DMS) is designed for heterogeneous or homogeneous migrations with ongoing replication, but it introduces significant complexity and potential downtime during the full-load and change-data-capture phases; it is not the optimal approach for a simple major version upgrade of an existing RDS instance. Option C is wrong because modifying the DB instance and applying the upgrade immediately causes an in-place upgrade that typically results in several minutes of downtime (often 10–30 minutes or more) while the instance is stopped, upgraded, and restarted, which violates the goal of minimizing downtime. Option D is wrong because taking a snapshot and restoring it as a new instance with the upgraded version requires the source instance to be available during the snapshot (which can take time) and then the restore process creates a new instance that is not automatically synchronized with the original; this approach involves significant downtime for the snapshot creation and restore, and does not provide a seamless cutover.

Practice this question →

350

MCQeasy

A data engineer needs to store semi-structured JSON data from IoT devices. The data is written frequently and read occasionally. Which AWS service is MOST cost-effective for this use case?

A.Amazon ElastiCache for Redis

B.Amazon DynamoDB

C.Amazon RDS for MySQL

D.Amazon Redshift

AnswerB

DynamoDB handles high write volumes efficiently.

Why this answer

Option A is correct because DynamoDB is designed for high write throughput and low-latency reads. Option B (RDS) is relational and more expensive. Option C (ElastiCache) is in-memory and costly for large data.

Option D (Redshift) is for analytics.

Practice this question →

351

MCQhard

A data engineer is designing a multi-region disaster recovery solution for Amazon RDS for PostgreSQL. The primary region must have a standby in a different Availability Zone, and the secondary region must have a readable replica that can be promoted in case of failure. Which configuration meets these requirements?

A.Use a single-AZ primary and enable automatic backups

B.Enable Multi-AZ in the primary region and create a cross-region read replica

C.Use a single-AZ primary and create a cross-region read replica

D.Enable Multi-AZ in both primary and secondary regions

AnswerB

Multi-AZ provides standby; cross-region replica provides DR.

Why this answer

Option B is correct because it meets both requirements: Multi-AZ in the primary region provides a synchronous standby in a different Availability Zone for high availability, and a cross-region read replica in the secondary region provides an asynchronous, readable copy that can be promoted to a standalone primary during a regional failure. This combination ensures both intra-region fault tolerance and inter-region disaster recovery.

Exam trap

The trap here is that candidates often confuse Multi-AZ (synchronous, for high availability within a region) with cross-region read replicas (asynchronous, for disaster recovery), and may incorrectly assume that Multi-AZ alone provides cross-region failover or that a single-AZ primary with a read replica satisfies the intra-region standby requirement.

How to eliminate wrong answers

Option A is wrong because a single-AZ primary with automatic backups does not provide a standby in a different Availability Zone, nor does it create a readable replica in a secondary region; backups are for point-in-time recovery, not for immediate failover or read scaling. Option C is wrong because a single-AZ primary lacks the required standby in a different Availability Zone within the primary region; the cross-region read replica only addresses the secondary region requirement. Option D is wrong because enabling Multi-AZ in both regions does not create a cross-region read replica; Multi-AZ in the secondary region provides a standby within that region but does not establish a readable replica that can be promoted from the primary region.

Practice this question →

352

MCQeasy

A data engineer is designing a data lake on Amazon S3. The data includes personally identifiable information (PII) that must be encrypted at rest. Which encryption option provides the most control over encryption keys?

A.Client-side encryption using Amazon S3 Encryption Client.

B.Server-side encryption with S3 managed keys (SSE-S3).

C.Server-side encryption with AWS KMS managed keys (SSE-KMS).

D.Server-side encryption with customer-provided keys (SSE-C).

AnswerC

Allows use of customer-managed KMS keys, giving more control.

Why this answer

Option C is correct because SSE-KMS allows the customer to manage and control KMS keys. Option A is incorrect because SSE-S3 uses Amazon-managed keys. Option B is incorrect because SSE-C uses customer-provided keys but requires managing keys on client side.

Option D is incorrect because client-side encryption is not server-side.

Practice this question →

353

MCQhard

A company is migrating an on-premises Hadoop cluster to AWS. The data is stored in HDFS and needs to be accessible by both Amazon EMR and Amazon Redshift Spectrum. Which storage solution is most cost-effective and scalable?

A.Amazon FSx for HDFS

B.Amazon Simple Storage Service (S3)

C.Amazon Elastic Block Store (EBS)

D.Amazon Elastic File System (EFS)

AnswerB

S3 is highly scalable, durable, and can be queried by Redshift Spectrum and processed by EMR.

Why this answer

Option B is correct because Amazon S3 is a cost-effective, scalable object store that can be accessed by both EMR and Redshift Spectrum. Option A is wrong because EBS is limited to a single EC2 instance. Option C is wrong because EFS is a file system, not as cost-effective for large-scale data.

Option D is wrong because Amazon FSx for HDFS is designed for HDFS compatibility but is more expensive than S3.

Practice this question →

354

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data includes personally identifiable information (PII) that must be encrypted at rest. Which combination of actions meets the encryption requirement with the least operational overhead?

A.Apply a bucket policy that denies access to unencrypted requests

B.Enable default encryption on the S3 bucket using SSE-S3

C.Use client-side encryption with AWS KMS

D.Use server-side encryption with AWS KMS (SSE-KMS)

AnswerB

SSE-S3 is simple and automatically encrypts objects.

Why this answer

Option B is correct because S3 managed keys (SSE-S3) provide encryption at rest with minimal management. Option A is wrong because client-side encryption requires application changes. Option C is wrong because KMS adds key management overhead.

Option D is wrong because bucket policies do not encrypt data.

Practice this question →

355

Multi-Selectmedium

A company uses Amazon DynamoDB for a gaming application. The application experiences throttling during peak hours. The table's read and write capacity is provisioned. Which TWO actions can reduce throttling?

Select 2 answers

A.Enable TTL (time to live) on the table to automatically delete old items

B.Enable DynamoDB auto scaling for the table

C.Increase the provisioned read capacity units (RCUs)

D.Implement DynamoDB Accelerator (DAX) to cache read requests

E.Add a DynamoDB Global Table for the table

AnswersB, D

Auto scaling adjusts provisioned capacity based on traffic.

Why this answer

DynamoDB auto scaling (Option B) automatically adjusts the provisioned read and write capacity based on actual traffic patterns, preventing throttling during peak hours without manual intervention. This is the correct action because it dynamically increases capacity when demand spikes and reduces it during low traffic, directly addressing the throttling issue.

Exam trap

The trap here is that candidates often confuse increasing provisioned capacity (Option C) as the only solution, but the exam tests whether you understand that auto scaling (Option B) is the correct managed approach, and that DAX (Option D) can reduce read throttling by caching, making both B and D valid together.

Practice this question →

356

Multi-Selectmedium

A financial services company is designing a data store for transaction records that must be immutable and auditable. The data must be stored for 7 years. Which AWS services can be combined to meet these requirements? (Choose TWO.)

Select 2 answers

A.Amazon S3 Glacier Deep Archive

B.Amazon S3 with Object Lock enabled

C.Amazon EBS volume with snapshots

D.Amazon RDS with automated backups

E.Amazon DynamoDB with point-in-time recovery

AnswersA, B

Glacier Deep Archive is cost-effective for long-term archival.

Why this answer

Amazon S3 Glacier Deep Archive is correct because it provides the lowest-cost storage for long-term retention of immutable data, with a 7-year lifecycle meeting compliance requirements. Amazon S3 with Object Lock enabled is correct because it enforces a write-once-read-many (WORM) model, preventing records from being deleted or overwritten for a specified retention period, ensuring immutability and auditability.

Exam trap

The trap here is that candidates often confuse backup solutions (like RDS automated backups or DynamoDB PITR) with immutable storage, but backups are deletable and do not enforce WORM, whereas S3 Object Lock provides true immutability required for audit compliance.

Practice this question →

357

Multi-Selecteasy

A data engineer is migrating an on-premises Microsoft SQL Server database to Amazon RDS for SQL Server. The database is 2 TB in size and has a 4-hour maintenance window. The company needs to minimize downtime and ensure data consistency. Which TWO methods should the engineer use? (Choose TWO.)

Select 2 answers

A.Use AWS Database Migration Service (AWS DMS) with ongoing replication to minimize downtime.

B.Use SQL Server Management Studio (SSMS) export wizard to transfer data.

C.Take a native backup of the on-premises database and restore it to RDS.

D.Use AWS Schema Conversion Tool (AWS SCT) to convert the schema and migrate data.

E.Export the database to CSV files and use BULK INSERT to load into RDS.

AnswersA, C

DMS can perform a full load and then replicate changes, reducing downtime.

Why this answer

AWS DMS with ongoing replication (change data capture) is correct because it allows continuous synchronization from the on-premises SQL Server to Amazon RDS for SQL Server, minimizing downtime by keeping the target database up-to-date until the final cutover. This approach ensures data consistency by capturing and applying ongoing changes without requiring a long outage window.

Exam trap

The trap here is that candidates often assume native backup/restore alone is sufficient for minimal downtime, forgetting that it only handles the initial data load and does not capture changes made during the backup window without additional replication.

Practice this question →

358

MCQhard

A company uses Amazon DynamoDB with on-demand capacity for a gaming leaderboard. The table has 100 GB of data and receives 10,000 write requests per second with spikes to 50,000. The application experiences throttling during spikes. Which action should be taken to reduce throttling without changing the application?

A.Write data to Amazon S3 and use S3 Select

B.Increase the provisioned read capacity units

C.Switch to provisioned capacity with Auto Scaling

D.Enable DynamoDB Accelerator (DAX)

AnswerD

DAX can offload read traffic and reduce write throttling.

Why this answer

Option D is correct because DynamoDB Accelerator (DAX) provides in-memory caching to absorb read spikes and reduce write throttling by offloading reads. Option A is wrong because increasing read capacity does not help write throttling. Option B is wrong because Auto Scaling is not available with on-demand mode.

Option C is wrong because S3 is not suitable for real-time writes.

Practice this question →

359

MCQhard

A company runs a transactional database on Amazon RDS for PostgreSQL with Multi-AZ deployment. The database size is 2 TB and experiences moderate write load. The company recently enabled RDS Performance Insights and noticed a high number of 'TupleLock' wait events during peak hours. The development team reports that a batch update job runs every hour, updating millions of rows in a large table. The job takes longer than expected. The DBA suspects that excessive row-level locking is causing contention. The team wants to minimize lock contention without changing the application code. Which solution should be implemented?

A.Tune the autovacuum settings (e.g., autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold) to run more frequently and aggressively.

B.Increase the RDS instance size to a larger instance class with more vCPUs and memory.

C.Enable RDS Proxy to manage database connections and reduce connection overhead.

D.Implement table partitioning using the pg_partman extension to split the large table into smaller partitions.

AnswerA

This reduces dead tuple accumulation, which lowers lock contention and improves the batch job performance.

Why this answer

Option D (autovacuum tuning) is correct because frequent updates generate dead tuples, leading to increased lock contention. Tuning autovacuum ensures timely cleanup, reducing lock escalation and wait events. Option A (increasing RDS instance size) may alleviate CPU/memory pressure but does not address lock contention.

Option B (enabling RDS Proxy) helps with connection pooling, not lock contention. Option C (using pg_partman for partitioning) reduces row-level contention by splitting the table but requires code changes to queries, which is prohibited by the stem.

Practice this question →

360

Multi-Selecthard

A data engineer is setting up an Amazon Redshift cluster for a data warehouse. The cluster will store historical sales data and support complex analytical queries. To optimize query performance and manage storage, the engineer needs to choose appropriate distribution styles and sort keys for a large fact table 'sales' and several dimension tables. Which TWO of the following design decisions are BEST practices?

Select 2 answers

A.Use interleaved sort keys on columns that are frequently used in filter predicates (e.g., date, region, product).

B.Use EVEN distribution for the fact table 'sales' to ensure an even data distribution across all nodes.

C.Use ALL distribution for the 'sales' fact table to replicate data to every node and avoid data movement.

D.Use a compound sort key with the most frequently filtered column first.

E.Choose AUTO distribution style for all tables and let Amazon Redshift automatically assign distribution.

AnswersA, B

Interleaved sort keys improve performance for queries filtering on multiple columns.

Why this answer

Option A is correct because interleaved sort keys in Amazon Redshift give equal weight to each column in the sort key, making them ideal for queries with filter predicates on multiple columns (e.g., date, region, product). This design optimizes zone maps and minimizes the amount of data scanned, significantly improving query performance for complex analytical workloads on large fact tables.

Exam trap

The trap here is that candidates often confuse EVEN distribution as a universal best practice for all fact tables, overlooking that KEY distribution on the join column is superior for star schema joins, and they may also incorrectly assume ALL distribution is suitable for large fact tables due to its join performance benefits, ignoring the prohibitive storage and write costs.

Practice this question →

361

MCQmedium

A company runs an Amazon RDS for PostgreSQL database for its e-commerce platform. The application team reports that write-intensive workloads are causing high latency and the database is experiencing storage bottlenecks. The database currently uses General Purpose SSD (gp2) storage. Which action would be MOST effective in improving write performance without changing the database instance class?

A.Create a read replica and offload writes to it.

B.Switch the storage type to Provisioned IOPS SSD (io1).

C.Enable Multi-AZ deployment for high availability.

D.Change the storage type to General Purpose SSD (gp3).

AnswerD

gp3 offers higher baseline IOPS and throughput than gp2, improving write performance.

Why this answer

D is correct because gp3 storage provides a baseline performance that is higher than gp2 for the same storage size, and it allows you to independently provision IOPS and throughput without needing to increase storage. This directly addresses the write-intensive workload's high latency and storage bottleneck by offering up to 4,000 IOPS at no additional cost (compared to gp2's 3,000 IOPS baseline for larger volumes), and you can scale IOPS up to 16,000 without changing the instance class.

Exam trap

The trap here is that candidates often assume Provisioned IOPS (io1) is always the best choice for write performance, but the question specifically tests knowledge of gp3's superior baseline performance and cost efficiency for write-intensive workloads without requiring an instance class change.

How to eliminate wrong answers

Option A is wrong because a read replica cannot offload writes; it only handles read traffic, and writes must still go to the primary database, so it does not reduce write latency or storage bottlenecks. Option B is wrong because while io1 provides consistent IOPS, it is significantly more expensive than gp3 and does not offer the same baseline performance improvements for write-heavy workloads without also increasing storage; additionally, the question asks for the most effective action without changing the instance class, and gp3 is a more cost-effective and modern choice. Option C is wrong because Multi-AZ deployment provides high availability and automatic failover, but it does not improve write performance; in fact, synchronous replication to the standby can add slight latency to writes.

Practice this question →

362

MCQeasy

A data engineer needs to store JSON documents that are frequently read and written by a web application. The data has a flexible schema and requires low-latency queries on primary key lookups. Which AWS service is MOST suitable?

A.Amazon Redshift

B.Amazon S3

C.Amazon DynamoDB

D.Amazon RDS for MySQL

AnswerC

DynamoDB provides single-digit millisecond performance for key-value lookups and supports flexible schemas.

Why this answer

Option B is correct because DynamoDB is a NoSQL database designed for low-latency key-value lookups with flexible schema. Option A is wrong because RDS is relational and requires fixed schema. Option C is wrong because S3 is object storage, not designed for low-latency primary key lookups.

Option D is wrong because Redshift is a data warehouse for analytics, not transactional workloads.

Practice this question →

363

MCQmedium

A data engineer needs to store and analyze time-series data from IoT devices. The data volume is 10 GB per day, and the queries are mostly on the most recent 7 days of data. The engineer wants to minimize storage costs while retaining historical data for 1 year. Which combination of AWS services is most cost-effective?

A.Amazon Timestream

B.Amazon DynamoDB with TTL and S3 for archival

C.Amazon Redshift

D.Amazon RDS with MySQL

AnswerA

Timestream is cost-effective for time-series data with automatic storage tiering.

Why this answer

Amazon Timestream is purpose-built for time-series data, offering automatic tiering between in-memory (for recent 7 days) and magnetic stores (for historical data up to 1 year). This matches the query pattern (mostly recent 7 days) and retention requirement (1 year) while minimizing storage costs through its serverless, pay-per-query model. Timestream also supports time-series-specific functions like interpolation and smoothing, making it more efficient than general-purpose databases for this workload.

Exam trap

The trap here is that candidates often choose DynamoDB with TTL and S3 for archival (Option B) because it seems cost-effective, but they overlook the operational complexity and query latency of accessing historical data in S3, which violates the 'minimize storage costs while retaining historical data for 1 year' requirement without considering query patterns.

How to eliminate wrong answers

Option B (DynamoDB with TTL and S3 for archival) is wrong because DynamoDB is optimized for key-value and document workloads, not time-series analytics; TTL only deletes old data, but querying historical data from S3 requires additional services like Athena or Glue, increasing complexity and latency. Option C (Amazon Redshift) is wrong because Redshift is a columnar data warehouse designed for large-scale analytical queries on structured data, but it is over-provisioned and costly for 10 GB/day of time-series data, and its storage and compute are not optimized for time-series-specific operations like downsampling or retention policies. Option D (Amazon RDS with MySQL) is wrong because RDS is a relational database with fixed storage and compute, leading to higher costs for storing 3.65 TB of historical data (10 GB/day × 365 days) and poor query performance on time-series data without built-in time-series features like automatic retention or partitioning.

Practice this question →

364

MCQeasy

A company is using Amazon S3 for data lake storage. They need to query the data directly using SQL without loading it into a database. Which AWS service should be used?

A.Amazon Redshift Spectrum

B.Amazon Athena

C.Amazon EMR

D.AWS Glue

AnswerB

Athena is a serverless query service for S3 data using SQL.

Why this answer

Amazon Athena is the correct choice because it is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL, without needing to load or transform the data into a database. Athena uses Presto under the hood and supports querying structured, semi-structured, and unstructured data formats (e.g., CSV, JSON, Parquet, ORC) stored in S3, making it ideal for ad-hoc SQL queries on a data lake.

Exam trap

The trap here is that candidates often confuse AWS Glue's data cataloging and ETL capabilities with direct SQL querying, or they assume Redshift Spectrum is a standalone service rather than a feature requiring an existing Redshift cluster, leading them to pick a wrong answer that requires additional infrastructure or is not a query engine.

How to eliminate wrong answers

Option A is wrong because Amazon Redshift Spectrum is a feature of Amazon Redshift that allows querying data in S3 from within a Redshift data warehouse, but it requires an existing Redshift cluster and is not a standalone service for directly querying S3 data without a database. Option C is wrong because Amazon EMR is a big data platform that uses frameworks like Apache Spark, Hive, or Presto for querying S3 data, but it requires provisioning and managing clusters, which adds complexity and is not a serverless SQL-only solution. Option D is wrong because AWS Glue is a serverless data integration service primarily used for ETL (extract, transform, load) jobs and data cataloging, not for directly querying S3 data with SQL; while it can prepare data for Athena, it is not a query engine itself.

Practice this question →

365

MCQeasy

A company wants to use Amazon Redshift Spectrum to query data in Amazon S3. The data is in Parquet format and partitioned by date. Which step is required to enable Redshift Spectrum?

A.Load the data into Redshift tables using the COPY command.

B.Create an external schema and external table in the AWS Glue Data Catalog.

C.Create a separate Redshift Spectrum cluster.

D.Copy the data from S3 to Redshift-managed storage.

AnswerB

Redshift Spectrum uses the Glue Data Catalog to query data in S3.

Why this answer

Option A is correct because Redshift Spectrum requires an external schema and table defined in the AWS Glue Data Catalog or an external Hive metastore. Option B is wrong because the data is already in S3. Option C is wrong because loading data into Redshift is not required for Spectrum.

Option D is wrong because Spectrum does not require a separate cluster.

Practice this question →

366

MCQeasy

A company needs to store files that are accessed by multiple EC2 instances in a VPC. The files must be concurrently accessible and durable. Which storage solution should the data engineer choose?

A.Amazon EC2 instance store

B.Amazon Simple Storage Service (Amazon S3)

C.Amazon Elastic Block Store (Amazon EBS)

D.Amazon Elastic File System (Amazon EFS)

AnswerD

EFS provides a shared, durable file system for EC2 instances.

Why this answer

Amazon EFS provides a fully managed, scalable, and elastic NFS file system that can be concurrently accessed by multiple EC2 instances across multiple Availability Zones. It is designed for high durability (11 nines of durability) and automatically replicates data across multiple AZs within a region, meeting the requirements for concurrent access and durability.

Exam trap

The trap here is that candidates often confuse Amazon EBS Multi-Attach with a general-purpose shared file system, but EBS Multi-Attach is limited to specific io1/io2 volumes, requires cluster-aware applications, and does not provide the POSIX file system semantics or cross-AZ durability that EFS offers.

How to eliminate wrong answers

Option A is wrong because EC2 instance store provides ephemeral block storage that is physically attached to the host; it is not durable (data is lost on instance stop/termination) and cannot be shared concurrently across multiple EC2 instances. Option B is wrong because Amazon S3 is an object storage service, not a file system; it does not support standard file-level locking or NFS/SMB protocols required for concurrent file access from multiple EC2 instances without additional gateways or software. Option C is wrong because Amazon EBS provides block-level storage volumes that can only be attached to a single EC2 instance at a time (except for multi-attach EBS io1/io2 volumes, which are limited to specific instance types and have strict constraints, not a general solution for concurrent file access).

Practice this question →

367

MCQeasy

A data engineer needs to store log files from multiple applications in a central S3 bucket. The logs must be stored cost-effectively for long-term retention (7 years). The logs are accessed infrequently after the first 30 days. Which storage class should the engineer use for objects older than 30 days?

A.S3 Glacier Deep Archive

B.S3 Standard

C.S3 One Zone-IA

D.S3 Standard-IA

AnswerD

Standard-IA is for infrequently accessed data with lower storage cost.

Why this answer

D is correct because S3 Standard-IA (Infrequent Access) is designed for data accessed less frequently but requires rapid access when needed, with a lower storage cost than S3 Standard and a 30-day minimum storage duration charge. After the first 30 days, logs are infrequently accessed, making Standard-IA the most cost-effective option that still provides millisecond first-byte latency for occasional retrieval needs over the 7-year retention period.

Exam trap

AWS often tests the misconception that any 'infrequent access' scenario automatically requires Glacier or Deep Archive, but the trap here is that the logs still need millisecond retrieval latency for occasional access, which Standard-IA provides while Glacier classes do not.

How to eliminate wrong answers

Option A is wrong because S3 Glacier Deep Archive is intended for data accessed at most once or twice per year with retrieval times of 12–48 hours, which is too slow for logs that may need occasional access within minutes after the first 30 days. Option B is wrong because S3 Standard is designed for frequently accessed data with no minimum storage duration, leading to higher costs for long-term retention of infrequently accessed logs. Option C is wrong because S3 One Zone-IA stores data in a single Availability Zone, which does not provide the durability and availability needed for critical log files that must survive an AZ failure, and it also has a 30-day minimum storage charge.

Practice this question →

368

MCQmedium

A company stores financial data in Amazon RDS for MySQL. They need to retain backups for 7 years to meet compliance. Which backup strategy meets this requirement?

A.Use read replicas to retain data

B.Take daily manual snapshots and delete after 7 years

C.Enable automated backups with a retention period of 7 years

D.Use the AWS Backup service with a 7-year retention policy

AnswerD

AWS Backup can manage snapshots with long retention.

Why this answer

AWS Backup is the correct service for long-term retention of RDS snapshots beyond the 35-day limit of automated backups. It allows you to create backup plans with retention policies up to 100 years, making it suitable for the 7-year compliance requirement. Manual snapshots can also be retained indefinitely, but AWS Backup provides centralized management and lifecycle policies.

Exam trap

The trap here is that candidates may assume automated backups can be configured for long retention periods, but AWS enforces a hard 35-day limit, making AWS Backup the only viable option for multi-year retention.

How to eliminate wrong answers

Option A is wrong because read replicas are used for read scaling and disaster recovery, not for backup retention; they do not provide point-in-time recovery or long-term retention. Option B is wrong because while manual snapshots can be retained indefinitely, taking daily manual snapshots is operationally inefficient and error-prone, and AWS Backup offers a more automated and managed solution with lifecycle policies. Option C is wrong because Amazon RDS automated backups have a maximum retention period of 35 days, which cannot be extended to 7 years.

Practice this question →

369

MCQhard

A company runs a production Amazon Redshift cluster with a 5-node ra3.4xlarge configuration. The data engineer observes that write operations are failing with 'Disk Full' errors on some nodes. The cluster has not reached its total capacity. What should the engineer do to resolve this issue?

A.Recreate the table with a different distribution style to avoid data skew.

B.Change the sort keys to distribute data evenly.

C.Enable compression on all tables.

D.Add more nodes to the cluster.

AnswerA

Choosing an appropriate DISTKEY distributes data evenly across nodes.

Why this answer

Redshift distributes data across nodes, but if data distribution is skewed, some nodes may run out of disk space. Recreating the table with a different distribution style (e.g., DISTKEY on a column with high cardinality) can balance the data. Option D is correct.

Option A: adding more nodes increases capacity but does not fix skew. Option B: enabling compression reduces storage but may not fix skew. Option C: using SORT KEY improves query performance, not disk usage.

Practice this question →

370

MCQeasy

A data engineering team is using AWS Glue to catalog data in an S3 data lake. They have a Glue crawler that runs daily to update the Data Catalog. Recently, they noticed that the crawler is taking longer to run and sometimes fails because of a timeout. The team suspects the issue is due to the large number of small files in the S3 bucket. They need to improve crawler performance and reliability. Which solution should they implement?

A.Configure the crawler to use a different classifier.

B.Use AWS Glue ETL to consolidate small files into larger ones before crawling.

C.Increase the crawler timeout to 24 hours.

D.Schedule the crawler to run more frequently to avoid large data accumulation.

AnswerB

Reduces number of files to scan.

Why this answer

Option B is correct because consolidating small files into larger ones (e.g., using AWS Glue ETL with a groupFiles or groupSize option, or a separate compaction job) reduces the number of objects the crawler must list and sample. This directly addresses the root cause: a high volume of small files increases metadata operations and can cause crawler timeouts. By reducing file count, the crawler can complete within the default 24-hour timeout and avoid failures.

Exam trap

The trap here is that candidates assume increasing the timeout or running the crawler more frequently will fix performance issues, but the real bottleneck is the sheer number of small files, which requires data compaction to resolve.

How to eliminate wrong answers

Option A is wrong because changing the classifier affects how the crawler interprets data format (e.g., JSON vs. Parquet), not the number of files or the performance bottleneck caused by small files. Option C is wrong because increasing the timeout to 24 hours does not solve the underlying issue of excessive small files; the crawler may still fail due to resource limits or S3 request throttling, and the default timeout is already 24 hours.

Option D is wrong because running the crawler more frequently would only accumulate more small files over time, worsening the problem and increasing the likelihood of timeouts.

Practice this question →

371

MCQhard

A company runs an Amazon RDS for MySQL database. The database experiences high write latency during peak hours. The data engineer notices that the WriteIOPS metric is consistently at the provisioned limit. Which action would most effectively reduce write latency without increasing costs?

A.Enable Multi-AZ deployment

B.Increase the provisioned IOPS on the existing RDS instance

C.Add a read replica to offload read traffic

D.Migrate to Amazon Aurora MySQL with appropriate instance size

AnswerD

Aurora's distributed storage can handle higher write throughput with lower latency and cost.

Why this answer

Migrating to Amazon Aurora MySQL with an appropriate instance size reduces write latency because Aurora’s distributed storage architecture provides up to 20 times the write throughput of standard MySQL on RDS, without requiring additional IOPS provisioning. Aurora automatically scales storage I/O and uses a 6-replica quorum-based write model, which eliminates the bottleneck of hitting a fixed IOPS limit while keeping costs comparable to or lower than provisioned IOPS on RDS.

Exam trap

The trap here is that candidates often assume increasing provisioned IOPS (Option B) is the only way to fix write latency, overlooking that Aurora’s pay-per-request I/O model can provide higher throughput without a fixed cost increase, and that Multi-AZ (Option A) is a common distractor because it sounds like it improves performance but actually targets availability.

How to eliminate wrong answers

Option A is wrong because enabling Multi-AZ deployment provides high availability through synchronous standby replication, but it does not increase write throughput or reduce write latency; in fact, it can slightly increase write latency due to the synchronous commit to the standby. Option B is wrong because increasing provisioned IOPS directly increases costs, as you pay for the provisioned IOPS regardless of usage, and the question explicitly asks to reduce write latency without increasing costs. Option C is wrong because adding a read replica offloads read traffic, which does nothing to address write latency caused by hitting the WriteIOPS limit; write operations still hit the same primary instance with the same IOPS ceiling.

Practice this question →

372

Multi-Selecteasy

A data engineer is setting up Amazon S3 bucket policies for a data lake. Which TWO statements are true regarding S3 bucket policies? (Choose TWO.)

Select 2 answers

A.Bucket policies can grant access to accounts in other AWS Organizations

B.Bucket policies are the only way to control access to S3

C.Bucket policies can be applied to individual objects

D.The Principal element in a bucket policy is optional

E.Bucket policies are written in JSON format

AnswersA, E

Cross-account access can be granted via bucket policies.

Why this answer

Option A is correct because S3 bucket policies can grant cross-account access to principals in other AWS accounts, including those in different AWS Organizations, by specifying the target account ID or organization ID in the Principal element. This enables centralized data lake access management across organizational boundaries without requiring IAM roles or resource-based policies in each account.

Exam trap

The trap here is that candidates often confuse bucket policies with IAM policies, mistakenly thinking the Principal element is optional in bucket policies (it is required), or that bucket policies can target individual objects (they cannot; they use prefix or tag conditions instead).

Practice this question →

373

Multi-Selecteasy

A data engineering team is migrating a MySQL database to Amazon RDS for MySQL. They need to ensure high availability and automated failover. Which THREE configurations should they implement?

Select 3 answers

A.Enable Enhanced Monitoring.

B.Enable automated backups with a retention period.

C.Enable Multi-AZ deployment.

D.Configure a DB subnet group with subnets in at least two Availability Zones.

E.Create a read replica in a different region.

AnswersB, C, D

Automated backups enable recovery to any point within retention.

Why this answer

Option B is correct because automated backups with a retention period enable point-in-time recovery (PITR) and are required for Multi-AZ failover to function properly. RDS uses automated backups to keep the standby instance synchronized and to support recovery after a failover event.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ standby instances (which provide automatic failover), leading them to incorrectly select a cross-region read replica as a high-availability solution.

Practice this question →

374

MCQeasy

A company uses Amazon S3 as its data lake. A data engineer needs to enforce encryption of data at rest using server-side encryption with AWS KMS. Which S3 bucket property should be configured?

A.Default encryption

B.Server access logging

C.Versioning

D.Bucket policy

AnswerA

Default encryption enforces SSE-KMS on all objects.

Why this answer

Option A is correct because configuring default encryption on an S3 bucket ensures that all objects stored in the bucket are encrypted at rest using server-side encryption. When AWS KMS is specified as the encryption type, S3 automatically encrypts objects with a KMS key (SSE-KMS) upon upload, even if the upload request does not include encryption headers. This enforces encryption at rest without requiring changes to client applications.

Exam trap

The trap here is that candidates often confuse bucket policies (which can enforce encryption conditions) with default encryption (which actually applies encryption), leading them to select bucket policy as the answer when the question asks for the property that enforces encryption of data at rest.

How to eliminate wrong answers

Option B is wrong because server access logging records requests made to the bucket for auditing purposes, but it does not enforce or configure encryption of data at rest. Option C is wrong because versioning preserves, retrieves, and restores every version of every object in the bucket, but it has no effect on encryption settings. Option D is wrong because a bucket policy can deny unencrypted uploads using a condition key like `s3:x-amz-server-side-encryption`, but it does not itself configure the encryption mechanism; it only enforces a policy requirement, whereas default encryption directly applies encryption to all objects.

Practice this question →

375

MCQeasy

Refer to the exhibit. A data engineer creates an Amazon Redshift table with the above DDL. The engineer runs a query to find all orders for a specific customer within a date range. Which statement about query performance is correct?

A.The query will be inefficient because the distribution key is not the same as the sort key.

B.The table should use DISTSTYLE EVEN to improve performance.

C.The query will benefit from both the distribution key and the sort key to minimize data scanned.

D.The sort key will not help because the query filters on customer_id first.

AnswerC

Distribution reduces data movement, sort key reduces data scanned.

Why this answer

Option B is correct because the DISTKEY on customer_id and SORTKEY on order_date optimize the query. Option A is wrong because the query benefits from both distribution and sorting. Option C is wrong because sort key helps.

Option D is wrong because distribution is already key-based.

Practice this question →

← PreviousPage 5 of 7 · 456 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Store Management questions.

Start 20-question session