Knowledge + Practice

CCNA Ml Data Engineering Questions

75 of 374 questions · Page 1/5 · Ml Data Engineering topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQhard

A company uses Amazon EMR with Spark to process data daily. The job reads from S3 and writes to S3. Recently, the job started failing with 'S3AccessDenied' errors. The IAM role used by EMR has not changed. What is the MOST likely cause?

A.The EMR cluster's security group blocks outbound traffic

B.The S3 bucket policy was updated to deny access to the EMR role

C.The S3 bucket was deleted and recreated

D.The EMR service role was rotated

AnswerB

Bucket policies can deny access even if IAM allows.

Why this answer

S3 bucket policies can be changed independently and may block access. IAM policies are not the only factor; bucket policies also apply. The role itself hasn't changed, but the bucket policy might have been updated to deny access.

Other options are less likely because the role hasn't changed and the errors are access-related.

Practice this question →

2

MCQeasy

A data engineer is building a data pipeline that ingests streaming data from IoT devices. The data must be processed in near real-time and stored in Amazon S3 for further analysis. Which AWS service should be used to capture and process the streaming data before storing it in S3?

A.Use Amazon S3 with S3 Event Notifications to trigger AWS Lambda for processing.

B.Use AWS Glue to perform ETL on the streaming data.

C.Use Amazon Kinesis Data Streams to capture the data and Amazon Kinesis Data Firehose to deliver it to S3.

D.Use Amazon Simple Queue Service (SQS) to buffer the data and then process it with AWS Lambda.

AnswerC

Kinesis Data Streams ingests real-time data and Kinesis Data Firehose delivers it to S3.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data ingestion and can be integrated with Lambda for processing. Kinesis Data Firehose can then load the data into S3. Option A is wrong because S3 is a storage service, not a streaming ingestion service.

Option B is wrong because AWS Glue is for ETL and cataloging, not real-time streaming. Option D is wrong because SQS is a message queue, not optimized for streaming analytics.

Practice this question →

3

MCQhard

A company is running a data pipeline that uses Amazon EMR with Spark to process 100 TB of data daily. The pipeline must complete within 6 hours. Currently, it takes 8 hours. Which optimization will most likely reduce the runtime?

A.Consolidate small input files into fewer larger files

B.Enable EMR Managed Scaling

C.Increase the memory of each node by using r5 instances

D.Use Spot Instances for all core nodes

AnswerB

Managed Scaling dynamically adds resources to meet deadlines.

Why this answer

Option D is correct because enabling EMR Managed Scaling automatically adjusts cluster resources based on workload, which can reduce runtime. Option A is wrong because using spot instances may cause interruptions; Option B is wrong because more memory per node may not help if the bottleneck is parallelism; Option C is wrong because consolidating data into fewer files can reduce overhead but may not be the main issue.

Practice this question →

4

Multi-Selecthard

A company is building a real-time anomaly detection system for network traffic logs. The logs are ingested via Amazon Kinesis Data Streams and processed with an Amazon SageMaker endpoint for inference. The team needs to ensure that the inference results are stored durably and can be replayed for model retraining. The system must handle at least 10,000 records per second with low latency. Which three AWS services should the team use to build this architecture? (Select THREE.)

Select 3 answers

A.AWS Glue ETL

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Analytics for Apache Flink

D.Amazon Kinesis Data Firehose

E.Amazon SageMaker

AnswersB, C, E

Kinesis Data Streams provides the ingestion layer with low latency and high throughput.

Why this answer

Amazon Kinesis Data Streams is the correct ingestion layer because it provides durable, real-time data streaming with the ability to handle over 10,000 records per second. It acts as the source of truth for network traffic logs, enabling low-latency processing and replay for model retraining.

Exam trap

The trap here is that candidates often confuse Kinesis Data Firehose with Kinesis Data Streams, assuming Firehose's simplicity and S3 integration make it suitable for real-time inference, but Firehose lacks the record-level replay and low-latency processing required for this use case.

Practice this question →

5

MCQhard

A company uses Amazon Redshift as a data warehouse. They need to load 50 TB of clickstream data from S3 into Redshift daily. The data arrives in 5-minute intervals as gzipped CSV files. The target table has a sort key and a distribution key. The load must complete within 2 hours. Which approach is MOST efficient?

A.Use AWS Glue to transform the data and write to Redshift using JDBC.

B.Use a staging table and then merge using a stored procedure.

C.Use a series of INSERT statements from a Lambda function.

D.Use the COPY command with a manifest file and gzip compression.

AnswerD

COPY is optimized for bulk loading from S3.

Why this answer

The COPY command is the most efficient way to load large volumes of data into Amazon Redshift because it uses the cluster's massively parallel processing (MPP) architecture to read data directly from S3 in parallel across all nodes. With a manifest file, you can specify multiple gzipped CSV files, and the gzip compression reduces network I/O and storage overhead. This approach can easily load 50 TB within 2 hours, especially when the target table has a sort key and distribution key, as COPY automatically leverages these for optimal data distribution and sorting during the load.

Exam trap

The trap here is that candidates may overcomplicate the solution by choosing AWS Glue or staging tables, not realizing that Redshift's COPY command is purpose-built for high-speed parallel ingestion from S3 and is the most efficient method for bulk data loads.

How to eliminate wrong answers

Option A is wrong because AWS Glue writing to Redshift via JDBC is a row-by-row or small-batch operation that cannot match the parallel throughput of the COPY command, and it would introduce unnecessary transformation overhead for already-structured CSV data. Option B is wrong because using a staging table and a stored procedure merge adds extra steps and complexity without improving load speed; the COPY command can directly load into the target table with proper sort and distribution keys, making a staging table redundant for this bulk load scenario. Option C is wrong because a series of INSERT statements from a Lambda function would be extremely slow and inefficient for 50 TB of data, as each INSERT is a single-row operation that cannot leverage Redshift's parallel processing, and Lambda has a 15-minute execution timeout that would require complex orchestration to handle the full load.

Practice this question →

6

MCQhard

A company uses Amazon Kinesis Data Streams to ingest real-time clickstream data from a website. The data is consumed by a Lambda function that writes records to an S3 bucket. Recently, the number of shards was increased from 2 to 4 to handle higher throughput. After the change, the Lambda function started processing records with increased latency and some records were being written out of order. What is the MOST likely cause?

A.The S3 bucket is not configured with versioning, causing overwrites.

B.The Lambda function is reading from the oldest sequence number, causing high IteratorAgeSeconds.

C.The Lambda function’s reserved concurrency is too low for the increased shard count.

D.The partition key used by the producer does not ensure that related records go to the same shard after resharding.

AnswerD

After resharding, the mapping of partition keys to shards changes. If ordering matters, the partition key must be chosen to keep related records together.

Why this answer

Option D is correct because after resharding from 2 to 4 shards, the mapping of partition keys to shards changes. If the producer does not use a partition key that ensures related records (e.g., same user session) are routed to the same shard, records that were previously ordered within a shard may now be split across multiple shards. Since the Lambda consumer processes shards independently, records from the same logical sequence can arrive out of order, and the increased shard count can also cause higher latency if the consumer is not properly parallelized.

Exam trap

The trap here is that candidates often confuse increased shard count with a need for more concurrency (Option C), but the real issue is that resharding changes the partition-to-shard mapping, which can break ordering guarantees unless the producer explicitly handles the new hash range.

How to eliminate wrong answers

Option A is wrong because S3 versioning controls object overwrites and deletions, not the ordering or latency of records written by Lambda; out-of-order writes are caused by upstream data distribution, not S3 configuration. Option B is wrong because the Lambda function reads from the latest sequence number by default when using the Kinesis trigger, not the oldest; high IteratorAgeSeconds would indicate a slow consumer, not a configuration to read from the oldest record. Option C is wrong because reserved concurrency limits the maximum number of concurrent Lambda executions, but the default unreserved concurrency is usually sufficient for 4 shards; low concurrency would cause throttling (e.g., 429 errors), not out-of-order processing.

Practice this question →

7

MCQeasy

A data scientist needs to process a large volume of streaming data from IoT devices and store the results in Amazon S3 for further analysis. Which AWS service is most suitable for ingesting and processing this data in near real-time?

A.Amazon Redshift

B.AWS Glue

C.Amazon Kinesis Data Analytics

D.Amazon EMR

AnswerC

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is designed for real-time processing of streaming data. AWS Glue is for batch ETL, Amazon EMR for big data processing, and Amazon Redshift for data warehousing.

Practice this question →

8

Multi-Selecthard

A data engineering team uses AWS Glue to run ETL jobs. They notice that jobs are taking longer to complete as data volume grows. They want to optimize performance without increasing cost significantly. Which THREE strategies should they consider?

Select 3 answers

A.Remove partitioning from the output

B.Partition the input data in S3

C.Use Amazon EMR instead of Glue

D.Convert input data to columnar format (e.g., Parquet)

E.Increase the number of DPUs (workers)

AnswersB, D, E

Enables parallel processing.

Why this answer

Partitioning the input data allows Glue to process in parallel. Using columnar formats like Parquet reduces I/O. Increasing the number of DPUs (workers) improves parallelism but increases cost; however, it can be cost-effective if job duration decreases significantly.

Removing partitions would hurt performance. Using Spark directly is not necessary.

Practice this question →

9

MCQmedium

A company is using AWS Glue ETL jobs to process data stored in Amazon S3. The jobs currently run sequentially and take too long. The data engineer wants to reduce job duration without rewriting the code. Which action is most effective?

A.Change the underlying EC2 instance type to a compute-optimized instance

B.Increase the number of DPUs (Data Processing Units) for the job

C.Convert the data from CSV to Parquet format

D.Enable job bookmarks to skip already processed data

AnswerB

More DPUs allow parallel execution, reducing job duration.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job can parallelize the processing and reduce runtime without code changes. Option A (changing to a larger instance type) is not applicable because Glue uses DPUs, not EC2 instances. Option B (using a different data format) may help but is not a direct solution for parallelization.

Option D (enabling job bookmarks) helps with incremental processing but does not speed up the existing job.

Practice this question →

10

MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The solution must handle data that arrives in bursts and must be able to reprocess failed records automatically. Which combination of AWS services should the team use?

A.AWS Glue with Amazon S3

B.Amazon SQS with AWS Lambda

C.Amazon Kinesis Data Streams with AWS Lambda

D.Amazon DynamoDB Streams with AWS Lambda

AnswerC

Kinesis Data Streams can ingest bursty streaming data and retain it for replay; Lambda can process and load to S3.

Why this answer

Option B is correct because Amazon Kinesis Data Streams can ingest high-throughput streaming data and retain it for up to 365 days, allowing reprocessing. AWS Lambda can be used to transform and load the data into S3. Option A is wrong because Amazon SQS is not optimized for streaming ingestion and lacks replay capability.

Option C is wrong because Amazon Glue is a batch ETL service, not for streaming. Option D is wrong because Amazon DynamoDB Streams are tied to DynamoDB table changes, not direct IoT ingestion.

Practice this question →

11

MCQmedium

A company needs to perform complex transformations on large datasets stored in Amazon S3 using Apache Spark. They want to minimize operational overhead. Which AWS service should they use?

A.Amazon EMR

B.Amazon EC2 with manually configured Spark

C.Amazon Athena

D.AWS Glue

AnswerA

EMR provides managed Spark clusters for complex transformations.

Why this answer

Amazon EMR with Spark is a managed service that reduces operational overhead. AWS Glue is for simpler ETL, not complex Spark transformations. EC2 requires manual setup.

Athena is SQL-based, not Spark.

Practice this question →

12

MCQhard

A company uses AWS Glue crawlers to populate the AWS Glue Data Catalog from Amazon S3. The data is partitioned by year/month/day/hour. The crawler runs every hour and adds new partitions. However, the data engineer notices that the crawler is taking longer to run as the number of partitions grows, and sometimes it misses new partitions. What is the most cost-effective and reliable way to address this?

A.Enable the crawler's partition index feature.

B.Manually add new partitions using ALTER TABLE ADD PARTITION in Athena.

C.Use the Athena MSCK REPAIR TABLE command after the crawler runs.

D.Increase the crawler's schedule to run every 30 minutes.

AnswerA

Partition indexes allow the crawler to efficiently discover new partitions without scanning the entire dataset.

Why this answer

Option B is correct because enabling the crawler's partition index feature allows Glue to quickly find new partitions without re-scanning the entire table. Option A is wrong because increasing the crawler's frequency will not help if the crawler is already missing partitions due to scanning overhead. Option C is wrong because manually adding partitions is error-prone and does not scale.

Option D is wrong because Athena MSCK REPAIR TABLE is a manual command and requires running it after data arrives, not automated.

Practice this question →

13

MCQmedium

A company runs a daily ETL job that reads data from Amazon RDS, transforms it using AWS Glue, and writes the results to Amazon S3. The job started failing yesterday with the error: 'Rate exceeded'. What is the most likely cause and solution?

A.The Glue job is using too many DPUs; reduce the number of DPUs

B.The RDS database is overwhelmed by the number of connections; reduce the Glue job's parallelism or increase RDS instance size

C.The S3 bucket has reached its request rate limit; request a limit increase

D.Enable job bookmarks in the Glue job to process only new data

AnswerB

Rate exceeded errors often come from RDS when connection or IO limits are reached.

Why this answer

The 'Rate exceeded' error typically indicates that the job is exceeding the RDS database's maximum connections or IOPS limits. The best solution is to reduce the parallelism or increase the database capacity. Option A (increasing S3 bucket limits) does not address the RDS issue.

Option B (increasing Glue DPUs) may exacerbate the issue. Option D (enabling Glue job bookmarks) does not solve rate limiting.

Practice this question →

14

MCQmedium

A company uses Amazon Kinesis Data Streams to collect IoT sensor data. The stream has 4 shards. A consumer application reads from the stream using the Kinesis Client Library (KCL). The application processes records and stores them in Amazon DynamoDB. Recently, the data volume has increased, and the consumer is falling behind. Which action should the team take to increase the processing throughput?

A.Deploy additional consumer instances using the same application name.

B.Increase the write capacity of the DynamoDB table.

C.Increase the data retention period of the stream to 7 days.

D.Increase the number of shards in the Kinesis stream.

AnswerD

More shards provide more read capacity units and allow more parallel consumers.

Why this answer

Option C is correct because increasing the number of shards increases the stream's throughput and allows more concurrent consumers. Option A is wrong because increasing the retention period does not affect throughput. Option B is wrong because adding more KCL workers without increasing shards will cause them to idle.

Option D is wrong because increasing DynamoDB write capacity may reduce throttling but does not increase the consumer's reading throughput.

Practice this question →

15

MCQeasy

A company uses Amazon Redshift for its data warehouse. The data engineering team notices that queries are slow and wants to improve performance without changing the schema. Which action is most likely to improve query performance?

A.Decrease the number of nodes to reduce network overhead.

B.Disable compression on all tables to reduce CPU overhead.

C.Increase the number of nodes in the cluster.

D.Change the distribution style from AUTO to EVEN.

AnswerC

Adding nodes increases parallelism and improves query performance.

Why this answer

Option C is correct because increasing the number of nodes adds compute resources, improving parallel processing. Option A is wrong because changing distribution style would alter the schema. Option B is wrong because decreasing node count reduces resources.

Option D is wrong because disabling compression increases storage and I/O.

Practice this question →

16

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be partitioned by year, month, and day. The delivery stream is configured with a buffer interval of 60 seconds and buffer size of 5 MB. The data producer sends about 1 MB per second. The data is arriving in S3 but the partitions are not being created as expected. What is the MOST likely reason?

A.The data is encrypted with AWS KMS and Firehose cannot write to encrypted buckets.

B.The delivery stream does not have dynamic partitioning enabled with the appropriate custom prefix.

C.The buffer interval is too short for the data volume, causing incomplete records.

D.The S3 bucket has versioning enabled, which prevents partitioning.

AnswerB

Without dynamic partitioning and the correct prefix, Firehose will not partition the data by year/month/day.

Why this answer

Option B is correct because Kinesis Data Firehose requires dynamic partitioning to be explicitly enabled and configured with a custom prefix (e.g., 'year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/') to automatically partition data by year, month, and day. Without this setting, Firehose writes all data to a single S3 prefix, ignoring the desired partition structure.

Exam trap

The trap here is that candidates assume simply setting a prefix with date-like placeholders (e.g., 'data/year=2025/') is enough, but Firehose requires explicit dynamic partitioning to be enabled and the prefix must use the correct !{timestamp:...} syntax for automatic date-based partitioning.

How to eliminate wrong answers

Option A is wrong because Firehose can write to KMS-encrypted S3 buckets when the correct IAM permissions and KMS key policies are in place; encryption does not prevent partitioning. Option C is wrong because a 60-second buffer interval is sufficient for 1 MB/s data (60 MB per interval), and Firehose buffers complete records, not partial ones. Option D is wrong because S3 versioning does not affect Firehose's ability to write partitioned data; versioning simply maintains multiple versions of objects.

Practice this question →

17

MCQmedium

A data engineer is troubleshooting an AWS Glue job that reads from an S3 bucket and writes to another S3 bucket. The job fails with an 'Access Denied' error when trying to write to the output bucket. The IAM policy attached to the Glue service role is shown. What is the MOST likely cause of the failure?

A.The user who runs the job does not have S3 permissions

B.The Glue job role does not have permissions to start a job run

C.The output bucket is not listed in the Resource of the IAM policy

D.The S3 bucket policy denies access to the Glue service

AnswerC

The policy only allows PutObject on example-bucket, not the output bucket.

Why this answer

The policy only allows s3:GetObject and s3:PutObject on 'example-bucket', but the job is writing to a different bucket. The Glue job needs permissions on the output bucket as well. Option A is wrong because the job role has permissions for Glue actions.

Option B is wrong because S3 permissions are for a specific bucket. Option D is wrong because the job role is separate from the user role.

Practice this question →

18

MCQhard

A data scientist needs to run a one-time training job on a 5 TB dataset stored in Amazon S3. The training algorithm requires random access to individual records. Which SageMaker input mode and data format combination would be MOST appropriate?

A.Use Pipe mode with Parquet format

B.Use Pipe mode with RecordIO-Protobuf format

C.Use File mode with RecordIO-Protobuf format

D.Use Pipe mode with CSV format

AnswerC

File mode downloads data to disk, allowing random access; Protobuf is efficient.

Why this answer

Option C is correct because File mode downloads the entire dataset to the instance's local disk, enabling random access to any record. Pipe mode does not support random access. Option A is wrong because Pipe mode streams data sequentially.

Option B is wrong because Pipe mode does not support random access. Option D is wrong because Pipe mode is not suitable for random access.

Practice this question →

19

MCQmedium

Refer to the exhibit. A data engineer is creating an IAM policy for an AWS Glue ETL job that reads encrypted objects from an S3 bucket, transforms them, and writes the results back to the same bucket. The bucket uses SSE-KMS encryption with the KMS key specified. The ETL job is failing with an "Access Denied" error when trying to write data. What is the likely cause?

A.The policy is missing the kms:Decrypt permission

B.The policy is missing the s3:PutObjectAcl permission

C.The policy is missing the s3:PutObject permission

D.The policy is missing the kms:Encrypt permission

AnswerD

Writing with SSE-KMS requires kms:Encrypt.

Why this answer

Option C is correct because the policy grants s3:PutObject, which is needed to write, but the KMS permissions include kms:Decrypt and kms:GenerateDataKey, which are sufficient for reading and writing with SSE-KMS. The issue is that the job role must also have kms:Encrypt to write encrypted objects. Option A is wrong because the policy includes s3:PutObject.

Option B is wrong because the policy includes both KMS actions needed for reading. Option D is wrong because there is no s3:PutObject condition missing.

Practice this question →

20

MCQmedium

A machine learning team is preparing a large dataset for training. The dataset consists of 10,000 CSV files, each about 100 MB, stored in Amazon S3. The team wants to transform the data using AWS Glue ETL jobs. The transformation involves filtering rows, adding new columns, and joining with a small reference table (100 KB). The team is concerned about job performance and cost. They currently have a Glue job with 10 DPU (Data Processing Units) and it takes about 2 hours to complete. The team wants to reduce the runtime and cost. Which approach should they take?

A.Use Amazon Athena to transform the data.

B.Increase the number of DPUs to 100.

C.Use Amazon EMR with Spot Instances instead of AWS Glue.

D.Convert the CSV files to Parquet format and partition the data by a column.

AnswerD

Parquet reduces I/O and partitioning reduces data scanned.

Why this answer

Using a columnar format like Parquet and partitioning the data on a relevant column (e.g., date) can significantly reduce the amount of data scanned and improve performance. Additionally, optimizing the number of DPUs (e.g., using a larger number of DPUs for a shorter time) can reduce cost if the job is billed by DPU-hour.

Practice this question →

21

Multi-Selecthard

A company uses Amazon Redshift to run analytics on sales data. The data is loaded daily from S3 using COPY commands. The team notices that the COPY command performance degrades over time due to table bloat. The team needs to maintain query performance and reduce storage costs. Which combination of maintenance operations should the team perform regularly? (Choose THREE.)

Select 3 answers

A.Run the UNLOAD command to export data to S3 and then reload.

B.Change the distribution style of the table to KEY.

C.Run the VACUUM command to reclaim space and re-sort rows.

D.Run a DEEP COPY to recreate the table with optimal physical storage.

E.Run the ANALYZE command to update table statistics.

AnswersC, D, E

VACUUM removes deleted rows and re-sorts data.

Why this answer

Option A, C, and D are correct. VACUUM reclaims space and re-sorts rows (if sort keys defined). ANALYZE updates statistics for query planning.

DEEP COPY recreates the table to eliminate bloat completely. Option B is wrong because UNLOAD exports data, not maintenance. Option E is wrong because changing distribution style is a schema change, not regular maintenance.

Practice this question →

22

MCQmedium

A company is building a data pipeline that ingests data from on-premises databases into Amazon S3 using AWS Database Migration Service (AWS DMS). The company wants to capture continuous changes from the source database and replicate them to S3 in near-real time. Which AWS DMS configuration should the company use?

A.Create a full-load task to copy the existing data

B.Create a full-load plus CDC task with S3 target

C.Create a validation task to compare source and target

D.Create a CDC-only task with S3 as the target endpoint

AnswerD

CDC-only captures and replicates changes in near-real time.

Why this answer

Option D is correct because using a CDC-only task with S3 as the target endpoint replicates continuous changes to S3. Option A is wrong because a full-load task only migrates existing data. Option B is wrong because a full-load plus CDC task includes both, but the requirement is only changes.

Option C is wrong because a validation task is for data validation, not replication.

Practice this question →

23

Multi-Selecthard

A company is using AWS Glue ETL jobs to transform data. The jobs are failing due to insufficient memory. The data processing involves complex joins and aggregations. Which THREE actions can improve job performance and reduce memory usage?

Select 3 answers

A.Filter and project data early in the transformation to reduce data volume

B.Decrease the number of DPUs allocated to the job

C.Repartition the data and use bucketing to reduce shuffle size

D.Increase the number of DPUs (workers) allocated to the job

E.Use a single node cluster to avoid shuffle overhead

AnswersA, C, D

Reduces memory footprint.

Why this answer

Option A is correct because filtering and projecting data early in the transformation reduces the volume of data that must be processed in subsequent operations like joins and aggregations. By using pushdown predicates and selecting only necessary columns, you minimize the data shuffled across the cluster, which directly reduces memory pressure and improves job performance in AWS Glue ETL.

Exam trap

The trap here is that candidates often assume reducing resources (Option B) or eliminating parallelism (Option E) will solve memory issues, when in fact these actions exacerbate the problem by increasing the data load per executor or removing the benefits of distributed processing.

Practice this question →

24

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Kinesis Data Analytics application that runs SQL queries. The application has been failing intermittently with 'ProvisionedThroughputExceededException' errors. Which action should be taken to resolve this issue?

A.Disable error logging in the Kinesis Data Analytics application.

B.Increase the record size in the Kinesis data stream.

C.Switch from Kinesis Data Analytics to Kinesis Data Firehose.

D.Increase the number of shards in the Kinesis data stream.

AnswerD

Correct: More shards increase read throughput capacity.

Why this answer

The error indicates that the shard's read throughput limit (5 transactions/second per shard) is being exceeded. Increasing the number of shards increases the total throughput. Option A (increase shard count) is the correct solution.

Option B (increase record size) could worsen the problem. Option C (use Kinesis Firehose) changes the architecture but does not address the shard throughput. Option D (disable error logging) does not solve the underlying issue.

Practice this question →

25

MCQeasy

A data engineering team needs to orchestrate a complex workflow that involves multiple AWS Glue jobs, Lambda functions, and S3 operations. The workflow must run on a schedule and allow monitoring of each step. Which AWS service should they use?

A.Amazon Simple Workflow Service (SWF)

B.AWS Step Functions

C.AWS Data Pipeline

D.Amazon CloudWatch Events

AnswerB

Step Functions provides state machines to orchestrate multi-step workflows.

Why this answer

AWS Step Functions is a serverless orchestration service that can coordinate multiple AWS services and visualizes workflows. Option A is wrong because CloudWatch Events can trigger based on events but not orchestrate. Option B is wrong because Data Pipeline is for data-driven workflows but less flexible.

Option D is wrong because Simple Workflow Service is a legacy service.

Practice this question →

26

MCQmedium

A company uses AWS Glue ETL jobs to process data from multiple sources. The job fails with the error: 'An error occurred while calling o123.pyWriteDynamicFrame. Insufficient memory.' The job runs on a G.1X worker type with 10 workers. What should be changed to resolve this error?

A.Increase the number of workers to 20.

B.Enable the Spark UI to monitor the job.

C.Change the worker type to G.2X.

D.Reduce the number of partitions in the DynamicFrame.

AnswerA

More workers increase parallelism and reduce memory pressure per worker.

Why this answer

The error 'Insufficient memory' in AWS Glue ETL jobs typically indicates that the total memory across all executors is insufficient for the data being processed. Increasing the number of workers from 10 to 20 doubles the total memory and compute capacity available, allowing the job to handle larger datasets without running out of memory. This is the most direct and effective fix for a memory exhaustion error when using the G.1X worker type.

Exam trap

The trap here is that candidates often confuse 'insufficient memory' with a per-worker memory limit and choose to upgrade the worker type (G.2X), but the error is about total cluster memory, which is more effectively addressed by increasing the number of workers.

How to eliminate wrong answers

Option B is wrong because enabling the Spark UI only provides monitoring and debugging capabilities; it does not allocate additional memory or resolve the underlying memory shortage. Option C is wrong because changing the worker type to G.2X doubles the memory per worker (from 16 GB to 32 GB), but the error is about total memory insufficiency, and increasing the number of workers (option A) is a more scalable and cost-effective approach that directly addresses the error without requiring a change in worker type. Option D is wrong because reducing the number of partitions in the DynamicFrame would actually increase the data size per partition, potentially worsening memory pressure on individual executors, not resolving the overall memory shortage.

Practice this question →

27

MCQhard

A data pipeline uses Amazon Kinesis Data Streams to ingest event data. The data is consumed by an AWS Lambda function, which writes to Amazon DynamoDB. The Lambda function is experiencing throttling errors, and the DynamoDB write capacity is underutilized. The events must be processed in order per shard. Which solution most effectively addresses the throttling?

A.Use an SQS FIFO queue between Kinesis and Lambda to buffer

B.Increase the Lambda function's reserved concurrency

C.Increase the write capacity units of the DynamoDB table

D.Increase the number of shards in the Kinesis Data Stream

AnswerD

More shards allow more parallel Lambda invocations, reducing throttling.

Why this answer

Adding more shards to the Kinesis stream increases the number of concurrent Lambda invocations, spreading the load. Option B (increasing DynamoDB write capacity) does not address Lambda throttling. Option C (using SQS FIFO) would decouple but may cause duplicates.

Option D (increasing Lambda reserved concurrency) alone may not help if Lambda is throttled due to concurrency limits; adding shards is more effective.

Practice this question →

28

MCQmedium

A data engineering team is building a real-time clickstream analytics pipeline on AWS. They need to ingest millions of events per second from mobile apps and websites, process them with low latency, and store the results in Amazon S3 for downstream analysis. Which combination of AWS services should the team use to minimize operational overhead while meeting these requirements?

A.Use Amazon MQ to ingest streaming data, AWS Lambda to process each message, and save output to Amazon S3.

B.Use Amazon Kinesis Data Streams to ingest data, Amazon EMR to process with Spark Streaming, and save output to Amazon S3.

C.Use Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for real-time processing, and Amazon Kinesis Data Firehose to deliver results to Amazon S3.

D.Use AWS Glue to ingest data into Amazon RDS, then use AWS Glue ETL jobs to transform and load into Amazon S3.

AnswerC

This combination provides serverless, low-latency ingestion, processing, and delivery with minimal operational overhead.

Why this answer

Option C is correct because Amazon Kinesis Data Streams can handle high-throughput ingestion, Kinesis Data Analytics processes streaming data with low latency, and Kinesis Data Firehose delivers processed data to S3 with minimal overhead. Option A is wrong because AWS Glue is a batch ETL service, not suitable for real-time processing. Option B is wrong because Amazon EMR is a managed Hadoop cluster that requires more operational overhead.

Option D is wrong because Amazon MQ is a message broker for standard messaging protocols, not optimized for real-time analytics.

Practice this question →

29

Multi-Selecteasy

A data engineer needs to collect and analyze log data from multiple EC2 instances in real-time. The solution should be serverless and scalable. Which TWO AWS services should be used?

Select 2 answers

A.Amazon Kinesis Data Firehose

B.Amazon EMR

C.Amazon Athena

D.Amazon OpenSearch Service

E.Amazon S3

AnswersA, D

Firehose can ingest streaming data.

Why this answer

Amazon Kinesis Data Firehose is the correct choice because it is a fully managed, serverless service that can capture, transform, and load streaming log data from EC2 instances into destinations like Amazon S3 or Amazon OpenSearch Service in near real-time, with no infrastructure to manage. It automatically scales to handle high-throughput data streams, making it ideal for real-time log analytics.

Exam trap

The trap here is that candidates often choose Amazon S3 alone for storage, forgetting that a real-time ingestion layer like Kinesis Data Firehose is required to collect and stream the data from EC2 instances into a queryable destination.

Practice this question →

30

MCQeasy

A data scientist needs to perform exploratory data analysis on a 100 GB CSV file stored in Amazon S3. The data is not sensitive. The scientist wants to use SQL queries to filter and aggregate the data without setting up a server or moving the data. Which service should be used?

A.AWS Glue

B.Amazon EMR

C.Amazon Athena

D.Amazon Redshift Spectrum

AnswerC

Athena is serverless and allows SQL queries on S3 data.

Why this answer

Option B is correct because Amazon Athena is a serverless query service that allows SQL queries directly on data in S3. Option A is wrong because Redshift Spectrum requires a Redshift cluster. Option C is wrong because EMR requires a cluster.

Option D is wrong because Glue is for ETL, not ad-hoc querying.

Practice this question →

31

MCQhard

An organization stores sensitive customer data in S3. A data pipeline uses AWS Glue to transform the data and load it into Amazon Redshift. The security team requires that data be encrypted at rest in S3 and in transit between S3 and Glue, and between Glue and Redshift. Which configuration meets these requirements?

A.Use S3 client-side encryption, and use VPC Peering between Glue and Redshift.

B.Use S3 default encryption with SSE-KMS, and use Network Load Balancer for Redshift.

C.Enable S3 server-side encryption with SSE-S3, and use SSL for both Glue connections.

D.Enable S3 default encryption with SSE-KMS, use a VPC endpoint for S3, and configure Glue to use SSL for Redshift connection.

AnswerD

SSE-KMS encrypts at rest, VPC endpoint uses AWS network, SSL encrypts connection to Redshift.

Why this answer

Option D is correct because it ensures encryption at rest in S3 via SSE-KMS, encrypts data in transit between S3 and Glue by using a VPC endpoint (which enforces HTTPS/TLS), and encrypts data in transit between Glue and Redshift by configuring SSL for the Redshift connection. SSE-KMS provides envelope encryption with a customer-managed key, while the VPC endpoint and SSL satisfy the in-transit encryption requirements.

Exam trap

The trap here is that candidates often assume VPC Peering alone provides encryption in transit, but it only provides network isolation without encryption, and they may overlook that SSL must be explicitly configured for the Glue-to-Redshift connection.

How to eliminate wrong answers

Option A is wrong because client-side encryption does not guarantee server-side encryption at rest in S3 (the security team requires encryption at rest in S3, which is typically satisfied by server-side encryption), and VPC Peering alone does not enforce encryption in transit between Glue and Redshift (it only provides network connectivity, not TLS/SSL). Option B is wrong because a Network Load Balancer (NLB) for Redshift does not inherently encrypt traffic between Glue and Redshift; NLB operates at Layer 4 and does not terminate TLS unless explicitly configured with a TLS listener, which is not mentioned. Option C is wrong because SSE-S3 encrypts data at rest but does not provide encryption in transit between S3 and Glue (SSL must be explicitly enabled for the Glue connection to S3, and the option does not specify SSL for the S3-to-Glue leg).

Practice this question →

32

MCQeasy

A company is using Amazon DynamoDB to store sensor data. The data is exported to Amazon S3 using DynamoDB Streams and AWS Lambda for long-term archival. Recently, the Lambda function has been failing due to 'ProvisionedThroughputExceededException' on the DynamoDB stream. What is the most likely cause?

A.The Lambda function is processing records too slowly, causing the stream to throttle.

B.The DynamoDB stream is disabled.

C.The DynamoDB table's write capacity is too low.

D.The Lambda function does not have enough memory allocated.

AnswerA

Correct: Slow processing can lead to throttling; increasing batch size or concurrency can help.

Why this answer

The error indicates that the DynamoDB stream's read throughput is being exceeded. The Lambda function's event source mapping reads from the stream, and if it is processing too slowly or there are too many shards, it can throttle. Option B (increase the batch size) reduces the number of reads per second.

Option A (increase Lambda memory) may not help. Option C (disable stream) would stop the export. Option D (increase DynamoDB write capacity) does not affect stream reads.

Practice this question →

33

MCQhard

A company uses Amazon Kinesis Data Streams with a shard count of 5. The data producer sends 1000 records per second, each 1 KB in size. The consumer application reads from the stream using the Kinesis Client Library (KCL) and processes records. The consumer is experiencing high latency and falling behind. What is the most effective way to improve consumer throughput?

A.Switch to Kinesis Data Analytics for processing.

B.Use enhanced fan-out to dedicate read throughput to the consumer.

C.Increase the record size to 5 KB.

D.Increase the number of shards in the stream.

AnswerD

Correct: More shards provide more read capacity and allow parallel processing.

Why this answer

Increasing the number of shards increases the read throughput and allows more consumers to read in parallel. Option B (increase the number of shards) is correct. Option A (increase record size) is irrelevant.

Option C (use Kinesis Data Analytics) adds another service. Option D (use enhanced fan-out) is for multiple consumers, not for a single consumer falling behind.

Practice this question →

34

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The on-premises network has a 1 Gbps connection to AWS. The transfer must be completed within 10 days. What is the MOST efficient approach?

A.Use AWS Snowball Edge to physically ship the data.

B.Use Amazon S3 Transfer Acceleration to speed up the upload.

C.Use AWS DataSync over the existing network connection.

D.Set up a VPN connection and use multi-part upload directly to S3.

AnswerA

Snowball Edge provides high-speed local transfer and avoids network bottlenecks.

Why this answer

Option C is correct because AWS Snowball Edge can transfer large volumes faster than a network, especially with limited bandwidth. Option A is wrong because a 1 Gbps line would take over 5 days for 50 TB, but may be unreliable. Option B is wrong because 50 TB over VPN would be very slow.

Option D is wrong because AWS DataSync is network-based and limited by bandwidth.

Practice this question →

35

MCQmedium

A data engineer needs to transform a large dataset stored in Amazon S3 using Apache Spark. The engineer wants to minimize costs and avoid managing infrastructure. Which AWS service should be used?

A.Amazon Athena

B.Amazon SageMaker

C.Amazon EMR

D.AWS Glue

AnswerD

Glue is serverless and ideal for cost-sensitive, intermittent ETL jobs.

Why this answer

AWS Glue is a serverless Spark ETL service that can process large datasets stored in S3 without requiring infrastructure management.

Practice this question →

36

MCQhard

A data engineer uses the IAM policy above for an AWS Lambda function that processes data in S3 and triggers an AWS Glue job. The Lambda function is unable to start the Glue job. What is the most likely cause?

A.The Glue job name in the resource ARN is misspelled.

B.The policy does not allow s3:PutObject on the bucket.

C.The policy does not include iam:PassRole permission.

D.The policy does not include s3:GetObject on the bucket.

AnswerC

To start a Glue job, Lambda must pass an execution role; iam:PassRole is required.

Why this answer

Option B is correct. The policy lacks 'glue:StartJobRun' action for the resource? Actually, the policy includes 'glue:StartJobRun' but the resource is specific to the job. However, the Lambda function's execution role may not have permission to pass the IAM role to Glue (iam:PassRole).

Option B is correct: missing 'iam:PassRole' permission. Option A is wrong because the policy has GetObject on the bucket. Option C is wrong because Glue job name is correct.

Option D is wrong because S3 PutObject is allowed.

Practice this question →

37

MCQhard

Refer to the exhibit. A data engineer runs an Athena query and gets a failure. What is the most likely cause?

A.The query result location uses a bucket with default encryption enabled.

B.The SQL query syntax is incorrect.

C.The IAM role used does not have permissions to write to S3.

D.The output S3 bucket specified in the query result configuration already exists.

AnswerD

Athena expects the output bucket to be created by the service; if it already exists, it may cause this error.

Why this answer

Option D is correct. The error message says 'Bucket my-query-results should not exist', which indicates that the output location bucket must not exist beforehand; Athena creates it if it doesn't exist. Option A is incorrect because the error does not mention permissions.

Option B is incorrect because the error is about the bucket existing, not the query syntax. Option C is incorrect because the error does not mention encryption.

Practice this question →

38

MCQmedium

A machine learning team is building a real-time inference pipeline using Amazon SageMaker. The input data is located in an S3 bucket, and the team needs to transform the data before inference using a custom Python script. The transformation should run on a serverless infrastructure and must be triggered automatically when new data arrives in S3. Which combination of services should the team use?

A.Use AWS Lambda functions triggered by S3 events to run the transformation, then invoke a SageMaker endpoint.

B.Use AWS Glue jobs triggered by S3 events.

C.Use Amazon SageMaker Processing jobs triggered by S3 events.

D.Use Amazon Kinesis Data Firehose to transform data and deliver to SageMaker.

AnswerA

Lambda provides serverless compute triggered by S3 events, and can call SageMaker endpoints.

Why this answer

Option C is correct because S3 events can trigger Lambda, which runs the custom script, and the output can be sent to a SageMaker endpoint for inference. Option A is wrong because SageMaker Processing is not serverless (it runs on EC2 instances). Option B is wrong because Glue is for ETL, not real-time inference.

Option D is wrong because Kinesis Data Firehose is for streaming ingestion, not suitable for S3-triggered batch processing.

Practice this question →

39

MCQhard

A company runs a real-time analytics platform that ingests IoT sensor data from millions of devices. The data is sent to Amazon Kinesis Data Streams with 16 shards. A custom Java application using the Kinesis Client Library (KCL) processes the data and writes aggregated results to Amazon DynamoDB. The application runs on a fleet of EC2 instances in an Auto Scaling group. Recently, the team noticed that some records are being processed multiple times, resulting in duplicate entries in DynamoDB. The application uses the DynamoDB PutItem API to write records. The team needs to eliminate duplicates without significantly increasing latency. Which solution should the team implement?

A.Enable DynamoDB auto scaling to increase write capacity and reduce throttling, which causes retries and duplicates.

B.Use DynamoDB TransactWriteItems with a condition check that the record's Kinesis sequence number does not already exist in the table.

C.Place an Amazon SQS FIFO queue between the KCL application and DynamoDB to deduplicate messages.

D.Modify the application to use DynamoDB BatchWriteItem instead of PutItem to reduce the number of write requests.

AnswerB

This ensures exactly-once semantics by atomically checking and writing.

Why this answer

Option B is correct because using a DynamoDB transaction with a condition check on the Kinesis sequence number ensures that each record is written only once. Option A is wrong because idempotent writes would require a unique identifier; using PutItem with a condition expression on a unique attribute (like the sequence number) is effectively the same as option B but transactions provide atomicity. Option C is wrong because idempotent writes in DynamoDB are not natively supported; you must use conditional writes.

Option D is wrong because adding a FIFO queue adds latency and complexity without guaranteeing exactly-once processing in the consumer.

Practice this question →

40

Multi-Selecthard

A data engineering team is designing a streaming data pipeline that ingests 10,000 events per second. Each event is 2 KB. The pipeline must process events with a latency of less than 1 second. The team is considering using Amazon Kinesis Data Streams with 10 shards. Which TWO additional configurations should the team implement to meet the latency requirement? (Choose two.)

Select 2 answers

A.Use record aggregation to reduce the number of records.

B.Configure Kinesis Data Firehose to deliver data to Amazon S3.

C.Increase the number of shards to 100.

D.Enable auto-scaling of shards based on throughput.

E.Use enhanced fan-out for consumers.

AnswersD, E

Auto-scaling ensures that the stream has enough shards to handle peak load without manual intervention.

Why this answer

Correct options: B and D. Using enhanced fan-out allows multiple consumers to read from the stream with dedicated throughput, reducing latency. Auto-scaling shards ensures sufficient capacity as load varies.

Option A (increase shards to 100) is over-provisioning and increases cost. Option C (record aggregation) is for reducing PUT costs but doesn't affect read latency. Option E (S3 delivery) is not relevant to latency.

Practice this question →

41

Matchingmedium

Match each AWS security service to its function in ML.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Manage access to AWS resources

Encryption key management

Audit API calls

Isolate network resources

Discover and protect sensitive data

Why these pairings

Security services are important for compliance in ML.

Practice this question →

42

MCQhard

A financial services company uses Amazon Kinesis Data Streams with 50 shards to ingest real-time stock trade data. The data is consumed by a custom Java application running on Amazon EC2 instances. Recently, the application has been experiencing high latency, and CloudWatch metrics show that the average iterator age is increasing. The application uses the Kinesis Client Library (KCL) with DynamoDB for lease tracking. The EC2 instances are in an Auto Scaling group with a minimum of 2 and maximum of 10 instances, and the current CPU utilization is below 50%. The team wants to reduce latency without increasing costs significantly. What should they do?

A.Increase the provisioned read capacity of the DynamoDB lease table

B.Enable enhanced fan-out on the Kinesis stream

C.Increase the number of shards in the Kinesis stream

D.Increase the maximum size of the Auto Scaling group and set a scaling policy based on iterator age

AnswerD

Adding more consumers reduces iterator age.

Why this answer

Increasing the number of consumers (EC2 instances) allows parallel processing of shards, reducing iterator age. Option A is wrong because reducing shards would increase load per shard. Option C is wrong because DynamoDB provisioned throughput is not the bottleneck.

Option D is wrong because enabling enhanced fan-out is for multiple consumers, not the same consumer group.

Practice this question →

43

MCQmedium

A company is using AWS Glue Data Catalog as the metadata store for their data lake. They have multiple AWS accounts and want to share the catalog across accounts. Which feature should they use?

A.Amazon Athena Federated Query

B.AWS Lake Formation

C.AWS Resource Access Manager (RAM)

D.Amazon S3 Cross-Region Replication

AnswerC

RAM allows sharing Glue Data Catalog across accounts.

Why this answer

AWS Resource Access Manager (RAM) enables you to share AWS Glue Data Catalog databases and tables across multiple AWS accounts without needing to copy metadata. This allows a centralized catalog to be consumed by different accounts for querying and ETL operations, maintaining a single source of truth for the data lake.

Exam trap

The trap here is that candidates often confuse AWS Lake Formation's cross-account access capabilities with the actual sharing mechanism, but Lake Formation relies on AWS RAM to enable the sharing of Data Catalog resources.

How to eliminate wrong answers

Option A is wrong because Amazon Athena Federated Query allows querying data from external sources (e.g., CloudWatch, DynamoDB) using connectors, but it does not share the Glue Data Catalog across accounts. Option B is wrong because AWS Lake Formation provides fine-grained access control and data lake management, but cross-account catalog sharing is implemented via AWS RAM, not directly by Lake Formation (though Lake Formation can use RAM for sharing). Option D is wrong because Amazon S3 Cross-Region Replication replicates objects between S3 buckets in different regions, but it does not share the Glue Data Catalog metadata store across accounts.

Practice this question →

44

MCQeasy

A data scientist needs to query a 2 TB dataset stored in Amazon S3 using Amazon Athena. The data is in CSV format and is used for exploratory analysis. Queries are currently slow and expensive. Which action will improve query performance and reduce cost?

A.Convert the data to JSON format to improve compression.

B.Increase the number of workers in the Athena query engine.

C.Convert the data to Parquet format and partition by a commonly filtered column.

D.Create a composite index on the data using Athena's index feature.

AnswerC

Parquet reduces data scanned due to columnar storage, and partitioning limits scan range.

Why this answer

Option D is correct because converting CSV to Parquet reduces scan size and query cost, and partitioning further limits data scanned. Option A is wrong because increasing workers is not applicable to Athena. Option B is wrong because converting to JSON may increase data size and cost.

Option C is wrong because Athena does not use indexes.

Practice this question →

45

Multi-Selectmedium

Which TWO options are valid ways to reduce the amount of data scanned by Amazon Athena queries, thereby reducing cost?

Select 2 answers

A.Use columnar storage formats like Parquet or ORC

B.Use LIMIT clause in SQL queries

C.Convert data to CSV format

D.Create materialized views in Athena

E.Partition the data by a frequently filtered column

AnswersA, E

Columnar formats allow reading only required columns.

Why this answer

Partitioning allows Athena to skip entire partitions. Using columnar formats like Parquet reduces the amount of data read per column. Converting to CSV increases data scanned.

Materialized views don't reduce scan. Limiting row count does not reduce scan of underlying data.

Practice this question →

46

MCQmedium

An IAM policy attached to a SageMaker notebook role is shown. The data engineer tries to run an Athena query on a table in the 'my_database' Glue database. The query fails with an access denied error. What is the MOST likely cause?

A.The policy does not allow s3:PutObject on the query results location.

B.The policy does not allow glue:GetTable on the specific database.

C.The policy does not allow athena:StartQueryExecution on the Athena workgroup.

D.The policy does not allow s3:ListBucket on the bucket.

AnswerC

Athena requires workgroup-level permissions; the policy grants StartQueryExecution on all resources, but if a workgroup is specified, additional permissions may be needed.

Why this answer

Option D is correct because the policy does not grant permission to the Athena workgroup. Athena requires workgroup permissions for StartQueryExecution. Option A is wrong because the policy allows GetObject.

Option B is wrong because the policy allows GetTable and GetDatabase. Option C is wrong because S3 actions are allowed on the bucket.

Practice this question →

47

MCQeasy

A data engineer needs to transfer 50 TB of data from an on-premises HDFS cluster to Amazon S3. The data must be encrypted in transit and at rest. The on-premises network has a 1 Gbps connection to AWS. The transfer must complete within 5 days. Which solution is MOST cost-effective and meets the requirements?

A.Use S3 Transfer Acceleration to upload the data directly from HDFS to S3.

B.Use AWS DataSync with a DataSync agent installed on-premises to transfer the data to S3.

C.Order an AWS Snowball Edge device and copy the data to it, then ship it back.

D.Use AWS Glue to read from HDFS and write to S3 in a continuous ETL job.

AnswerB

DataSync can transfer over network with encryption and is optimized for speed.

Why this answer

Option C is correct. With 1 Gbps, the maximum theoretical transfer in 5 days is about 54 TB (1 Gbps = 0.125 GB/s, 0.125 * 86400 * 5 = 54000 GB = 54 TB). So it is feasible.

AWS DataSync can transfer data from HDFS via a private endpoint using the DataSync agent, with encryption in transit (TLS) and at rest (S3 SSE). Option A is wrong because S3 Transfer Acceleration only speeds up uploads over public internet, not from HDFS directly. Option B is wrong because Snowball Edge would be faster but more expensive for this volume that can fit in the time window.

Option D is wrong because AWS Glue is for ETL, not data transfer.

Practice this question →

48

MCQmedium

A data engineering team needs to ingest streaming data from thousands of IoT devices into a data lake on Amazon S3 for near-real-time analytics. The data must be partitioned by device ID and timestamp, and the team must minimize data loss during ingestion failures. Which solution is MOST appropriate?

A.Use Amazon Kinesis Data Streams with a Lambda function that writes to S3.

B.Use Amazon Kinesis Data Firehose to write directly to S3 with dynamic partitioning.

C.Use Amazon S3 Transfer Acceleration with direct uploads from devices.

D.Use AWS Lambda to receive data via API Gateway and write to S3.

AnswerB

Firehose provides automatic partitioning, retries, and near-real-time delivery to S3.

Why this answer

Option B is correct because Kinesis Data Firehose can directly write to S3 with partitioning and automatic retries, minimizing data loss. Option A is wrong because Kinesis Data Streams requires a separate consumer to write to S3, adding complexity. Option C is wrong because Lambda has a 15-minute limit and may lose data if the function fails.

Option D is wrong because S3 Transfer Acceleration is for speeding up uploads, not for streaming ingestion.

Practice this question →

49

MCQmedium

A company uses Amazon EMR to run Spark jobs on a cluster with 10 core nodes of type r5.xlarge. The jobs are I/O intensive and read large amounts of data from S3. The team notices high network throughput but low CPU utilization. Which configuration change would improve job performance at the same cost?

A.Change the instance type to m5.xlarge (general purpose) to balance resources.

B.Increase the number of core nodes to 20.

C.Replace the core nodes with r5d.xlarge instances that have local SSDs.

D.Use spot instances for the core nodes to save cost and reinvest in more nodes.

AnswerC

Local SSDs provide high I/O for caching, reducing network traffic.

Why this answer

Option B is correct because r5d instances include local NVMe SSDs, which can be used for caching intermediate data, reducing network I/O and improving performance for I/O intensive jobs. Option A is wrong because increasing core nodes increases cost. Option C is wrong because using spot instances reduces cost but not performance.

Option D is wrong because moving to m5 instances (general purpose) may not improve I/O.

Practice this question →

50

Multi-Selecteasy

A data engineer is building a data pipeline using AWS Glue. The pipeline reads data from Amazon S3, transforms it, and writes it back to S3 in a different format. The engineer needs to handle schema evolution (new columns added over time). Which TWO features of AWS Glue can help manage schema evolution?

Select 2 answers

A.AWS Glue Data Catalog

B.AWS Glue DynamicFrame

C.AWS Lake Formation

D.Amazon Athena

E.Amazon S3 object tags

AnswersA, B

Data Catalog stores schema and can be updated as schema evolves.

Why this answer

Options B and D are correct. The Glue Data Catalog can store schema and update it as new columns are added. DynamicFrame in Glue ETL can handle schema changes automatically by allowing optional fields.

Option A is wrong because AWS Lake Formation is for data lake security, not schema evolution. Option C is wrong because Amazon Athena is a query engine, not a schema evolution tool. Option E is wrong because S3 object tags are not for schema management.

Practice this question →

51

MCQhard

A data scientist is building a training dataset from data stored in Amazon S3. The data consists of JSON files each containing a 'timestamp' field. The scientist wants to use AWS Glue to catalog the data and enable querying via Amazon Athena. However, Athena queries are returning zero results for time-range filters. What is the most likely cause?

A.The AWS Glue crawler does not have permissions to read the S3 bucket.

B.Athena cannot query nested JSON objects.

C.The JSON files are not in the correct format for Athena.

D.The 'timestamp' field is not defined as a partition column in the Glue table.

AnswerD

Correct: Without partitioning, Athena scans all data, but time-range filters still work; however, the question implies zero results, which could be due to incorrect partition pruning.

Why this answer

Athena uses the partition columns derived from the Glue catalog. If the timestamp column is not used as a partition key, queries that filter on it will scan all data. Option B (timestamp is not a partition column) is correct because Glue can automatically partition by date, but the user must set it.

Option A (wrong file format) is unlikely if JSON is supported. Option C (Athena cannot query nested JSON) is false; Athena supports JSON. Option D (insufficient permissions) would cause a different error.

Practice this question →

52

MCQhard

A company stores sensitive customer data in an S3 bucket. The security team requires that all data be encrypted at rest with a key that is automatically rotated every year. Which solution meets these requirements with the least operational overhead?

A.Use SSE-KMS with a customer-managed key and automatic rotation

B.Use SSE-C (customer-provided keys)

C.Use SSE-S3 (Amazon S3-managed keys)

D.Use SSE-KMS with a customer-managed key and manual rotation

AnswerC

SSE-S3 automatically rotates keys and requires no customer management.

Why this answer

SSE-S3 uses Amazon S3-managed keys (SSE-S3) which are automatically rotated. SSE-KMS with automatic rotation also works but requires KMS key management. SSE-C requires customer-managed keys.

SSE-KMS with manual rotation adds overhead.

Practice this question →

53

MCQeasy

A data engineering team needs to ingest streaming data from thousands of IoT devices into Amazon S3 for near-real-time analytics. The data arrives in bursts and must be processed with minimal latency. Which AWS service is most appropriate for the ingestion layer?

A.Amazon Kinesis Data Streams

B.Amazon Kinesis Data Firehose

C.Amazon SQS

D.Amazon S3

AnswerA

Kinesis Data Streams provides low-latency, real-time data ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data streaming with low latency and can handle high throughput from many sources. Option A is wrong because SQS is for decoupled messaging, not streaming. Option C is wrong because Kinesis Data Firehose is for loading streaming data into destinations with some latency.

Option D is wrong because S3 itself is not a streaming ingestion service.

Practice this question →

54

MCQeasy

A data engineer needs to load data from a MySQL database to Amazon S3 daily. The database is 500 GB and the load window is 2 hours. The data must be extracted without impacting the source database performance. Which AWS service should be used to perform the extraction?

A.AWS Glue ETL job using a JDBC connection to read the full table.

B.AWS Database Migration Service (AWS DMS) with a full-load task to S3.

C.Amazon Athena with the MySQL federated query connector.

D.Amazon EMR with a Spark job reading from MySQL via JDBC.

AnswerB

DMS is designed for minimal impact migration and can load data directly to S3.

Why this answer

Option D is correct because AWS Database Migration Service (DMS) can perform continuous or scheduled full-load and change-data-capture tasks from MySQL to S3 without impacting source performance when using appropriate task settings and replication instance. Option A is wrong because Amazon Athena is a query service, not an extraction tool. Option B is wrong because AWS Glue can connect to JDBC sources but may cause higher overhead on the source; DMS is purpose-built.

Option C is wrong because Amazon EMR is for big data processing, not direct extraction from MySQL to S3.

Practice this question →

55

MCQhard

A data scientist needs to run ad-hoc SQL queries on a large dataset stored in Amazon S3 (Parquet format, 2 TB). The queries are interactive and require sub-second response times. Which service should they use?

A.Amazon Redshift Spectrum

B.Amazon QuickSight

C.Amazon EMR with Spark SQL

D.Amazon Athena

AnswerD

Athena is serverless and optimized for interactive queries on S3 data.

Why this answer

Amazon Athena can query data in S3 using SQL and provides fast interactive query performance, especially with Parquet. Option A is wrong because EMR requires cluster setup. Option C is wrong because Redshift Spectrum is heavier for ad-hoc.

Option D is wrong because QuickSight is for visualization.

Practice this question →

56

Multi-Selecteasy

Which TWO AWS services can be used to transform data in transit before storing it in Amazon S3? (Choose TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Redshift Spectrum

C.AWS Data Pipeline

D.Amazon Kinesis Data Firehose

E.Amazon Athena

AnswersA, D

Glue can process streaming data with streaming ETL jobs.

Why this answer

AWS Glue can perform ETL transformations on data in motion. Amazon Kinesis Data Firehose can invoke Lambda functions to transform data before delivery. Options B, D, and E are not used for transforming data in transit.

Practice this question →

57

Multi-Selecthard

A company uses Amazon Athena to query a data lake in Amazon S3. The data is partitioned by year, month, day, and hour. The team notices that queries are slow and expensive. The team wants to improve performance and reduce costs. Which THREE actions should the team take?

Select 3 answers

A.Ensure queries filter on partition columns (year, month, day, hour).

B.Increase the number of partitions by adding a partition for minute.

C.Convert data from CSV to Parquet format.

D.Use CSV format with GZIP compression.

E.Use S3 storage classes like S3 Intelligent-Tiering for cost savings.

AnswersA, C, E

Partition pruning reduces scanned data.

Why this answer

Options A, D, and E are correct. Converting to columnar formats like Parquet reduces the amount of data scanned. Partition pruning using WHERE clauses on partition columns reduces scanned partitions.

Compressing data reduces storage and scan volume. Option B is wrong because increasing partitions beyond need can increase overhead. Option C is wrong because using CSV is less efficient.

Practice this question →

58

MCQeasy

A Lambda function is triggered by S3 events. The event payload shown in the exhibit is received by the Lambda function. The function is supposed to process the CSV file and load it into DynamoDB. However, the function fails because it cannot read the file. What is the MOST likely cause?

A.The Lambda function lacks DynamoDB write permissions

B.The Lambda function's IAM role does not have s3:GetObject permission

C.The S3 bucket does not exist

D.The S3 event notification is misconfigured

AnswerB

Without read permission, the function cannot access the S3 object.

Why this answer

Option C is correct. The Lambda function needs an IAM role with s3:GetObject permission to read the object. Option A is wrong because the event is valid.

Option B is wrong because DynamoDB permissions are separate. Option D is wrong because the bucket exists.

Practice this question →

59

MCQhard

A company is designing a data pipeline to process log files from multiple sources. The logs are written to Amazon S3 every hour. The data is then transformed using AWS Glue ETL jobs and loaded into Amazon Redshift for analysis. The company needs to ensure that the data is available for analysis within 30 minutes of being written to S3. Currently, the Glue job is triggered hourly, but the company wants to reduce the latency. Which solution should the company implement?

A.Increase the frequency of the Glue crawler to run every 5 minutes

B.Use Amazon Redshift Spectrum to query the data directly from S3 without transformation

C.Use Amazon S3 event notifications to invoke an AWS Lambda function that starts the Glue job automatically

D.Reduce the Glue job trigger frequency to every 15 minutes

AnswerC

S3 events trigger Lambda immediately, which starts the Glue job with low latency.

Why this answer

Option C is correct because configuring an S3 event notification to invoke AWS Lambda, which starts the Glue job, allows near-real-time processing within minutes. Option A is wrong because hourly triggers do not reduce latency. Option B is wrong because increasing the crawler frequency does not trigger ETL jobs.

Option D is wrong because Redshift Spectrum does not transform data.

Practice this question →

60

MCQeasy

A company needs to ingest real-time clickstream data from thousands of web servers into AWS for near-real-time analytics. The data volume varies and can spike during promotions. Which service should be used to capture and buffer the data before processing?

A.Amazon SQS

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams

D.Amazon MQ

AnswerC

Kinesis Data Streams provides a durable buffer for real-time data, enabling multiple consumers.

Why this answer

Amazon Kinesis Data Streams is designed for real-time data ingestion and can buffer data for up to 7 days. It scales automatically and integrates with Kinesis Analytics and Lambda.

Practice this question →

61

MCQmedium

A data science team is building a real-time fraud detection system. Transactions are streamed via Amazon Kinesis Data Streams, and a Lambda function performs feature engineering and invokes an Amazon SageMaker endpoint for predictions. The team notices that the Lambda function is timing out and causing data loss. Which solution should the team implement to process the stream reliably and at low latency?

A.Use Amazon Kinesis Data Analytics for Apache Flink to consume the stream, perform feature engineering, and invoke the SageMaker endpoint with exactly-once processing.

B.Use the Kinesis Client Library (KCL) to process the stream in an Amazon EC2 instance, and store the predictions in Amazon DynamoDB.

C.Increase the Lambda function timeout to 15 minutes and allocate more memory to reduce processing time.

D.Configure Amazon Kinesis Firehose to deliver the stream to an Amazon S3 bucket, then trigger a Lambda function to process the data in batches.

AnswerA

Kinesis Data Analytics provides stateful stream processing with checkpointing, ensuring no data loss and low-latency integration with SageMaker.

Why this answer

Option A is correct because Amazon Kinesis Data Analytics for Apache Flink provides a stateful, low-latency stream processing engine that can consume from Kinesis Data Streams, perform feature engineering in real-time, and invoke SageMaker endpoints with exactly-once processing semantics. This eliminates Lambda timeouts and data loss by using a long-running, scalable application instead of a short-lived function.

Exam trap

The trap here is that candidates often assume increasing Lambda resources (timeout/memory) or moving to a batch-based approach (Firehose/S3) can solve real-time streaming issues, but the exam tests the understanding that stateful, long-running stream processing engines like Flink are required for reliable, low-latency, exactly-once processing in production.

How to eliminate wrong answers

Option B is wrong because using the Kinesis Client Library (KCL) on an EC2 instance requires manual management of scaling, fault tolerance, and checkpointing, and does not natively integrate with SageMaker for low-latency predictions; it also adds operational overhead and potential for data loss if the instance fails. Option C is wrong because increasing the Lambda timeout to 15 minutes and allocating more memory only masks the underlying issue of Lambda's 15-minute maximum execution time and does not address the fundamental problem of stream processing at scale; Lambda is not designed for long-running, stateful stream processing and can still lose data if the function fails or throttles. Option D is wrong because Amazon Kinesis Firehose delivers data in batches to S3, which introduces significant latency (typically minutes) and is not suitable for real-time fraud detection; triggering a Lambda on S3 objects adds further delay and does not provide low-latency, per-record processing.

Practice this question →

62

MCQhard

A data engineer has attached the above IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from and writes to 'my-data-bucket'. The job is failing with an Access Denied error. What is the most likely cause?

A.The condition restricts access to a specific IP range that does not include the AWS Glue service IPs.

B.The IAM role needs to have s3:ListBucket permission.

C.The IAM role does not have permission to list the bucket.

D.The resource ARN should include the bucket itself, not just the objects.

AnswerA

The condition requires the request source IP to be in 10.0.0.0/24, but Glue's IPs are different.

Why this answer

The policy restricts access to requests originating from the IP range 10.0.0.0/24. AWS Glue jobs run in a VPC that uses private IPs, but the source IP condition is evaluated based on the IP address of the Glue service principal, which is not within that range. The condition should be removed or modified to allow access from the Glue service.

Practice this question →

63

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. They notice that the data is delivered in 5-minute intervals even though they set the buffer interval to 60 seconds. What could be the cause?

A.The source Kinesis stream has insufficient shards.

B.The buffer size is set to a value larger than the incoming data rate.

C.The S3 bucket is in a different region.

D.The IAM role does not have permission to write to S3.

AnswerB

If the buffer size is large and data rate low, Firehose waits longer.

Why this answer

Firehose has a minimum buffer interval of 60 seconds and a maximum of 900 seconds. The actual delivery interval is controlled by both buffer size and interval; if the buffer size is not reached, Firehose waits up to the maximum interval. The default buffer size is 5 MB.

If data rate is low, Firehose will wait 5 minutes (default max interval) unless the buffer size is lowered.

Practice this question →

64

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data is in JSON format, and the company wants to convert it to Parquet for efficient querying. Which configuration should be used?

A.Enable data transformation in Firehose using an AWS Lambda function to convert JSON to Parquet, and set the output format to Parquet.

B.Use an AWS Glue job to convert the JSON files in S3 to Parquet after delivery.

C.Use Amazon Kinesis Data Analytics to convert the stream to Parquet before sending to Firehose.

D.Configure Firehose to deliver data directly to Amazon Redshift, which automatically converts to Parquet.

AnswerA

Firehose can invoke a Lambda function for transformation and write Parquet to S3.

Why this answer

Option A is correct because Firehose supports data transformation using Lambda and can convert to Parquet via the output format setting. Option B is wrong because Glue is not directly integrated with Firehose. Option C is wrong because Kinesis Data Analytics does not convert to Parquet for S3 delivery.

Option D is wrong because Firehose cannot directly write to Redshift in Parquet format without transformation.

Practice this question →

65

MCQeasy

A data engineer needs to set up a data pipeline that ingests data from an Amazon RDS MySQL database into Amazon S3. The pipeline should run daily and capture incremental changes (inserts, updates, deletes) from the source database. Which AWS service should be used as the data ingestion tool?

A.AWS Database Migration Service (DMS) with continuous change data capture (CDC).

B.Amazon Kinesis Data Streams with a Lambda function.

C.AWS Data Pipeline with a SQL activity.

D.AWS Glue with a scheduled crawler.

AnswerA

Correct: DMS with CDC can capture incremental changes.

Why this answer

AWS Database Migration Service (DMS) supports continuous replication and can capture changes using CDC. Option A (DMS with CDC) is correct. Option B (Glue) can do batch but not native CDC.

Option C (Data Pipeline) is older and less suited for CDC. Option D (Kinesis) is for streaming, not database replication.

Practice this question →

66

MCQeasy

A startup is building a data pipeline that ingests data from multiple sources into an Amazon S3 data lake. The data includes CSV files from legacy systems, JSON from web APIs, and Avro from mobile apps. The data must be transformed into Parquet format and cataloged for querying with Amazon Athena. The pipeline must be serverless and minimize operational overhead. The team has decided to use AWS Glue for ETL and cataloging. However, they are concerned about the cost of running Glue jobs continuously. The data arrives in small batches every 10 minutes. Which approach should the team use to minimize cost while meeting the requirements?

A.Use AWS Lambda functions to transform each file upon arrival and store as Parquet

B.Use Amazon Kinesis Data Firehose to stream data directly into S3 and use Glue to catalog it

C.Use scheduled Glue jobs to process the data every hour, consolidating multiple batches

D.Use a single daily Glue job to process all data at once

AnswerC

Hourly batch processing balances cost and latency.

Why this answer

Triggering Glue jobs on schedule (e.g., every hour) to process accumulated data reduces the number of job runs and cost, while still meeting near-real-time needs. Option A is wrong because continuous streaming with Firehose may not handle all source formats. Option C is wrong because using Lambda for transformation is limited by timeout and memory.

Option D is wrong because running a single daily job may introduce too much latency.

Practice this question →

67

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises HDFS cluster to Amazon S3. The company has a 1 Gbps internet connection. Which service would complete the transfer in the shortest time?

A.AWS Snowball

B.Amazon S3 Transfer Acceleration

C.AWS Direct Connect

D.AWS DataSync

AnswerA

Snowball can transfer 50 TB physically in days.

Why this answer

AWS Snowball is a physical device that can transfer large amounts of data faster than internet due to bandwidth limitations. DataSync is for network transfers. Direct Connect helps but still limited by bandwidth.

S3 Transfer Acceleration speeds up internet transfers but cannot match physical shipment for 50 TB.

Practice this question →

68

Multi-Selectmedium

A company is migrating on-premises data to AWS. The data includes both structured and unstructured files, totaling 200 TB. The company has a 1 Gbps dedicated network connection to AWS. They want to minimize migration time and cost. Which TWO AWS services or features should they use together? (Choose two.)

Select 2 answers

A.AWS Snowball Edge

B.Amazon S3 Transfer Acceleration

C.AWS Glue

D.AWS DataSync

E.AWS Direct Connect

AnswersA, D

Snowball Edge can be used to transfer large amounts of data physically, reducing network load.

Why this answer

Correct options: A and D. AWS DataSync can efficiently transfer data over the network to S3, and Snowball Edge can be used for the largest files to reduce network load. Option B (Direct Connect) is already implied by the 1 Gbps connection; it's not an additional service.

Option C (S3 Transfer Acceleration) is for speeding up internet transfers, not needed over a dedicated connection. Option E (Glue) is used for ETL, not for bulk data transfer.

Practice this question →

69

MCQhard

A company uses AWS Glue to run ETL jobs on a daily schedule. The jobs are failing intermittently with 'OutOfMemory' errors. The data volume has grown 5x over the past month. Which is the MOST cost-effective fix?

A.Increase the number of partitions in the source S3 data

B.Increase the number of DPUs for the Glue job

C.Reduce the data volume by sampling

D.Switch from AWS Glue to Amazon EMR

AnswerB

More DPUs provide more memory and parallelism.

Why this answer

Increasing the number of DPUs (data processing units) in the Glue job configuration provides more memory and parallelism cost-effectively. Switching to EMR is more expensive and complex. Reducing data is not a solution.

Increasing S3 partitions does not affect memory.

Practice this question →

70

MCQmedium

A company runs a nightly batch job that reads data from Amazon RDS for PostgreSQL, transforms it using AWS Glue, and writes the output to Amazon S3 in Parquet format. The job takes 2 hours to complete, but the data volume has grown, and the job now takes 4 hours, exceeding the allowed window. The team needs to reduce the job duration without increasing cost. Which action is MOST effective?

A.Increase the number of Glue DPUs (Data Processing Units) allocated to the job.

B.Enable AWS Glue job bookmark to skip already processed data.

C.Change the output format from Parquet to CSV to reduce write time.

D.Partition the output S3 data by a high-cardinality column used in filtering during transformation.

AnswerD

Partitioning can reduce data shuffling and improve write performance.

Why this answer

Option D is correct because partitioning the Parquet output by a frequently filtered column reduces the amount of data processed downstream and can speed up the Glue job if the transformation includes filtering. However, the question asks to reduce job duration: increasing parallelism by increasing the number of DPUs (option C) would increase cost, but the team does not want to increase cost. Option A is wrong because Glue's job bookmark can help incremental loads but does not speed up the full load.

Option B is wrong because converting to CSV may increase size and processing time. Option D is the best: partitioning the output can improve write performance and reduce data volume for subsequent steps, but note that it might not drastically reduce the job duration itself. Actually, the most effective without increasing cost is to optimize the transformation logic.

But among given options, D is plausible. Let's reconsider: Option C increases cost. Option D partitions output, which may slightly reduce write time but not transformation time.

Option A and B are not effective. Perhaps the best is to use a larger instance with same cost? Not possible. Option D is the only one that does not increase cost and may help.

I'll stick with D.

Practice this question →

71

MCQeasy

A company wants to analyze historical data stored in Amazon S3 using Amazon Athena. The data is in CSV format and is partitioned by date. Which action will provide the best query performance and cost optimization?

A.Use AWS Glue to compress the CSV files with gzip

B.Create an S3 event notification to trigger a Lambda function that warms up Athena

C.Keep CSV format but ensure partitions are in the format year=YYYY/month=MM/day=DD

D.Convert the data to Parquet format and use the existing partition structure

AnswerD

Parquet is columnar and compressed, reducing scanned data and improving performance.

Why this answer

Converting data to Parquet and partitioning improves query performance and reduces cost because Athena scans less data. Option B (only partitioning) helps but Parquet is columnar and more efficient. Option C (increasing S3 events) is irrelevant.

Option D (using Glue to compress) may help but Parquet already includes compression.

Practice this question →

72

Multi-Selecthard

A data engineer needs to set up a data lake on S3 that supports both batch and streaming ingestion. The data must be queryable by Athena, Redshift Spectrum, and EMR. Which TWO configurations are essential? (Choose two.)

Select 2 answers

A.Store data in columnar formats like Parquet or ORC.

B.Use the AWS Glue Data Catalog as a central metadata repository.

C.Enable S3 Select on the target buckets.

D.Enable S3 versioning on all buckets.

E.Set up Kinesis Data Firehose for streaming ingestion.

AnswersA, B

Columnar formats improve query performance and reduce scan costs for Athena and Redshift Spectrum.

Why this answer

Options A and B are correct. Option A ensures the Glue Data Catalog is used as a metastore for all services. Option B stores data in columnar format for query efficiency.

Option C is wrong because S3 Select is not required. Option D is wrong because Kinesis is needed only for streaming. Option E is wrong because S3 versioning is not essential for querying.

Practice this question →

73

Multi-Selectmedium

A data engineer is designing a data ingestion pipeline that will receive up to 5 GB of data per hour from thousands of IoT devices. The data must be stored in Amazon S3 and analyzed in near real-time. Which TWO services should be used together to meet these requirements? (Choose TWO.)

Select 2 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.Amazon Kinesis Data Firehose

E.Amazon Simple Queue Service (Amazon SQS)

AnswersB, D

Kinesis Data Analytics can run SQL queries on streaming data for near real-time analysis.

Why this answer

Amazon Kinesis Data Firehose can ingest streaming data and deliver to S3. Amazon Kinesis Data Analytics can perform near real-time analysis on the data stream. Option A (Amazon SQS) is for decoupling applications, not streaming analytics.

Option B (AWS Lambda) can be used for processing but is not a streaming analytics service. Option E (Amazon Athena) is for ad-hoc queries on S3, not real-time.

Practice this question →

74

MCQhard

A data engineer is designing a data lake on Amazon S3. The data comes from various sources and must be stored in a way that supports both batch and real-time analytics. The engineer needs to partition the data to optimize query performance in Amazon Athena. Which partitioning strategy is MOST appropriate?

A.Partition by a hash of the record ID to distribute data evenly

B.Do not partition; use a single prefix for all data

C.Partition by year, then month, then day, then hour

D.Partition by source type and then by date

AnswerC

Hierarchical time partitioning is standard for time-series data and works well with Athena.

Why this answer

Option C is correct. Partitioning by year/month/day/hour allows efficient querying for both batch (daily) and real-time (hourly) use cases, and is a common practice. Option A (source type) may cause small files.

Option B (random) is not helpful. Option D (single partition) defeats the purpose.

Practice this question →

75

Multi-Selectmedium

A company is designing a data pipeline to ingest data from multiple sources into an Amazon S3 data lake. The data must be encrypted at rest and in transit. Which TWO actions should be taken to meet these requirements?

Select 2 answers

A.Enable Server-Side Encryption on the S3 bucket

B.Enable S3 Transfer Acceleration

C.Enforce HTTPS for all S3 API requests using bucket policy

D.Use client-side encryption before uploading

E.Use S3 VPC Endpoint

AnswersA, C

Encrypts objects at rest.

Why this answer

Server-Side Encryption (SSE-S3 or KMS) encrypts data at rest in S3. Using HTTPS (SSL/TLS) for all API calls encrypts data in transit. Client-side encryption is an alternative but not the standard AWS approach.

VPC endpoints and CloudFront do not encrypt in transit by themselves.

Practice this question →

Page 1 of 5 · 374 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Ml Data Engineering questions.

Start 20-question session