Knowledge + Practice

CCNA Ml Data Engineering Questions

75 of 374 questions · Page 2/5 · Ml Data Engineering topic · Answers revealed

Practice these questions Exam hub All questions

76

MCQeasy

A data engineer needs to move 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The network bandwidth is limited to 100 Mbps. Which AWS service should be used to transfer the data most efficiently?

A.Amazon S3 Transfer Acceleration to speed up the transfer.

B.AWS Snowball Edge device to physically ship the data.

C.AWS Direct Connect to establish a dedicated network connection.

D.AWS Site-to-Site VPN to connect and copy data.

AnswerB

Snowball bypasses network limitations by shipping data physically.

Why this answer

Option C is correct because AWS Snowball is designed for large data transfers over limited bandwidth. Option A is wrong because Direct Connect requires high bandwidth and doesn't physically move data. Option B is wrong because S3 Transfer Acceleration speeds up transfers over the internet but still limited by bandwidth.

Option D is wrong because VPN is not efficient for 50 TB.

Practice this question →

77

MCQmedium

A machine learning team needs to preprocess large volumes of clickstream data stored in Amazon S3 before training a model. The preprocessing includes data cleaning, feature engineering, and normalization. The team wants to use a serverless solution that minimizes operational overhead. Which combination of services should the team use?

A.Amazon SageMaker Notebooks with custom Python scripts.

B.Amazon EMR with Spark clusters.

C.AWS Glue ETL jobs reading from and writing to S3.

D.Amazon Athena with SQL queries.

AnswerC

AWS Glue is serverless and designed for ETL on data lakes.

Why this answer

AWS Glue provides a serverless Spark environment for running ETL jobs on data in S3. Amazon SageMaker Processing jobs are also serverless but are more suited for post-training tasks. Option B is wrong because EMR requires cluster management.

Option C is wrong because SageMaker Notebooks are interactive, not automated. Option D is wrong because Athena is for ad-hoc queries, not complex transformations.

Practice this question →

78

MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process streaming data. The application reads from a Kinesis data stream and writes results to a sink. The application is failing with an 'OutOfMemoryError'. The application has parallelism set to 4 and uses 1 Kinesis Processing Unit (KPU). What is the MOST likely cause and solution?

A.The application is using too many operators; reduce parallelism to 2.

B.The heap memory per operator is too low; increase parallelism to 8.

C.The checkpoint interval is too short; increase it to 5 minutes.

D.The buffer timeout is too high; reduce it to 50 ms.

AnswerB

Higher parallelism allocates more total memory across tasks.

Why this answer

With parallelism set to 4 but only 1 KPU, each operator slot receives a fraction of the available heap memory, leading to an OutOfMemoryError. Increasing parallelism to 8 distributes the workload across more slots, but more importantly, it forces Kinesis Data Analytics to allocate additional KPUs (each KPU provides 4 GB of memory), thereby increasing the total heap memory available to the application.

Exam trap

The trap here is that candidates assume increasing parallelism always reduces per-operator memory, but in Kinesis Data Analytics, parallelism is tied to KPU allocation, so increasing parallelism can actually increase total memory by provisioning more KPUs.

How to eliminate wrong answers

Option A is wrong because reducing parallelism would further decrease the number of operator slots, concentrating memory usage and worsening the OutOfMemoryError. Option C is wrong because a short checkpoint interval can cause backpressure and increased memory usage, but the primary issue here is insufficient heap memory per operator, not checkpoint timing. Option D is wrong because buffer timeout affects latency and batching behavior, not heap memory allocation; reducing it would increase the number of small records processed, potentially increasing memory pressure.

Practice this question →

79

MCQmedium

A company is migrating its on-premises Hadoop cluster to AWS. They have a large amount of historical data stored in HDFS. Which approach is the most efficient for transferring this data to Amazon S3?

A.Use AWS Snowball Edge devices.

B.Use AWS Direct Connect.

C.Use AWS DataSync over the internet.

D.Use S3 Transfer Acceleration.

AnswerA

Snowball is designed for large offline data transfers.

Why this answer

AWS Snowball Edge is ideal for large data transfers when network bandwidth is limited. AWS DataSync is for network transfers, but slower for huge datasets. S3 Transfer Acceleration improves speed but still network.

Direct Connect is network-based.

Practice this question →

80

Multi-Selecteasy

A company stores IoT sensor data in Amazon S3 and uses Amazon Athena for ad-hoc queries. The data is partitioned by date, but queries are still slow and expensive. Which TWO actions can improve query performance and reduce cost? (Choose TWO.)

Select 2 answers

A.Use S3 lifecycle policies to compact small files into larger ones

B.Convert the data from CSV to Parquet format

C.Disable server-side encryption on the S3 bucket

D.Use AWS Glue instead of Athena for querying

E.Increase the number of partitions to hour-level granularity

AnswersA, B

Fewer, larger files reduce the overhead of opening many files in Athena.

Why this answer

Option A (convert to Parquet) reduces data scanned. Option C (compact small files) reduces overhead. Option B (increase partitions) can create many small files.

Option D (use Glue instead) changes service. Option E (disable encryption) is not related to performance.

Practice this question →

81

MCQhard

A data engineering team is building a real-time fraud detection pipeline. The pipeline ingests transaction data from an Amazon Kinesis Data Stream with 10 shards. Each shard produces about 500 records per second, each record is 2 KB. The data is processed by a Lambda function that runs for about 200 ms and then writes results to an Amazon DynamoDB table. The team notices that the Lambda function is experiencing a high number of throttles, and there are increasing numbers of records being retried. The Lambda function's reserved concurrency is set to 100. The DynamoDB table has 100 read capacity units and 100 write capacity units. Which change would most effectively reduce throttling and improve processing throughput?

A.Decrease the Lambda function's batch size to 10.

B.Increase the DynamoDB write capacity units to 1000.

C.Increase the number of shards in the Kinesis stream to 100.

D.Increase the Lambda function's reserved concurrency to 1000.

AnswerD

More concurrency allows the function to handle more concurrent invocations.

Why this answer

The Lambda function is throttling because the concurrent executions needed exceed the reserved concurrency. Each shard invokes Lambda with batches, and with 10 shards and a batch size of 100 (default), the number of concurrent invocations can be high. Increasing reserved concurrency to a higher value (e.g., 1000) allows more concurrent executions, reducing throttling.

However, if DynamoDB write capacity is also a bottleneck, increasing it might help. But the primary issue is Lambda throttling.

Practice this question →

82

MCQhard

A data engineer is setting up a data lake on Amazon S3 for a large retail company. The data includes customer transactions, inventory, and web logs. The company wants to use AWS Glue for ETL and Amazon Athena for ad-hoc queries. The data is partitioned by year, month, day, and hour. The engineer notices that Athena queries are slow and often scan large amounts of data even when only a specific hour is needed. The engineer has already enabled partitioning and used columnar formats like Parquet. What additional step should the engineer take to optimize query performance and reduce data scanned?

A.Use a coarser partition layout, such as partitioning only by date, and leverage Hive-style partitioning with AWS Glue Crawlers to avoid excessive small files.

B.Convert the Parquet files to CSV format to reduce the overhead of columnar storage and improve compression.

C.Use S3 Select to push down filters to S3, reducing the amount of data scanned by Athena.

D.Increase the granularity of partitioning to include minute-level partitions to further limit data scanned.

AnswerA

Coarser partitions reduce the number of partitions and improve query planning.

Why this answer

Option C is correct because partitioning by hour alone can lead to many small files, which increases metadata overhead. Using a coarser partition like day and then using Hive-style partitioning with AWS Glue Crawlers will reduce the number of partitions and improve query performance. Option A is incorrect because S3 Select is for filtering within a single object, not for query optimization across multiple objects.

Option B is incorrect because increasing the number of partitions further (e.g., adding minute) would worsen the small files problem. Option D is incorrect because converting to CSV would increase scan size and slow down queries.

Practice this question →

83

MCQmedium

A company uses Amazon Kinesis Data Streams for real-time clickstream analysis. The data is consumed by a Lambda function that enriches the records and stores them in Amazon S3. Recently, the Lambda function has been failing with throttling errors, and the consumer is falling behind. The team needs to increase the throughput of the consumer without changing the data format or the Lambda function code. What should the team do?

A.Add a second Kinesis data stream and send duplicate records to both.

B.Increase the batch size in the event source mapping for Lambda.

C.Increase the number of shards in the Kinesis data stream.

D.Increase the reserved concurrency of the Lambda function.

AnswerC

More shards increase the stream's capacity and number of Lambda consumers.

Why this answer

Option A is correct because increasing the number of shards increases the parallelism of the stream, allowing more Lambda invocations in parallel. Option B is wrong because increasing Lambda concurrency limits may help but the bottleneck is the stream's throughput. Option C is wrong because changing the batch size may help but not as effectively as increasing shards.

Option D is wrong because adding a second stream would require splitting the data, which is not a direct solution.

Practice this question →

84

MCQmedium

A data scientist needs to run complex ETL transformations on a large dataset stored in Amazon S3. The transformations are written in PySpark and require occasional access to Hive metastore. The solution should minimize operational overhead and allow the data scientist to focus on code development. Which AWS service should be used?

A.Amazon Redshift

B.Amazon EMR

C.AWS Glue

D.Amazon SageMaker

AnswerB

EMR provides a managed Spark environment with Hive support and allows custom PySpark code.

Why this answer

Amazon EMR is a managed Hadoop framework that supports PySpark and Hive metastore. AWS Glue is good for simpler ETL but has limitations on custom PySpark code. Amazon SageMaker is for ML training, not general ETL.

Amazon Redshift is a data warehouse.

Practice this question →

85

MCQeasy

A data engineer is tasked with building a system to process a continuous stream of IoT sensor data. The data must be processed in near real-time, and the results must be stored in Amazon S3 partitioned by hour. Which AWS service is the most cost-effective and simplest to implement?

A.Amazon Simple Queue Service (SQS) with AWS Lambda

B.Amazon Kinesis Data Firehose

C.Amazon Kinesis Data Streams with Amazon EC2 consumers

D.AWS Database Migration Service (DMS) for continuous replication

AnswerB

Serverless, automatic partitioning, and direct delivery to S3.

Why this answer

Amazon Kinesis Data Firehose is the simplest and most cost-effective way to ingest streaming data and deliver it to S3 with automatic partitioning by time. Option A (Kinesis Data Streams) requires custom consumers. Option C (Amazon SQS) is for message queues, not streaming.

Option D (AWS Database Migration Service) is for database migration.

Practice this question →

86

MCQhard

A company uses Amazon Kinesis Data Analytics for real-time anomaly detection on a stream of IoT sensor data. The application is experiencing high latency. The data volume has doubled. Which action would MOST effectively reduce latency?

A.Increase the Parallelism setting of the Kinesis Data Analytics application

B.Change the record format from JSON to Avro

C.Decrease the retention period of the source stream

D.Increase the number of shards in the source Kinesis stream

AnswerA

More KPUs allow parallel processing of records.

Why this answer

Increasing the parallelism (number of KPUs) in Kinesis Data Analytics allows processing more data in parallel, reducing latency. Changing record format may help but not as much as scaling. Reducing retention is not relevant.

Using Lambda adds overhead.

Practice this question →

87

Multi-Selecthard

A data engineer is designing a data pipeline to process streaming data from Amazon Kinesis Data Streams and store the results in Amazon S3 in Parquet format. The data must be available for querying in Amazon Athena within minutes of arrival. Which THREE services should be used together? (Choose THREE.)

Select 2 answers

A.Amazon EMR

B.Amazon Redshift

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Analytics

E.AWS Glue

AnswersC, E

Firehose can deliver streaming data to S3 in Parquet format.

Why this answer

Kinesis Data Firehose can write data to S3 in Parquet format with near-real-time delivery. AWS Glue provides the Data Catalog for table metadata, and Athena queries the data. Option A (Kinesis Data Analytics) is for real-time analytics on streams, not for storage.

Option C (EMR) is for batch processing, not streaming. Option E (Redshift) is for data warehousing, not immediate S3 querying.

Practice this question →

88

MCQhard

An IAM policy attached to an AWS Glue job allows reading and writing to an S3 bucket and accessing Glue Data Catalog. The job fails with an access denied error when trying to create a table in the Data Catalog. What is the likely issue?

A.The Glue Data Catalog is not enabled for the account.

B.The job does not have permission to write to the S3 bucket.

C.The S3 bucket is encrypted with a KMS key that the job cannot access.

D.The policy does not include the glue:CreateTable action.

AnswerD

Only GetTable and GetDatabase are allowed, not CreateTable.

Why this answer

The policy allows GetTable and GetDatabase actions, but not CreateTable. The job needs glue:CreateTable permission. The S3 actions are sufficient.

The error is specifically about creating a table.

Practice this question →

89

MCQmedium

A company is building a data pipeline that ingests data from multiple sources into a centralized data lake on Amazon S3. The data must be transformed before it is available for analysis. The pipeline should be event-driven, automatically triggering transformation jobs when new data arrives. Which combination of AWS services should be used?

A.Amazon Kinesis Data Analytics for transformation

B.Amazon S3 event notifications to invoke AWS Lambda, which triggers an AWS Glue job

C.Amazon EMR with automatic scaling

D.AWS Step Functions to orchestrate the pipeline

AnswerB

S3 events trigger Lambda, which starts a Glue ETL job; this is event-driven and serverless.

Why this answer

Amazon S3 can send events to AWS Lambda or SQS when new objects are created. AWS Glue can be triggered by Lambda to run ETL jobs. Step Functions (option A) can orchestrate but adds complexity.

Kinesis Data Analytics (option B) is for streaming analytics, not batch. EMR (option D) requires cluster management.

Practice this question →

90

Multi-Selecthard

A data engineer is designing an ETL pipeline using AWS Glue to process data from Amazon S3 and load it into Amazon Redshift. The pipeline must handle incremental data loads and ensure data consistency. Which THREE features should the engineer use to achieve this? (Choose THREE.)

Select 3 answers

A.Pushdown predicates to filter partitions in S3

B.Glue data preview to validate transformation logic

C.Glue partition filters to limit data scanned

D.Redshift transactional tables with automatic commit

E.Glue job bookmarks to track processed data

AnswersA, D, E

Pushdown predicates reduce the amount of data read from S3, improving performance.

Why this answer

Option A (job bookmark) enables incremental processing. Option C (pushdown predicate) reduces data scanned. Option E (transactional table) ensures consistency.

Option B (partition filter) is less efficient. Option D (data preview) is for development.

Practice this question →

91

MCQmedium

A company uses AWS Glue to catalog data in S3. Data is partitioned by year, month, day. The Glue crawler runs daily but sometimes misses new partitions. What should be done to ensure all partitions are cataloged?

A.Use a custom classifier to detect partition patterns.

B.Increase the crawler schedule to run every hour.

C.Configure the crawler to update all partitions on each run.

D.Enable partition indexing in the Glue table properties.

AnswerD

Partition indexing helps Athena query without full scan.

Why this answer

Option D is correct because enabling partition indexing in the Glue table properties allows the Glue Data Catalog to automatically discover and register new partitions as they are added to S3, without relying solely on crawler runs. This feature uses the Hive-style partition structure (e.g., year=2024/month=01/day=15) to index partitions, ensuring that even if the crawler misses a run, new partitions are still cataloged via the partition index.

Exam trap

The trap here is that candidates often assume increasing crawler frequency or using custom classifiers will solve partition discovery issues, but the correct solution is to leverage Glue's built-in partition indexing feature, which decouples partition discovery from crawler runs.

How to eliminate wrong answers

Option A is wrong because custom classifiers are used to infer the schema of data formats (e.g., CSV, JSON) and do not affect partition discovery or cataloging. Option B is wrong because increasing the crawler schedule to run every hour does not guarantee that all partitions are cataloged if the crawler fails or if partitions are added between runs; it only reduces the window of missed partitions but does not solve the underlying issue of missed partitions. Option C is wrong because configuring the crawler to update all partitions on each run would be inefficient and does not address the root cause of missed partitions; the crawler still depends on its schedule and may skip partitions if they are not present during the crawl.

Practice this question →

92

Multi-Selectmedium

A data engineering team is designing a data lake on AWS for machine learning workloads. The data includes structured, semi-structured, and unstructured data. The team needs to ensure that the data is cataloged, easily discoverable, and can be queried by Amazon Athena and Amazon EMR. The team also wants to enforce fine-grained access control at the column and row level for sensitive data. Which combination of AWS services should the team use? (Select TWO.)

Select 2 answers

A.AWS Lake Formation

B.AWS Identity and Access Management (IAM)

C.AWS Glue Data Catalog

D.Amazon RDS for PostgreSQL

E.Amazon DynamoDB

AnswersA, C

Lake Formation provides fine-grained access control and integrates with Glue Catalog.

Why this answer

AWS Lake Formation is correct because it provides a centralized service to build, secure, and manage data lakes on AWS. It enables fine-grained access control at the column and row level for sensitive data, which directly meets the requirement for enforcing such controls. Additionally, Lake Formation integrates with Amazon Athena and Amazon EMR for querying and processing the cataloged data.

Exam trap

The trap here is that candidates often assume IAM alone can handle fine-grained data access control, but IAM lacks the column- and row-level filtering capabilities that Lake Formation provides through its integration with the Glue Data Catalog and query engines.

Practice this question →

93

Multi-Selectmedium

A company wants to use Amazon SageMaker to train a model using data stored in Amazon S3. The data is sensitive and must be encrypted at rest and in transit. Which THREE steps should be taken to ensure data security?

Select 3 answers

A.Configure the SageMaker training job to use an IAM role with least privilege and enable network isolation

B.Enable default encryption on the S3 bucket using AWS KMS

C.Use an S3 VPC endpoint to keep traffic within the AWS network

D.Store the data in Amazon Redshift instead of S3

E.Allow internet access for the SageMaker notebook instance

AnswersA, B, C

Network isolation ensures no internet egress.

Why this answer

Encrypting the S3 bucket with KMS ensures encryption at rest. Using VPC endpoints for S3 ensures data does not traverse the public internet. Enabling encryption in transit between SageMaker and S3 (using HTTPS) is essential.

Option A (public internet) is not secure; Option E (Redshift) is irrelevant.

Practice this question →

94

MCQeasy

A team is building a data pipeline using Amazon Kinesis Data Firehose to deliver real-time clickstream data to an Amazon S3 bucket. The data must be partitioned by year, month, day, and hour. Which configuration should the team use to achieve this?

A.Configure an S3 lifecycle rule to move data into partition folders after delivery

B.Use an AWS Lambda function to write data to S3 with the desired partition structure

C.Enable dynamic partitioning in Firehose and configure the partition keys as YYYY/MM/dd/HH

D.Use Amazon Athena partition projection to dynamically create partitions

AnswerC

Firehose dynamic partitioning automatically creates folder structures.

Why this answer

Option D is correct because Firehose has a built-in feature to add dynamic partitioning using keys like YYYY/MM/dd/HH based on the timestamp. Option A is wrong because Lambda can partition but adds complexity. Option B is wrong because S3 lifecycle rules do not repartition on delivery.

Option C is wrong because partition projection is for Athena, not Firehose.

Practice this question →

95

MCQmedium

A data engineer runs the AWS CLI command above to inspect an object in S3. The engineer wants to query this metadata (kafka-offset) using Amazon Athena to track processing progress. How can the engineer make this metadata available for Athena queries without modifying the existing data pipeline?

A.Use S3 object tags instead of metadata and query the tags using Athena.

B.Use an AWS Lambda function to copy the metadata into the object's content as a new line.

C.Use AWS Glue to create a table that includes the metadata as a column by running an ETL job.

D.Use Amazon Athena to query the object metadata directly by referencing the metadata field.

AnswerC

A Glue ETL job can read objects, extract metadata, and write to a table that Athena can query.

Why this answer

Option B is correct. S3 object metadata is not automatically available in Athena. The engineer can use AWS Glue to crawl the S3 bucket and extract metadata into the Data Catalog; however, custom metadata is not crawled by default.

A better approach is to store the metadata in a separate table or use S3 object tagging. But among options, Option B is correct: configure a Glue crawler to extract metadata? Actually, Glue crawlers do not extract custom metadata. Option D is correct: use S3 object tags, which can be queried via Athena using the $metadata column? Not exactly.

Let's rethink. The best practice is to store metadata in a separate manifest file. Option B is correct because you can create a Glue table with a custom classifier to extract metadata? Actually, the correct answer is to use S3 Object Lambda to add metadata to the object content? Not listed.

Given the options, Option B is correct: Use AWS Glue to create a table that includes the metadata? But Glue crawlers don't capture custom metadata. Option A is wrong because you cannot query metadata directly. Option C is wrong because Lambda cannot add metadata to existing objects without rewriting.

Option D is correct: Use S3 object tags, which can be queried via Athena? Actually, Athena does not query tags. The best answer is to store metadata in a separate manifest file in S3 and query that. But the most practical is to use a Glue ETL job to read the objects and extract metadata into a table.

Option B is the closest: 'Use AWS Glue to create a table that includes the metadata as a column' - you can use a Glue ETL job to extract metadata and store in Parquet. So Option B is correct.

Practice this question →

96

MCQmedium

A team wants to build a data pipeline that processes incoming JSON files from an S3 bucket and loads them into a Redshift table. The pipeline must handle schema evolution and data validation. Which combination of services would be MOST appropriate?

A.Amazon S3 + AWS Glue + Amazon Redshift

B.Amazon S3 + Amazon SQS + Amazon Redshift

C.Amazon S3 + AWS Data Pipeline + Amazon Redshift

D.Amazon S3 + AWS Lambda + Amazon Redshift

AnswerA

Glue provides schema inference and ETL.

Why this answer

AWS Glue can crawl the S3 data to infer schema, perform ETL transformations, and load into Redshift. SQS is not needed. Lambda is event-driven but lacks built-in schema evolution.

Data Pipeline is older and less flexible.

Practice this question →

97

Multi-Selecthard

A company is using Amazon DynamoDB as a source for a machine learning pipeline. The data is exported nightly to Amazon S3 using DynamoDB Streams and an AWS Glue job. The Glue job reads the stream records, transforms them, and writes to S3 in Parquet format. The team notices that the Glue job is taking too long and consuming high DynamoDB read capacity. Which THREE actions would reduce the load on DynamoDB and improve performance? (Choose THREE.)

Select 3 answers

A.Use Amazon DynamoDB export to S3 (incremental) feature instead of Glue

B.Increase the DynamoDB write capacity units to handle the stream writes

C.Use DynamoDB Streams with AWS Lambda to write data directly to S3 in near-real-time, bypassing Glue

D.Increase the DynamoDB read capacity units to handle Glue's workload

E.Configure Glue to read from a S3 snapshot exported earlier instead of directly from DynamoDB

AnswersA, C, E

The export feature does not consume read capacity and can be automated.

Why this answer

Option A is correct because enabling DynamoDB Streams with a Lambda function to write to S3 directly avoids Glue's read from DynamoDB. Option B is correct because using DynamoDB export to S3 (incremental) does not consume read capacity. Option D is correct because using S3 as the source for Glue reduces DynamoDB reads.

Option C is wrong because increasing read capacity increases load. Option E is wrong because increasing write capacity does not affect reads.

Practice this question →

98

Multi-Selectmedium

Which THREE factors should a data engineer consider when choosing between Amazon S3 and Amazon Redshift for storing large datasets used for machine learning? (Choose 3.)

Select 3 answers

A.Query performance and latency requirements

B.Encryption at rest capabilities

C.Cost of storage vs. compute

D.Data format and compression support

E.Data retention policies

AnswersA, C, D

Redshift provides fast SQL analytics; S3 queries are slower.

Why this answer

Options A, C, and E are key considerations. Option B is about data retention, not storage choice. Option D is about security, which both services support.

Practice this question →

99

Multi-Selectmedium

A company is using Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by a fleet of EC2 instances running a custom consumer application. The consumer is falling behind and the shard iterator age is increasing. Which TWO actions should the data engineer take to improve consumer performance? (Choose TWO.)

Select 2 answers

A.Increase the number of shards in the stream

B.Decrease the data retention period

C.Use an AWS Lambda function to process the data

D.Enable enhanced fan-out on the stream

E.Switch to the Kinesis Client Library (KCL)

AnswersA, D

More shards increase the total read capacity.

Why this answer

Increasing the number of shards increases parallelism, and using enhanced fan-out allows each consumer to have its own read throughput. Option B is wrong because decreasing retention period does not improve consumer performance. Option D is wrong because using a Lambda function may not help if the bottleneck is shard throughput.

Option E is wrong because using KCL (Kinesis Client Library) is already standard; not a direct fix.

Practice this question →

100

MCQmedium

A data engineer runs the AWS CLI command shown in the exhibit to find large log files in S3. The command returns an empty list, but the engineer knows there are files larger than 1 MB in that prefix. What is the MOST likely issue?

A.The JMESPath query syntax is incorrect

B.The command does not paginate through all objects; only the first 1000 are returned

C.The prefix is incorrect; there are no objects under that prefix

D.The Size value is in kilobytes, not bytes

AnswerB

list-objects limits to 1000 keys; use --max-items or pagination.

Why this answer

Option A is correct. The list-objects command returns up to 1000 objects per call. If there are more objects, pagination is needed.

Option B is incorrect because the query syntax is correct. Option C is incorrect because the prefix is fine. Option D is incorrect because the unit is bytes, so 1000000 is 1 MB.

Practice this question →

101

MCQmedium

A company captures streaming data from IoT devices using Amazon Kinesis Data Streams. The data is consumed by a custom application that processes records in near real-time. Recently, the application has been falling behind, and the stream is showing increased 'iterator age' metrics in CloudWatch. Which action is MOST likely to reduce the iterator age?

A.Increase the data retention period of the stream

B.Decrease the number of shards in the stream

C.Increase the number of shards in the stream

D.Reduce the data retention period of the stream

AnswerC

More shards increase throughput, allowing the consumer to process faster.

Why this answer

Option D is correct because increasing the number of shards increases the capacity of the stream, allowing more parallel consumers and reducing backlog. Option A is wrong because reducing retention period does not affect processing speed. Option B is wrong because increasing retention period may increase backlog.

Option C is wrong because decreasing shards reduces capacity, worsening the issue.

Practice this question →

102

Multi-Selecthard

A company uses Amazon Redshift for data warehousing. The data engineering team notices that query performance has degraded over time. Which THREE actions should the team take to improve performance? (Choose THREE.)

Select 3 answers

A.Increase the number of nodes in the Redshift cluster

B.Define appropriate sort keys on large tables

C.Define appropriate distribution keys on large tables

D.Delete old data that is no longer needed

E.Run the ANALYZE command to update table statistics

AnswersB, C, E

Sort keys minimize the amount of data scanned, improving query performance.

Why this answer

Sort keys help the query optimizer scan less data. Distribution keys reduce data shuffling. ANALYZE commands update statistics for the optimizer.

Option B (increasing node count) is costly and not always necessary. Option D (deleting old data) may help but is not a direct performance tuning technique.

Practice this question →

103

MCQmedium

A data scientist is using Amazon SageMaker to train a model. The training data is stored in Amazon S3 and is approximately 500 GB. The data scientist notices that the training job is taking a long time to start because the data is being copied to the training instance's storage. The data scientist wants to reduce the startup time for subsequent training jobs. Which action should the data scientist take?

A.Use Pipe input mode instead of File input mode for the training job

B.Use an EBS-optimized instance type

C.Use Amazon FSx for Lustre as a high-performance file system mounted to the training instance

D.Increase the size of the training instance's Amazon EBS storage volume

AnswerA

Pipe mode streams data from S3 directly, reducing startup time.

Why this answer

Option A is correct because using Pipe input mode streams data directly from S3 to the training algorithm without downloading, reducing startup time. Option B is wrong because FSx for Lustre is not needed for simple streaming. Option C is wrong because increasing instance storage does not address the data transfer issue.

Option D is wrong because using EBS optimized instances does not change the data loading mechanism.

Practice this question →

104

MCQmedium

A company is using Amazon Kinesis Data Firehose to deliver streaming data to an S3 bucket. The data is JSON and must be transformed into Parquet format before delivery. Which approach should the data engineer use?

A.Send the data to Amazon Kinesis Data Analytics to convert to Parquet

B.Configure Kinesis Data Firehose to convert the record format to Parquet using a schema from AWS Glue Data Catalog

C.Use an AWS Lambda function to transform JSON to Parquet and write to S3

D.Use an AWS Glue ETL job to read from Firehose and write Parquet to S3

AnswerB

Firehose can convert JSON to Parquet using a Glue Data Catalog schema.

Why this answer

Option A is correct because Kinesis Data Firehose supports built-in data format conversion to Parquet. Option B is wrong because AWS Glue is an ETL service, not a real-time transformation; Option C is wrong because Lambda can transform but must output to Firehose again; Option D is wrong because Kinesis Data Analytics is for streaming analytics, not format conversion.

Practice this question →

105

MCQeasy

A company is using Amazon S3 as a data lake. The data engineering team needs to catalog the schema of the data and make it available for querying with Amazon Athena. Which AWS Glue component should be used?

A.AWS Glue Studio

B.AWS Glue Crawlers

C.AWS Glue ETL jobs

D.AWS Glue DataBrew

AnswerB

Crawlers populate the Glue Data Catalog with table definitions.

Why this answer

AWS Glue Crawlers automatically scan data sources, infer schemas, and populate the AWS Glue Data Catalog. Option B is wrong because Glue ETL jobs are for transforming data. Option C is wrong because Glue DataBrew is for visual data preparation.

Option D is wrong because Glue Studio is a visual ETL development tool.

Practice this question →

106

Multi-Selecthard

A company uses Amazon S3 to store historical transaction data in CSV format. The data is partitioned by transaction_date. A data analyst runs Amazon Athena queries that frequently filter on customer_id and transaction_date. The queries are slow and expensive. The team needs to improve query performance and reduce cost. Which combination of actions should the team take? (Choose TWO.)

Select 2 answers

A.Enable S3 Select pushdown in Athena to reduce data transfer.

B.Convert the data to JSON format for better query performance.

C.Convert the data from CSV to Parquet format.

D.Reorganize the data by partitioning on customer_id first, then transaction_date.

E.Increase the number of Athena query workers.

AnswersC, D

Parquet is columnar and compressed, reducing data scanned.

Why this answer

Option B and D are correct. Converting to Parquet reduces data scanned due to columnar storage and compression. Partitioning by customer_id (which is frequently filtered) improves partition pruning.

Option A is wrong because increasing workers is not applicable to Athena (serverless). Option C is wrong because converting to JSON increases size. Option E is wrong because using S3 Select may not integrate with Athena directly.

Practice this question →

107

MCQmedium

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data is consumed by a Lambda function that writes to an S3 bucket. Recently, the Lambda function started failing with 'ProvisionedThroughputExceededException' errors. What is the MOST likely cause?

A.The data retention period of the stream is too short.

B.The S3 bucket has insufficient write capacity.

C.The Kinesis stream has too few shards for the data volume.

D.The Lambda function's reserved concurrency is set too high.

AnswerC

Insufficient shards cause ProvisionedThroughputExceededException.

Why this answer

The 'ProvisionedThroughputExceededException' error in Amazon Kinesis Data Streams indicates that the data ingestion rate exceeds the write capacity of the stream's shards. Each shard supports up to 1 MB/s or 1,000 records/s for writes. If the clickstream data volume surpasses this limit, the Lambda function, which reads from the stream, will encounter this exception.

Increasing the number of shards scales the write capacity to match the data volume.

Exam trap

The trap here is that candidates confuse Kinesis throughput limits with Lambda concurrency or S3 capacity, but the specific exception name 'ProvisionedThroughputExceededException' is a direct indicator of insufficient shard write capacity in Kinesis.

How to eliminate wrong answers

Option A is wrong because the data retention period (default 24 hours, up to 365 days) controls how long records are stored, not the write throughput; a short retention period would cause data loss, not throughput errors. Option B is wrong because S3 buckets have virtually unlimited write capacity (thousands of PUT requests per second per prefix) and do not produce 'ProvisionedThroughputExceededException' errors, which are specific to Kinesis. Option D is wrong because setting reserved concurrency too high for the Lambda function would not cause a Kinesis throughput error; it might lead to throttling of the Lambda itself, but the exception originates from the Kinesis stream's shard limits.

Practice this question →

108

MCQeasy

An AWS Glue job is failing with an error that it cannot access an S3 bucket. The IAM role attached to the Glue job is shown in the exhibit. What is the MOST likely cause of the failure?

A.The S3 bucket has a bucket policy that denies access to this role

B.The role lacks S3 permissions

C.The role does not have permission to call S3 APIs

D.The trust policy does not allow Glue to assume the role

AnswerA

A bucket policy can override the role's permissions.

Why this answer

Option C is correct because the trust policy allows only Glue service to assume the role, but if the S3 bucket has a bucket policy that denies access to the role, the Glue job will fail. Option A is wrong because the trust policy allows Glue. Option B is wrong because the role has S3 full access.

Option D is wrong because the role explicitly allows S3 access.

Practice this question →

109

MCQmedium

A data engineering team is building a real-time fraud detection system. Transactions are ingested via Amazon Kinesis Data Streams, and a machine learning model (deployed on Amazon SageMaker) scores each transaction. The team needs to store the raw transactions and the model's predictions in Amazon S3 for later analysis. Which architecture should the team use?

A.Use AWS Lambda to read from Kinesis, invoke SageMaker, and write directly to S3.

B.Use Amazon Kinesis Data Firehose with a transformation Lambda to call SageMaker.

C.Use Amazon Kinesis Data Analytics for Apache Flink to enrich records with SageMaker predictions, then output to Firehose for S3.

D.Use AWS Lambda to invoke the SageMaker endpoint for each record, then write to S3 via Firehose.

AnswerC

Flink can handle high-throughput, call SageMaker per record, and output to Firehose.

Why this answer

Option C is correct. Use Kinesis Data Analytics with a Flink application to enrich each record with the SageMaker prediction, then output to Kinesis Data Firehose for delivery to S3. Option A is wrong because Lambda cannot directly invoke SageMaker for every record in high-throughput streams due to concurrency limits.

Option B is wrong because Kinesis Data Firehose does not support invoking SageMaker directly. Option D is wrong because Lambda is not suitable for high-frequency real-time scoring.

Practice this question →

110

MCQhard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time sensor data. The application reads from a Kinesis data stream, performs windowed aggregations, and writes results to an S3 bucket. Recently, the application has been experiencing high latency and checkpoint failures. What is the MOST likely cause?

A.The number of shards in the Kinesis stream is insufficient for the data volume

B.The S3 destination bucket is located in a different AWS Region than the Kinesis application

C.The record size in the Kinesis stream exceeds the 1 MB limit

D.The parallelism of the Flink application is set too low for the number of shards

AnswerB

Cross-region writes increase latency and can cause checkpoint timeouts.

Why this answer

Option C is correct. If the S3 bucket is in a different region, cross-region data transfer can introduce latency and checkpoint failures. Option A (parallelism) would cause resource issues, not necessarily checkpoint failures.

Option B (shard count) would cause throttling. Option D (record size) is limited to 1 MB.

Practice this question →

111

Multi-Selecthard

Which THREE of the following are best practices for optimizing performance of Amazon EMR clusters? (Choose 3)

Select 3 answers

A.Use Spot Instances for task nodes

B.Consolidate small files into larger ones before processing

C.Use instance fleets for heterogeneous instances

D.Enable EBS optimization on EC2 instances

E.Use Spot Instances to reduce costs

AnswersB, C, D

Consolidation reduces overhead and improves performance.

Why this answer

Option B is correct because consolidating small files into larger ones before processing on Amazon EMR reduces the overhead of the Hadoop Distributed File System (HDFS) metadata operations. Each small file consumes a block of memory in the NameNode, and processing many small files leads to excessive task launches and I/O overhead, degrading performance. Using tools like `s3-dist-cp` to combine files into fewer, larger blocks improves throughput and reduces job execution time.

Exam trap

The trap here is that candidates confuse cost optimization strategies (like Spot Instances) with performance optimization, leading them to select options A or E even though the question explicitly asks for performance best practices.

Practice this question →

112

MCQhard

An ML engineer runs the AWS CLI command shown in the exhibit on a file in S3. The engineer wants to use this file in a SageMaker training job. What does the output reveal about the data?

A.The file is 5 GB and is stored as a CSV

B.The file is 5 MB and is stored as a CSV

C.The file is in Parquet format with two features and one label

D.The file is versioned and can be accessed by version ID

AnswerA

ContentLength indicates 5 GB, and metadata shows format: csv.

Why this answer

Option B is correct because ContentLength is 5,368,709,120 bytes = 5 GB, and the file is a CSV. Option A is wrong because the file is 5 GB, not 5 MB. Option C is wrong because versioning is disabled (VersionId: null).

Option D is wrong because the metadata shows the file is CSV, not Parquet.

Practice this question →

113

MCQhard

Refer to the exhibit. A data engineer runs the AWS CLI command to check an object in an S3 bucket. The bucket is part of a data lake and is configured with versioning enabled. However, the output shows "VersionId": null. What is the most likely reason for this?

A.The object is encrypted using SSE-S3, which hides the version ID

B.The object was uploaded before versioning was enabled

C.The command must include the --version-id parameter to display the version ID

D.Versioning is not enabled on the bucket

AnswerB

Objects uploaded before versioning was enabled have a null version ID.

Why this answer

Option D is correct because the head-object command does not have the --version-id parameter, so it retrieves the latest version, but if versioning is enabled, the latest version will have a version ID. However, the output shows null, which indicates that the object is not versioned. This happens when the bucket has versioning enabled but the object was uploaded before versioning was enabled.

Option A is wrong because the command does not require versioning parameter to show version ID. Option B is wrong because versioning is enabled at the bucket level, not object. Option C is wrong because SSE does not affect versioning.

Practice this question →

114

MCQhard

A company uses Amazon EMR to run Spark jobs on a transient cluster that processes data from S3. The jobs are failing with 'OutOfMemory' errors. The data engineer has already increased the executor memory. Which additional configuration change would MOST likely resolve the issue?

A.Use fewer, larger instance types for the core nodes

B.Increase the number of partitions in the data

C.Increase the driver memory

D.Increase the number of executors

AnswerB

More partitions means smaller data per task, reducing memory usage.

Why this answer

Increasing the number of partitions can reduce the amount of data each executor processes, alleviating memory pressure. Option A is wrong because more executors may increase parallelism but not reduce per-executor data. Option C is wrong because increasing driver memory addresses driver-side issues.

Option D is wrong because using fewer nodes may worsen the problem.

Practice this question →

115

MCQmedium

A team is building a data pipeline that ingests data from an Amazon S3 bucket, transforms it using AWS Glue, and loads it into Amazon Redshift for analysis. The Glue job runs on a schedule every hour. The team has noticed that the job takes longer than expected and sometimes fails due to memory issues. The data volume is variable, with occasional spikes. Which solution should the team implement to optimize the pipeline?

A.Decrease the number of workers to reduce memory contention.

B.Enable job bookmarks to process only new data and use a G.2X worker type for more memory.

C.Increase the schedule frequency to run the job more often with smaller data increments.

D.Replace AWS Glue with Amazon EMR using Spark.

AnswerB

Job bookmarks prevent reprocessing and larger workers provide more memory.

Why this answer

Option C is correct because Glue Job Bookmarks help track processed data and avoid reprocessing, and using a larger worker type with more memory can handle spikes. Option A is wrong because increasing the schedule frequency would not address the root cause. Option B is wrong because using a smaller worker type would worsen memory issues.

Option D is wrong because Spark is already used internally.

Practice this question →

116

MCQeasy

A data scientist uses Amazon SageMaker to train a model. The training dataset is 10 GB and stored in S3. The training job uses a ml.m5.large instance. The data must be available on the local file system during training. Which input mode should be used?

A.Local input mode

B.Batch input mode

C.File input mode

D.Pipe input mode

AnswerC

File mode downloads data to the local file system, making it available for training.

Why this answer

File input mode is correct because it downloads the entire training dataset from S3 to the local file system of the ml.m5.large instance before training begins, ensuring the data is available locally as required. This mode is suitable for datasets up to 10 GB, as the instance's local storage (typically 8 GB for ml.m5.large) may be insufficient, but SageMaker uses the instance's Amazon EBS volume (up to 512 GB) for file input mode, making it viable.

Exam trap

The trap here is that candidates may confuse 'File input mode' with 'Pipe input mode' and incorrectly choose Pipe mode for local file availability, or invent 'Local input mode' as a plausible-sounding option.

How to eliminate wrong answers

Option A is wrong because 'Local input mode' is not a valid SageMaker input mode; the correct term is 'File input mode' for local file system access. Option B is wrong because 'Batch input mode' is not a SageMaker input mode; SageMaker uses 'File' or 'Pipe' modes, and batch processing refers to Batch Transform jobs, not training input. Option D is wrong because 'Pipe input mode' streams data directly from S3 to the training algorithm without writing to the local file system, which does not satisfy the requirement that data must be available on the local file system during training.

Practice this question →

117

MCQeasy

A data engineering team needs to process streaming data from thousands of IoT devices. The data must be ingested with low latency and processed in near real-time to detect anomalies. Which AWS service should they use for ingestion?

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Analytics

C.Amazon S3

D.Amazon Kinesis Data Streams

AnswerD

Kinesis Data Streams is the correct service for real-time streaming ingestion.

Why this answer

Amazon Kinesis Data Streams is designed for real-time streaming data ingestion and can handle large throughput with low latency. Option A is wrong because S3 is for object storage, not streaming. Option C is wrong because Kinesis Data Analytics is for processing.

Option D is wrong because Kinesis Data Firehose is for loading data to destinations.

Practice this question →

118

MCQhard

A data pipeline uses AWS Lambda to process small files (10-50 MB) from an S3 bucket and write results to DynamoDB. The Lambda function times out after 15 seconds for larger files. The team wants to handle files up to 100 MB without changing the Lambda code. Which solution is MOST cost-effective?

A.Use AWS Glue Python shell job to replace Lambda

B.Increase the Lambda function timeout to 5 minutes

C.Use Amazon ECS with AWS Fargate to run the processing task

D.Configure an SQS queue to buffer the S3 events and batch them

AnswerB

Lambda allows up to 15 minutes, and 5 minutes is sufficient for 100 MB. No code changes needed.

Why this answer

Increasing Lambda timeout is the simplest and most cost-effective solution for occasional larger files. Using ECS Fargate or Glue is overkill and more expensive. SQS does not solve the timeout issue.

Practice this question →

119

MCQmedium

A media company ingests video metadata from multiple sources into an Amazon S3 bucket. Each metadata record is a JSON file about 2 KB. They use AWS Glue ETL jobs to process these files and load them into Amazon Redshift for analytics. The jobs currently run hourly and take about 10 minutes to process all new files. However, the company is growing and expects the number of files to increase 100x. The data engineering team wants to minimize processing time and cost. The Glue job currently reads all files from the S3 bucket using a full scan. What should they do to optimize the pipeline?

A.Consolidate the small JSON files into larger files using a scheduled job

B.Convert the data to Parquet format and partition it

C.Increase the number of Glue DPUs to process files faster

D.Use S3 event notifications to trigger Glue jobs only for new files

AnswerD

Event-driven eliminates full scan and reduces cost.

Why this answer

Using S3 event notifications to trigger Glue jobs on new objects eliminates scanning and reduces latency. Option A is wrong because increasing DPUs costs more. Option B is wrong because consolidating files reduces number of objects.

Option D is wrong because converting to Parquet helps but still scanning all files.

Practice this question →

120

MCQmedium

A company is building a data pipeline to process sensitive customer data. The pipeline uses AWS Glue for ETL and stores results in Amazon S3. The security team requires that all data be encrypted at rest in S3 using customer-managed AWS KMS keys. Additionally, the Glue job must be able to write encrypted data to S3. What should the data engineer do to meet these requirements?

A.Attach a policy to the Glue job's IAM role that includes kms:GenerateDataKey and kms:Decrypt actions for the KMS key.

B.Use S3 server-side encryption with customer-provided keys (SSE-C).

C.Use S3 server-side encryption with SSE-S3, which is enabled by default.

D.Configure an S3 bucket policy to enforce encryption and attach it to the Glue job's IAM role.

AnswerA

These permissions allow Glue to encrypt and decrypt data using the KMS key.

Why this answer

The Glue job's IAM role needs kms:GenerateDataKey and kms:Decrypt permissions on the KMS key. Option A is wrong because S3 bucket policy alone doesn't grant the Glue service access to KMS. Option C is wrong because SSE-S3 is Amazon-managed keys, not customer-managed.

Option D is wrong because SSE-C requires the customer to manage keys and Glue cannot provide the encryption key in each request.

Practice this question →

121

MCQhard

A company is building a near-real-time dashboard using data from multiple sources. They need to aggregate millions of events per second with sub-second latency. The architecture must be fully managed and minimize operational overhead. Which service should they use for the aggregation layer?

A.Amazon Kinesis Data Analytics for Apache Flink.

B.AWS Lambda functions triggered by Kinesis Data Streams.

C.Amazon EMR with Spark Streaming.

D.Amazon Redshift with materialized views refreshed frequently.

AnswerA

Kinesis Data Analytics with Flink provides low-latency, stateful stream processing at scale.

Why this answer

Option B is correct because Kinesis Data Analytics is designed for real-time streaming analytics with sub-second latency and is fully managed. Option A is wrong because Redshift is not designed for sub-second streaming ingestion; it is batch-oriented. Option C is wrong because EMR requires cluster management.

Option D is wrong because Lambda has concurrency limits and is not optimized for millions of events per second.

Practice this question →

122

MCQhard

An e-commerce company uses Amazon Kinesis Data Firehose to deliver clickstream data to Amazon S3. The data arrives at unpredictable rates, with occasional bursts. The company needs to ensure data is delivered within 60 seconds of ingestion, and the data must be partitioned by year/month/day/hour. Which configuration meets these requirements?

A.Set the buffer size to 1 MB and disable dynamic partitioning

B.Use a Lambda function to process data and write to S3 with partitioning

C.Use AWS Glue streaming ETL to read from Firehose and write to S3

D.Set the buffer interval to 60 seconds and enable dynamic partitioning

AnswerD

Buffer interval controls delivery frequency; dynamic partitioning creates time-based folders.

Why this answer

Setting the buffer interval to 60 seconds and enabling dynamic partitioning ensures data is delivered within 60 seconds and automatically partitioned by time. Option B (buffer size 1 MB) would cause excessive small files. Option C (Lambda transformation) adds latency.

Option D (Glue streaming) is not directly integrated with Firehose.

Practice this question →

123

MCQhard

A data engineering team is designing a data lake on Amazon S3. They need to enforce encryption at rest for all data stored in the bucket. The security policy requires that the encryption keys be managed by the organization using AWS Key Management Service (KMS), and that the bucket must deny uploads of unencrypted objects. Which bucket policy should be applied?

A.A bucket policy that denies PutObject unless the request includes the 'x-amz-server-side-encryption' header with value 'AES256'

B.A bucket policy that denies PutObject if the 'x-amz-server-side-encryption' header is not present

C.A bucket policy that denies PutObject unless the request includes the 'x-amz-server-side-encryption-aws-kms-key-id' header matching the desired KMS key ID

D.Enable default encryption on the bucket with AWS-KMS

AnswerC

This enforces the use of a specific KMS key.

Why this answer

To enforce encryption, a bucket policy can deny PutObject if the object is not encrypted with the required KMS key. The condition 's3:x-amz-server-side-encryption-aws-kms-key-id' checks the key ID. Option B correctly denies requests that do not include the required key.

Option A is incomplete; Option C uses the wrong condition; Option D uses SSE-S3 instead of KMS.

Practice this question →

124

MCQeasy

A machine learning team is using Amazon SageMaker to train models on a large dataset stored in Amazon S3. The dataset is 5 TB in size and is partitioned by date. The team wants to minimize data transfer costs and reduce training time by caching frequently accessed data locally on the training instances. The training instances are EC2 instances with attached Amazon EBS volumes. The team is considering using SageMaker Pipe mode to stream data directly from S3, but they are concerned about network bandwidth. Which approach should the team use to optimize data loading for training?

A.Use Amazon FSx for Lustre as a high-performance file system linked to the S3 bucket, and mount it on the training instances.

B.Use SageMaker File mode with Amazon EFS, which allows multiple training instances to share the same file system and caches data from S3.

C.Increase the size of the EBS volumes attached to the training instances and copy the entire dataset to the volumes before training.

D.Use SageMaker Pipe mode to stream data from S3 directly to the training algorithm, which automatically caches data in memory.

AnswerB

File mode with EFS enables caching and sharing, reducing repeated S3 downloads.

Why this answer

Option D is correct because Amazon SageMaker File mode with Amazon Elastic File System (EFS) provides a shared file system that can cache data across training jobs, reducing the need to repeatedly download from S3. Option A is incorrect because FSx for Lustre is optimized for high-performance computing but not specifically for SageMaker training. Option B is incorrect because SageMaker Pipe mode streams data but does not cache.

Option C is incorrect because EBS volumes are attached per instance and cannot be shared across jobs for caching.

Practice this question →

125

MCQeasy

A data engineer needs to extract data from an Amazon RDS for MySQL database into Amazon S3 for further processing. The data volume is 2 TB and the job must run daily within a 1-hour window. Which AWS service is most suitable for this task?

A.Amazon Kinesis Data Firehose

B.AWS Database Migration Service (DMS)

C.Amazon Athena

D.AWS Glue

AnswerD

AWS Glue provides managed ETL jobs that can extract from JDBC sources and write to S3 on a schedule.

Why this answer

AWS Glue is designed for extract, transform, and load (ETL) jobs and can connect to JDBC sources like RDS and write to S3. DMS is for database migration, not scheduled ETL. Athena is for querying data in S3.

Kinesis is for streaming data.

Practice this question →

126

MCQeasy

A machine learning team needs to create a training dataset by joining two large datasets (10 TB and 5 TB) stored in S3. The join key is 'user_id'. They want to minimize data movement and cost. Which approach should they use?

A.Use AWS Glue ETL to read both datasets, join them using Spark DataFrames, and write the result to S3.

B.Launch an Amazon EMR cluster with Spark, read data from S3, perform the join, and write results back to S3.

C.Use Amazon Athena to run a SQL query joining the two datasets directly on S3.

D.Load both datasets into Amazon Redshift using COPY commands, then perform the join in Redshift.

AnswerC

Athena queries data in place, charges per query scanned, and requires no infrastructure management.

Why this answer

Option C is correct because Amazon Athena allows serverless SQL joins directly on S3 data without moving it, and is cost-effective for large ad-hoc queries. Option A is wrong because Redshift Spectrum still requires moving data into Redshift for optimal performance. Option B is wrong because EMR requires provisioning clusters and incurs compute costs even when idle.

Option D is wrong because Glue ETL typically moves data into a transformation environment.

Practice this question →

127

Drag & Dropmedium

Drag and drop the steps to create an Amazon SageMaker notebook instance in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Creating a notebook instance requires navigating the SageMaker console, configuring instance settings, IAM role, and VPC, then launching.

Practice this question →

128

MCQhard

A data engineer is investigating why an Athena query against the my-data-lake bucket is slow. The query filters on year, month, and day. The exhibit shows the metadata of one Parquet file. What is the MOST likely cause of the slow query?

A.The version ID is null, causing data inconsistency

B.The file is too large, causing Athena to process it in a single task

C.The partition columns are not being used in the query

D.The storage class is STANDARD, which is slower than GLACIER

AnswerB

Large files limit parallelism; Athena works best with files 128-512 MB.

Why this answer

The file is 1 GB (1073741824 bytes), which is large for a single Parquet file. Athena splits files into tasks; a single large file cannot be parallelized, causing slow performance. Partitioning is fine, but file size matters.

The metadata is not missing. Storage class is standard. Versioning is not enabled.

Practice this question →

129

MCQhard

A company is streaming data from thousands of devices using Amazon Kinesis Data Streams. The data is consumed by a AWS Lambda function that processes each record. The Lambda function is experiencing high error rates and throttling due to the volume of data. Which action would MOST effectively improve the processing throughput and reduce errors?

A.Send the data to Amazon SQS first and then process with Lambda

B.Use Amazon Kinesis Data Firehose instead of Kinesis Data Streams

C.Increase the Lambda function's batch size and reduce the batch window

D.Increase the number of shards in the Kinesis stream

AnswerD

More shards increase parallelism and throughput, reducing throttling.

Why this answer

Increasing the number of shards increases the stream's capacity. Using a larger batch size in Lambda (option B) can improve throughput but may cause timeout. Using SQS (option C) introduces another queue.

Using Firehose (option D) changes the architecture. Increasing shards is the most direct way to increase throughput.

Practice this question →

130

Multi-Selecthard

Which THREE considerations are important when designing a data lake on Amazon S3?

Select 3 answers

A.Setting up S3 Lifecycle policies to transition data to colder storage

B.Using Provisioned IOPS for S3

C.Partitioning data by date to improve query performance

D.Using a single Availability Zone for data storage

E.Encrypting data at rest using AWS KMS

AnswersA, C, E

Lifecycle policies manage cost.

Why this answer

Organizing data with partitioning, encrypting data at rest, and setting up lifecycle policies are key. Single Availability Zone is not an S3 consideration (S3 is regional). Using Provisioned IOPS is for block storage.

Practice this question →

131

MCQmedium

Refer to the exhibit. An ML engineer runs the above CLI command to inspect files in an S3 bucket. The training data consists of 200 CSV files, each 1 GB. The engineer plans to use Amazon SageMaker to train a model using this data. What should the engineer do to optimize training performance?

A.Increase the number of training instances to process files in parallel.

B.Use Amazon Athena to transform the data into CSV format with headers.

C.Use the File input mode and copy all files to the training instance's EBS volume.

D.Convert the CSV files to Parquet format and use Pipe input mode.

AnswerD

Parquet is columnar and compressed; Pipe mode streams data directly from S3.

Why this answer

Option D is correct because converting to Parquet and using a Pipe input mode reduces I/O and improves throughput. Option A is wrong because copying to EBS is not efficient. Option B is wrong because Athena is for querying, not training.

Option C is wrong because increasing instances may help but does not address the inefficiency of CSV format.

Practice this question →

132

MCQmedium

A data engineer is designing a data lake on Amazon S3. The data is collected from IoT devices and is highly variable in volume. The engineer needs to ensure that the data is ingested reliably and can be processed in near real-time. Which AWS service should be used to ingest the data into the data lake?

A.Amazon Kinesis Data Firehose

B.AWS Glue

C.Amazon Kinesis Data Streams

D.Amazon Simple Queue Service (SQS)

AnswerA

Firehose can load streaming data directly into S3 with near real-time latency.

Why this answer

Amazon Kinesis Data Firehose is the best choice for reliably ingesting streaming data into S3 with near real-time delivery, automatic scaling, and optional transformations.

Practice this question →

133

MCQhard

A data engineer configures an S3 event notification to trigger an AWS Lambda function when a new object is created in 'my-input-bucket'. The Lambda function processes the CSV file and writes results to 'my-output-bucket'. The engineer notices that the Lambda function is not triggered for some objects. Which step should the engineer take to diagnose the issue?

A.Check the Lambda function's execution role for permissions to write to the output bucket.

B.Review the CloudWatch Logs for the Lambda function to see if there are errors.

C.Check the Lambda function's resource-based policy to ensure S3 has permission to invoke the function.

D.Verify that the S3 event notification is configured with the correct prefix and suffix filters.

AnswerC

Missing invoke permission is a common cause of trigger failure.

Why this answer

Option D is correct. The most likely issue is missing permissions. The S3 bucket must have permission to invoke the Lambda function.

The engineer should check the Lambda resource-based policy to ensure it allows invocation from S3. Option A is wrong because the event notification configuration is separate. Option B is wrong because the Lambda function's execution role needs permission to write to output bucket, but that would not prevent triggering.

Option C is wrong because CloudWatch Logs would show invocations but not if not triggered.

Practice this question →

134

MCQhard

A data engineer runs the AWS CLI command shown and notices a zero-byte file in the results. What is the most likely cause of this zero-byte file?

A.The S3 bucket has a lifecycle policy that deleted the content.

B.The file was written with the wrong prefix.

C.The file was compressed, reducing size to zero.

D.The file was created by a failed Spark task that wrote no data.

AnswerD

Failed tasks can produce empty files.

Why this answer

Zero-byte files often occur when an ETL job fails partway through writing, or when a task starts but writes no data. A completed write would have non-zero size. The other options are less likely: prefix typo wouldn't produce a file; correct permissions wouldn't cause zero bytes; compression would produce some output.

Practice this question →

135

MCQhard

A large e-commerce company is using Amazon DynamoDB as the source for real-time analytics. The data is streamed to Amazon Kinesis Data Streams using DynamoDB Streams and then processed by an AWS Lambda function. The Lambda function writes the data to an Amazon Elasticsearch Service cluster for search and visualization. Recently, the Lambda function has been failing with throttling errors from the Elasticsearch cluster. What is the MOST effective way to handle this?

A.Increase the Lambda function's reserved concurrency to handle more invocations.

B.Increase the number of shards in the Kinesis data stream.

C.Decrease the Kinesis stream's retention period to reduce the data volume.

D.Configure a Dead Letter Queue (DLQ) on the Lambda function to capture failed records and implement retry logic.

AnswerD

DLQ captures records that fail due to throttling, allowing later reprocessing without blocking the stream.

Why this answer

Using a Dead Letter Queue (DLQ) for failed records allows the Lambda function to continue processing without blocking, and the failed records can be reprocessed later. Option A is wrong because increasing Lambda concurrency would exacerbate the throttling. Option B is wrong because increasing shards increases throughput but doesn't address Elasticsearch throttling.

Option D is wrong because Lambda cannot directly control Kinesis stream throughput; it processes in batches.

Practice this question →

136

MCQhard

A company is using Amazon Kinesis Data Analytics for Apache Flink to process real-time data. The data source is a Kinesis data stream, and the output is written to an S3 bucket. Recently, the processing latency has increased significantly. The team suspects that the Flink application is encountering backpressure. Which metric should the team monitor to confirm backpressure?

A.currentLowWatermark

B.busyTimeMsPerSecond

C.numberOfFailedCheckpoints

D.numRecordsInPerSecond

AnswerB

High busy time indicates operator is overloaded, causing backpressure.

Why this answer

Backpressure in Flink is indicated by high 'busyTimeMsPerSecond' metric, which shows the time the operator is busy processing. Option A is wrong because 'numRecordsInPerSecond' measures throughput, not backpressure. Option B is wrong because 'currentLowWatermark' is for event time progress.

Option D is wrong because 'numberOfFailedCheckpoints' indicates checkpoint failures, not backpressure directly.

Practice this question →

137

MCQmedium

The Glue job my-glue-job fails after a few successful runs. The error log shows 'Job run exceeds max concurrent runs limit'. The CloudFormation template is shown in the exhibit. What change should be made to allow multiple runs to execute concurrently?

A.Change the IAM role to one with more permissions

B.Increase the MaxRetries property to 3

C.Remove the --job-bookmark-option argument

D.Set the MaxConcurrentRuns property to 3

AnswerD

This allows up to 3 concurrent job runs.

Why this answer

The 'MaxConcurrentRuns' is set to 1, which prevents parallel executions. Setting it to a higher value (e.g., 3) allows concurrent runs. MaxRetries is for retry count, not concurrency.

Role and TempDir are not relevant.

Practice this question →

138

MCQmedium

A data science team uses Amazon SageMaker to train models on a large dataset stored in S3. The dataset is 500 GB in CSV format and is updated daily. The team wants to optimize data loading for training jobs to reduce I/O wait time. Which data ingestion strategy is MOST effective?

A.Use SageMaker File input mode and increase the EBS volume size to 1 TB.

B.Use SageMaker Pipe input mode to stream data directly from S3.

C.Convert the CSV files to Parquet format and use File input mode.

D.Load the data into an Amazon EFS file system and mount it to the training instance.

AnswerB

Pipe mode streams data on-the-fly, eliminating the need to download the full dataset, thus reducing I/O wait time.

Why this answer

Option B is correct because SageMaker Pipe input mode streams data directly from S3 to the training algorithm without writing to the instance's EBS volume, eliminating disk I/O bottlenecks. This is especially effective for large datasets (500 GB) that are updated daily, as it reduces startup time and avoids the need to download the entire dataset before training begins.

Exam trap

The trap here is that candidates often assume converting to a columnar format like Parquet always improves performance, but they overlook that File input mode still requires a full download to disk, whereas Pipe mode avoids that entirely regardless of file format.

How to eliminate wrong answers

Option A is wrong because increasing the EBS volume size to 1 TB does not reduce I/O wait time; it only provides more storage space, and the data must still be downloaded from S3 to the EBS volume before training, which adds latency. Option C is wrong because while converting CSV to Parquet can improve read performance and reduce data size, using File input mode still requires the entire dataset to be downloaded to the instance's EBS volume before training starts, negating the benefit of reduced I/O wait time. Option D is wrong because mounting an Amazon EFS file system to the training instance introduces network file system latency and is not optimized for the high-throughput, low-latency data loading required for training jobs; SageMaker's built-in Pipe mode is designed specifically for this purpose.

Practice this question →

139

MCQeasy

A company uses AWS Lambda to process events from Amazon S3. The Lambda function transforms the data and writes results to another S3 bucket. Recently, the function has been failing due to timeout errors when processing large files. Which solution should the data engineer implement?

A.Increase the Lambda function memory and timeout limit

B.Increase the Lambda timeout to 15 minutes

C.Use S3 Batch Operations with a Lambda function to process objects

D.Use Amazon SQS to queue the events and process them in batches

AnswerC

Batch Operations can invoke Lambda for each object, handling large volumes.

Why this answer

Using S3 Batch Operations with a Lambda function allows processing large numbers of objects asynchronously, avoiding Lambda timeouts. Option A is wrong because increasing memory also increases CPU but may not solve timeout for large files. Option B is wrong because SQS does not help with processing large files.

Option D is wrong because increasing timeout may not be sufficient and is not best practice for large files.

Practice this question →

140

MCQhard

A company needs to process sensitive data from multiple sources. They want to use AWS Glue to catalog and transform the data. Which feature should they use to ensure that sensitive columns are masked before the data is available for querying?

A.AWS Glue DataBrew

B.AWS Glue Studio

C.AWS Lake Formation

D.Amazon Macie

AnswerA

DataBrew allows data masking and cleansing interactively.

Why this answer

Glue DataBrew provides data masking and cleansing capabilities. Glue Studio is for building ETL jobs, but masking requires custom code. Lake Formation is for fine-grained access control, not masking.

Macie is for discovering sensitive data, not masking.

Practice this question →

141

MCQeasy

A data scientist needs to train a machine learning model using a large dataset (500 GB) stored in an S3 bucket. The training will be performed on a SageMaker notebook instance. The data scientist wants to minimize data transfer costs and reduce training time. Which data ingestion approach should the data engineer recommend?

A.Use the SageMaker SDK to directly read the data from S3 during training without copying it to the notebook.

B.Copy the dataset to the notebook instance's attached EBS volume before training.

C.Load the dataset into an Amazon RDS database and query it from the notebook.

D.Mount the S3 bucket to the notebook instance using Amazon Elastic File System (EFS).

AnswerA

SageMaker can read data directly from S3, minimizing transfer and storage costs.

Why this answer

Option C is correct because using S3 as a data source for SageMaker directly avoids copying data into the notebook instance, reducing transfer costs and time. Option A (copying to notebook EBS) incurs transfer costs and uses local storage. Option B (EFS) adds cost and complexity.

Option D (RDS) is for structured data and incurs additional costs.

Practice this question →

142

MCQhard

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

A.The policy grants s3:PutObject on all buckets, not just the specific one.

B.The condition requires objects to be encrypted with SSE-KMS, but the job uses SSE-S3.

C.The policy does not grant s3:PutObject on the bucket itself, which is needed for some write operations.

D.The condition requires objects to use SSE-S3, but the job uses SSE-KMS.

AnswerC

Bucket-level permissions may be required for certain write operations.

Why this answer

The error 'Access Denied' when writing to S3 with SSE-S3 encryption typically occurs because the IAM policy lacks the `s3:PutObject` permission on the bucket resource itself. While the policy may grant `s3:PutObject` on the object ARN (`arn:aws:s3:::bucket/*`), some S3 write operations—especially those involving encryption headers or bucket-level checks—also require the permission on the bucket ARN (`arn:aws:s3:::bucket`). Without this, the request is denied even if the object-level permission exists.

Exam trap

The trap here is that candidates assume `s3:PutObject` on the object ARN is sufficient for all write operations, overlooking that S3 requires the same permission on the bucket ARN for certain encryption-related or bucket-policy-evaluation scenarios.

How to eliminate wrong answers

Option A is wrong because granting `s3:PutObject` on all buckets would be overly permissive, not restrictive; the issue is missing permission on the specific bucket, not an overly broad scope. Option B is wrong because the job uses SSE-S3, and the condition requiring SSE-KMS would cause a different error (e.g., 'The request was denied because the encryption key is not authorized'), not a generic 'Access Denied'. Option D is wrong because the job uses SSE-S3, not SSE-KMS, so a condition requiring SSE-S3 would actually match and not cause a denial.

Practice this question →

143

MCQmedium

A data engineering team is using Apache Spark on Amazon EMR to process streaming data from Amazon Kinesis Data Streams. The Spark application uses structured streaming to read from Kinesis, perform transformations, and write to Amazon S3 in Parquet format. The team notices that the application is falling behind and the processing latency is increasing. The Kinesis stream has 5 shards, and the EMR cluster has 5 core nodes of type r5.xlarge. The Spark application is configured with 5 executors, each with 2 cores and 8 GB memory. The team wants to reduce processing latency. Which change would be most effective?

A.Increase the executor memory to 16 GB.

B.Increase the number of shards in the Kinesis stream to 10 and increase the number of core nodes to 10.

C.Use a larger instance type for the core nodes, such as r5.4xlarge.

D.Change the output format from Parquet to CSV to reduce write time.

AnswerB

More shards increase parallelism, and more nodes allow more concurrent processing.

Why this answer

The number of shards (5) matches the number of executors (5), but each shard can be processed by a single executor. To increase parallelism, the team should increase the number of shards in the Kinesis stream and correspondingly increase the number of executors or cores. Alternatively, they can increase the number of cores per executor to allow parallel processing of multiple shards per executor.

Practice this question →

144

MCQhard

A company runs a data pipeline using AWS Glue ETL jobs that process about 10 TB of data daily from Amazon S3. The jobs are triggered by a schedule and write results to a separate S3 bucket. Recently, the jobs have been taking longer to complete, and the data engineering team has observed that the number of files in the source bucket has increased significantly, from thousands to millions of small files (each about 100 KB). The Glue jobs are configured to use the 'Group Files' option, but performance is still poor. The team needs to improve the job performance without changing the source data generation process. Which course of action should the team take?

A.Increase the number of DPUs allocated to the existing Glue job

B.Switch the ETL processing to Amazon EMR with Spark

C.Use AWS Lambda to pre-process the files and combine them

D.Create a separate Glue job that runs before the main job to consolidate small files into larger ones in the source bucket

AnswerD

Consolidation reduces the number of files, improving read performance.

Why this answer

Using an AWS Glue job to periodically compact small files into larger files (e.g., 100 MB) before the main ETL job runs will reduce the overhead of reading millions of small files. Option B is wrong because increasing DPUs may help but does not address the root cause of many small files. Option C is wrong because using Spark on EMR may still suffer from small file issue.

Option D is wrong because Lambda has limitations on processing large numbers of files.

Practice this question →

145

Multi-Selecthard

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest real-time transaction data. The data must be processed in near real-time and stored in Amazon S3 for long-term analytics. The engineer wants to ensure data durability and exactly-once processing semantics. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Use the Kinesis Producer Library (KPL) with exactly-once delivery.

B.Use AWS Glue streaming ETL with checkpointing.

C.Enable exactly-once delivery on Kinesis Data Firehose.

D.Use AWS Lambda with the Kinesis trigger and enable event source mapping with RetryAttempts set to 0.

E.Use Amazon SQS as the event source for downstream processing.

AnswersA, C

KPL provides exactly-once semantics when configured to do so.

Why this answer

Correct options: C and D. Using the Kinesis Producer Library (KPL) with exactly-once delivery ensures no duplicates. Enabling Kinesis Data Firehose's exactly-once delivery to S3 ensures data is written exactly once.

Option A (SQS) is not part of Kinesis. Option B (Glue) does not provide exactly-once for streaming. Option E (Lambda) can process records but does not guarantee exactly-once semantics without additional logic.

Practice this question →

146

MCQeasy

A data scientist wants to explore a large dataset stored in Amazon S3 using SQL queries without moving the data. The dataset is in CSV format and is updated daily with new partitions. Which AWS service should be used to directly query the data in S3?

A.Amazon Athena

B.Amazon Redshift Spectrum

C.Amazon EMR

D.AWS Glue

AnswerA

Athena is purpose-built for querying data in S3 with no infrastructure to manage.

Why this answer

Amazon Athena is a serverless interactive query service that allows querying data directly in S3 using standard SQL. Amazon Redshift Spectrum (option B) can also query S3 but requires a Redshift cluster. Amazon EMR (option C) requires cluster management.

AWS Glue (option D) is for ETL, not interactive querying.

Practice this question →

147

MCQeasy

A data engineer needs to analyze large CSV files stored in Amazon S3 using SQL queries. The data is not frequently accessed, and cost is a primary concern. Which AWS service should be used to query the data directly in S3 without moving it?

A.Amazon Athena

B.Amazon EMR

C.Amazon Redshift Spectrum

D.AWS Glue

AnswerA

Athena is serverless and directly queries S3 using SQL with pay-per-query pricing.

Why this answer

Amazon Athena is a serverless query service that allows SQL queries directly on data in S3, ideal for ad-hoc analysis with a pay-per-query pricing model. Option A (Amazon Redshift Spectrum) also queries S3 but requires an existing Redshift cluster. Option B (AWS Glue) is for ETL.

Option D (Amazon EMR) requires cluster management and is more expensive for occasional queries.

Practice this question →

148

Multi-Selecteasy

A data engineering team needs to schedule a nightly ETL job that extracts data from an Amazon RDS for PostgreSQL instance, transforms it using Spark, and loads it into Amazon S3. The team wants to use AWS Glue for this task. Which components are required? (Select TWO.)

Select 2 answers

A.An AWS Glue ETL job with a Spark script.

B.An AWS Glue crawler to populate the Data Catalog.

C.An AWS Glue connection to the RDS database.

D.An AWS Glue development endpoint.

E.An AWS Glue notebook for data exploration.

AnswersA, C

The job performs the defined ETL logic.

Why this answer

Option A is correct because a connection to the PostgreSQL database is needed for extraction. Option D is correct because an AWS Glue ETL job with Spark script performs the transformation. Option B is wrong because a crawler is for cataloging, not for ETL.

Option C is wrong because a development endpoint is for interactive development, not production scheduling. Option E is wrong because a notebook is for development, not for scheduled jobs.

Practice this question →

149

MCQhard

A company is running a machine learning training job on Amazon SageMaker that reads training data from an S3 bucket. The job fails intermittently with an S3 throttling error. The data is partitioned across thousands of small files (average 100 KB). Which strategy is MOST effective to resolve the throttling issue?

A.Use Amazon Athena to query the data and output results to a new S3 location

B.Enable S3 Transfer Acceleration on the bucket

C.Combine the small files into larger files (e.g., 100 MB) using a preprocessing step

D.Increase the number of SageMaker training instances to distribute the load

AnswerC

Larger files reduce the number of GET requests, mitigating throttling.

Why this answer

Combining small files into larger files reduces the number of S3 GET requests, which reduces the chance of throttling. Increasing the number of instances (option A) would increase parallelism and could worsen throttling. Using S3 Transfer Acceleration (option B) improves transfer speed but does not reduce request rate.

Using Athena (option D) is for querying, not for training data access.

Practice this question →

150

MCQhard

A company runs a data lake on Amazon S3 with partitions by year/month/day. A machine learning team needs to read daily data from the last 30 days for model retraining. The data format is Parquet. The team uses Amazon Athena to query the data, but the queries are slow and scanning too much data. The team has already optimized the file sizes and compression. What additional step can reduce the amount of data scanned?

A.Remove the partition structure and store data as single large files.

B.Convert the Parquet files to JSON format for better query performance.

C.Use CSV format with Gzip compression.

D.Add more partition columns such as hour to reduce the scanned partitions.

AnswerD

More granular partitions allow queries to scan fewer files.

Why this answer

Option A is correct because partitioning on additional columns like hour can further prune partitions if queries filter by time ranges. Option B is wrong because converting to JSON would increase data size. Option C is wrong because converting to CSV would also increase size.

Option D is wrong because removing partitions would increase scanning.

Practice this question →

← PreviousPage 2 of 5 · 374 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Ml Data Engineering questions.

Start 20-question session