Knowledge + Practice

CCNA Data Engineering Questions

74 of 374 questions · Page 5/5 · Data Engineering · Answers revealed

Practice these questions Domain overview All questions

301

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should they use for scheduling?

A.AWS Lambda

B.Amazon CloudWatch Events

C.Amazon Simple Queue Service (SQS)

D.AWS Step Functions

AnswerB

CloudWatch Events can trigger Glue jobs on a schedule.

Why this answer

Amazon CloudWatch Events (now part of Amazon EventBridge) can trigger an AWS Glue ETL job on a schedule using a cron or rate expression. This is the native, serverless scheduling service for running jobs at fixed intervals, such as every hour, without needing to manage any infrastructure.

Exam trap

The trap here is that candidates may confuse AWS Lambda as a scheduler because it can be used to run code on a schedule via CloudWatch Events, but the question asks for the service used for scheduling, not for executing the scheduled action.

How to eliminate wrong answers

Option A is wrong because AWS Lambda is a compute service for running code in response to events, not a scheduling service; while Lambda can be used to trigger Glue jobs via custom code, it requires additional setup and is not the direct scheduling mechanism. Option C is wrong because Amazon SQS is a message queue service for decoupling application components, not a scheduler; it cannot natively trigger Glue jobs on a time-based schedule. Option D is wrong because AWS Step Functions is a workflow orchestration service that can coordinate multiple AWS services, but it is not designed for simple time-based scheduling; using it solely for hourly triggers would be over-engineering and incur unnecessary complexity and cost.

Practice this question →

302

MCQmedium

Refer to the exhibit. An IAM policy is attached to a data engineering role. The role is used by an AWS Glue ETL job that reads from 'raw/' and writes to 'processed/'. The job fails with an access denied error when trying to write to 'processed/'. What is the likely cause?

A.The Deny statement on s3:DeleteObject prevents overwriting objects.

B.The role is not correctly attached to the Glue job.

C.The policy does not allow both s3:GetObject and s3:PutObject on the same resource.

D.The policy specifies incorrect ARN for the 'processed' folder.

AnswerA

If the job tries to overwrite an existing object, it needs DeleteObject permission.

Why this answer

Option B is correct because the Deny statement explicitly denies s3:DeleteObject on all objects, but PutObject operations on existing objects may require DeleteObject permission for overwrite (depending on the object versioning configuration). However, for new objects, PutObject should work. The more common issue is that the policy does not allow 's3:ListBucket' at the bucket level, which is needed for certain operations.

Actually, the error is likely due to missing 's3:ListBucket' permission. But among options, B is most plausible: the Deny on DeleteObject may interfere if the job attempts to overwrite existing objects. Option A is wrong because the policy allows both.

Option C is wrong because the resource matches. Option D is wrong because the role is attached correctly.

Practice this question →

303

MCQhard

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

A.Configure the SageMaker training job to use Pipe mode, which streams data directly from S3 without downloading to the instance's local storage.

B.Use S3 Transfer Acceleration to speed up the data transfer from S3 to the training instance.

C.Use larger EC2 instances with more vCPUs and memory to speed up the training process.

D.Enable Elastic Fabric Adapter (EFA) on the training instances to improve network throughput.

AnswerA

Pipe mode reduces start-up time by streaming data, and it is cost-effective as it avoids EBS volume costs associated with File mode.

Why this answer

Pipe mode in SageMaker streams training data directly from Amazon S3 to the training algorithm without first downloading it to the instance's local storage. This eliminates the data download step, significantly reducing startup time and overall training time for large datasets like 10 TB. It is the most cost-effective because it avoids the need for larger instances or additional data transfer acceleration services.

Exam trap

The trap here is that candidates often confuse Pipe mode with File mode, assuming both require downloading data, or they over-engineer the solution by choosing expensive network accelerators or larger instances when the simplest streaming approach is both faster and cheaper.

How to eliminate wrong answers

Option B is wrong because S3 Transfer Acceleration is designed to speed up uploads to S3 over long distances, not downloads from S3 to SageMaker training instances, and it incurs additional costs without addressing the core issue of download time. Option C is wrong because using larger EC2 instances with more vCPUs and memory does not reduce the data download time; it only increases compute capacity, which may not help if the bottleneck is I/O from downloading data. Option D is wrong because Elastic Fabric Adapter (EFA) improves inter-node network communication for distributed training, but it does not accelerate data transfer from S3 to the instance, which is the primary bottleneck here.

Practice this question →

304

Multi-Selectmedium

Which TWO AWS services can be used to move data from an on-premises database to Amazon S3 on a recurring schedule without writing custom code? (Choose 2.)

Select 2 answers

A.AWS Glue

B.AWS Snowball Edge

C.AWS Database Migration Service (AWS DMS)

D.Amazon Athena

E.Amazon Kinesis Data Firehose

AnswersA, C

Glue can run scheduled ETL jobs from JDBC sources to S3.

Why this answer

AWS Database Migration Service (DMS) can perform continuous replication from on-premises databases to S3. AWS Glue can run scheduled ETL jobs to pull data from on-premises sources via JDBC and write to S3. Option A is wrong because Snowball is a one-time physical transfer.

Option D is wrong because Kinesis Data Firehose is for streaming, not scheduled batch. Option E is wrong because Athena is a query service, not a data movement tool.

Practice this question →

305

Multi-Selectmedium

A company uses AWS Glue to run ETL jobs. The data engineer wants to monitor job performance and troubleshoot failures. Which THREE AWS services or features should they use together? (Choose three.)

Select 3 answers

A.AWS Glue job bookmarks

B.Amazon S3 Event Notifications

C.Amazon Athena

D.Amazon CloudWatch Logs

E.Amazon CloudWatch metrics

AnswersA, D, E

Job bookmarks track processed data and help identify failures.

Why this answer

Correct options: A, C, E. CloudWatch Logs stores Glue job logs, CloudWatch metrics tracks job performance metrics, and Glue job bookmarks track processed data. Option B (Athena) is for querying, not monitoring.

Option D (S3 Events) can trigger jobs but not monitor them.

Practice this question →

306

Matchingmedium

Match each SageMaker optimization technique to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Train across multiple GPUs or instances

Hyperparameter optimization with Bayesian search

Use spot instances for cost savings

Stream data directly from S3 for faster training

Monitor training and detect issues

Why these pairings

These techniques improve training efficiency and cost.

Practice this question →

307

Multi-Selectmedium

Which TWO of the following are valid ways to reduce query costs in Amazon Athena? (Choose 2)

Select 2 answers

A.Use UNLOAD to export query results to S3

B.Partition the data in S3

C.Increase the query timeout limit

D.Use columnar storage formats like Parquet

E.Enable encryption at rest on S3

AnswersB, D

Partitioning limits data scanned per query.

Why this answer

Option A (partitioning) reduces data scanned, lowering cost. Option C (using columnar formats) also reduces scanned data. Option B is wrong because increasing limit doesn't reduce cost.

Option D is wrong because encryption doesn't reduce cost. Option E is wrong because unload is for exporting, not cost reduction.

Practice this question →

308

Multi-Selecteasy

A company wants to centralize logging from multiple AWS accounts and on-premises servers. The logs must be stored cost-effectively and be searchable. Which TWO services should be used? (Choose TWO.)

Select 2 answers

A.Amazon CloudWatch Logs

B.Amazon Redshift

C.Amazon Athena

D.Amazon Kinesis Data Streams

E.Amazon S3

AnswersC, E

Athena can query logs directly on S3.

Why this answer

Amazon S3 is a cost-effective storage for log archives, and Amazon Athena allows querying the logs directly on S3 without loading into a database. Option A (CloudWatch Logs) is for real-time monitoring, not long-term storage. Option C (Redshift) is more expensive.

Option E (Kinesis) is for streaming, not storage.

Practice this question →

309

Multi-Selecthard

A company is using Amazon Kinesis Data Streams with a Lambda consumer. The Lambda function writes results to an S3 bucket. The team wants to ensure that each record is processed exactly once and in order. Which TWO configurations should the team implement? (Choose 2.)

Select 2 answers

A.Set the batch size to 1

B.Increase the Lambda function's reserved concurrency

C.Set the parallelization factor to 1

D.Configure a dead-letter queue for failed records

E.Enable S3 bucket versioning to track duplicates

AnswersC, E

This ensures a single Lambda instance processes each shard, maintaining order.

Why this answer

Option B (Enable parallelization factor) and Option E (Use S3 bucket versioning) are correct. Option A is wrong because batch size does not affect ordering. Option C is wrong because concurrency limits don't ensure ordering.

Option D is wrong because DLQ does not affect ordering.

Practice this question →

310

MCQhard

A company runs a streaming data pipeline using Amazon Kinesis Data Streams with 10 shards. The pipeline ingests sensor data from thousands of devices. Each device sends a JSON payload every 5 seconds. The payload size is approximately 2 KB. The data is consumed by a fleet of EC2 instances running a custom Java application that uses the Kinesis Client Library (KCL). Over the past week, the company has observed that the consumer application is experiencing increased latency, and the Kinesis stream's 'GetRecords.IteratorAgeMilliseconds' CloudWatch metric is consistently above 10 seconds. The company has verified that the EC2 instances have sufficient CPU and memory resources. The KCL application is configured with 10 workers, one per shard. The application processes each record by performing a simple transformation and writing to Amazon DynamoDB. The DynamoDB table has sufficient write capacity and is not throttling. The company wants to reduce the iterator age to under 2 seconds. Which action should the company take?

A.Replace Kinesis Data Streams with Amazon Kinesis Data Firehose

B.Increase the write capacity of the DynamoDB table

C.Increase the number of shards in the Kinesis stream to 20

D.Increase the number of KCL workers to 20

AnswerC

More shards increase the number of concurrent consumers and reduce iterator age.

Why this answer

Option A is correct because the consumer is bottlenecked by the number of shards; increasing shards allows more parallelism and reduces latency. Option B is wrong because the issue is not DynamoDB capacity. Option C is wrong because switching to Firehose would change the architecture and may not support the custom transformation.

Option D is wrong because increasing worker count beyond the number of shards does not help due to KCL's design.

Practice this question →

311

MCQhard

A data science team uses Amazon SageMaker to train models on a dataset stored in Amazon S3. The dataset is 2 TB and is accessed by multiple training jobs. The team notices that training jobs are slow due to high S3 GET request latency. Which solution would provide the fastest and most cost-effective data access?

A.Place all training instances in a Cluster Placement Group

B.Enable S3 Transfer Acceleration on the bucket

C.Mount an Amazon FSx for Lustre file system integrated with the S3 bucket

D.Use Elastic Fabric Adapter (EFA) for training instances

AnswerC

FSx for Lustre provides a high-performance file system that can read data from S3 with low latency.

Why this answer

Using Amazon FSx for Lustre as a managed Lustre file system with S3 integration provides high-throughput, low-latency access for training jobs. Option A (Cluster Placement Group) reduces network latency but does not improve S3 access. Option B (Elastic Fabric Adapter) improves inter-node communication, not data access from S3.

Option D (S3 Transfer Acceleration) improves upload speed, not GET latency for large datasets.

Practice this question →

312

Multi-Selecthard

A company is using Amazon Kinesis Data Streams to ingest real-time clickstream data. The data must be processed and stored in S3 in near real-time. Which THREE services can be used together to achieve this?

Select 3 answers

A.Amazon Kinesis Data Analytics

B.Amazon Kinesis Data Firehose

C.AWS Glue ETL

D.AWS Lambda

E.Amazon EMR

AnswersA, B, D

Can process streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics is correct because it can process streaming clickstream data in real-time using SQL or Apache Flink, enabling transformations, aggregations, and filtering before the data is delivered downstream. It integrates directly with Kinesis Data Streams as a source and can output processed records to Kinesis Data Firehose for storage in Amazon S3, achieving near real-time processing and storage.

Exam trap

The trap here is that candidates often assume AWS Glue ETL can handle real-time streaming because it supports Spark Streaming, but Glue ETL jobs are fundamentally batch-oriented and not designed for continuous, low-latency ingestion from Kinesis Data Streams into S3.

Practice this question →

313

MCQeasy

A data engineer needs to store streaming data from thousands of IoT devices for real-time analytics. Which AWS service is most suitable for ingesting and storing this data for subsequent processing by Amazon Kinesis Data Analytics?

A.Amazon Kinesis Data Streams

B.Amazon S3

C.Amazon RDS

D.Amazon DynamoDB

AnswerA

Kinesis Data Streams ingests and stores streaming data in real-time for analytics.

Why this answer

Option C is correct because Kinesis Data Streams is designed for real-time data ingestion and storage, and integrates with Kinesis Data Analytics. Option A is wrong because S3 is not real-time; Option B is wrong because DynamoDB is for NoSQL, not streaming; Option D is wrong because RDS is a relational database not suited for high-velocity streaming.

Practice this question →

314

MCQmedium

A company is building a data pipeline using AWS Glue to transform data from Amazon RDS to Amazon S3. The pipeline runs daily and processes about 500 GB of data. The team notices that the job is taking longer than expected. Which change would MOST improve the job performance?

A.Disable job bookmarking

B.Increase the number of DPUs for the Glue job

C.Upgrade the RDS instance to a larger class

D.Use smaller file sizes in S3 output

AnswerB

More DPUs provide more parallelism and can speed up the job.

Why this answer

Increasing the number of DPUs (Data Processing Units) allocated to the Glue job can improve parallelism and reduce execution time. Option A is wrong because upgrading RDS may not be the bottleneck. Option B is wrong because using a smaller file size could increase overhead.

Option D is wrong because disabling job bookmarking is not a performance improvement.

Practice this question →

315

MCQmedium

A company runs a daily batch ETL job using AWS Glue that reads from Amazon RDS (MySQL), transforms the data, and writes to Amazon Redshift. The job takes 6 hours and processes 500 GB of data. Management wants to reduce the runtime. Which action would be MOST effective?

A.Increase the node size of the Redshift cluster

B.Use the Redshift COPY command to load data directly from RDS

C.Use Amazon RDS with Provisioned IOPS SSD storage

D.Increase the number of DPUs allocated to the Glue job

AnswerD

More DPUs allow Glue to process data in parallel, reducing overall runtime.

Why this answer

Option B is correct. Increasing the number of DPUs (Data Processing Units) in the Glue job can parallelize the workload and reduce runtime. Option A (increase Redshift node size) helps only at the write stage.

Option C (SSD on RDS) is not a bottleneck. Option D (using COPY command) is for Redshift, not Glue.

Practice this question →

316

MCQmedium

A data pipeline uses Amazon Kinesis Data Streams to ingest clickstream data. The data is consumed by an AWS Lambda function that transforms and writes to Amazon DynamoDB. The Lambda function is throttled during traffic spikes, causing data to be reprocessed. Which solution should the team implement to handle the throttling without losing data?

A.Use Amazon SQS as an intermediate buffer between Kinesis and Lambda.

B.Increase the number of shards in the Kinesis stream and configure a dead-letter queue (DLQ) for the Lambda function.

C.Enable DynamoDB auto scaling to handle writes.

D.Reduce the batch size in the Lambda event source mapping.

AnswerB

More shards increase parallelism; DLQ captures failures for reprocessing.

Why this answer

Option C is correct because increasing the number of shards increases the streaming capacity, and using a DLQ captures failed records for later reprocessing. Option A is wrong because DynamoDB auto scaling does not address Lambda throttling. Option B is wrong because reducing batch size may increase processing overhead.

Option D is wrong because SQS is not needed as Kinesis already buffers data.

Practice this question →

317

MCQmedium

A company is building a data lake on Amazon S3. Data arrives from multiple sources in different formats (CSV, JSON, Parquet). The engineering team wants to query this data using Amazon Athena with minimal transformation. Which approach minimizes query cost and improves performance?

A.Use Amazon Redshift Spectrum to query the data directly without transformation

B.Use AWS Glue to convert all data to Parquet format, partition by date, and store in a separate S3 bucket

C.Use Amazon EMR to convert data to CSV format and repartition

D.Store data as-is in S3 and create external tables in Athena for each format

AnswerB

This reduces data scanned, improves performance, and lowers cost.

Why this answer

Using Parquet with partitioning and compression reduces the amount of data scanned by Athena, lowering cost and improving performance. Converting to a single format is not necessary, but optimized formats like Parquet are beneficial. Glue can convert data, but it adds overhead.

Practice this question →

318

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver streaming data to Amazon S3. The data contains personally identifiable information (PII) that must be redacted before storage. Which AWS service can be integrated with Kinesis Data Firehose to transform the data in real time?

A.Amazon Athena

B.Amazon Kinesis Data Analytics

C.Amazon EMR

D.AWS Lambda

AnswerD

Lambda can be invoked by Firehose to transform records in real time.

Why this answer

Option A is correct. AWS Lambda can be used as a transformation function within a Kinesis Data Firehose delivery stream to redact PII. Option B is wrong because Amazon EMR is for big data processing, not inline transformation.

Option C is wrong because Amazon Kinesis Data Analytics is for analyzing streams, not for transformation. Option D is wrong because Amazon Athena is a query service, not a transformation service.

Practice this question →

319

MCQhard

A data engineer is configuring an IAM policy to allow users to upload objects to an S3 bucket only if the objects are encrypted using SSE-S3. However, users are getting AccessDenied errors when uploading objects without specifying encryption. What is the most likely cause?

A.The condition should check for aws:SourceIp instead of encryption

B.The condition requires encryption to be specified, but the upload does not specify it

C.The policy is attached to the wrong IAM user

D.The bucket policy denies all PutObject without encryption

AnswerB

The condition requires s3:x-amz-server-side-encryption to be AES256, so without it, access is denied.

Why this answer

Option B is correct because the policy allows PutObject only when encryption is AES256, but denies when no encryption is specified because the condition is not met. Option A is wrong because it's not a service control policy; Option C is wrong because the bucket policy is not shown; Option D is wrong because the condition checks for AES256, not KMS.

Practice this question →

320

MCQmedium

A data pipeline uses Amazon Kinesis Data Streams with a Lambda consumer to process clickstream data. The Lambda function sometimes times out because of spikes in traffic. The team wants to buffer the data before processing to handle spikes. Which approach is most effective?

A.Send the data to an Amazon SQS queue and have Lambda poll from it

B.Increase the number of shards in the Kinesis stream

C.Use Kinesis Data Firehose as an intermediary that buffers data and delivers to Lambda in batches

D.Increase the Lambda function timeout to 15 minutes

AnswerC

Firehose can buffer incoming data and invoke Lambda with larger batches, smoothing spikes.

Why this answer

Using Kinesis Data Firehose between the stream and Lambda buffers data and can deliver it to Lambda in batches, reducing timeouts. Option A (increasing Lambda timeout) may help but is not a buffer. Option B (increasing shards) increases throughput but does not buffer.

Option D (using SQS) adds complexity and delay.

Practice this question →

321

Multi-Selecthard

A company uses AWS Glue Data Catalog to manage metadata for its data lake on Amazon S3. The data lake contains terabytes of data in CSV format. The data engineering team wants to improve query performance in Amazon Athena and reduce costs. Which actions should the team take? (Select THREE.)

Select 3 answers

A.Create views in Athena to simplify queries.

B.Compress the data using Snappy or GZIP.

C.Partition the data by commonly filtered columns.

D.Convert the data to Parquet format.

E.Convert the data to JSON format.

AnswersB, C, D

Compression reduces storage and data scanned.

Why this answer

Option B is correct because converting to columnar formats like Parquet reduces data scanned and improves performance. Option C is correct because partitioning limits the amount of data scanned per query. Option D is correct because compressing data reduces storage cost and data scanned.

Option A is wrong because CSV is not efficient. Option E is wrong because it does not directly address performance or cost for Athena.

Practice this question →

322

Multi-Selecteasy

A company wants to analyze streaming data from IoT devices in near-real-time. They need to store raw data in Amazon S3 and also run SQL queries on the streaming data. Which TWO services should they use?

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Analytics

D.Amazon Kinesis Data Firehose

E.AWS Lambda

AnswersC, D

Runs SQL on streaming data.

Why this answer

Amazon Kinesis Data Firehose can deliver streaming data to S3. Amazon Kinesis Data Analytics can run SQL queries on the stream. Kinesis Data Streams is for ingestion but requires custom consumers.

Lambda can process streams but not SQL. Glue is for batch ETL.

Practice this question →

323

MCQhard

A retail company runs an e-commerce platform on AWS. They have a Data Engineering team that processes clickstream data using Amazon Kinesis Data Streams (KDS) with a shard count of 5. The data is consumed by an AWS Lambda function that transforms and loads the data into an Amazon S3 bucket partitioned by year/month/day/hour. Recently, the team has noticed that the Lambda function is experiencing throttling errors, and the KDS shard iterator age is increasing, indicating that the consumer cannot keep up with the incoming data rate. The team has already increased the Lambda reserved concurrency to 1000 and enabled batch window of 60 seconds. The metrics show that the Lambda function duration is well under the 5-minute timeout, and there are no errors in the transformation logic. The S3 write operations are not failing. Which course of action would MOST effectively resolve the issue without unnecessary cost or complexity?

A.Increase the number of shards in the Kinesis Data Stream to 20 to increase the parallelism of Lambda consumers.

B.Increase the Lambda reserved concurrency to 5000 to allow more parallel executions.

C.Increase the batch window to 300 seconds to accumulate more records per invocation and reduce the number of calls.

D.Switch to using Amazon Kinesis Data Analytics with a larger instance type to process the stream.

AnswerA

More shards allow more concurrent Lambda invocations, improving throughput and reducing iterator age.

Why this answer

The core issue is that the Lambda consumer cannot keep up with the incoming data rate, as evidenced by the increasing shard iterator age. Increasing the shard count from 5 to 20 directly increases the number of Kinesis Data Streams shards, which in turn increases the number of concurrent Lambda invocations (one per shard) and the overall throughput of the stream. This addresses the bottleneck at the source without adding unnecessary complexity or cost, as KDS pricing is based on shard hours and Lambda concurrency is already set to 1000.

Exam trap

The trap here is that candidates often assume increasing Lambda concurrency or batch window will solve throughput issues, but they fail to recognize that Kinesis shard count is the fundamental limiter of parallelism in the Lambda-Kinesis integration.

How to eliminate wrong answers

Option B is wrong because increasing Lambda reserved concurrency to 5000 does not help when the bottleneck is the number of Kinesis shards; Lambda can only process one shard per concurrent invocation, and with only 5 shards, the maximum parallelism is 5, so additional concurrency is unused. Option C is wrong because increasing the batch window to 300 seconds would increase latency and could cause the shard iterator age to grow further, as records would accumulate longer before being processed, worsening the backlog. Option D is wrong because switching to Kinesis Data Analytics introduces a different service (meant for real-time analytics with SQL or Flink) that adds complexity and cost, and does not directly address the consumer throughput limitation caused by insufficient shard parallelism.

Practice this question →

324

MCQhard

A company uses AWS Lake Formation to manage permissions on a data lake stored in Amazon S3. A data analyst tries to query a table using Amazon Athena but receives an 'Access Denied' error. The analyst has SELECT permission on the table in Lake Formation. What is the most likely cause?

A.The S3 bucket is not registered with Lake Formation

B.The S3 bucket is encrypted with a KMS key that the analyst does not have access to

C.The table does not have any partitions defined

D.The IAM role used by Athena does not have lakeformation:GetDataAccess permission

AnswerA

If the bucket is not registered, Lake Formation cannot control access, and the default S3 permissions apply, which may deny access.

Why this answer

Lake Formation requires that the underlying S3 bucket be registered with Lake Formation and that the IAM role used by Athena has Lake Formation permissions. If the S3 bucket is not registered, Lake Formation cannot grant access to it. Option A (missing IAM actions) could also be a cause, but the most common issue is that the bucket is not registered.

Option B (KMS key) not likely. Option D (no partition) would not cause Access Denied.

Practice this question →

325

MCQeasy

A data engineering team needs to set up a data pipeline that ingests streaming data from an Apache Kafka cluster running on Amazon EKS into an S3 data lake. The data must be stored in Parquet format, partitioned by date and event type. The team wants a fully managed solution with minimal operational overhead. Which solution should they choose?

A.Use Amazon MSK (Managed Streaming for Apache Kafka) and configure an MSK Connect S3 sink connector.

B.Set up a Kinesis Data Firehose delivery stream that reads from Kafka and writes to S3.

C.Use AWS Glue ETL jobs to pull data from Kafka cluster periodically.

D.Create a Kinesis Data Analytics application to read from Kafka and write to S3.

AnswerA

MSK is fully managed Kafka, and MSK Connect can stream data to S3 in Parquet format.

Why this answer

Option D is correct because Amazon MSK (Managed Streaming for Kafka) is a fully managed Kafka service, and MSK Connect with an S3 sink connector can deliver data directly to S3 in Parquet format. Option A (Kinesis Data Analytics) is for real-time analytics, not for data lake ingestion. Option B (Kinesis Data Firehose) works with Kinesis streams, not Kafka directly.

Option C (Glue ETL) is batch-oriented and adds latency.

Practice this question →

326

Multi-Selectmedium

A data engineering team is designing a data pipeline to process streaming data from social media feeds. The data must be deduplicated, enriched with customer information from a relational database, and stored in Amazon S3 in Parquet format. Which AWS services should the team use to build this pipeline? (Select TWO.)

Select 2 answers

A.AWS Glue

B.Amazon Kinesis Data Firehose

C.Amazon Athena

D.Amazon SageMaker

E.Amazon Kinesis Data Streams

AnswersA, E

Glue ETL can transform and enrich data from streams and databases.

Why this answer

Option A is correct because Kinesis Data Streams ingests streaming social media data. Option D is correct because AWS Glue ETL jobs can read from the stream, perform deduplication and enrichment using JDBC connections to the relational database, and write to S3 in Parquet. Option B is wrong because Kinesis Data Firehose does not support enrichment with a relational database.

Option C is wrong because Athena is a query engine, not an ETL tool. Option E is wrong because SageMaker is for ML, not data pipeline.

Practice this question →

327

MCQeasy

An ML engineer is using Amazon SageMaker to train a model on a dataset that contains personal identifiable information (PII). The data must be encrypted at rest and in transit. The company uses AWS KMS for key management. How should the engineer configure the SageMaker training job to meet these encryption requirements?

A.Enable S3 Server-Side Encryption (SSE-S3) on the input data bucket

B.Use a custom Docker image with built-in encryption and disable inter-container traffic encryption for performance

C.Use a VPC with an S3 VPC Endpoint and enable SSL for the endpoint

D.Specify a KMS key for the training job's VolumeKmsKeyId and enable inter-container traffic encryption

AnswerD

This encrypts the ML storage volume and inter-container traffic.

Why this answer

Option C is correct because SageMaker supports KMS encryption for both the ML storage volume (at rest) and inter-container traffic (in transit). Option A is wrong because S3 Server-Side Encryption only covers data at rest in S3, not during training. Option B is wrong because SSL does not encrypt the storage volume.

Option D is wrong because disabling inter-container encryption is not secure.

Practice this question →

328

MCQmedium

A data scientist is training a deep learning model using a large dataset stored in S3. The training job runs on a SageMaker training instance with a GPU. The data engineer notices that the GPU utilization is low, and the training is I/O bound. The data is read directly from S3 using the SageMaker SDK. Which change should the data engineer recommend to improve GPU utilization?

A.Increase the batch size in the training script to process more data per step.

B.Mount the S3 bucket to the training instance using Amazon Elastic File System (EFS).

C.Use SageMaker Pipe mode to stream data directly from S3 to the training container.

D.Copy the entire dataset to an Amazon EBS volume attached to the training instance.

AnswerC

Pipe mode eliminates disk I/O, allowing data to be streamed directly to the GPU.

Why this answer

Option B is correct because enabling SageMaker Pipe mode streams data directly from S3 to the training container without writing to disk, reducing I/O bottlenecks and improving GPU utilization. Option A (increasing batch size) may cause memory issues. Option C (using EFS) adds network latency.

Option D (using EBS) still involves disk I/O.

Practice this question →

329

MCQhard

A company has an AWS Glue ETL job that reads data from an Amazon RDS for MySQL table and writes to Amazon S3 in Parquet format. The job runs daily and processes 500 GB of data. Recently, the job has been failing with memory errors during the write phase. The data schema is wide (200 columns). Which change should a data engineer make to the Glue job to resolve the memory issue?

A.Increase the number of DPUs for the Glue job.

B.Change the output format from Parquet to CSV.

C.Use the JDBC connection with fetchSize parameter.

D.Configure the write operation with 'groupSize' to limit records per file.

AnswerD

Limiting records per file reduces the memory needed for buffering during writes.

Why this answer

Option B is correct. Using the 'groupSize' or 'maxRecordsPerFile' option in Glue's DynamicFrame writer can control the number of records per Parquet file, reducing memory pressure. Option A is wrong because increasing DPUs may help but is a costlier solution.

Option C is wrong because JDBC connection is for reading, not writing. Option D is wrong because using CSV is less efficient and doesn't address memory.

Practice this question →

330

MCQeasy

A data engineer needs to schedule an AWS Glue ETL job to run every hour. Which service should be used to trigger the job?

A.AWS Lambda

B.AWS Step Functions

C.Amazon CloudWatch Events (EventBridge)

D.AWS Data Pipeline

AnswerC

EventBridge can schedule cron jobs to trigger Glue.

Why this answer

Option B is correct because Amazon CloudWatch Events (EventBridge) can trigger Glue jobs on a schedule. Option A is wrong because Lambda can trigger Glue but is not a scheduler; Option C is wrong because Data Pipeline is for complex workflows; Option D is wrong because Step Functions is for state machines, not simple scheduling.

Practice this question →

331

MCQhard

You are a data engineer at a fintech company. The company processes real-time stock market data from multiple exchanges. The data is ingested via Amazon Kinesis Data Streams with 50 shards. Each record is about 1 KB, and the ingestion rate is 5,000 records per second. The data is consumed by a Java application running on Amazon ECS that performs real-time analytics and stores results in Amazon DynamoDB. Recently, the application has been experiencing high latency, and some records are stuck in the shards for minutes before being consumed. The CloudWatch metrics show that the application's CPU utilization is low, but the iterator age is increasing. The application uses the Kinesis Client Library (KCL) with a single worker. What is the most likely cause and how should it be fixed?

A.Increase the number of shards to 200 to provide more throughput.

B.Increase the CPU capacity of the ECS task by moving to a larger instance type.

C.Move the destination from DynamoDB to Amazon RDS to reduce write latency.

D.Scale the number of KCL workers to match the number of shards (e.g., 50 workers) to process shards in parallel.

AnswerD

A single worker can only process one shard at a time; with 50 shards, records in other shards wait. Multiple workers can process shards concurrently, reducing latency.

Why this answer

Option C is correct because a single KCL worker can only process one shard at a time; with 50 shards, records sit idle. Increasing the number of workers (e.g., to 50) allows parallel processing of all shards, reducing iterator age. Option A is wrong because the ingestion rate is well within the 50 shards' capacity (50 MB/s write vs ~5 MB/s actual).

Option B is wrong because CPU is low, not high. Option D is wrong because DynamoDB is not part of the ingestion path.

Practice this question →

332

MCQmedium

A data science team needs to process streaming data from thousands of IoT devices and perform real-time anomaly detection. The data must be persisted in Amazon S3 for batch processing later. Which combination of AWS services should be used to meet these requirements?

A.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for anomaly detection, and Amazon Kinesis Data Firehose to deliver data to Amazon S3.

B.Amazon Kinesis Data Streams for ingestion, AWS Glue for anomaly detection, and Amazon S3 for storage.

C.AWS Lambda for both ingestion and anomaly detection, and Amazon S3 for storage.

D.Amazon Simple Queue Service (SQS) for ingestion, AWS Lambda for anomaly detection, and Amazon S3 for storage.

AnswerA

This combination provides real-time ingestion, analytics, and durable storage.

Why this answer

Option A is correct because Kinesis Data Streams ingests streaming data, Kinesis Data Analytics performs real-time anomaly detection, and Firehose delivers data to S3 for batch processing. Option B is wrong because SQS is not optimized for streaming and does not have built-in analytics. Option C is wrong because Lambda alone cannot handle high-throughput streaming and lacks persistence to S3.

Option D is wrong because Glue is a batch ETL service, not real-time.

Practice this question →

333

MCQmedium

A machine learning team needs to process a large dataset stored in Amazon S3 using Apache Spark. They want to minimize cost and avoid managing infrastructure. Which AWS service should they use?

A.AWS Glue

B.Amazon Athena

C.Amazon EMR

D.Amazon SageMaker

AnswerA

Glue provides serverless Spark for ETL on S3 data.

Why this answer

AWS Glue is a serverless Spark environment that can process data in S3 without provisioning clusters. EMR requires cluster management. Athena is SQL-only.

SageMaker is for training models, not general-purpose Spark.

Practice this question →

334

Matchingmedium

Match each AWS service to its primary purpose in a machine learning pipeline.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Build, train, and deploy ML models

ETL and data cataloging

Object storage for datasets and models

Serverless compute for preprocessing

Image and video analysis

Why these pairings

These services are commonly used in ML workflows on AWS.

Practice this question →

335

MCQeasy

A company is building a data pipeline to process streaming data from IoT devices. The data must be ingested with low latency, transformed in real-time using custom logic, and stored in Amazon S3 partitioned by device ID and timestamp. Which combination of AWS services should the company use to meet these requirements?

A.Amazon Kinesis Data Firehose with direct S3 delivery

B.Amazon Managed Streaming for Apache Kafka (MSK) with Amazon S3 sink connector

C.Amazon DynamoDB Streams with AWS Lambda and Amazon S3

D.Amazon Kinesis Data Streams with AWS Lambda and Amazon S3

AnswerD

Kinesis Data Streams for ingestion, Lambda for real-time transformation, and S3 for storage with partitioning.

Why this answer

Option B is correct because Amazon Kinesis Data Streams provides low-latency ingestion, AWS Lambda can apply custom transformation logic in real-time, and Amazon S3 with partitioning can store the data. Option A is wrong because Kinesis Data Firehose does not support custom transformation without Lambda and cannot partition on write. Option C is wrong because Amazon MSK (Managed Streaming for Kafka) is more complex than needed and not as tightly integrated.

Option D is wrong because Amazon DynamoDB Streams is not designed for this volume of streaming data.

Practice this question →

336

MCQmedium

A data engineer needs to continuously ingest streaming data from thousands of IoT devices and store the raw data in Amazon S3 for archival processing. The data volume varies significantly throughout the day, and the solution must be serverless, scalable, and cost-effective. Which AWS service should be used to capture and buffer the streaming data before writing to S3?

A.Amazon Kinesis Data Firehose

B.Amazon Kinesis Data Streams

C.AWS Glue

D.Amazon Simple Queue Service (SQS)

AnswerA

Kinesis Data Firehose is a serverless service that can directly deliver streaming data to S3 with buffering.

Why this answer

Amazon Kinesis Data Firehose is a serverless service that can capture, transform, and load streaming data into S3. It automatically scales and can buffer data to handle varying volumes. Amazon Kinesis Data Streams (option A) is for real-time processing but requires a consumer to write to S3.

Amazon SQS (option C) is a message queue, not designed for streaming data. AWS Glue (option D) is a batch ETL service.

Practice this question →

337

MCQhard

A data engineering team is designing a data pipeline to process large CSV files (10-50 GB each) stored in Amazon S3. The pipeline must transform the data using AWS Glue and load it into Amazon Redshift for analytics. The team wants to minimize costs while ensuring the pipeline can handle peak loads. Which approach is the most cost-effective?

A.Use AWS Lambda to process each file and load into Redshift.

B.Use Amazon EMR with Hive to transform the data and load into Redshift.

C.Use an AWS Glue Python shell job with a single r5.xlarge worker.

D.Use AWS Glue with Spark and dynamic frames, scaling the number of workers based on file size.

AnswerD

Correct: Glue Spark jobs handle large files efficiently; dynamic frames simplify schema handling.

Why this answer

Option D (use AWS Glue with Spark and dynamic frame) is correct because Glue's Spark-based ETL can handle large files efficiently, and using dynamic frames allows schema inference without manual parsing. Option A (use a single r5.xlarge) may not handle peak loads. Option B (Lambda) has time and memory limits unsuitable for large files.

Option C (EMR with Hive) is more complex and typically more expensive than Glue for this use case.

Practice this question →

338

MCQhard

A company runs a real-time recommendation system that uses Amazon SageMaker endpoints for inference. The system ingests user activity data from a mobile app via Amazon API Gateway and AWS Lambda, which writes events to an Amazon Kinesis Data Stream. A second Lambda function consumes the stream, calls a SageMaker endpoint to generate recommendations, and stores the results in Amazon DynamoDB. The system has been working well, but recently the team noticed an increase in latency from the time a user action occurs to when the recommendation is stored. The SageMaker endpoint shows increased invocation latency but no throttling. CloudWatch metrics show that the Kinesis stream's IteratorAgeMilliseconds is increasing, indicating the consumer is falling behind. The Lambda consumer's duration is within limits, but the number of invocations is lower than expected. The team suspects the issue is with the event source mapping. Which course of action should the team take to reduce the latency?

A.Increase the batch size in the event source mapping to process more records per invocation.

B.Increase the number of shards in the Kinesis data stream to increase parallelism.

C.Decrease the Lambda function's reserved concurrency to force it to scale down.

D.Replace the Lambda consumer with an Amazon Kinesis Data Firehose delivery stream.

AnswerA

Larger batches improve throughput by reducing overhead per invocation.

Why this answer

Option B is correct because the consumer is falling behind despite adequate Lambda duration, suggesting that the batch size or parallelization factor is too low. Increasing the batch size allows each Lambda invocation to process more records, increasing throughput. Option A is wrong because increasing shards increases cost and may not help if the consumer is the bottleneck.

Option C is wrong because reducing concurrency would worsen the situation. Option D is wrong because the Lambda function is already consuming from Kinesis; using Firehose would not directly solve the consumer lag.

Practice this question →

339

Multi-Selectmedium

A company is building a data lake on Amazon S3 and wants to ensure that data is encrypted at rest using AWS KMS. Which TWO actions are required to achieve this? (Choose TWO.)

Select 2 answers

A.Configure the KMS key policy to allow the S3 service to use the key

B.Enable default encryption on the S3 bucket with SSE-KMS

C.Add a bucket policy that denies PutObject without encryption

D.Enable encryption in transit using HTTPS for all S3 API calls

E.Use client-side encryption on all data before uploading

AnswersA, B

The key policy must grant the S3 service principal permission to encrypt/decrypt.

Why this answer

Option B is correct because default bucket encryption with SSE-KMS ensures all objects are encrypted with KMS. Option D is correct because KMS key policy must allow the bucket to use the key. Option A is wrong because bucket policy can restrict access but does not enforce encryption.

Option C is wrong because client-side encryption is not required if server-side encryption is used. Option E is wrong because encryption in transit (HTTPS) is separate from at-rest encryption.

Practice this question →

340

Drag & Dropmedium

Drag and drop the steps to set up cross-validation in a SageMaker training job using the built-in XGBoost algorithm in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Cross-validation requires data splitting, job configuration with CV parameters, execution, and model selection.

Practice this question →

341

MCQeasy

A data scientist needs to run a one-time query on 10 TB of data stored in S3 using Amazon Athena. The query scans 5 TB and returns a small result set. Which approach minimizes cost?

A.Query the data directly in Athena without any preprocessing

B.Create an S3 Select query to filter data before Athena

C.Use Amazon Redshift Spectrum to query the data

D.Use AWS Glue to convert the data to Parquet format and repartition by date

AnswerA

For a one-time query, scanning 5 TB at $5 per TB is $25, which is minimal compared to preprocessing costs.

Why this answer

Athena charges based on data scanned. Partitioning and using columnar formats like Parquet reduce the amount of data scanned. Redshift Spectrum would require a cluster.

EMR is more expensive for one-time queries.

Practice this question →

342

MCQmedium

A data engineer is responsible for managing a data lake on Amazon S3. The data lake contains CSV files from various sources, totaling 10 TB. The engineer needs to make this data queryable using Amazon Athena. However, Athena queries are currently taking a long time and scanning large amounts of data. The engineer has noticed that the CSV files are not partitioned, and there are no indexes. The engineer wants to improve query performance and reduce costs. The data is accessed frequently for the last 30 days, but older data is rarely queried. The engineer also wants to minimize the amount of data scanned by Athena. What should the engineer do?

A.Convert the CSV files to JSON format and use Athena to query them.

B.Convert the CSV files to Parquet format and partition the data by date.

C.Create indexes on the S3 objects using AWS Glue.

D.Convert the CSV files to ORC format and create a view in Athena.

AnswerB

Parquet is columnar and compressed; partitioning by date allows partition pruning, reducing scan size.

Why this answer

Option D is correct. Converting CSV to Parquet reduces scan size due to columnar storage and compression. Partitioning by date allows Athena to skip irrelevant partitions.

Option A is wrong because it does not address the partitioning issue. Option B is wrong because converting to ORC alone without partitioning helps but not as much as partitioning. Option C is wrong because Athena does not support indexes.

Practice this question →

343

MCQeasy

A company uses Amazon RDS for its transactional database and needs to export a daily snapshot of a table to Amazon S3 in Parquet format for analytics. Which AWS service can perform this export without writing custom code?

A.Amazon Redshift

B.AWS Database Migration Service (DMS)

C.Amazon Athena

D.AWS Glue

AnswerD

Glue can run scheduled ETL jobs to extract from RDS and write to S3 in Parquet.

Why this answer

AWS Glue can connect to RDS, extract data, and write to S3 in Parquet format using a Glue job. AWS Database Migration Service (DMS) is primarily for migration, not scheduled exports. Amazon Athena cannot directly connect to RDS.

Amazon Redshift is a data warehouse, not a migration tool.

Practice this question →

344

Multi-Selectmedium

A data engineer is building a streaming pipeline using Amazon Kinesis Data Streams and AWS Lambda. The Lambda function processes records and writes results to Amazon S3. The engineer notices that the Lambda function is experiencing throttling and some records are being dropped. Which TWO actions should the engineer take to improve the reliability of the pipeline?

Select 2 answers

A.Increase the number of shards in the Kinesis data stream.

B.Set a reserved concurrency on the Lambda function to prevent other functions from using its capacity.

C.Add a Dead Letter Queue to the Lambda function to capture failed records.

D.Decrease the batch size in the Lambda event source mapping.

E.Increase the Kinesis stream's retention period to 7 days.

AnswersA, B

More shards increase parallelism and throughput.

Why this answer

Increasing the number of shards in the Kinesis stream increases throughput and reduces throttling. Setting the Lambda function's reserved concurrency ensures it has sufficient capacity to handle the incoming records.

Practice this question →

345

MCQeasy

A data engineer wants to stream clickstream data from a web application to Amazon S3 for near-real-time analytics. Which AWS service should be used to ingest and buffer the data before landing in S3?

A.Amazon AppFlow

B.Amazon Kinesis Data Streams

C.Amazon Kinesis Data Firehose

D.AWS Glue

AnswerC

Firehose can directly deliver streaming data to S3.

Why this answer

Amazon Kinesis Data Firehose is the correct service for loading streaming data into S3 without custom code. Option A is wrong because Kinesis Data Streams requires a consumer to write to S3. Option C is wrong because AppFlow is for SaaS integrations.

Option D is wrong because Glue is for ETL, not real-time streaming.

Practice this question →

346

MCQmedium

A company uses AWS Glue jobs with job bookmarks enabled to process incremental data. They notice that the job processes all data each time instead of only new data. What is the most likely reason?

A.The TempDir is not configured correctly.

B.The job bookmark option is set to 'job-bookmark-enable' but should be 'job-bookmark-disable'.

C.The source data does not have a column that can be used as a bookmark key.

D.The MaxConcurrentRuns is set to 3, which can cause bookmark conflicts.

AnswerD

Multiple concurrent runs can corrupt bookmark state.

Why this answer

Job bookmarks require a unique identifier to track processed data. If the source data lacks a monotonically increasing column or the schema does not include a suitable key, bookmarks may not work. Setting MaxConcurrentRuns > 1 can cause issues with bookmarks.

The correct answer: MaxConcurrentRuns is set to 3, which can interfere with bookmark state.

Practice this question →

347

MCQmedium

Refer to the exhibit. An IAM policy is attached to a data engineering team's role. The team needs to upload data to the 'confidential' prefix in the 'my-data-lake' bucket. However, they are receiving 'AccessDenied' errors. What is the likely cause?

A.The condition in the Deny statement requires the team to use a specific source IP address.

B.The Allow statement only grants GetObject and PutObject, but the team needs ListBucket.

C.The Deny statement with the condition explicitly denies access to the 'confidential' prefix for accounts other than 123456789012.

D.The Allow statement's resource does not include the 'confidential' prefix.

AnswerC

The Deny statement applies to all actions on the confidential prefix for accounts not matching 123456789012, overriding the Allow.

Why this answer

Option A is correct. The Deny statement denies all s3 actions on the confidential prefix for any principal account that is not 123456789012. If the team's role is from a different account (e.g., 111111111111), the Deny applies.

Option B is incorrect because the Deny has a condition, but it still denies for other accounts. Option C is incorrect because the Allow statement does not include the confidential prefix explicitly; however, the Deny overrides the Allow. Option D is incorrect because the condition does not require the team to use a specific source IP.

Practice this question →

348

MCQeasy

An S3 event notification is configured to trigger a Lambda function when new objects are created. The Lambda function processes the event JSON shown. Which field should the function use to read the new object from S3?

A.s3.s3SchemaVersion

B.awsRegion

C.eventName

D.s3.bucket.arn and s3.object.key

AnswerD

These provide the bucket ARN and object key.

Why this answer

The event JSON contains the bucket name under s3.bucket.name and the object key under s3.object.key. The function should use these to construct the S3 URI and read the object. The other fields are not sufficient.

Practice this question →

349

MCQmedium

A company is using AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The jobs are currently failing due to insufficient memory. The data volume varies, with occasional spikes. Which solution should be used to handle the variable memory requirements efficiently?

A.Migrate the ETL jobs to Amazon EMR with Apache Spark

B.Use AWS Glue Flex execution to allocate resources dynamically

C.Increase the number of DPUs (Data Processing Units) for all jobs

D.Split the jobs into smaller steps and run them sequentially

AnswerB

Flex execution provides flexible resources that adapt to workload, optimizing cost and performance.

Why this answer

AWS Glue Flex execution allows jobs to use flexible resources that can handle varying memory needs, and it is cost-effective for variable workloads. Option A (increasing DPU) would waste resources during low volume. Option C (using Apache Spark on EMR) increases management overhead.

Option D (splitting jobs) adds complexity and may not handle spikes.

Practice this question →

350

MCQmedium

A company uses Kinesis Data Streams to ingest real-time sensor data. The data is consumed by a Lambda function that writes to DynamoDB. During peak hours, the Lambda function throws ProvisionedThroughputExceededException. The team wants to decouple the write operation and improve resilience. What should they do?

A.Use Kinesis Firehose as a consumer of the stream, with a Lambda transformation to write to DynamoDB, and enable error handling.

B.Increase the Lambda function's reserved concurrency and provision more DynamoDB write capacity.

C.Place the Lambda function's output into an Amazon SQS queue, and have a second Lambda function write to DynamoDB.

D.Use Kinesis Data Analytics to process the stream and write results directly to DynamoDB.

AnswerA

Firehose buffers data, retries on failures, and decouples the producer from DynamoDB writes.

Why this answer

Option A is correct because Kinesis Firehose can buffer data and write to DynamoDB via Lambda, providing retry and decoupling. Option B is wrong because SQS does not integrate directly with Kinesis. Option C is wrong because Kinesis Data Analytics does not write to DynamoDB directly.

Option D is wrong because Lambda is already being used; the issue is throughput, not compute.

Practice this question →

351

MCQeasy

A company uses Amazon Kinesis Data Firehose to deliver data to an Amazon S3 bucket. The data is organized by year/month/day/hour. The team needs to ensure that all data is encrypted at rest in S3 using an AWS KMS customer managed key (CMK). Which configuration should the team implement?

A.Configure the S3 bucket's default encryption to use the customer managed KMS key.

B.Use an AWS Lambda function to encrypt the data after it is delivered to S3.

C.In the Firehose delivery stream configuration, enable S3 destination encryption and select the customer managed KMS key.

D.Add a bucket policy that denies PutObject unless the request includes the correct KMS key.

AnswerC

Firehose supports SSE-KMS for the S3 destination directly.

Why this answer

Option A is correct because Kinesis Data Firehose can be configured to use server-side encryption with AWS KMS (SSE-KMS) for the S3 destination. Option B is wrong because S3 default encryption is applied at the bucket level, but Firehose can override it; the recommended approach is to configure encryption in Firehose. Option C is wrong because the bucket policy with Deny is not needed if encryption is configured properly.

Option D is wrong because client-side encryption is not suitable for this pipeline.

Practice this question →

352

Multi-Selecteasy

A company needs to move 50 TB of data from an on-premises data center to Amazon S3. The company has a limited internet bandwidth of 100 Mbps. The data transfer must be completed within 10 days. Which TWO services should the company use together to meet these requirements?

Select 2 answers

A.Amazon S3 as the destination

B.AWS Direct Connect

C.AWS Site-to-Site VPN

D.AWS Snowball Edge

E.AWS DataSync over the internet

AnswersA, D

Data is ultimately stored in S3.

Why this answer

Options A and E are correct. AWS Snowball Edge is a physical device for large data transfer, and S3 is the destination. Option B is wrong because internet transfer at 100 Mbps would take more than 50 days.

Option C is wrong because Direct Connect is for dedicated network, not for physical transfer. Option D is wrong because VPN is not designed for bulk data transfer.

Practice this question →

353

MCQhard

A company is using Amazon Redshift for data warehousing. The data engineering team notices that queries are slow and the system is frequently writing to disk due to insufficient memory. Which type of workload management (WLM) configuration change would help reduce disk writes?

A.Increase the number of query concurrency slots.

B.Increase the memory percentage allocated to the WLM queue.

C.Enable query monitoring rules to abort queries that spill to disk.

D.Enable short query acceleration (SQA).

AnswerB

More memory per query reduces disk spill.

Why this answer

If queries spill to disk, they need more memory. Increasing the memory percentage allocated to the queue reduces disk spills. Option A is wrong because concurrency slots increase parallelism but may reduce memory per query.

Option C is wrong because query monitoring rules only flag issues, not fix them. Option D is wrong because short query acceleration is for short queries, not memory issues.

Practice this question →

354

MCQmedium

A data engineer needs to design a data pipeline that ingests CSV files from an SFTP server daily, transforms them, and loads them into Amazon Redshift. The files are typically 2-3 GB. Which combination of AWS services is MOST appropriate?

A.Use AWS Glue ETL with a JDBC connection to the SFTP server to read files directly.

B.Use AWS Lambda to download the files from SFTP, transform them in memory, and write to Redshift using the Data API.

C.Use AWS Transfer Family to automate SFTP file retrieval to S3, then use Redshift COPY to load data.

D.Use Amazon Kinesis Data Firehose with an HTTP endpoint source to receive files from SFTP.

AnswerC

Transfer Family handles SFTP natively, and COPY loads data efficiently into Redshift.

Why this answer

Option D is correct because AWS Transfer Family provides SFTP integration, and Amazon Redshift COPY command efficiently loads large files from S3. Option A is wrong because Lambda has a 15-minute timeout and 6 MB payload limit for invocation. Option B is wrong because Glue ETL can read from S3 but does not directly connect to SFTP.

Option C is wrong because Kinesis is for streaming, not batch file transfers.

Practice this question →

355

MCQhard

An e-commerce company uses Amazon DynamoDB as the primary data store for user sessions. They want to run analytics on historical session data using Amazon Athena. What is the recommended approach to export DynamoDB data to S3 in a format optimized for Athena?

A.Use AWS Data Pipeline to copy data to S3 as CSV

B.Use Amazon Kinesis Data Firehose to stream data from DynamoDB to S3

C.Use DynamoDB Streams with AWS Lambda to write to S3 as JSON

D.Use AWS Glue ETL to read from DynamoDB and write to S3 as Parquet

AnswerD

Glue can efficiently export data and convert to columnar format.

Why this answer

AWS Glue ETL can read from DynamoDB and write to S3 in Parquet format, which is optimal for Athena. Option A is wrong because DynamoDB Streams to Lambda to S3 is complex and not optimal. Option B is wrong because Data Pipeline is legacy.

Option D is wrong because Kinesis is for streaming.

Practice this question →

356

MCQhard

Refer to the exhibit. A data engineer has attached this IAM policy to an IAM role used by an AWS Glue ETL job. The job reads from an S3 bucket (data-bucket) that is encrypted with SSE-KMS using the key arn:aws:kms:us-east-1:123456789012:key/abc123, transforms the data, and writes the result to a different S3 bucket (output-bucket) encrypted with a different KMS key (arn:aws:kms:us-east-1:123456789012:key/xyz789). When the job runs, it fails with an access denied error. What is the cause?

A.The policy does not include s3:GetObject permission for the output bucket.

B.The policy does not include glue:CreateTable permission.

C.The policy does not include s3:PutObject permission for the output bucket.

D.The policy does not grant kms:Encrypt permission for the output bucket's KMS key.

AnswerD

To write to an SSE-KMS encrypted bucket, the role needs kms:Encrypt or kms:GenerateDataKey for that key.

Why this answer

Option C is correct because the policy grants kms:Decrypt and kms:GenerateDataKey for the input bucket's key, but does not grant the equivalent permissions for the output bucket's key (xyz789). The job needs to encrypt the output, so it needs kms:Encrypt (or GenerateDataKey) for the output key. Option A is wrong because the policy includes s3:PutObject.

Option B is wrong because the Glue catalog permissions are sufficient. Option D is wrong because the policy includes s3:GetObject.

Practice this question →

357

MCQmedium

A data engineering team needs to move 10 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The data is currently stored in HDFS. Which service should they use for an efficient transfer?

A.AWS DataSync

B.S3 Transfer Acceleration

C.Amazon Kinesis Data Streams

D.AWS Snowball Edge

AnswerA

DataSync can transfer data from HDFS to S3.

Why this answer

AWS DataSync can transfer data from on-premises HDFS to S3 efficiently, with built-in encryption and validation. Option A is wrong because S3 Transfer Acceleration speeds up transfers but requires network. Option C is wrong because Snowball Edge is for offline transfer.

Option D is wrong because Kinesis is for streaming, not bulk historical.

Practice this question →

358

Drag & Dropmedium

Drag and drop the steps to perform hyperparameter tuning using SageMaker Automatic Model Tuning in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Tuning involves defining search space, creating a tuning job, setting limits, executing, and selecting best model.

Practice this question →

359

MCQhard

Refer to the exhibit. A team deploys this CloudFormation stack. The Kinesis stream is created, but the Firehose delivery stream fails to create with a 'Resource handler returned message: Unable to assume role' error. What is the most likely cause?

A.The shard count of 2 is too low for Firehose to read from.

B.The Kinesis stream is encrypted with KMS, but Firehose does not have permission to decrypt.

C.The retention period of 168 hours is too long for Firehose.

D.The IAM role 'firehose-role' does not exist in the AWS account.

AnswerD

Correct: The role ARN is hardcoded; if the role is missing, Firehose cannot assume it.

Why this answer

The Firehose role 'firehose-role' is specified with a static ARN that includes an account ID. If the stack is deployed in a different account, or if the role does not exist, the assumption fails. Option A (role does not exist in the account) is the most likely.

Option B (stream encryption) is not related. Option C (retention period) is fine. Option D (shard count) is fine.

Practice this question →

360

MCQeasy

A data engineer is building a data pipeline to process user clickstream data. The data arrives as JSON files in an S3 bucket. The pipeline must transform the JSON into Parquet format and partition by date and event type, then make the data available for Amazon Athena queries. The engineer needs a fully managed, serverless solution with minimal operational overhead. Which combination of AWS services should the engineer use?

A.Use Amazon EMR with Spark to read JSON, convert to Parquet, and partition, then query with Athena.

B.Use AWS Glue ETL jobs to read JSON from S3, transform to Parquet, and write to a partitioned S3 location, then use Athena.

C.Use S3 Event Notifications to trigger an AWS Lambda function that converts the JSON to Parquet and writes to a partitioned S3 location, then query with Athena.

D.Use Amazon Kinesis Firehose to ingest data and convert to Parquet, then write to S3, and query with Athena.

AnswerC

Lambda is serverless, cost-effective for per-file processing, and can partition output easily.

Why this answer

Option C is correct because AWS Lambda triggered by S3 Event Notifications provides a fully serverless, event-driven architecture with minimal operational overhead for converting JSON to Parquet and partitioning by date and event type. Lambda can process each new JSON file as it arrives, perform the transformation in memory (using libraries like PyArrow or Pandas), and write the Parquet output to a partitioned S3 path, which Athena can then query directly. This approach avoids managing any clusters or job scheduling, aligning with the requirement for a fully managed, serverless solution.

Exam trap

Cisco often tests the misconception that AWS Glue is the only serverless ETL option, but the trap here is that Lambda with S3 Event Notifications is a simpler, fully serverless alternative for file-based transformations when the workload fits within Lambda's constraints.

How to eliminate wrong answers

Option A is wrong because Amazon EMR with Spark requires provisioning and managing a cluster (even if ephemeral), incurring operational overhead and not being fully serverless; it also introduces complexity for a simple transformation task. Option B is wrong because AWS Glue ETL jobs, while serverless, involve job scheduling, startup latency, and cost for each job run, and are overkill for a real-time, event-driven pipeline where Lambda can handle the transformation more efficiently with lower latency and cost. Option D is wrong because Amazon Kinesis Firehose is designed for streaming data ingestion, not for batch processing of existing JSON files in S3; it cannot be triggered by S3 events to process files already stored, and its Parquet conversion is limited to the Firehose delivery stream, not arbitrary file transformations.

Practice this question →

361

Multi-Selectmedium

A data engineering team is designing a data pipeline that processes streaming data from Amazon Kinesis Data Streams using AWS Lambda. The team notices that some records are being processed multiple times (duplicates). Which TWO steps should the team take to ensure exactly-once processing?

Select 2 answers

A.Design the Lambda function to be idempotent.

B.Use a unique record identifier and store processed IDs in an external store like DynamoDB.

C.Increase the batch size to reduce the number of invocations.

D.Use Kinesis Producer Library (KPL) to guarantee exactly-once delivery.

E.Disable retries on the Lambda function.

AnswersA, B

Idempotency ensures repeated processing produces same result.

Why this answer

Options A and C are correct. Making the Lambda function idempotent ensures that processing the same record multiple times does not cause duplicates downstream. Using a unique identifier per record and checking against a DynamoDB table allows deduplication.

Option B is wrong because increasing batch size increases duplicates. Option D is wrong because disabling retries may cause data loss. Option E is wrong because Kinesis does not guarantee exactly-once delivery; deduplication is needed.

Practice this question →

362

MCQmedium

A company is building a data lake on Amazon S3 and wants to use AWS Glue to catalog the data. The data includes CSV, Parquet, and JSON files. The team wants to ensure that the Glue crawler can infer the schema correctly and update the Data Catalog when new partitions are added. Which crawler configuration should be used?

A.Create separate crawlers for each file format and schedule them at different times.

B.Use a crawler that only catalogs Parquet files because they are more efficient.

C.Use a crawler with 'Update all new and existing partitions' disabled to avoid schema conflicts.

D.Create a single crawler that includes all file extensions and set the 'Update all new and existing partitions' option.

AnswerD

Correct: Single crawler with partition updates ensures comprehensive cataloging.

Why this answer

Option C (create a single crawler that includes all supported file types and set 'Update all new and existing partitions' to true) is correct because Glue can handle multiple formats, and updating partitions ensures new data is cataloged. Option A (separate crawlers per format) is unnecessary. Option B (only Parquet) ignores other formats.

Option D (disable partition updates) would miss new partitions.

Practice this question →

363

MCQeasy

A company uses Amazon Kinesis Data Streams to ingest clickstream data from a website. The data is consumed by a custom application that runs on Amazon EC2 instances. The company notices that the consumer application is falling behind the producer, causing data to be throttled. Which action should the company take to improve the consumer's throughput?

A.Reduce the data retention period of the stream

B.Increase the number of shards in the Kinesis data stream

C.Increase the maximum concurrency of the AWS Lambda function that processes the stream

D.Use Amazon Kinesis Data Firehose to deliver data to Amazon S3

AnswerB

More shards increase the stream's read and write capacity.

Why this answer

Option A is correct because increasing the number of shards increases the stream's capacity and allows more consumers to read in parallel. Option B is wrong because Lambda concurrency is not directly related to Kinesis shard throughput. Option C is wrong because Kinesis Data Firehose is a different service.

Option D is wrong because reducing the retention period does not increase throughput.

Practice this question →

364

MCQhard

A company runs an e-commerce platform that generates clickstream data in real-time. The data is ingested into Amazon Kinesis Data Streams (100 shards) and processed by AWS Lambda functions, which aggregate data in 1-minute windows and write the results to Amazon S3. The Lambda functions are triggered by the Kinesis stream using the event source mapping. Recently, the company noticed that some records are being processed multiple times, leading to duplicate data in S3. The Lambda function is idempotent, but the duplicates are causing downstream issues. The Lambda function's concurrency limit is 1000, and the batch size is 100. The average processing time per record is 200 ms. What is the most likely cause of the duplicates, and how should it be fixed?

A.Increase the Lambda concurrency limit to 2000 to handle the load.

B.Ensure the Lambda function is idempotent and uses the sequence number to deduplicate records.

C.Decrease the batch size to 10 to reduce the impact of failures.

D.Use Amazon SQS FIFO queue as a buffer between Kinesis and Lambda to guarantee exactly-once processing.

AnswerB

If the function fails and retries, using sequence numbers allows it to skip already processed records, preventing duplicates.

Why this answer

Option B is correct. Lambda functions process records from Kinesis in batches. If the function fails (e.g., due to timeout or error), the entire batch is retried, causing duplicates if some records were already partially processed.

To avoid duplicates, the function should be idempotent and should not commit partial results. Option A is wrong because the concurrency is sufficient. Option C is wrong because increasing batch size increases the risk of partial failure.

Option D is wrong because a FIFO queue does not integrate with Kinesis.

Practice this question →

365

MCQhard

An organization is migrating its on-premises Hadoop cluster to AWS. The cluster runs Spark jobs that process 50 TB of data daily. The data is stored in HDFS with 3x replication. Which storage option on AWS provides the best price-performance for this workload?

A.Use AWS Glue to run Spark jobs with data stored in S3

B.Use Amazon EMR with S3 as the data store via EMRFS

C.Use Amazon Redshift Spectrum to query the data directly in S3

D.Use Amazon EMR with HDFS on EBS volumes

AnswerB

S3 provides 11 9's durability and is cheaper than EBS. EMRFS seamlessly integrates with Spark.

Why this answer

Amazon EMR with S3 as storage (using EMRFS) allows separating compute and storage. S3 is durable and cost-effective, avoiding 3x replication overhead. HDFS on EBS would require similar replication and is more expensive.

S3 with EMR is the standard recommendation.

Practice this question →

366

MCQhard

A Glue job fails with an AccessDenied error when trying to write to the S3 bucket my-data-lake. The IAM policy attached to the job role is shown in the exhibit. What is the MOST likely reason for the failure?

A.The s3:ListBucket action is missing on the bucket level

B.The job role does not have permissions to decrypt the KMS key used for server-side encryption

C.The s3:PutObject action is not sufficient; the job needs s3:PutObjectAcl

D.The resource ARN for s3:PutObject should include a specific prefix

AnswerB

SSE-KMS requires kms:Decrypt and kms:GenerateDataKey permissions, which are missing.

Why this answer

The policy allows s3:PutObject on the bucket, so write access seems granted. However, if the bucket is encrypted with SSE-KMS, the job also needs kms:Decrypt and kms:GenerateDataKey permissions. The policy does not include KMS actions.

The bucket policy might also deny, but the most common issue is KMS encryption.

Practice this question →

367

MCQeasy

A team stores raw data in S3 and uses a Glue Data Catalog for metadata. They want to allow data scientists to query the data with Amazon Athena using their existing IAM roles. What is the MINIMUM set of permissions required?

A.Grant the IAM role permissions for Athena, Glue, and S3 (read and write).

B.Grant the IAM role permissions for Athena actions, Glue Data Catalog actions, and S3 read access.

C.Grant the IAM role permissions for Athena and Amazon Redshift Spectrum.

D.Grant the IAM role permissions for Athena and Amazon Kinesis.

AnswerB

Athena requires GetTable, GetDatabase, etc. from Glue, and GetObject from S3.

Why this answer

Option C is correct because Athena needs permissions to query the Glue Data Catalog and to read data from S3. Option A is wrong because write permissions to S3 are not needed for querying. Option B is wrong because Glue job execution is separate.

Option D is wrong because Kinesis permissions are irrelevant.

Practice this question →

368

MCQeasy

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 100 Mbps internet connection and a tight deadline of two weeks. Which AWS service should the engineer use to transfer the data most efficiently?

A.AWS Storage Gateway (Volume Gateway)

B.AWS Snowball Edge

C.Amazon S3 Transfer Acceleration

D.AWS DataSync over the internet

AnswerB

Snowball Edge provides physical shipping, bypassing bandwidth limitations.

Why this answer

Option B is correct. AWS Snowball Edge is a physical device that can transfer large amounts of data faster than over the internet. Option A is wrong because AWS DataSync over internet would be too slow (100 Mbps).

Option C is wrong because AWS Storage Gateway is for ongoing hybrid storage, not bulk transfer. Option D is wrong because Amazon S3 Transfer Acceleration improves speed but still over internet, not enough for 50 TB in two weeks over 100 Mbps.

Practice this question →

369

MCQmedium

A company is streaming data from IoT devices to Amazon Kinesis Data Firehose, which writes to an Amazon S3 bucket. The data is then processed by an AWS Glue ETL job and loaded into Amazon Redshift. The team notices that some records are missing in Redshift. They suspect data loss during the Firehose delivery. Which configuration parameter should be checked first?

A.The AWS KMS key used for encryption.

B.The CloudWatch error logging configuration.

C.The buffer interval (e.g., 60 seconds) and buffer size.

D.The compression format (GZIP, Snappy, etc.).

AnswerC

Correct: If the buffer interval is too long and the stream is stopped, buffered data may be lost if not flushed properly.

Why this answer

Firehose can buffer data before writing to S3. If the buffer interval is too long and the stream ends, data may be lost if the buffer is not flushed. Option C (buffer interval) is the most likely cause.

Option A (compression) does not cause loss. Option B (KMS key) is for encryption. Option D (error logging) only logs errors, does not prevent loss.

Practice this question →

370

Multi-Selectmedium

Which TWO steps are required to set up cross-account access to an Amazon S3 data lake for AWS Glue jobs running in a different AWS account? (Choose two.)

Select 2 answers

A.Add a bucket policy to the S3 bucket that grants access to the Glue service role from the other account.

B.Create an IAM role in the second account that the Glue job can assume, with permissions to read from the S3 bucket.

C.Create a cross-account Glue crawler in the source account.

D.Set up VPC peering between the two accounts' VPCs.

E.Ensure both accounts are in the same AWS organization.

AnswersA, B

Correct: Bucket policy allows cross-account access.

Why this answer

To allow cross-account access, the S3 bucket policy must grant access to the Glue service role from the other account, and the Glue job must assume a role that has permissions to access the bucket. Option A (bucket policy) and Option D (IAM role in the second account) are correct. Option B (VPC peering) is not required for S3 access.

Option C (cross-account Glue crawler) is not needed. Option E (same account) defeats cross-account.

Practice this question →

371

Multi-Selectmedium

A data engineer needs to design a data ingestion pipeline that ingests data from a MySQL database hosted on-premises into Amazon S3 for analytics. The pipeline must capture change data (CDC) and run continuously with low latency. Which two services should the data engineer use?

Select 2 answers

A.AWS Database Migration Service (DMS) with ongoing replication.

B.Amazon S3 as the target endpoint for DMS.

C.Amazon AppFlow.

D.AWS Glue ETL jobs scheduled at regular intervals.

E.Amazon Kinesis Data Streams.

AnswersA, B

DMS supports CDC and can write changes to S3 continuously.

Why this answer

Option A and D are correct. AWS DMS can capture ongoing changes from MySQL using CDC and replicate them to S3. Option B (Kinesis Data Streams) can receive CDC data from DMS but is not directly needed if DMS writes to S3.

Option C (Glue ETL) is batch-oriented. Option E (AppFlow) is for SaaS applications, not for on-premises databases.

Practice this question →

372

Multi-Selectmedium

A data engineering team is designing a data lake on AWS. They need to store raw data in S3 and allow multiple analytics services to query the data. Which TWO services can be used to catalog and provide schema information for the data?

Select 2 answers

A.AWS Glue Data Catalog

B.Amazon Kinesis Data Streams

C.Amazon RDS

D.Amazon DynamoDB

E.Amazon Athena

AnswersA, E

Glue Data Catalog stores metadata and schemas.

Why this answer

Option B (AWS Glue Data Catalog) is a metadata catalog. Option D (Amazon Athena) uses Glue Data Catalog as its schema store. Option A (DynamoDB) is not a catalog; Option C (RDS) is a database; Option E (Kinesis) is streaming.

Practice this question →

373

MCQmedium

A company is using AWS Glue to catalog metadata from various data sources. The crawler is configured to run daily. However, the catalog is not reflecting new partitions added to an S3 bucket during the day. What is the MOST likely cause?

A.The S3 bucket has insufficient permissions for the Glue crawler

B.The table schema has changed and the crawler does not update it

C.The crawler is not scheduled frequently enough to capture changes

D.The data format is not supported by AWS Glue

AnswerC

The crawler runs once a day, so it misses partitions added between runs.

Why this answer

Option C is correct because the crawler is configured to run daily, but new partitions are being added to the S3 bucket throughout the day. Since the crawler only runs once per day, it will not detect and catalog those new partitions until its next scheduled run. To capture changes more frequently, the crawler schedule should be increased or an event-driven trigger (e.g., using Amazon S3 Events and AWS Lambda) should be implemented.

Exam trap

The trap here is that candidates may assume the crawler automatically detects all changes in real time, but AWS Glue crawlers are batch-oriented and only discover new partitions during a crawl run, so scheduling frequency is critical.

How to eliminate wrong answers

Option A is wrong because if the S3 bucket had insufficient permissions for the Glue crawler, the crawler would fail entirely or produce errors, not selectively miss new partitions while still cataloging existing data. Option B is wrong because the question states that new partitions are not being reflected, not that the table schema has changed; Glue crawlers can update schemas by default unless configured otherwise, and schema changes would cause different symptoms (e.g., type mismatches). Option D is wrong because AWS Glue supports a wide range of data formats (CSV, JSON, Parquet, Avro, ORC, etc.), and if the format were unsupported, the crawler would fail to read the data entirely, not just miss new partitions.

Practice this question →

374

MCQeasy

A data engineer runs the AWS CLI command above to inspect a file in S3. They need to determine if the file was modified after a Glue ETL job processed it. What additional information could they obtain from this command?

A.The object's content type.

B.The object's storage class.

C.The object's last modified timestamp.

D.The object's ETag.

AnswerC

The LastModified field indicates when the object was last modified.

Why this answer

Option D is correct because the LastModified timestamp is provided, which can be used to compare with job completion time. Option A is wrong because head-object does not show object size. Option B is wrong because ContentLength is shown.

Option C is wrong because ETag is shown.

Practice this question →

← PreviousPage 5 of 5 · 374 questions total

Ready to test yourself?

Try a timed practice session using only Data Engineering questions.

Start 20-question session