Knowledge + Practice

CCNA Data Engineering Questions

75 of 374 questions · Page 4/5 · Data Engineering · Answers revealed

Practice these questions Domain overview All questions

226

MCQmedium

An AWS Glue ETL job failed with the error 'Insufficient memory allocated for the job'. The job run details show AllocatedCapacity: 5, WorkerType: Standard, NumberOfWorkers: 5. Which change should be made to resolve the issue?

A.Delete and recreate the job with a different name

B.Increase the job timeout to 3600 minutes

C.Increase the number of workers to 10

D.Change the worker type to G.2X

AnswerC

More workers increase total memory and compute capacity.

Why this answer

Option C is correct because the error indicates insufficient memory; increasing the number of workers (DPUs) provides more memory. Option A is wrong because worker type Standard has 16 GB memory; G.2X has 32 GB, but the error is about allocated capacity, not worker type. Option B is wrong because job timeout is not the issue.

Option D is wrong because starting a new job won't fix the resource allocation.

Practice this question →

227

Multi-Selecteasy

A company needs to transfer 10 TB of data from an on-premises data center to Amazon S3. The network bandwidth is limited to 100 Mbps, and the transfer must complete within 5 days. Which TWO options are viable? (Choose TWO.)

Select 1 answer

A.Use S3 Transfer Acceleration to speed up the transfer

B.Use S3 Multipart Upload to upload files in parallel

C.Use AWS Snowball Edge device to ship the data

D.Use AWS DataSync over the existing internet connection

E.Use AWS Direct Connect to establish a dedicated network connection

AnswersC

Snowball Edge is ideal for large data volumes over slow networks; physical shipping is faster.

Why this answer

AWS Snowball Edge is a physical device for large data transfers over slow networks. AWS DataSync can be used if there is some bandwidth, but with 100 Mbps, it would take ~10 days, exceeding the 5-day window. Option B (S3 Transfer Acceleration) uses existing internet, which is still limited.

Option C (Direct Connect) would require setup time and may not meet the window. Option E (S3 Multipart Upload) is for uploading files, not for bulk transfer over slow link.

Practice this question →

228

MCQmedium

A company is using AWS Glue to run ETL jobs that process data in an S3 data lake. The jobs are failing with out-of-memory errors when processing large files. Which configuration change should be made to resolve this issue?

A.Change the worker type to G.1X

B.Increase the number of DPUs allocated to the job

C.Partition the input data into smaller files

D.Enable job bookmark to process only new data

AnswerB

More DPUs provide more memory and compute resources.

Why this answer

Option B is correct because increasing the number of DPUs (Data Processing Units) allocates more memory and compute to the Glue job. Option A is wrong because partitioning does not address memory per task; Option C is wrong because job bookmark is for incremental processing, not memory; Option D is wrong because G.1X worker type has limited memory; G.2X or higher is better, but increasing DPUs on the default worker type is simpler.

Practice this question →

229

MCQhard

A company runs a critical ETL job using AWS Glue that writes to an Amazon Redshift cluster. The job occasionally fails due to insufficient disk space on the Redshift cluster. How can the company automate the process to prevent this failure?

A.Use a CloudWatch alarm to trigger a Lambda function that resizes the cluster.

B.Use RA3 node types with managed storage.

C.Increase the number of slices in the Redshift cluster.

D.Reserve additional nodes for the Redshift cluster.

AnswerA

This automates scaling based on disk usage.

Why this answer

Using Amazon CloudWatch to monitor disk space and automatically resize the cluster is the best automated solution. Reserving nodes does not address space. Using RA3 nodes with managed storage is a good proactive step but does not automate resizing.

The correct answer is to monitor and auto-resize.

Practice this question →

230

MCQmedium

An IAM policy is attached to a data engineering role that writes to an S3 bucket. The policy is shown in the exhibit. What is the effect of this policy?

A.The role can write objects with any encryption, but reading is restricted to SSE-KMS only

B.The role can read and write any object without encryption restrictions

C.The role can only read objects; writing is always denied

D.The role must use SSE-KMS when writing objects; reading is allowed only if the object is encrypted with SSE-KMS

AnswerD

The Allow statement grants GetObject only when SSE-KMS is specified, and the Deny statement enforces SSE-KMS for PutObject.

Why this answer

Option B is correct because the first statement allows GetObject and PutObject only when SSE-KMS is used; the second statement denies PutObject if SSE-KMS is not used, effectively enforcing SSE-KMS for PutObject. Option A is wrong because the policy allows PutObject with SSE-KMS. Option C is wrong because GetObject is allowed with SSE-KMS.

Option D is wrong because the policy never denies GetObject.

Practice this question →

231

MCQmedium

A company uses AWS Glue to run ETL jobs that transform data from Amazon S3 to Amazon Redshift. The Glue job writes data to Redshift using the JDBC connection. Recently, the job has been failing with connection timeout errors when writing to Redshift. The Redshift cluster is a 2-node dc2.large cluster. The Glue job processes about 50 GB of data per run. The errors occur sporadically, and the job succeeds after a few retries. The data engineer needs to resolve the issue to prevent job failures. What should the engineer do?

A.Increase the Redshift cluster size to 4 nodes.

B.Modify the Glue job to write data to S3 first, then use the Redshift COPY command to load data.

C.Increase the Glue job timeout to 24 hours.

D.Use AWS Database Migration Service (DMS) to load data into Redshift.

AnswerB

Using COPY from S3 is the recommended approach for bulk loading into Redshift; it is faster and more reliable than JDBC writes.

Why this answer

Option C is correct. The Glue JDBC driver may not handle large data volumes efficiently; using the COPY command via S3 is optimized for bulk loads and is more reliable. Option A is wrong because increasing timeout may mask the issue but not resolve it.

Option B is wrong because the Redshift cluster may be undersized, but the issue is likely the JDBC driver. Option D is wrong because AWS DMS is for database migration, not ETL jobs.

Practice this question →

232

MCQhard

Refer to the exhibit. A company is using the Kinesis stream 'my-stream' with one shard. The producer is sending 1000 records per second, each 1 KB. The consumer is reading from the stream using the Kinesis Client Library (KCL). The consumer is able to process 500 records per second per shard. What is the most likely cause of the consumer falling behind?

A.The retention period is set to 24 hours, which is too short.

B.The stream uses KMS encryption, which adds latency.

C.The stream has only one shard, which limits the read throughput to 1 MB/s.

D.The consumer application is not using enhanced fan-out.

AnswerC

Correct: One shard provides 2 MB/s read, but the producer is sending 1 MB/s (1000 records * 1 KB), which is within limits, but the consumer can only process 500 records/s, so it falls behind. Actually, read throughput is 2 MB/s, but processing capacity is lower.

Why this answer

The stream has only 1 shard, which provides a read throughput of 2 MB/s (or 1000 records per second of 1 KB each). The consumer can process 500 records per second, which is half the incoming rate, so it will fall behind. Increasing the number of shards would increase read throughput.

Option A (not enough shards) is correct. Option B (retention period) does not affect ingestion. Option C (encryption) may add overhead but is not the main issue.

Option D (consumer code) could be optimized, but the root cause is shard count.

Practice this question →

233

MCQmedium

A company uses Amazon DynamoDB as the primary data store for a real-time recommendation engine. The data engineering team needs to export a daily snapshot of the DynamoDB table to S3 for offline analytics. The table is large (10 TB) and has a high read/write throughput. Which method will export the data with the least impact on the production workload?

A.Use AWS Data Pipeline to export the DynamoDB table to S3.

B.Use DynamoDB Scan API with parallel scans to export data to S3.

C.Use the DynamoDB export to S3 feature available in the AWS Console or CLI.

D.Use AWS Glue ETL job with a DynamoDB connection to export data.

AnswerC

This feature exports data without consuming read capacity units, minimizing impact.

Why this answer

Option D is correct because DynamoDB's export to S3 feature uses the table's backup capability, which reads from the storage layer without consuming read capacity units, thus having zero impact on production. Option A (scan) consumes RCUs. Option B (Data Pipeline) also uses scan.

Option C (Glue ETL) also consumes RCUs.

Practice this question →

234

MCQeasy

A data pipeline uses AWS Glue to crawl an S3 bucket and create a table in the AWS Glue Data Catalog. The data is in Parquet format with partitions by date. After a new partition is added to S3, the crawler runs but the new partition is not reflected in the table. What is the most likely cause?

A.The crawler requires an AWS Lambda trigger to be configured for new partitions.

B.The Parquet schema in the new partition does not match the existing table schema.

C.The new partition folder does not follow the Hive-style partition naming convention expected by the crawler.

D.The S3 bucket has too many partitions, exceeding the Glue crawler limit.

AnswerC

Glue crawlers require partition folders to follow the key=value pattern to automatically detect partitions.

Why this answer

Option A is correct because the Glue crawler may be configured to only crawl new folders if the crawler's configuration is set to 'Crawl new folders only' but the partition path may not match the expected pattern, or the crawler's 'Schema updates' setting might be set to 'Ignore'. More commonly, the crawler's 'Update the table definition' option is set to 'Add new columns only' or 'Ignore the change and don't update the table'; the correct setting to add partitions is 'Add new columns only' or 'Add new partitions only'. However, the most typical issue is that the crawler is set to 'Crawl all folders each time' and still not picking up because the partition path is not in a recognized Hive-style format (e.g., date=2023-01-01/).

Option A points to the partition path format. Option B is wrong because Glue can handle Parquet. Option C is wrong because the crawler does not need Lambda triggers.

Option D is wrong because the crawler can handle up to many partitions.

Practice this question →

235

MCQeasy

A company wants to use Amazon SageMaker to train a model on a dataset stored in Amazon S3. The dataset is 100 GB and consists of millions of small JSON files. What should the data engineering team do to optimize training performance?

A.Combine the small JSON files into larger Parquet files using a Spark job on Amazon EMR.

B.Copy the data to an Amazon EBS volume attached to the training instance.

C.Use Amazon Athena to convert the data into a single CSV file.

D.Use S3 Select to filter data before training.

AnswerA

Parquet with larger files improves read efficiency and reduces overhead.

Why this answer

Option D is correct because combining small files into larger ones reduces S3 LIST overhead and improves I/O performance. Option A is wrong because Athena is not needed for training. Option B is wrong because EBS is ephemeral and not shared across instances.

Option C is wrong because S3 Select is for server-side filtering, not for training performance.

Practice this question →

236

MCQmedium

A company is building a data lake on Amazon S3. They need to enforce encryption at rest for all objects. Which combination of actions will achieve this? (Assume the bucket is versioned.)

A.Use AWS KMS with automatic key rotation

B.Enable S3 default encryption and set a bucket policy to deny PutObject without encryption headers

C.Enable S3 default encryption only

D.Enable S3 Block Public Access

AnswerB

This ensures all objects are encrypted.

Why this answer

Enabling default encryption on the bucket ensures all new objects are encrypted. Adding a bucket policy that denies PutObject without encryption headers prevents unencrypted uploads. The other options do not enforce encryption for all objects.

Practice this question →

237

MCQmedium

A company uses Amazon DynamoDB as the primary data store for a real-time application. The data science team wants to analyze the data using Amazon Athena. What is the most efficient way to make the DynamoDB data available for Athena queries?

A.Use AWS Glue to extract data from DynamoDB and load into S3 on a schedule.

B.Use Amazon Redshift Spectrum to query DynamoDB directly.

C.Use DynamoDB Streams to invoke an AWS Lambda function that writes data to Amazon S3 in Parquet format. Then query the data in S3 using Athena.

D.Use Amazon EMR to read directly from DynamoDB and run Hive queries.

AnswerC

This provides a decoupled, cost-effective solution for analytics.

Why this answer

Option A is correct because DynamoDB Streams can trigger a Lambda function to write to S3, and Athena can query S3. Option B is wrong because EMR can query DynamoDB directly but is more complex. Option C is wrong because Glue can extract data but adds latency.

Option D is wrong because Redshift Spectrum uses S3, not DynamoDB directly.

Practice this question →

238

Multi-Selecteasy

A team wants to move data from an on-premises Oracle database to Amazon S3 for analytics. The pipeline must run daily and handle incremental updates. Which THREE services should they use together? (Choose three.)

Select 3 answers

A.Amazon SageMaker

B.Amazon S3

C.Amazon Athena

D.AWS Database Migration Service (DMS)

E.AWS Glue

AnswersB, D, E

S3 is the target data lake storage.

Why this answer

Options A, B, and D are correct. AWS DMS can do continuous replication from Oracle to S3. Glue can transform the data.

S3 is the target. Option C is wrong because SageMaker is for ML, not data ingestion. Option E is wrong because Athena is a query engine, not a data movement service.

Practice this question →

239

MCQmedium

A company uses Amazon Athena to analyze data stored in S3. The data is in CSV format and is partitioned by year/month/day. Queries that filter on a specific day are slow. The team wants to improve query performance without changing the data format. Which action should the team take?

A.Use a larger number of small files to increase parallelism.

B.Increase the number of partitions by adding hour and minute levels.

C.Convert the data to Parquet format.

D.Ensure that the S3 folder structure follows the Hive-style partition naming convention (e.g., year=2023/month=01/day=01).

AnswerD

Hive-style partitions allow Athena to prune partitions effectively.

Why this answer

Option D is correct because partitioning pruning is most effective when the partition columns are in a Hive-style format (e.g., year=2023/month=01/day=01). If the folders are not in that format, Athena may not prune partitions properly. Option A is wrong because increasing the number of partitions would further slow queries.

Option B is wrong because converting to Parquet changes the format. Option C is wrong because using a large number of small files hurts performance.

Practice this question →

240

Multi-Selecteasy

Which TWO services can be used to transform data in transit within a Kinesis Data Firehose delivery stream? (Choose 2)

Select 2 answers

A.AWS Lambda

B.Amazon Kinesis Data Analytics

C.Amazon Athena

D.Amazon S3

E.AWS Glue

AnswersA, E

Firehose can invoke a Lambda function to transform records.

Why this answer

AWS Lambda is correct because it can be invoked as a transformation function within a Kinesis Data Firehose delivery stream. When you enable data transformation, Firehose buffers incoming records and then calls a Lambda function you specify, passing batches of records for processing. The Lambda function can modify, enrich, filter, or reformat the data before Firehose continues delivering it to the destination.

Exam trap

The trap here is that candidates often confuse 'transformation' with 'analytics' and select Kinesis Data Analytics, not realizing that Firehose's built-in transformation feature is specifically powered by Lambda, not by a separate analytics engine.

Practice this question →

241

Multi-Selecthard

A machine learning team is using Amazon SageMaker to train a model on a dataset stored in S3. The training job reads data from S3 using Pipe input mode, but the training is slow. The team wants to improve data throughput. Which THREE actions should they take?

Select 3 answers

A.Enable S3 Transfer Acceleration on the bucket.

B.Mount the S3 bucket using an S3 file system and use File mode with a larger instance type.

C.Use Amazon S3 VPC Gateway Endpoint to reduce data transfer costs and improve latency.

D.Use Amazon EFS as the data source for training.

E.Use Amazon ElastiCache to cache the training data.

AnswersB, C, D

File mode with high-bandwidth instances can improve throughput.

Why this answer

Using Amazon S3 with a VPC endpoint improves network performance. Using Amazon EFS as a data source can provide higher throughput for sequential access. Using SageMaker's File mode with a larger instance type with more network bandwidth can also improve throughput.

Practice this question →

242

MCQmedium

A company is using AWS Glue to run ETL jobs that process data from an Amazon RDS for PostgreSQL database. The jobs are failing with connection timeouts. The security group for the RDS instance allows inbound traffic from the Glue job's security group. What is the most likely cause?

A.The security group inbound rule is incorrect

B.The VPC does not have an S3 VPC endpoint

C.RDS is in a different subnet

D.The IAM role does not have permission to access RDS

AnswerB

Glue jobs need an S3 VPC endpoint to access the catalog.

Why this answer

AWS Glue jobs run in a VPC that requires an S3 VPC endpoint to access the AWS Glue catalog and other services. Without the endpoint, the job cannot connect to Glue's metadata store, causing timeouts. Option A is wrong because the security group is correctly configured.

Option C is wrong because the issue is not about IAM permissions for RDS. Option D is wrong because there is no subnet issue mentioned.

Practice this question →

243

MCQhard

A data engineer runs the above CLI command and sees that the bucket contains many small Parquet files (1 MB each) under the prefix. When querying this data with Athena, the query performance is poor and costs are high. Which approach would MOST improve performance and reduce cost?

A.Convert the files to JSON format

B.Convert the files to CSV format

C.Consolidate the small files into fewer, larger Parquet files

D.Add more partitions by including hour in the prefix

AnswerC

Fewer, larger files reduce overhead and improve compression.

Why this answer

Consolidating small files into larger files (e.g., 100 MB) reduces the overhead of reading many small files in Athena. Option A is wrong because CSV is not columnar and would not improve performance. Option B is wrong because converting to JSON is similar.

Option D is wrong because adding more partitions may create even more files.

Practice this question →

244

MCQhard

A data engineering team is designing a data lake on Amazon S3. The data is ingested from multiple sources in JSON, CSV, and Parquet formats. The team needs to make the data available for analysis using Amazon Athena and Amazon Redshift Spectrum. The team wants to minimize data transformation costs and storage overhead. Which data storage approach should the team use?

A.Load the data into Amazon Redshift cluster and then unload to S3 in Parquet

B.Store the data in its original format in S3 and use Athena to query directly

C.Store the data in its original format and use AWS Glue to convert to Parquet when queried

D.Convert all data to Apache Parquet before storing in S3

AnswerD

Parquet is columnar, reducing storage and improving query performance.

Why this answer

Option B is correct because storing data in columnar formats like Parquet reduces storage and improves query performance. Option A is wrong because storing all data as raw JSON inflates storage. Option C is wrong because converting data to a single format increases transformation costs.

Option D is wrong because using a relational database is not a data lake approach.

Practice this question →

245

MCQhard

A data engineer is building a data pipeline that uses AWS Lambda to process records from an SQS queue and write results to an S3 bucket. The Lambda function processes each record individually and writes a separate file to S3. The team notices high latency and wants to reduce the number of S3 PUT requests to improve performance and reduce cost. Which approach should the data engineer take?

A.Use S3 multipart upload for each record to improve throughput.

B.Increase the Lambda function's memory allocation to improve processing speed.

C.Use S3 Batch Operations to process the records in batches.

D.Aggregate multiple records into a single file in a DynamoDB table, then periodically write the aggregated data to S3.

AnswerD

Aggregation reduces the number of S3 PUT requests by writing larger files less frequently.

Why this answer

Option A is correct because buffering records in a DynamoDB table and then using a batch write to S3 reduces the number of PUT requests. Option B (increasing Lambda memory) does not reduce S3 PUT count. Option C (S3 batch operations) is for existing objects, not for incoming data.

Option D (multipart upload) is for large objects, not for many small objects.

Practice this question →

246

MCQeasy

A data engineer needs to ingest streaming data from an on-premises Kafka cluster into Amazon S3 with minimal operational overhead. Which AWS service should be used to stream the data into S3 without managing servers?

A.Amazon Kinesis Data Streams

B.AWS Glue

C.Amazon Managed Streaming for Apache Kafka (Amazon MSK)

D.Amazon Kinesis Data Firehose

AnswerD

Kinesis Data Firehose can directly ingest streaming data and deliver to S3 without managing servers.

Why this answer

Amazon Kinesis Data Firehose is the correct service for loading streaming data into S3 without managing servers. Option A (Amazon Kinesis Data Streams) requires a consumer to process data; Option B (Amazon MSK) is a managed Kafka service but still requires management; Option D (AWS Glue) is for ETL jobs, not real-time streaming.

Practice this question →

247

MCQeasy

A machine learning engineer is using Amazon SageMaker to train a model. The training dataset is 2 TB and is stored in Amazon S3. The engineer wants to reduce the training time by improving data loading performance. Which data ingestion mode should be used?

A.Pipe mode

B.Incremental mode

C.File mode

D.Fast file mode

AnswerA

Pipe mode streams data from S3 directly to the algorithm, reducing I/O wait time.

Why this answer

SageMaker Pipe mode streams data directly from S3 to the training algorithm, which can reduce training time by overlapping data loading and training, especially for large datasets.

Practice this question →

248

MCQhard

A data scientist wants to run a one-time SQL query on a large dataset stored in Amazon S3 (CSV format, 2 TB) using Amazon Athena. The query involves joining this dataset with a smaller table stored in Amazon RDS. What is the MOST cost-effective and performant approach?

A.Export the RDS table to S3 in Parquet format, then use Athena to join the two S3 datasets

B.Use Amazon Redshift Spectrum to query both S3 and RDS

C.Use Athena Federated Query to query RDS directly

D.Use AWS Glue ETL to join the data and write results back to S3, then query with Athena

AnswerA

This keeps the query in Athena's environment, avoiding data movement and using columnar format for performance.

Why this answer

Option A is correct. Exporting the RDS table to S3 as Parquet and running the join in Athena avoids data transfer costs and leverages Athena's fast query engine. Option B (federated query) adds complexity and may be slower.

Option C (Redshift Spectrum) requires a Redshift cluster. Option D (Glue ETL) is overkill for a one-time query.

Practice this question →

249

MCQhard

A research lab stores large genomic datasets in Amazon S3 Glacier Deep Archive. They need to run a one-time analysis on a subset of 10 PB of data. The analysis will use an Amazon EMR cluster with Amazon S3 as the data source. What is the MOST cost-effective and performant way to make the data available for the EMR cluster?

A.Restore the data to S3 Standard-IA and delete after the analysis

B.Configure the EMR cluster to read directly from Glacier Deep Archive using S3 Console

C.Initiate a Bulk retrieval request and restore the data to S3 Standard for the duration of the analysis

D.Initiate an Expedited retrieval request and use the temporary copy for the EMR cluster

AnswerC

Bulk retrieval is the lowest cost tier, and restoring to Standard avoids IA minimum charges.

Why this answer

Option D is correct because Bulk retrieval is the cheapest retrieval tier for Glacier Deep Archive, and restoring to S3 Standard allows the EMR cluster to read efficiently. Option A is wrong because reading directly from Glacier is not supported by EMR. Option B is wrong because Expedited is more expensive and not needed for a one-time batch job.

Option C is wrong because restoring to S3 Standard-IA adds retrieval costs for data that is accessed once.

Practice this question →

250

MCQhard

A company is migrating its on-premises Apache Hadoop cluster to AWS. The cluster processes large datasets using Spark jobs. The company wants to minimize operational overhead and use native AWS services. Which combination of services should the company use?

A.Amazon EMR with Spark and Amazon S3

B.Amazon Redshift with Spectrum and Amazon S3

C.Amazon Athena and AWS Glue

D.Amazon EC2 instances with Apache Spark installed and Amazon S3

AnswerA

EMR is a managed service that runs Spark and integrates with S3.

Why this answer

Option B is correct because Amazon EMR is a managed Hadoop framework that supports Spark, and S3 is a scalable storage layer. Option A is wrong because EC2 would require managing the cluster manually. Option C is wrong because Redshift is a data warehouse, not a Hadoop replacement.

Option D is wrong because Athena is for querying, not for running Spark jobs.

Practice this question →

251

MCQeasy

A retail company uses Amazon Redshift for its data warehouse. The data engineering team runs ETL jobs that load data from multiple sources into Redshift daily. They notice that the load performance is slow and the cluster CPU utilization is high during the ETL window. The team wants to improve load performance without changing the cluster configuration. They currently load data using INSERT statements from a staging table. What should they do?

A.Run VACUUM and ANALYZE before loading

B.Use the COPY command to load data from S3 in parallel

C.Increase the number of nodes in the Redshift cluster

D.Apply compression encoding on the staging table

AnswerB

COPY is optimized for bulk loading.

Why this answer

Using the COPY command is the most efficient way to load data into Redshift, as it uses parallel processing. Option A is wrong because increasing node count changes configuration. Option B is wrong because VACUUM is for space reclaim.

Option D is wrong because compression might help but not as much as COPY.

Practice this question →

252

MCQeasy

A machine learning engineer needs to process a large dataset that does not fit on a single Amazon SageMaker notebook instance's EBS volume. The data is stored in S3. What is the MOST efficient way to access the data from the notebook?

A.Increase the EBS volume size to 5 TB.

B.Mount the S3 bucket as a file system using s3fs.

C.Read the data directly from S3 using the boto3 library.

D.Use SageMaker File input mode in the notebook.

AnswerC

Reading directly from S3 avoids storage limitations and is efficient for large datasets.

Why this answer

Option C is correct because reading data directly from S3 using the boto3 library is the most efficient approach for a dataset that exceeds the notebook instance's EBS volume capacity. Boto3 allows you to stream data in chunks or use S3 Select for server-side filtering, avoiding the need to download the entire dataset to local storage. This method leverages S3's high-throughput API and eliminates the bottleneck of writing to a local EBS volume, which is limited in size and I/O performance.

Exam trap

The trap here is that candidates confuse SageMaker's File input mode (designed for training jobs) with a general-purpose data access method for notebooks, or they assume that mounting S3 as a filesystem (s3fs) is efficient for large-scale data processing, when in reality it introduces performance penalties due to FUSE overhead and lack of native parallel I/O.

How to eliminate wrong answers

Option A is wrong because increasing the EBS volume to 5 TB does not solve the fundamental issue of the dataset not fitting; it only postpones the problem and incurs unnecessary cost, and SageMaker notebook instances have a maximum EBS volume size of 5 TB, which may still be insufficient for extremely large datasets. Option B is wrong because mounting an S3 bucket as a file system using s3fs relies on FUSE (Filesystem in Userspace), which introduces significant latency and overhead due to metadata caching and POSIX translation, and is not designed for high-throughput data processing in a notebook environment. Option D is wrong because SageMaker File input mode is a training job feature that streams data from S3 to the training container, not a method for accessing data within a notebook instance; it cannot be used directly in a notebook's kernel.

Practice this question →

253

Multi-Selectmedium

Which TWO configurations are required to enable AWS Glue to access data stored in a VPC? (Choose two.)

Select 2 answers

A.A VPC endpoint for Amazon S3.

B.An AWS Glue connection object that specifies the VPC, subnet, and security group.

C.A NAT gateway in a public subnet.

D.An Internet gateway attached to the VPC.

E.An S3 bucket policy that allows access from the Glue service principal.

AnswersA, B

Correct: Allows Glue jobs in VPC to access S3 without Internet.

Why this answer

Glue jobs running inside a VPC require a VPC endpoint for S3 (or Internet/NAT) to access S3 data, and a Glue connection that specifies the VPC, subnet, and security group. Option A (VPC endpoint for S3) and Option D (Glue connection) are correct. Option B (Internet gateway) is not secure.

Option C (NAT gateway) is an alternative but not required if using VPC endpoint. Option E (S3 bucket policy) is not specific to VPC access.

Practice this question →

254

Multi-Selectmedium

Which TWO data formats are columnar and optimized for analytics queries in Amazon S3?

Select 2 answers

A.CSV

B.ORC

C.JSON

D.Avro

E.Parquet

AnswersB, E

ORC is columnar and optimized for analytics.

Why this answer

Parquet and ORC are columnar storage formats. JSON and CSV are row-oriented. Avro is row-oriented.

Practice this question →

255

MCQhard

A company uses an Amazon SageMaker notebook to train a model using data from an S3 bucket. The IAM role attached to the notebook has the following policy. What is the MOST specific change needed to allow the notebook to read from the bucket 'ml-data-123'?

A.Add an Allow statement for 's3:GetObject' on 'ml-data-123' to the IAM policy.

B.Remove the Deny statement from the IAM policy.

C.Create an S3 access point and update the IAM policy to use the access point ARN.

D.Add a bucket policy on 'ml-data-123' that grants access to the notebook's IAM role.

AnswerB

An explicit deny overrides any allow; removing the deny allows the existing S3 actions to work.

Why this answer

Option A is correct because the existing policy denies access to the specific bucket, so the explicit deny must be removed. Option B is wrong because the AllowedPrincipal is not required for S3 bucket policies. Option C is wrong because S3 access points are not necessary.

Option D is wrong because an explicit allow cannot override an explicit deny.

Practice this question →

256

MCQmedium

An IAM policy is attached to a group. A user in the group tries to read the object s3://data-lake-bucket/sensitive/file.txt from an IP address 192.168.1.1. What will happen?

A.The request is allowed because the Allow statement grants s3:GetObject

B.The request is allowed because the Deny condition does not match

C.The request is denied because of the Deny statement

D.The request is denied because the policy has no explicit Allow for the sensitive prefix

AnswerC

Deny applies when condition is met.

Why this answer

The Deny statement explicitly denies any S3 action on the sensitive prefix when the source IP is not from 10.0.0.0/8. Since the IP 192.168.1.1 is not in that range, the Deny applies. Deny statements override Allow statements.

So the user is denied access.

Practice this question →

257

MCQhard

A company processes large streams of IoT sensor data using Amazon Kinesis Data Streams with 100 shards. Each sensor reading is about 1 KB. The data is consumed by an Amazon EMR cluster running Spark Streaming jobs. The team notices that the Spark Streaming job's processing time is gradually increasing, and the stream is falling behind. They suspect the issue is due to skewed data distribution across shards. Which approach should the team take to diagnose and resolve the issue?

A.Increase the number of shards to 200 to provide more parallelism.

B.Modify the producer to add a random prefix to the partition key, ensuring even distribution across all shards, and monitor the stream using CloudWatch.

C.Check Amazon CloudWatch metrics for Kinesis to identify hot shards, then manually redistribute the data by repartitioning in Spark.

D.Use the Kinesis Client Library (KCL) with a custom worker to rebalance the load across shards.

AnswerB

Adding a random prefix to partition keys uniformizes distribution, eliminating hot shards; CloudWatch helps confirm the fix.

Why this answer

Option B is correct because adding a random prefix to the partition key ensures that sensor data is evenly distributed across all 100 shards, eliminating hot shards that cause processing delays. This directly addresses the skewed data distribution issue without requiring infrastructure changes, and the team can monitor the improvement using CloudWatch metrics like IncomingBytes and ReadProvisionedThroughputExceeded.

Exam trap

The trap here is that candidates often confuse consumer-side rebalancing (KCL or Spark repartitioning) with producer-side data distribution, and incorrectly assume that increasing shards or using Spark repartitioning can fix a hot shard caused by a poor partition key.

How to eliminate wrong answers

Option A is wrong because simply increasing the number of shards to 200 does not fix the root cause of skewed distribution; it only adds more shards that may still be unevenly loaded if the partition key remains the same, potentially worsening the imbalance. Option C is wrong because while CloudWatch metrics can identify hot shards, manually redistributing data by repartitioning in Spark does not change how data is written to Kinesis shards; the producer-side partition key must be fixed to prevent future skew. Option D is wrong because the Kinesis Client Library (KCL) rebalances consumers across shards, but it cannot change how data is distributed across shards at the producer level; the skew originates from the producer's partition key selection.

Practice this question →

258

MCQeasy

A data engineer is tasked with building a data pipeline that moves data from an on-premises database to Amazon S3 for analytics. The database is a MySQL instance that is 2 TB in size. The company has a 1 Gbps dedicated network connection to AWS (AWS Direct Connect). The data must be transferred once daily. The engineer needs to choose the most efficient and reliable service for this task. Which service should they use?

A.AWS DataSync

B.AWS Database Migration Service (DMS)

C.AWS Glue

D.Amazon S3 Transfer Acceleration

AnswerB

DMS is designed for database migrations and supports S3 as a target.

Why this answer

AWS Database Migration Service (DMS) is designed for migrating databases to AWS and can continuously replicate data. For a one-time daily transfer, DMS can perform a full load and then ongoing replication if needed. It supports MySQL as a source and S3 as a target.

Practice this question →

259

MCQhard

A data engineer is building a data pipeline that uses Amazon S3 to store raw data, AWS Lambda for transformation, and Amazon DynamoDB for serving. The Lambda function experiences high latency when writing to DynamoDB. Which action will most effectively reduce the latency?

A.Enable DynamoDB Accelerator (DAX) for caching

B.Use Amazon S3 instead of DynamoDB

C.Configure a VPC gateway endpoint for DynamoDB

D.Increase the DynamoDB write capacity units

AnswerA

DAX provides in-memory caching, reducing latency.

Why this answer

Option A is correct because enabling DynamoDB Accelerator (DAX) provides a caching layer that reduces read/write latency. Option B is wrong because increasing write capacity units helps throughput but not latency; Option C is wrong because S3 is object storage, not a low-latency store; Option D is wrong because using a VPC endpoint does not reduce latency significantly.

Practice this question →

260

MCQhard

A company runs a critical data pipeline using Apache Spark on Amazon EMR. The pipeline reads data from Amazon S3, performs complex transformations, and writes results back to S3. The job runs every hour and must complete within 30 minutes. Recently, the job has been taking longer and occasionally failing due to executor losses. The team suspects memory pressure. Which action should the team take to improve stability and performance without increasing cost?

A.Increase the spark.executor.memory setting to allocate more memory per executor.

B.Increase the number of core nodes in the EMR cluster.

C.Decrease the number of shuffle partitions (spark.sql.shuffle.partitions) to reduce overhead.

D.Enable Spark dynamic allocation to adjust executors based on workload.

AnswerD

Dynamic allocation helps utilize resources efficiently and prevents over-allocation.

Why this answer

Option C is correct because enabling dynamic allocation allows Spark to release idle executors and request more when needed, reducing memory pressure. Option A is wrong because increasing the number of core nodes increases cost. Option B is wrong because increasing executor memory per node may cause YARN containers to fail if the instance memory is exceeded.

Option D is wrong because reducing shuffle partitions may reduce parallelism and increase memory per task.

Practice this question →

261

MCQmedium

A data engineer uses AWS Glue to run ETL jobs that transform data from JSON to Parquet. The job runs successfully but takes 30 minutes longer than expected. CloudWatch metrics show high memory utilization and disk spills. What is the most likely cause?

A.The number of DPUs is too low

B.The sink bucket has insufficient I/O throughput

C.The source data format is too large

D.The data is skewed and not evenly distributed across partitions

AnswerD

Data skew causes some tasks to take longer, leading to spills and increased runtime.

Why this answer

High memory utilization and disk spills indicate that the data is not evenly distributed, causing some executors to process more data than others. This is often due to data skew. Increasing DPUs might help, but addressing skew is more effective.

Practice this question →

262

MCQmedium

Refer to the exhibit. A data engineer is troubleshooting an AWS Glue job that fails with an 'AccessDenied' error when trying to write to the S3 bucket 'my-data-lake'. The IAM policy attached to the Glue service role is shown. What is the missing permission?

A.s3:ListBucket

B.s3:PutObjectAcl

C.s3:GetBucketLocation

D.s3:DeleteObject

AnswerA

Correct: Glue needs ListBucket to list objects in the bucket.

Why this answer

The policy allows s3:GetObject and s3:PutObject on the bucket's objects, but it does not allow s3:ListBucket on the bucket itself. Many Glue operations require ListBucket to discover objects. Option D (s3:ListBucket) is missing.

Option A (s3:DeleteObject) is not needed. Option B (s3:GetBucketLocation) is not required. Option C (s3:PutObjectAcl) is not needed.

Practice this question →

263

MCQhard

An ML team is building a recommendation system. The training data includes user-item interactions stored in Amazon DynamoDB. The team wants to export this data to S3 in Parquet format for use with Amazon SageMaker. The export should be incremental (only new or changed records) and run daily. Which approach meets these requirements with MINIMAL operational overhead?

A.Use the DynamoDB Export to S3 feature and schedule it daily with AWS Glue.

B.Use DynamoDB Streams with AWS Lambda to write changes to S3 in Parquet format.

C.Use a script that scans the DynamoDB table and filters by last updated timestamp.

D.Set up an Amazon EMR cluster running Spark jobs to read DynamoDB and write to S3.

AnswerB

Streams capture changes in near-real-time, enabling incremental exports with minimal overhead.

Why this answer

Option B is correct because DynamoDB Streams capture every change (insert, update, delete) in near real-time, and AWS Lambda can process these events to write only the changed records to S3 in Parquet format. This approach provides incremental, daily exports with minimal operational overhead, as it is fully serverless and requires no infrastructure management.

Exam trap

The trap here is that candidates often choose Option A because they assume 'Export to S3' is incremental, but it actually exports the entire table, not just changes, leading to higher costs and redundant data processing.

How to eliminate wrong answers

Option A is wrong because the DynamoDB Export to S3 feature exports the entire table snapshot, not incremental changes, and scheduling it with AWS Glue adds unnecessary complexity and cost for a full export each day. Option C is wrong because scanning the entire DynamoDB table daily and filtering by last updated timestamp is inefficient, costly (consumes read capacity), and does not capture deletions; it also requires custom scripting and handling of large datasets. Option D is wrong because setting up and managing an Amazon EMR cluster introduces significant operational overhead for a simple incremental export task, and it is overkill compared to the serverless Streams + Lambda approach.

Practice this question →

264

MCQhard

A company is using AWS Glue to run ETL jobs that transform data from multiple sources into a data lake on S3. The jobs are scheduled to run hourly. Recently, the jobs have been failing intermittently with 'MemoryError' exceptions. The data volume has grown over time. The data engineer needs to resolve this issue cost-effectively. Which action should be taken?

A.Increase the number of DPUs allocated to the Glue job and use a larger worker type.

B.Increase the S3 timeout settings in the Glue job configuration.

C.Switch the Glue job type from Spark to Python shell to reduce memory overhead.

D.Repartition the data using Spark's repartition method before processing.

AnswerA

More DPUs and larger worker types provide more memory to handle larger data volumes.

Why this answer

Option B is correct because Glue jobs can be configured to use more DPUs (Data Processing Units) to increase memory, and using worker type G.2X or G.4X provides more memory per worker. Option A (increasing S3 timeout) does not address memory. Option C (Spark partitioning) may help but is more complex and may not be sufficient if memory is insufficient.

Option D (changing to Python shell) reduces memory and will likely fail.

Practice this question →

265

MCQeasy

A company is using Amazon Kinesis Data Firehose to load streaming data into Amazon S3. The data is in JSON format, and they want to convert it to Parquet before storage. What should they configure?

A.Enable data format conversion in Firehose and specify a Glue table

B.Use an AWS Lambda function to transform the data

C.Run an AWS Glue ETL job after data is in S3

D.Use Kinesis Data Analytics for Apache Flink to convert the format

AnswerA

Firehose can convert to Parquet using a Glue table schema.

Why this answer

Kinesis Data Firehose can convert incoming data to Parquet or ORC using a schema from AWS Glue. Option A is wrong because Lambda can also transform but is not the primary method. Option C is wrong because Kinesis Data Analytics is for processing.

Option D is wrong because Athena is for querying.

Practice this question →

266

Multi-Selecteasy

Which TWO options are best practices for managing access to data stored in Amazon S3 for a data lake?

Select 2 answers

A.Use S3 access control lists (ACLs) for granular permissions

B.Enable default encryption with SSE-S3

C.Use IAM policies to control user and role permissions

D.Use S3 bucket policies to grant cross-account access

E.Generate pre-signed URLs for all data access

AnswersC, D

IAM policies are central to access management.

Why this answer

IAM policies and bucket policies are standard for access control. S3 ACLs are legacy and not recommended. SSE-S3 is encryption, not access control.

Pre-signed URLs are for temporary access, not general governance.

Practice this question →

267

MCQeasy

A company runs a nightly AWS Glue ETL job that processes data from an Amazon Redshift cluster and writes to Amazon S3. The job fails intermittently with 'ERROR: cannot execute INSERT in a read-only transaction'. What is the most likely cause?

A.The IAM role used by Glue does not have permissions to insert into Redshift

B.The JDBC driver version is incompatible with the Redshift cluster

C.The Redshift cluster is in a read-only state due to a failover or maintenance

D.The Glue job's connection pool is exhausted

AnswerC

During failover, the secondary cluster may be read-only, causing this error.

Why this answer

This error occurs when the Glue job tries to write to a Redshift table that is in a read-only transaction, often because the Redshift cluster is in a read-only state due to a failover or maintenance. Option A (JDBC timeout) gives a different error. Option B (connection pool) would give a different error.

Option D (IAM permissions) would give an access denied error.

Practice this question →

268

MCQhard

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A.Use Amazon EMR with Spark to convert data to Parquet and use on-demand instances.

B.Use Amazon EMR with Spark to convert data to Parquet and store in S3, using spot instances for task nodes.

C.Use AWS Glue to convert data to gzip-compressed CSV and query with Athena.

D.Use Amazon EMR with Hive to transform data to compressed CSV and store in S3.

AnswerB

Parquet reduces scan size, spot instances reduce cost.

Why this answer

Option B is correct because converting gzip-compressed CSV to Parquet reduces storage size and improves query performance due to columnar storage and predicate pushdown. Using spot instances for task nodes significantly lowers compute cost, while the 30-minute SLA is achievable with Spark on EMR processing 5-minute windows of data.

Exam trap

The trap here is that candidates may overlook the cost savings of spot instances for transient, fault-tolerant workloads, or assume that any compression (like gzip CSV) is sufficient for performance, ignoring the benefits of columnar formats like Parquet for analytical queries.

How to eliminate wrong answers

Option A is wrong because using on-demand instances for task nodes increases cost unnecessarily; spot instances are suitable for fault-tolerant, transient workloads like data transformation. Option C is wrong because AWS Glue is not optimized for high-volume, low-latency ETL on terabytes of daily log data, and converting to gzip-compressed CSV does not improve query performance over Parquet. Option D is wrong because Hive on EMR is slower than Spark for large-scale data processing, and storing as compressed CSV does not provide the performance benefits of columnar formats like Parquet.

Practice this question →

269

Multi-Selecthard

A company is using Amazon Redshift for data warehousing. The data engineering team observes that query performance degrades over time due to data skew. Which three strategies should the team implement to improve performance?

Select 3 answers

A.Choose appropriate distribution keys based on join and group-by columns.

B.Increase the number of nodes in the Redshift cluster.

C.Run VACUUM and ANALYZE commands regularly.

D.Define appropriate sort keys to minimize the number of blocks scanned.

E.Drop unused indexes on large tables.

AnswersA, C, D

Good distribution keys reduce data movement and improve performance.

Why this answer

Option A, B, and E are correct. Choosing appropriate distribution keys reduces data movement. Regular VACUUM and ANALYZE reclaims space and updates statistics.

Using sort keys improves query performance by reducing scans. Option C (increasing node count) may help but is costly and not a targeted fix for skew. Option D (dropping indexes) is not applicable to Redshift (no indexes).

Practice this question →

270

Multi-Selecthard

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

Select 3 answers

A.AWS Glue Data Catalog for schema-on-read.

B.Amazon Redshift for data warehousing.

C.Amazon S3 as the primary storage layer.

D.Amazon EMR for data processing.

E.S3 Lifecycle policies to transition data to Glacier.

AnswersA, C, E

Glue enables schema-on-read for analytics.

Why this answer

AWS Glue Data Catalog is correct because it provides a centralized metadata repository that enables schema-on-read for data stored in Amazon S3. It allows you to define table schemas and partitions without transforming the underlying data, so analytics tools like Amazon Athena and Amazon EMR can query the data with the schema applied at read time.

Exam trap

The trap here is that candidates often confuse Amazon Redshift as a data lake storage layer due to its analytics capabilities, but it is a data warehouse with schema-on-write and higher costs for infrequently accessed data, making it unsuitable for the described requirements.

Practice this question →

271

MCQeasy

A company wants to store semi-structured data from IoT sensors in a cost-effective manner for occasional querying. The data is not updated once written. Which Amazon S3 storage class is the most cost-effective for this use case?

A.S3 Standard

B.S3 One Zone-Infrequent Access

C.S3 Intelligent-Tiering

D.S3 Glacier Deep Archive

AnswerD

Correct: Deep Archive is the lowest cost for rarely accessed data with long retrieval times.

Why this answer

S3 Glacier Deep Archive is the lowest-cost storage class for rarely accessed data with retrieval times of 12 hours. Option D is correct. Option A (S3 Standard) is for frequent access.

Option B (S3 Intelligent-Tiering) incurs monitoring costs. Option C (S3 One Zone-IA) is for infrequent access but has a minimum storage charge.

Practice this question →

272

MCQhard

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

A.Increase the write capacity of the DynamoDB lease table to handle more workers.

B.Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput.

C.Switch to using AWS Lambda as the consumer to handle scaling automatically.

D.Increase the number of EC2 instances to match the number of shards.

AnswerB

Enhanced fan-out gives dedicated throughput per consumer.

Why this answer

The primary bottleneck is DynamoDB write capacity being fully utilized during peak hours. Enhanced fan-out (option B) provides each consumer with a dedicated 2 MB/second read throughput per shard, eliminating the need for consumers to contend for the shared 2 MB/second per shard. This reduces the load on the DynamoDB lease table because workers no longer need to poll for records, which in turn lowers the write operations to the lease table and alleviates the DynamoDB write capacity issue.

Exam trap

The trap here is that candidates assume increasing DynamoDB write capacity (option A) is the direct fix for write capacity exhaustion, but they miss that enhanced fan-out reduces the underlying cause of those writes by eliminating polling-based contention.

How to eliminate wrong answers

Option A is wrong because increasing EC2 instances to match shards does not address the DynamoDB write capacity bottleneck; it may even increase lease table writes due to more workers contending for leases. Option C is wrong because increasing the write capacity of the DynamoDB lease table treats a symptom (high write load from KCL workers) rather than the root cause (contention for shard throughput); enhanced fan-out reduces the need for frequent lease updates. Option D is wrong because switching to AWS Lambda does not inherently solve the DynamoDB write capacity issue; Lambda still uses KCL under the hood with DynamoDB as the lease table, and the same write contention would persist unless enhanced fan-out is also used.

Practice this question →

273

Multi-Selecthard

A company uses AWS Glue to run ETL jobs on a daily basis. The jobs read from Amazon RDS and write to Amazon S3. The data volume has grown, and the jobs are taking longer to complete. The team wants to optimize the jobs for cost and performance. Which combination of techniques should the team implement? (Choose THREE.)

Select 3 answers

A.Use a larger Glue worker type, such as G.2X, for more memory per worker.

B.Enable job bookmarks to process only new data since the last run.

C.Increase the number of partitions in the output S3 data to improve parallelism.

D.Increase the maximum number of DPUs for the job to 100.

E.Use pushdown predicates in the JDBC connection to filter data at the source.

AnswersA, B, E

Larger workers provide more resources per task, improving performance.

Why this answer

Option A, B, and D are correct. Using job bookmarks enables incremental processing, reducing the amount of data read. Using pushdown predicates filters data at the source, reducing data transfer.

Using a larger number of worker type (e.g., G.1X or G.2X) increases memory and CPU per worker, improving performance. Option C is wrong because adding more partitions after the job does not speed up the job. Option E is wrong because increasing the number of DPUs increases cost linearly and may not be as effective as using larger workers.

Practice this question →

274

MCQhard

Refer to the exhibit. A data engineer examines the output of 'aws glue get-job-run' for a failed job. The job run state is FAILED, but ErrorMessage is empty. The job ran for 3600 seconds (1 hour) before failing. What is the MOST likely cause of the failure?

A.The JDBC connection to the source database timed out.

B.The IAM role does not have sufficient permissions to access S3.

C.The job ran out of memory due to insufficient DPU allocation.

D.The Python script has a syntax error.

AnswerC

Out-of-memory errors may not always produce a detailed error message in the job run output.

Why this answer

Option D is correct because the job ran for exactly 1 hour and then failed with no error message, indicating it hit the timeout. The default timeout is 2880 minutes (48 hours), but the job may have a custom timeout set. However, the exhibit shows Timeout: 2880 minutes, which is not hit.

Wait, ExecutionTime: 3600 seconds = 60 minutes. The Timeout is 2880 minutes, so not timeout. Another common cause is out-of-memory.

But the question states 'empty error message' - often Glue jobs fail silently due to resource constraints like memory. Option D: The job ran out of memory because MaxCapacity is 10 DPUs, which may be insufficient for the data size. Option A is wrong because permissions would show Access Denied error.

Option B is wrong because Python errors would appear in logs. Option C is wrong because no error message suggests it's not a connection timeout.

Practice this question →

275

MCQeasy

A data engineer is tasked with building a pipeline to process streaming data from IoT devices. The devices send data in JSON format every second. The pipeline must aggregate data in 5-minute windows and store the results in Amazon S3. The engineer needs to handle late-arriving data (up to 1 hour) and ensure exactly-once semantics. Which combination of AWS services should they use?

A.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Analytics for windowed aggregations, and Amazon Kinesis Data Firehose to write to Amazon S3.

B.Amazon Kinesis Data Streams for ingestion, AWS Glue Streaming ETL for aggregation, and Amazon S3 for storage.

C.Amazon SQS for ingestion, AWS Lambda for aggregation, and Amazon S3 for storage.

D.Amazon Kinesis Data Streams for ingestion, Amazon Kinesis Data Firehose for transformation, and Amazon S3 for storage.

AnswerA

Kinesis Data Analytics supports windowed aggregations and exactly-once processing; Firehose delivers to S3 with minimal overhead.

Why this answer

Option A is correct because Kinesis Data Analytics supports windowed aggregations, can handle late data via watermarking, and provides exactly-once processing when used with Kinesis Data Streams. Option B is wrong because Kinesis Data Firehose does not allow custom windowed aggregations. Option C is wrong because Glue Streaming is a batch-oriented service.

Option D is wrong because Lambda does not have built-in support for exactly-once semantics for streaming applications.

Practice this question →

276

MCQeasy

Refer to the exhibit. A data engineer has deployed this CloudFormation template. The Glue job 'my-etl-job' reads from the S3 bucket 'my-data-lake-bucket' and writes transformed data to another bucket. After 30 days, the data engineer notices that the Glue job fails with 'Input data not found' errors. What is the most likely cause?

A.The temporary directory 'my-temp-dir' is being cleaned up by the lifecycle configuration.

B.The script location 's3://my-scripts/etl.py' is being deleted by the lifecycle rule.

C.The job bookmark option 'job-bookmark-enable' is causing the job to skip newly arriving data.

D.The lifecycle configuration deletes objects from the bucket after 30 days, removing the input data.

AnswerD

The ExpirationInDays: 30 rule deletes objects older than 30 days, which may include input data.

Why this answer

Option A is correct. The lifecycle rule expires objects after 30 days, removing the input data that the Glue job expects. Option B is incorrect because job bookmarks track state but do not cause data loss.

Option C is incorrect because the temp directory is separate. Option D is incorrect because the script location is not affected by the lifecycle rule.

Practice this question →

277

MCQeasy

A company uses Amazon Redshift for data warehousing. The data engineering team needs to load data from multiple S3 buckets into Redshift daily. Each bucket contains files in different formats (CSV, JSON, Parquet). Which AWS service is BEST suited to automate this ingestion process?

A.Amazon EMR with Apache Spark

B.AWS Data Pipeline

C.AWS Database Migration Service (DMS)

D.AWS Glue

AnswerD

Glue provides crawlers for schema discovery and ETL jobs for loading into Redshift.

Why this answer

Option A is correct. AWS Glue can crawl the S3 buckets to discover schema and run ETL jobs to transform and load data into Redshift. Option B (DMS) is for database migration.

Option C (EMR) requires more management. Option D (Data Pipeline) is less flexible.

Practice this question →

278

Multi-Selecthard

Which THREE factors should be considered when choosing between Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose for a real-time data ingestion pipeline? (Choose 3.)

Select 3 answers

A.Ability to compress data before delivery

B.Ability to encrypt data at rest

C.Need for custom data processing using AWS Lambda

D.Data retention requirements

E.Latency requirements for data delivery to S3

AnswersC, D, E

Kinesis Data Streams supports custom processing with Lambda, Firehose has limited transformation.

Why this answer

Kinesis Data Streams provides custom processing with shard-level throughput and retention up to 365 days. Firehose automatically delivers to destinations like S3, Redshift, and Elasticsearch with near-real-time latency. Option A is wrong because both support encryption.

Option D is wrong because both support compression before delivery. Option E is wrong because both can handle streaming data, but Firehose is simpler for delivery to S3.

Practice this question →

279

MCQmedium

A company is using Amazon SageMaker to train a model on a dataset that is updated daily. The data is stored in an S3 bucket. The training pipeline uses AWS Step Functions to orchestrate data preprocessing and model training. The preprocessing step uses a SageMaker Processing job that reads data from S3, cleans it, and writes the output back to S3. The team notices that the training step often fails due to insufficient disk space on the processing instance. Which change should the team make to resolve this issue without increasing cost?

A.Enable automatic scaling for the processing job.

B.Use AWS Batch instead of SageMaker Processing.

C.Use a larger instance type with more memory.

D.Configure the processing job to use local instance store (SSD) for scratch space.

AnswerD

Local instance store provides additional disk space without additional cost.

Why this answer

Option B is correct because using local instance store provides more disk space at no extra cost compared to EBS. Option A is wrong because using a different instance type may increase cost. Option C is wrong because using S3 as intermediate storage is not the issue.

Option D is wrong because the team already uses SageMaker Processing, which is appropriate.

Practice this question →

280

MCQhard

A team is building a data lake on Amazon S3 and using AWS Glue to catalog data. They notice that Glue crawlers are taking too long to update the catalog for a large dataset with millions of small files. Which approach will MOST improve crawler performance?

A.Increase the frequency of the crawler runs.

B.Consolidate the small files into larger files (e.g., 100 MB each).

C.Partition the data by date in S3.

D.Use a custom classifier to parse the data.

AnswerB

Fewer, larger files reduce overhead and crawler scan time.

Why this answer

Option D is correct because consolidating small files into larger files reduces the number of objects the crawler must scan. Option A is wrong because increasing crawler frequency doesn't reduce scan time. Option B is wrong because using a custom classifier doesn't reduce scan time.

Option C is wrong because partitioning helps but still many files per partition.

Practice this question →

281

MCQeasy

A company is streaming clickstream data from a website to Amazon Kinesis Data Streams. The data is consumed by a Lambda function that enriches each record with geolocation information before writing to an S3 bucket. Recently, the Lambda function has been failing with throttling errors. What is the MOST likely cause?

A.The Lambda function's payload size exceeds the 6 MB limit

B.The Lambda function's concurrent execution limit has been reached

C.The Lambda function's reserved concurrency is set too high

D.The Kinesis stream has exceeded the default shard limit of 500

AnswerB

Lambda throttles when the number of concurrent executions exceeds the account limit.

Why this answer

Option D is correct because the default Lambda concurrent execution limit (1000) can be reached if the stream has many shards or high throughput. Option A (shard limit) is a Kinesis limit, not Lambda. Option B (buffer size) is not a Lambda issue.

Option C (record size) would cause a different error.

Practice this question →

282

MCQhard

A data scientist is training a deep learning model on a GPU instance. The training data is stored in S3 and is 50 GB. To reduce I/O bottlenecks, which storage option should be used to cache the data locally on the instance?

A.Attach an Amazon EFS file system to the instance and copy data from S3

B.Mount an Amazon FSx for Lustre file system linked to the S3 bucket

C.Provision an Amazon EBS io2 volume and copy data from S3 using AWS DataSync

D.Use instance store volumes to cache the data from S3

AnswerB

FSx for Lustre provides high throughput and can cache S3 data locally.

Why this answer

Option A is correct because Amazon FSx for Lustre provides a high-performance file system integrated with S3 that can cache data locally. Option B is wrong because EBS with io2 volumes offers high IOPS but is not optimized for S3 caching; Option C is wrong because EFS is a shared file system with lower throughput; Option D is wrong because Instance Store is ephemeral and not persistent.

Practice this question →

283

MCQeasy

A data scientist needs to transform raw JSON data from an S3 bucket into Parquet format using AWS Glue. The job must be cost-effective and run only when new data arrives. Which solution should be used?

A.Create a Glue crawler that runs continuously.

B.Schedule a Glue ETL job to run every hour.

C.Use Glue DataBrew to transform data and schedule it daily.

D.Create a Glue ETL job triggered by an S3 event notification via Lambda.

AnswerD

Event-driven trigger ensures cost-effectiveness.

Why this answer

Option D is correct because it uses an S3 event notification to invoke a Lambda function, which then triggers an AWS Glue ETL job only when new data arrives. This event-driven architecture ensures cost-effectiveness by avoiding continuous or scheduled runs, and it directly transforms raw JSON into Parquet format as required.

Exam trap

The trap here is that candidates may confuse Glue crawlers (which only catalog metadata) with Glue ETL jobs (which transform data), or assume scheduled jobs are always cost-effective without considering event-driven triggers.

How to eliminate wrong answers

Option A is wrong because a Glue crawler runs continuously to update the Data Catalog, not to transform data into Parquet; it would incur unnecessary costs and does not perform ETL transformations. Option B is wrong because scheduling a Glue ETL job every hour runs regardless of whether new data has arrived, leading to wasted compute resources and higher costs. Option C is wrong because Glue DataBrew is a visual data preparation tool, not designed for automated, event-driven ETL transformations; scheduling it daily would also run even without new data and is less cost-effective than an event-triggered approach.

Practice this question →

284

MCQmedium

A company uses Amazon Kinesis Data Firehose to deliver data to Amazon S3. The data must be transformed before delivery using AWS Lambda. The Lambda function adds a timestamp field. The Firehose stream receives up to 10,000 records per second. The transformation currently takes 500 ms per record. What should the team do to ensure the transformation can keep up with the incoming data without data loss?

A.Increase the number of shards in the Kinesis stream.

B.Place the Lambda function in a VPC to improve network performance.

C.Increase the Lambda concurrency limit for the function to handle parallel invocations.

D.Increase the S3 buffer size and buffer interval in the Firehose delivery stream.

AnswerC

More concurrency allows processing more records in parallel.

Why this answer

Option B is correct because increasing the Lambda concurrency limit ensures that multiple Lambda invocations can run in parallel to handle the high throughput. Option A is wrong because increasing the buffer size would cause delays and potential data loss if the buffer fills up. Option C is wrong because increasing the number of shards applies to Kinesis Data Streams, not Firehose.

Option D is wrong because Lambda functions in a VPC may have reduced network performance and are not needed for adding a timestamp.

Practice this question →

285

MCQmedium

A data engineer needs to ingest data from an on-premises Apache Kafka cluster into Amazon S3 with minimal latency (under 5 minutes) for real-time analytics. The data volume is approximately 10 MB per second. Which solution is MOST cost-effective and meets the latency requirement?

A.Use Amazon MSK to mirror the on-premises Kafka cluster, then use Kinesis Firehose to write to S3

B.Use Amazon S3 Transfer Acceleration for direct uploads from on-premises

C.Use Amazon Kinesis Data Streams with a Direct Connect connection from on-premises

D.Set up a VPN connection and use AWS Lambda to consume from Kafka and write to S3

AnswerA

MSK provides managed Kafka with low latency, and Firehose can buffer and write to S3 every 60 seconds.

Why this answer

Option B is correct. Amazon MSK (Managed Streaming for Kafka) can replicate the on-premises Kafka topics to the cloud with low latency, and then Kinesis Firehose can deliver to S3 with a 60-second buffer. Option A (Direct connect + Lambda) has higher latency.

Option C (Kinesis Data Streams) requires additional ingestion from on-premises. Option D (S3 Transfer Acceleration) is for file uploads, not streaming.

Practice this question →

286

MCQmedium

A data engineering team needs to process streaming data from thousands of IoT devices. They want to aggregate data in 1-minute windows and store results in an S3 data lake for downstream analytics. Which architecture should they use?

A.Use AWS Glue ETL jobs running in streaming mode to read from Kinesis Data Streams, apply window aggregations, and write to S3.

B.Use Kinesis Data Streams with enhanced fan-out and multiple consumers to aggregate windows, then write to S3 via Firehose.

C.Use Kinesis Data Streams, trigger a Lambda function for 1-minute window aggregation using Python, and write results to S3.

D.Use Kinesis Data Analytics for SQL-based windowed aggregations and send results to Kinesis Data Firehose for delivery to S3.

AnswerD

Kinesis Data Analytics supports tumbling windows and continuous queries; Firehose is the natural sink for S3.

Why this answer

Option D is correct because Kinesis Data Analytics provides real-time SQL-based processing with windowing functions, and Kinesis Firehose can deliver aggregated data directly to S3. Option A is wrong because Lambda scales but has a 15-minute timeout and is not ideal for heavy streaming aggregation. Option B is wrong because Kinesis Data Streams alone does not process data; it requires a consumer.

Option C is wrong because Glue is batch-oriented, not real-time.

Practice this question →

287

Multi-Selecthard

Which THREE AWS services can be used together to build a serverless data pipeline that ingests streaming data, transforms it, and loads it into Amazon Redshift for analysis?

Select 3 answers

A.Amazon EMR

B.Amazon SQS

C.Amazon Kinesis Data Firehose

D.Amazon Kinesis Data Streams

E.AWS Lambda

AnswersC, D, E

Delivers transformed data directly to Redshift.

Why this answer

Kinesis Data Streams ingests streaming data. Lambda processes it. Firehose delivers to Redshift.

EMR is not serverless (managed). SQS is not ideal for streaming. Glue can also be used but is not the only option.

Practice this question →

288

MCQmedium

A data engineering team needs to build a data lake on Amazon S3 that will be queried by Amazon Athena and Amazon Redshift Spectrum. The data will be ingested from multiple sources in various formats (CSV, JSON, Parquet). Which partitioning strategy will provide the best query performance for date-range queries?

A.Partition by date with one partition per day in a flat structure.

B.Do not partition; let Athena scan the entire dataset.

C.Partition by year, month, and day in a hierarchical structure.

D.Partition by source system first, then by date.

AnswerC

Hierarchical date partitioning enables partition pruning for date-range queries.

Why this answer

Option D is correct because partitioning by year, month, and day as separate prefixes (e.g., s3://bucket/year=2024/month=01/day=15/) allows Athena and Redshift Spectrum to prune partitions efficiently. Option A is wrong because a single partition per day results in too many partitions for large datasets. Option B is wrong because no partitioning leads to full table scans.

Option C is wrong because partition by source first then date may be less optimal if queries often filter by date across sources.

Practice this question →

289

MCQmedium

A company uses Amazon Kinesis Data Analytics for Apache Flink to process real-time clickstream data. The application uses event time and watermarks for windowed aggregations. The team notices that the output from tumbling windows is delayed, and many late records are being dropped. What is the MOST likely cause?

A.The checkpointing interval is too long, causing state to be lost

B.The parallelism is too low, causing backpressure

C.The source is marking itself as idle, causing watermarks to stall

D.The allowed lateness is set too low, causing late records to be discarded

AnswerD

Low allowed lateness means records arriving after the watermark are dropped.

Why this answer

Option B is correct because late records are dropped when the watermark has passed the window's end; increasing the allowed lateness gives more time for late records to arrive. Option A is wrong because idle sources cause watermarks to stall, not drop late records. Option C is wrong because checkpointing interval does not affect watermark progress.

Option D is wrong because parallelism affects throughput, not watermark behavior.

Practice this question →

290

MCQhard

An e-commerce company uses Amazon Redshift for analytics. The data engineering team needs to load daily sales data from an S3 bucket that receives new files every hour. The data must be loaded into Redshift with minimal impact on query performance during the day, and they need to handle late-arriving data (files that appear after the daily load). Which approach should they use?

A.Use AWS Glue ETL to copy the data from S3 to Redshift, overwriting the existing data each day.

B.Use a staging table to load data incrementally with a MERGE operation, and schedule a late-arriving data job to merge files that arrive after the daily load.

C.Stream the data from S3 using Amazon Kinesis Firehose to load into Redshift continuously.

D.Use Amazon Redshift Spectrum to query data directly from S3 and create external tables.

AnswerB

Staging tables allow incremental upserts and handling of late data without blocking queries.

Why this answer

Option A is correct because staging tables allow incremental loads with upsert logic using a staging table, and a late-arriving data process can merge the additional records later without blocking queries. Option B (COPY with auto staging) Redshift Spectrum queries data in S3 without loading, which may be slower for frequent queries. Option C (Kinesis Firehose) is real-time streaming, suitable for near-real-time but not for batched daily loads with late data handling.

Option D (Glue ETL with overwrite) overwrites data, losing late-arriving data.

Practice this question →

291

Multi-Selecteasy

Which TWO AWS services are suitable for real-time stream processing?

Select 2 answers

A.Amazon Athena

B.AWS Glue

C.Amazon EMR

D.Amazon Kinesis Data Analytics

E.AWS Lambda

AnswersD, E

Kinesis Data Analytics processes streaming data in real-time.

Why this answer

Amazon Kinesis Data Analytics and AWS Lambda can process streams in real-time. AWS Glue is batch-oriented, Amazon EMR can process streams but is more batch, and Amazon Athena is for ad-hoc SQL queries on S3.

Practice this question →

292

MCQmedium

A data engineer needs to transfer 50 TB of historical data from an on-premises Hadoop cluster to Amazon S3. The company has a 1 Gbps internet connection and wants to complete the transfer within 5 days. What is the MOST cost-effective and reliable solution?

A.Use AWS Snowball Edge device to physically ship the data

B.Use S3 multipart upload over the internet

C.Set up AWS Direct Connect and transfer over the dedicated line

D.Use S3 Transfer Acceleration to speed up the transfer

AnswerA

Snowball can transfer 50 TB in a few days, cost-effective for large data.

Why this answer

AWS Snowball Edge is a physical device that can transfer large amounts of data faster than over the internet. Option A is wrong because the internet connection would take about 5 days at full bandwidth, but is unreliable and may incur high costs. Option B is wrong because Direct Connect requires setup time and ongoing costs.

Option D is wrong because S3 Transfer Acceleration may help but still relies on internet.

Practice this question →

293

MCQhard

A company uses Amazon Kinesis Data Streams to ingest clickstream data. The stream has 8 shards. A Lambda function processes each record and writes to Amazon DynamoDB. The Lambda function sometimes fails due to DynamoDB write throttling, causing duplicate processing of records after retries. The data engineering team needs to ensure exactly-once processing semantics for the DynamoDB writes. What should the team do?

A.Use an Amazon SQS FIFO queue between Kinesis and Lambda to deduplicate records.

B.Configure the Lambda event source mapping with a maximum retry count of 0 and a DLQ.

C.Increase the DynamoDB write capacity units to avoid throttling.

D.Use DynamoDB conditional writes with the Kinesis sequence number as a unique attribute to make writes idempotent.

AnswerD

Conditional writes based on the sequence number ensure each record is written only once.

Why this answer

Option B is correct because DynamoDB transactions allow conditional writes. Using the Kinesis sequence number as a condition check ensures idempotency: if a record with that sequence number already exists, the write is skipped. Option A is wrong because increasing write capacity may reduce throttling but does not guarantee exactly-once; duplicates can still occur.

Option C is wrong because SQS delays are for visibility timeout, not idempotency. Option D is wrong because FIFO queues support exactly-once delivery but the Lambda would still need to deduplicate within the batch.

Practice this question →

294

Multi-Selectmedium

A company uses Amazon Kinesis Data Streams to ingest clickstream data. They need to archive raw data to S3 every hour and also enable real-time processing with sub-second latency. Which TWO actions should they take? (Choose two.)

Select 2 answers

A.Use Kinesis Data Analytics to write output to S3.

B.Configure a Lambda function as a consumer of the stream for real-time processing.

C.Use S3 events to trigger a Lambda function that reads from the stream.

D.Create a Kinesis Data Firehose delivery stream with S3 as destination and set a buffer interval of 3600 seconds.

E.Install the Kinesis Agent on an EC2 instance to write data to S3.

AnswersB, D

Lambda can process records with low latency.

Why this answer

Options A and D are correct. Option A provides real-time processing via Lambda. Option D archives data to S3 using Firehose with hourly buffering.

Option B is wrong because Kinesis Agent is for log files, not archiving. Option C is wrong because Data Analytics does not write to S3 directly. Option E is wrong because S3 events are not used for archiving.

Practice this question →

295

MCQhard

A company uses AWS Glue ETL jobs to transform CSV data from an S3 bucket into Parquet. The jobs often fail with memory errors when processing large datasets. They want to minimize cost and improve reliability. What should they do?

A.Use G.1X or G.2X worker types and increase the number of DPUs per worker.

B.Use Amazon Athena with CTAS queries to convert the data to Parquet.

C.Switch to S3 Batch Operations with AWS Lambda to process the files individually.

D.Increase the number of workers in the Glue job configuration.

AnswerA

G.1X workers provide more memory and vCPU per worker, reducing OOM errors for memory-intensive transformations.

Why this answer

Option B is correct because increasing the number of DPUs per worker and using G.1X workers provides more memory and compute per task, reducing out-of-memory errors. Option A is wrong because Spark's shuffle behavior can still cause memory issues even with more workers. Option C is wrong because S3 batch operations are not suitable for complex transformations.

Option D is wrong because Presto is interactive, not designed for scheduled ETL.

Practice this question →

296

MCQmedium

A company is building a data pipeline to process streaming data from IoT devices. The data is ingested via Amazon Kinesis Data Streams. Each record is about 1 KB. The company wants to use AWS Lambda for real-time transformations and then store the results in Amazon DynamoDB. The expected throughput is 10,000 records per second. The Lambda function currently runs in about 200 ms. The company is concerned about Lambda concurrency limits and wants to ensure there are no throttling errors. The default concurrency limit for Lambda is 1,000. Which approach should the team take to handle the expected throughput without throttling?

A.Increase the Lambda function memory to 3,000 MB to reduce the execution time below 100 ms.

B.Use Amazon Kinesis Data Firehose instead of Lambda to load data directly into DynamoDB.

C.Reduce the Lambda batch size to 10 so that each invocation processes fewer records, reducing the time per invocation.

D.Increase the number of shards in the Kinesis Data Stream to 10 and set the Lambda batch size to 100.

AnswerD

With 10 shards and batch size 100, at most 10 concurrent Lambda invocations, well within limits.

Why this answer

Option B is correct because increasing the shard count to 10 ensures that each shard can trigger a Lambda invocation concurrently, and with a batch size of 100, the number of concurrent Lambda executions is at most 10 (10 shards * 1 batch per shard). This stays well within the concurrency limit. Option A is incorrect because reducing batch size increases the number of invocations per second (10,000 / 10 = 1,000 invocations per second), which would exceed the concurrency limit if each invocation takes 200 ms.

Option C is incorrect because Kinesis Data Firehose does not support Lambda for per-record transformations. Option D is incorrect because increasing memory does not affect concurrency limits.

Practice this question →

297

Multi-Selecteasy

A data engineer is designing a data pipeline that uses Amazon Kinesis Data Streams to ingest sensor data. The data must be processed in real-time, and the results must be stored in Amazon DynamoDB. Which TWO AWS services can be used together to achieve this? (Choose TWO.)

Select 2 answers

A.Amazon Athena

B.Amazon Kinesis Data Analytics

C.AWS Glue

D.Amazon S3

E.AWS Lambda

AnswersB, E

Kinesis Data Analytics can process streaming data in real-time.

Why this answer

Option B is correct because Kinesis Data Analytics can process data in real-time using SQL or Flink. Option E is correct because Lambda can consume from Kinesis and write to DynamoDB. Option A is wrong because Glue is for batch ETL, not real-time.

Option C is wrong because S3 is not real-time storage. Option D is wrong because Athena is for querying, not processing.

Practice this question →

298

Multi-Selecthard

A data engineering team is migrating on-premises Hadoop workloads to AWS. The workloads include batch processing using Apache Spark and interactive SQL queries. The data is stored in HDFS. Which TWO AWS services should be used to replace HDFS and provide a scalable, durable storage layer? (Choose TWO.)

Select 2 answers

A.Amazon EMR with EMRFS

B.Amazon S3

C.Amazon EBS

D.Amazon FSx for Lustre

E.Amazon RDS

AnswersA, B

EMRFS allows EMR to use S3 as a replacement for HDFS.

Why this answer

Amazon S3 is the primary storage for data lakes, replacing HDFS. Amazon EMR with EMRFS can access S3 as if it were HDFS. Option B (Amazon FSx for Lustre) is a high-performance file system but not a direct HDFS replacement for durability.

Option C (Amazon EBS) is block storage, not scalable. Option D (Amazon RDS) is relational database. Option E (Amazon DynamoDB) is NoSQL.

Practice this question →

299

Multi-Selecteasy

Which TWO AWS services can be used to schedule and orchestrate a data pipeline that includes multiple steps such as data extraction, transformation, and loading? (Choose 2.)

Select 2 answers

A.AWS Lambda

B.AWS Glue

C.Amazon Managed Workflows for Apache Airflow (MWAA)

D.AWS Step Functions

E.Amazon CloudWatch Events

AnswersC, D

MWAA is a managed orchestration service for data pipelines.

Why this answer

AWS Step Functions is a serverless orchestration service that can coordinate multiple AWS services into workflows. Amazon Managed Workflows for Apache Airflow (MWAA) is a managed version of Apache Airflow for orchestrating pipelines. Option A is wrong because Lambda is a compute service, not an orchestrator.

Option C is wrong because CloudWatch Events is for event scheduling, not complex orchestration. Option D is wrong because Glue is an ETL service, not an orchestrator, though it can be part of a pipeline.

Practice this question →

300

MCQmedium

A data engineering team uses Amazon EMR with Spark to transform large datasets in S3. The team notices that the Spark jobs on the EMR cluster are failing with out-of-memory errors. The cluster uses instance types with moderate memory. Which configuration change would MOST effectively reduce memory pressure without increasing cost?

A.Use a larger EMR cluster with more instances of the same type.

B.Increase the number of executor cores to improve parallelism.

C.Switch from Java serialization to Kryo serialization.

D.Enable shuffle compression and set spark.shuffle.compress to true.

AnswerD

Compression reduces the amount of data stored in memory during shuffles.

Why this answer

Option C is correct because enabling compression reduces the amount of data shuffled over the network and stored in memory, thus reducing memory usage. Option A is wrong because increasing the number of executor cores may increase parallelism but does not directly reduce memory per task. Option B is wrong because using more instances would increase cost.

Option D is wrong because Kryo serialization reduces object size but is not as effective as compression for shuffle data.

Practice this question →

← PreviousPage 4 of 5 · 374 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Engineering questions.

Start 20-question session