← MLA-C01·Amazon Web Services

Question 366 of 507

ML Solution Monitoring, Maintenance and Security →mediumMultiple ChoiceObjective-mapped

Quick Answer

The answer is to configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance. This is correct because the core issue is that Variant A is overwhelmed by concurrent requests while both variants show similar CPU load, meaning CPU-based scaling would not trigger. By scaling on InvocationsPerInstance, each production variant independently adjusts its instance count to maintain a target concurrency level, directly addressing the latency spike from queue buildup without manual intervention. On the AWS Certified Machine Learning Engineer Associate MLA-C01 exam, this scenario tests your understanding that auto-scaling SageMaker production variants must use the right metric for the bottleneck—here, concurrency, not CPU or latency. A common trap is choosing a p99 latency alarm, but that reacts to symptoms after latency has already spiked, whereas concurrent requests per instance is a leading indicator. Remember the mnemonic: “Concurrency cures congestion”—scale on what’s piling up, not on what’s slowing down.

MLA-C01 Practice Question: ML Solution Monitoring, Maintenance and Security

This MLA-C01 practice question tests your understanding of ml solution monitoring, maintenance and security. Examine the command output carefully: the correct answer depends on what the output actually shows, not on general recall alone. After answering, compare your reasoning against the explanation and wrong-answer breakdown below. Once you have made your selection, read the full explanation to reinforce the concept and understand why each distractor is designed to mislead on exam day.

A media company uses SageMaker endpoints to serve a model that predicts video engagement. They have two production variants: Variant A (ml.c5.large) for regular traffic and Variant B (ml.c5.xlarge) for burst traffic. They use weighted routing (90% to A, 10% to B). Recently, during peak hours, Variant A's latency increase causes many requests to time out. The metrics show that both variants are under similar CPU load, but the number of concurrent requests to Variant A is very high. The team wants to ensure that burst traffic is handled properly without manual intervention. What should they do?

Question 1mediummultiple choice

Review the full routing breakdown →

A
Increase the traffic weight to Variant B to 70% and reduce Variant A to 30%.
Why wrong: This doesn't solve the root cause of Variant A being overloaded; it might overload B.
B
Configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance.
Autoscaling adjusts capacity based on load, preventing timeouts.
C
Set a CloudWatch alarm on Variant A's p99 latency and trigger a step scaling policy to add instances.
Why wrong: Step scaling based on alarm may be reactive; target tracking is more proactive.
D
Create a separate endpoint for burst traffic and route peak traffic to it via DNS.
Why wrong: Separate endpoint adds complexity; scaling within the same endpoint is preferred.

Full breakdown with real-world context →

Answer choices

Why each option matters

Answer the question above first, then reveal the full breakdown to understand why each option is right or wrong.

Correct answer & explanation

✓

Configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance.

Option B is correct because changing to target tracking scaling based on the number of concurrent requests (or InvocationsPerInstance) ensures each variant scales based on its load. Option A (swap weights) doesn't fix scaling. Option C (p99 latency alarm) might trigger too late. Option D (separate endpoint) is not necessary.

Key principle: Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Answer analysis

Option-by-option breakdown

For each option: why learners choose it and why it is or isn't the right answer here.

✗
Increase the traffic weight to Variant B to 70% and reduce Variant A to 30%.
Why it's wrong here
This doesn't solve the root cause of Variant A being overloaded; it might overload B.
✓
Configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance.
Why this is correct
Autoscaling adjusts capacity based on load, preventing timeouts.
Related concept
Read the scenario before looking for a memorised answer.
✗
Set a CloudWatch alarm on Variant A's p99 latency and trigger a step scaling policy to add instances.
Why it's wrong here
Step scaling based on alarm may be reactive; target tracking is more proactive.
✗
Create a separate endpoint for burst traffic and route peak traffic to it via DNS.
Why it's wrong here
Separate endpoint adds complexity; scaling within the same endpoint is preferred.

Common exam traps

Common exam trap: answer the scenario, not the keyword

Many certification questions include familiar terms but test a specific constraint. Read the exact wording before choosing an answer that is generally true but wrong for this case.

Detailed technical explanation

How to think about this question

This question should be treated as a scenario, not a definition check. Identify the problem, the constraint and the best action. Then compare each option against those facts.

KKey Concepts to Remember

Read the scenario before looking for a memorised answer.
Find the constraint that changes the correct option.
Eliminate answers that are true in general but not in this case.
Use explanations to understand the rule behind the answer.

TExam Day Tips

Underline the problem statement mentally.
Watch for words such as best, first, most likely and least administrative effort.
Review why wrong options are wrong, not only why the correct option is correct.

Key takeaway

Answer the scenario, not the keyword: identify the specific constraint before choosing the most familiar-sounding option.

Real-world example

How this comes up in practice

An e-commerce site experiences heavy traffic on Black Friday and near-zero traffic during off-peak weeks. Rather than provisioning permanent large VMs, the team uses auto-scaling groups that add capacity automatically under load and reduce it overnight. Questions like this test whether you understand elasticity, availability zones, and cloud compute scaling patterns.

What to study next

Got this wrong? Here's your next step.

Identify which MLA-C01 exam domain this question belongs to, then review the specific concept being tested. Practise related questions in that domain and focus on understanding why each wrong answer is tempting — not just why the correct answer is right.

Related MLA-C01 practice-question pages

Use these pages to review the topic behind this question. This is how one missed question becomes focused revision.

Data Preparation for Machine Learning practice questions

Practise MLA-C01 questions linked to Data Preparation for Machine Learning.

ML Model Development practice questions

Practise MLA-C01 questions linked to ML Model Development.

Deployment and Orchestration of ML Workflows practice questions

Practise MLA-C01 questions linked to Deployment and Orchestration of ML Workflows.

ML Solution Monitoring, Maintenance and Security practice questions

Practise MLA-C01 questions linked to ML Solution Monitoring, Maintenance and Security.

MLA-C01 fundamentals practice questions

Practise MLA-C01 questions linked to MLA-C01 fundamentals.

MLA-C01 scenario practice questions

Practise MLA-C01 questions linked to MLA-C01 scenario.

MLA-C01 troubleshooting practice questions

Practise MLA-C01 questions linked to MLA-C01 troubleshooting.

Practice this exam

Start a free MLA-C01 practice session

Short sessions build daily habit. Longer sessions build exam-day stamina. Try a timed session to simulate real conditions.

10 questions 20 questions 30 questions 50 questions Timed 30

MLA-C01 practice-test guide →Study guide →Browse all practice tests

FAQ

Questions learners often ask

What does this MLA-C01 question test?

ML Solution Monitoring, Maintenance and Security — This question tests ML Solution Monitoring, Maintenance and Security — Read the scenario before looking for a memorised answer..

What is the correct answer to this question?

The correct answer is: Configure Application Auto Scaling for each variant with a target tracking scaling policy based on the number of concurrent requests per instance. — Option B is correct because changing to target tracking scaling based on the number of concurrent requests (or InvocationsPerInstance) ensures each variant scales based on its load. Option A (swap weights) doesn't fix scaling. Option C (p99 latency alarm) might trigger too late. Option D (separate endpoint) is not necessary.

What should I do if I get this MLA-C01 question wrong?

What is the key concept behind this question?

Read the scenario before looking for a memorised answer.

About these practice questions

Courseiva creates original exam-style practice questions with explanations and wrong-answer analysis. It does not publish real exam questions, exam dumps, or protected exam content. Learn why practice questions differ from exam dumps →

How Courseiva writes practice questions · Editorial policy

Same concept, more angles

2 more ways this is tested on MLA-C01

These questions test the same concept from different angles. Work through them to make sure you can recognise it however the exam phrases it.

Variation 1. A company's SageMaker endpoint is experiencing increased latency during peak hours. The endpoint uses a single ml.m5.large instance. The deployment is critical and must maintain low latency. Which action is MOST effective to reduce latency without sacrificing cost efficiency?

medium

A.Deploy multiple variants with A/B testing
B.Use Elastic Inference to attach an accelerator
C.Switch to a ml.c5.large instance
✓ D.Add an auto-scaling policy based on request count
E.Enable SageMaker Model Monitor

Why D: Adding auto-scaling based on request count allows the endpoint to handle spikes without over-provisioning, balancing cost and latency.

Variation 2. A company's SageMaker real-time endpoint is experiencing high latency under load. The CloudWatch metrics show that the ModelLatency is acceptable, but the OverheadLatency is spiking. What is the most likely cause?

hard

A.The request payload size is too large.
B.The SageMaker endpoint is not in the same VPC as the client.
✓ C.The endpoint is under-provisioned with insufficient instance count.
D.The model inference code is inefficient.

Why C: Option C is correct because OverheadLatency includes SageMaker framework overhead, which increases when the endpoint is scaled improperly. Option A would affect ModelLatency. Option B would increase latency but not specifically OverheadLatency. Option D would affect network latency but not OverheadLatency.

Last reviewed: Jun 23, 2026

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

This MLA-C01 practice question is part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style practice questions with explanations, topic-based practice, mock exams, readiness tracking, and study analytics to help learners prepare for the MLA-C01 exam.