DVA-C02Chapter 95 of 101Objective 2.4

Rate Limiting and Throttling for APIs

Rate limiting and throttling for APIs manage access to backend resources, using AWS services like API Gateway and Lambda. It explains why throttling is critical for protecting backend resources and ensuring fair usage, and how it appears in the DVA-C02 exam under Domain 2.4 (Security). Approximately 5-10% of exam questions touch on throttling concepts, often in scenario-based questions about API design or Lambda concurrency limits. You will learn the token bucket algorithm, AWS service limits, configuration options, and common exam traps.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Rate Limiting as a Toll Booth with Token Buckets

How does a toll booth on a bridge control how many cars can cross per minute? The booth uses a token bucket system: a bucket holds a fixed number of tokens (say 100), and each car must take a token to cross. The bucket is refilled at a steady rate (e.g., 10 tokens per second). If the bucket is empty, cars must wait until new tokens arrive. This directly mirrors AWS API Gateway's token bucket algorithm for rate limiting. The bucket size (burst capacity) allows short bursts of traffic (e.g., 100 cars in one second), while the refill rate (sustained rate) controls long-term throughput (e.g., 10 cars per second). If a car arrives and the bucket is empty, it gets a 429 Too Many Requests response, just like an API client. The toll booth operator also has a per-account limit (like AWS account-level throttling) and a per-lane limit (like per-method or per-API key throttling). Multiple toll booths can be stacked: a regional limit (all cars entering the city) and a per-destination limit (cars to a specific district). This analogy explains the mechanics: the bucket size (max burst), refill rate (sustained rate), and the wait time (retry-after header). AWS uses this exact algorithm for API Gateway and Lambda throttling.

How It Actually Works

What is Rate Limiting and Throttling?

Rate limiting restricts the number of requests a client can make to an API within a given time window. Throttling is the mechanism that enforces rate limits by rejecting requests that exceed the limit, typically returning HTTP 429 (Too Many Requests). In AWS, throttling is implemented at multiple layers: API Gateway, Lambda, and AWS account-level service quotas.

Why Throttling Exists

Throttling prevents a single client from overwhelming backend resources, protects against DDoS attacks, and ensures fair distribution of capacity. It also helps manage costs by limiting resource consumption. Without throttling, a sudden spike in traffic could degrade performance for all users or cause backend failures.

The Token Bucket Algorithm

AWS API Gateway and Lambda use the token bucket algorithm for throttling. A token bucket has two parameters: - Bucket size (burst limit): The maximum number of requests that can be processed in a burst (i.e., the capacity of the bucket). - Refill rate (sustained rate): The rate at which tokens are added back to the bucket per second.

When a request arrives, it attempts to take a token. If a token is available, the request is forwarded. If not, the request is throttled (429). The bucket fills at the refill rate, allowing bursts up to the bucket size, then smoothing out to the sustained rate.

API Gateway Throttling

API Gateway has two tiers of throttling: - Account-level limits: Per AWS account, across all APIs in a region. Default: 10,000 requests per second (rps) with a burst of 5,000 requests. - Per-API limits: Can be set per stage or per method. Default: 10,000 rps with a burst of 5,000, but you can configure custom limits using usage plans and API keys.

Usage plans allow you to set throttling limits per API key. You define: - Rate: Number of requests per second (steady-state). - Burst: Maximum number of concurrent requests (burst capacity).

If an API key exceeds its rate, it gets throttled. The burst allows short spikes but overall throughput is limited.

Lambda Throttling

Lambda has two types of throttling: - Concurrency limits: The maximum number of function instances running concurrently. Default: 1,000 per account per region. You can request a quota increase. - Reserved concurrency: Sets a specific concurrency limit for a function, ensuring it doesn't consume all account concurrency. - Provisioned concurrency: Pre-warms a number of instances to reduce cold starts, but still counts towards concurrency limits.

When a Lambda function is invoked, it counts towards concurrency. If concurrency exceeds the limit, new invocations are throttled with a 429 error (or Lambda returns a TooManyRequestsException).

How Throttling Works Internally

Client sends a request to API Gateway.

API Gateway checks the token bucket for the API key (or IP if no key).

If a token is available, the request proceeds to the backend (e.g., Lambda).

If the bucket is empty, API Gateway returns HTTP 429 with a Retry-After header indicating seconds to wait.

For Lambda-backed APIs, API Gateway also respects Lambda's concurrency limits. If Lambda is throttled, API Gateway may return 429 or 503.

Key Default Values and Timers

API Gateway account-level limit: 10,000 rps, burst 5,000.

API Gateway per-API default: same as account-level, but can be lowered via usage plans.

Lambda concurrency default: 1,000 per region.

Lambda reserved concurrency: 0 to account limit.

Lambda burst concurrency: varies by region (e.g., 500-3000 per minute for new functions).

Retry-After header: typically 1-5 seconds.

Configuration and Verification

To configure throttling in API Gateway:

Create a usage plan and associate it with an API stage.

Set rate and burst limits.

Optionally, attach API keys to clients.

To verify throttling:

Test with curl or Postman and check for 429 responses.

Use CloudWatch metrics (e.g., ThrottleCount, 4xxError).

Enable API Gateway logging to see throttle events.

Interaction with Related Technologies

AWS WAF: Can be used to implement IP-based rate limiting before API Gateway.

CloudFront: Can also throttle at the CDN level using Lambda@Edge or AWS WAF.

Amazon Cognito: Can throttle authentication requests.

AWS Shield: Provides DDoS protection but not application-level throttling.

Exam Relevance

For DVA-C02, you must know:

The difference between rate limiting (requests per second) and concurrency (simultaneous executions).

How to configure throttling via usage plans.

The default limits and how to request increases.

How throttling affects Lambda: reserved vs. provisioned concurrency.

Common error codes: 429 (throttling), 503 (service unavailable).

The token bucket algorithm conceptually.

Code Example: Creating a Usage Plan

aws apigateway create-usage-plan \
  --name "MyUsagePlan" \
  --throttle rate=1000,burst=500 \
  --api-stages apiId=abc123,stage=prod

Monitoring Throttling

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name ThrottleCount \
  --start-time 2023-01-01T00:00:00Z \
  --end-time 2023-01-01T01:00:00Z \
  --period 300 \
  --statistics Sum

Walk-Through

Client sends API request

The client (e.g., mobile app) sends an HTTP request to the API Gateway endpoint. The request includes headers such as `x-api-key` if using API keys, or the request is associated with a usage plan via the API stage. At this point, no throttling check has occurred yet.

API Gateway checks token bucket

API Gateway examines the request's API key (or source IP if no key) and looks up the associated usage plan's throttling limits. It then checks the token bucket for that key: if the bucket has at least one token, it consumes one and forwards the request. The bucket size is the burst limit, and tokens are refilled at the rate limit per second. This check happens in microseconds.

Token available: forward to backend

If a token is available, the request is forwarded to the integration endpoint (e.g., Lambda, HTTP). API Gateway records the request in CloudWatch logs and metrics. The backend processes the request normally. The token bucket decrements by one.

No token: return 429 Too Many Requests

If the token bucket is empty, API Gateway immediately returns an HTTP 429 response with a `Retry-After` header (e.g., `Retry-After: 2`). The client should wait the specified seconds before retrying. The request is not forwarded to the backend, saving resources. API Gateway also increments the `ThrottleCount` metric.

Lambda concurrency check (if applicable)

If the request is forwarded to a Lambda function, Lambda checks its concurrency limit. The function's reserved concurrency (if set) or account-level concurrency is evaluated. If the function is at its concurrency limit, Lambda returns a `TooManyRequestsException` (429). API Gateway may retry or return 503 depending on configuration. This is a separate throttling layer.

What This Looks Like on the Job

In a real-world enterprise, rate limiting is critical for a B2B SaaS platform that exposes APIs to thousands of customers. For example, a payment processing API must ensure that no single merchant can overwhelm the system during peak sales events like Black Friday. The platform uses API Gateway with usage plans: each merchant gets an API key with a rate limit of 1000 requests per second and a burst of 2000. This allows short spikes but smooths out sustained traffic. The backend is a fleet of Lambda functions with reserved concurrency of 500 to prevent any single merchant from consuming all concurrency. During a flash sale, one merchant's traffic spikes to 5000 rps. The token bucket empties quickly, and the merchant receives 429 responses. The client SDK implements exponential backoff with jitter, retrying after the Retry-After header. CloudWatch alarms trigger when ThrottleCount exceeds a threshold, alerting the operations team. Misconfiguration can happen: if the rate limit is set too low (e.g., 10 rps), legitimate traffic gets throttled, causing customer complaints. Conversely, if burst is set too high (e.g., 10,000), a single client can cause a denial-of-service to others. Another scenario: a microservices architecture where internal services communicate via API Gateway. If the internal service's Lambda function has insufficient reserved concurrency, it becomes a bottleneck. The operations team must monitor both API Gateway throttling and Lambda concurrency simultaneously. They use CloudWatch dashboards to correlate ThrottleCount and ConcurrentExecutions. They also implement circuit breakers in the client to avoid retry storms. In production, the team configures a usage plan with a rate of 5000 rps and burst 10,000 for premium customers, and 100 rps and burst 500 for basic customers. They also set a global account-level limit of 50,000 rps to protect the entire region. The key lesson: always test throttling behavior with load testing tools like Artillery or Locust before going live.

How DVA-C02 Actually Tests This

The DVA-C02 exam tests rate limiting and throttling under Domain 2.4 (Security) and also in Domain 3 (Development with AWS Services). Expect 2-4 questions directly on this topic. Key areas:

Token bucket algorithm: Know that burst limit is the bucket size and rate is the refill rate. A common wrong answer is confusing burst with sustained rate. Exam question: "An API has a rate of 1000 and burst of 2000. Which statement is true?" Wrong answer: "It allows 1000 requests per second, and any request over 1000 is rejected." Correct: "It allows up to 2000 requests in a burst, then settles to 1000 per second."

Default limits: Memorize: API Gateway default 10,000 rps, burst 5,000. Lambda default concurrency 1,000. Exam may ask: "What is the default API Gateway throttling limit?" Wrong answers: 5000, 1000, unlimited.

Usage plans vs. API keys: Usage plans set throttling and quota limits; API keys identify clients. Common trap: thinking API keys alone enforce throttling. They don't; they only identify. Usage plans must be associated.

Lambda throttling: Reserved concurrency vs. provisioned concurrency. Reserved concurrency limits the function's maximum concurrency; provisioned concurrency pre-warms instances but still counts towards reserved. Exam question: "How to ensure a Lambda function never exceeds 100 concurrent executions?" Wrong answer: "Set provisioned concurrency to 100." Correct: "Set reserved concurrency to 100."

Error codes: 429 for throttling, 503 for service unavailable. Exam may ask: "What HTTP status code does API Gateway return when throttling?" Wrong answers: 500, 503, 400.

Retry-After header: Included in 429 responses. Exam: "How does the client know when to retry?" Answer: Retry-After header.

Edge cases: When a Lambda function is throttled, API Gateway may retry (if configured) or return 503. Exam: "If Lambda returns a TooManyRequestsException, what does API Gateway do?" Answer: Depends on integration response configuration; default is 503.

Eliminating wrong answers: If a question mentions "burst" and "rate", remember that burst is the maximum instantaneous throughput. If a question asks about per-client limits, think usage plans with API keys. If it asks about account-wide limits, think service quotas. Always identify the layer: API Gateway or Lambda.

Key Takeaways

API Gateway default throttling: 10,000 requests per second per account per region, burst 5,000.

Usage plans allow per-client throttling with rate and burst limits.

Lambda default concurrency: 1,000 per account per region.

Reserved concurrency sets a maximum concurrency for a Lambda function; provisioned concurrency pre-warms instances.

Token bucket: burst = bucket size, rate = refill rate per second.

HTTP 429 is returned when throttling; include Retry-After header.

API Gateway can also be throttled at the method level and stage level.

CloudWatch metrics: ThrottleCount, ConcurrentExecutions.

To request higher limits, use Service Quotas or AWS Support.

Always implement exponential backoff in clients to handle 429 responses.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

API Gateway Throttling

Enforces rate limits at the API layer before backend invocation.

Uses token bucket algorithm with rate and burst parameters.

Returns HTTP 429 with Retry-After header.

Configured via usage plans, stage settings, or service quotas.

Throttles based on API key or source IP.

Lambda Concurrency Throttling

Enforces concurrency limits at the function execution layer.

Uses a simple concurrency counter (not token bucket).

Returns Lambda TooManyRequestsException (429) or API Gateway 503.

Configured via reserved concurrency or account-level limit.

Throttles based on function invocation, regardless of client.

Watch Out for These

Mistake

Rate limiting and throttling are the same thing.

Correct

Rate limiting is the policy (e.g., 1000 requests per second), while throttling is the enforcement mechanism (e.g., returning 429 when the limit is exceeded). In AWS, they are often used interchangeably, but technically throttling is the action.

Mistake

API Gateway's default throttling limit is 10,000 requests per second per API.

Correct

The default is 10,000 rps per account per region, not per API. All APIs in a region share this limit unless you configure per-API limits via usage plans or request a service quota increase.

Mistake

Setting a usage plan with a rate of 1000 and burst of 500 allows 1000 requests per second with a burst of 500, so a client can send 1500 requests in one second.

Correct

The burst is the bucket size; it does not add to the rate. The bucket starts full (500 tokens). The client can send up to 500 requests immediately, then tokens refill at 1000 per second. So over a long period, average is 1000 rps, but a burst of 500 is allowed. The maximum in one second is 500 (initial burst) + 1000 (refilled during that second) = 1500? Actually, the refill rate is per second, so in one second, the bucket refills 1000 tokens, but you can only consume tokens that are available. If you start with 500, you can consume 500 instantly, then as tokens refill, you can consume more. The maximum sustained throughput is 1000 rps, but a burst of 500 can be sent at the start. The exam expects you to know that burst is the bucket capacity.

Mistake

Lambda reserved concurrency guarantees that the function will always have that many instances available.

Correct

Reserved concurrency sets a limit on the maximum number of concurrent executions. It does not reserve instances; it only ensures that other functions cannot use that concurrency. Provisioned concurrency is needed to pre-warm instances.

Mistake

You can only throttle API Gateway using usage plans.

Correct

You can also throttle at the account level (service quotas), per-stage, per-method, and using AWS WAF. Usage plans are for per-client throttling with API keys.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between API Gateway rate limiting and Lambda concurrency limiting?

API Gateway rate limiting controls the number of requests per second (RPS) entering the API, using a token bucket algorithm. It returns 429 when exceeded. Lambda concurrency limiting controls the number of function executions happening simultaneously. When exceeded, Lambda returns a TooManyRequestsException. Both can be configured independently, and API Gateway can also throttle due to Lambda throttling. For the exam, remember that rate limiting is about request frequency, while concurrency is about parallel executions.

How do I set up per-client throttling in API Gateway?

Create a usage plan with desired rate and burst limits. Then create an API key and associate it with the usage plan. Finally, attach the API key to the client requests via the x-api-key header. The usage plan enforces throttling per API key. You can also set a default throttling per stage or method. For the exam, know that usage plans are the primary mechanism for per-client throttling.

What happens when a Lambda function is throttled while behind API Gateway?

When Lambda returns a TooManyRequestsException, API Gateway can be configured to handle it as an integration response. By default, it returns a 503 Service Unavailable response to the client. You can also configure API Gateway to retry the request or return a custom error. The exam may test that you need to handle Lambda throttling separately from API Gateway throttling.

Can I throttle API Gateway requests based on source IP?

Yes, but not directly via API Gateway's built-in throttling. You can use AWS WAF attached to the API Gateway to create IP-based rate limiting rules. Alternatively, you can use a custom Lambda authorizer to inspect IP and reject requests, but that adds latency. For the exam, know that WAF is the recommended service for IP-based throttling.

What is the default burst limit for API Gateway?

The default burst limit is 5,000 requests per account per region. This is the maximum number of requests that can be processed in a short burst before the rate limit kicks in. The token bucket starts full, so the first 5,000 requests in a region can be processed immediately. After that, the rate limit of 10,000 rps applies.

How do I monitor throttling in API Gateway?

Use CloudWatch metrics: ThrottleCount (number of throttled requests), 4xxError (includes 429), and Count (total requests). You can also enable detailed CloudWatch Logs to see individual throttle events. For Lambda, monitor ConcurrentExecutions and Throttles metrics. Set up CloudWatch alarms to notify when throttling exceeds a threshold.

What is the difference between reserved concurrency and provisioned concurrency?

Reserved concurrency sets a maximum limit on the number of concurrent executions for a function, preventing it from using more than that amount. Provisioned concurrency pre-initializes a specified number of instances, reducing cold starts, but it counts towards the reserved concurrency limit. Reserved concurrency is for limiting, while provisioned concurrency is for performance. Both are relevant for throttling.

Terms Worth Knowing

CloudWatch DynamoDB EC2 IAM Lambda S3

Ready to put this to the test?

You've just covered Rate Limiting and Throttling for APIs — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.

Try DVA-C02 practice questions Back to all chapters

Done with this chapter?

Saga Pattern for Distributed Transactions

Cost Optimisation for Developers: Lambda, DynamoDB

See the full DVA-C02 study guide