This chapter covers rate limiting and throttling for APIs, focusing on AWS services like API Gateway and Lambda. It explains why throttling is critical for protecting backend resources and ensuring fair usage, and how it appears in the DVA-C02 exam under Domain 2.4 (Security). Approximately 5-10% of exam questions touch on throttling concepts, often in scenario-based questions about API design or Lambda concurrency limits. You will learn the token bucket algorithm, AWS service limits, configuration options, and common exam traps.
Jump to a section
Imagine a toll booth on a bridge that controls how many cars can cross per minute. The booth uses a token bucket system: a bucket holds a fixed number of tokens (say 100), and each car must take a token to cross. The bucket is refilled at a steady rate (e.g., 10 tokens per second). If the bucket is empty, cars must wait until new tokens arrive. This directly mirrors AWS API Gateway's token bucket algorithm for rate limiting. The bucket size (burst capacity) allows short bursts of traffic (e.g., 100 cars in one second), while the refill rate (sustained rate) controls long-term throughput (e.g., 10 cars per second). If a car arrives and the bucket is empty, it gets a 429 Too Many Requests response, just like an API client. The toll booth operator also has a per-account limit (like AWS account-level throttling) and a per-lane limit (like per-method or per-API key throttling). Multiple toll booths can be stacked: a regional limit (all cars entering the city) and a per-destination limit (cars to a specific district). This analogy explains the mechanics: the bucket size (max burst), refill rate (sustained rate), and the wait time (retry-after header). AWS uses this exact algorithm for API Gateway and Lambda throttling.
What is Rate Limiting and Throttling?
Rate limiting restricts the number of requests a client can make to an API within a given time window. Throttling is the mechanism that enforces rate limits by rejecting requests that exceed the limit, typically returning HTTP 429 (Too Many Requests). In AWS, throttling is implemented at multiple layers: API Gateway, Lambda, and AWS account-level service quotas.
Why Throttling Exists
Throttling prevents a single client from overwhelming backend resources, protects against DDoS attacks, and ensures fair distribution of capacity. It also helps manage costs by limiting resource consumption. Without throttling, a sudden spike in traffic could degrade performance for all users or cause backend failures.
The Token Bucket Algorithm
AWS API Gateway and Lambda use the token bucket algorithm for throttling. A token bucket has two parameters: - Bucket size (burst limit): The maximum number of requests that can be processed in a burst (i.e., the capacity of the bucket). - Refill rate (sustained rate): The rate at which tokens are added back to the bucket per second.
When a request arrives, it attempts to take a token. If a token is available, the request is forwarded. If not, the request is throttled (429). The bucket fills at the refill rate, allowing bursts up to the bucket size, then smoothing out to the sustained rate.
API Gateway Throttling
API Gateway has two tiers of throttling: - Account-level limits: Per AWS account, across all APIs in a region. Default: 10,000 requests per second (rps) with a burst of 5,000 requests. - Per-API limits: Can be set per stage or per method. Default: 10,000 rps with a burst of 5,000, but you can configure custom limits using usage plans and API keys.
Usage plans allow you to set throttling limits per API key. You define: - Rate: Number of requests per second (steady-state). - Burst: Maximum number of concurrent requests (burst capacity).
If an API key exceeds its rate, it gets throttled. The burst allows short spikes but overall throughput is limited.
Lambda Throttling
Lambda has two types of throttling: - Concurrency limits: The maximum number of function instances running concurrently. Default: 1,000 per account per region. You can request a quota increase. - Reserved concurrency: Sets a specific concurrency limit for a function, ensuring it doesn't consume all account concurrency. - Provisioned concurrency: Pre-warms a number of instances to reduce cold starts, but still counts towards concurrency limits.
When a Lambda function is invoked, it counts towards concurrency. If concurrency exceeds the limit, new invocations are throttled with a 429 error (or Lambda returns a TooManyRequestsException).
How Throttling Works Internally
Client sends a request to API Gateway.
API Gateway checks the token bucket for the API key (or IP if no key).
If a token is available, the request proceeds to the backend (e.g., Lambda).
If the bucket is empty, API Gateway returns HTTP 429 with a Retry-After header indicating seconds to wait.
For Lambda-backed APIs, API Gateway also respects Lambda's concurrency limits. If Lambda is throttled, API Gateway may return 429 or 503.
Key Default Values and Timers
API Gateway account-level limit: 10,000 rps, burst 5,000.
API Gateway per-API default: same as account-level, but can be lowered via usage plans.
Lambda concurrency default: 1,000 per region.
Lambda reserved concurrency: 0 to account limit.
Lambda burst concurrency: varies by region (e.g., 500-3000 per minute for new functions).
Retry-After header: typically 1-5 seconds.
Configuration and Verification
To configure throttling in API Gateway:
Create a usage plan and associate it with an API stage.
Set rate and burst limits.
Optionally, attach API keys to clients.
To verify throttling:
Test with curl or Postman and check for 429 responses.
Use CloudWatch metrics (e.g., ThrottleCount, 4xxError).
Enable API Gateway logging to see throttle events.
Interaction with Related Technologies
AWS WAF: Can be used to implement IP-based rate limiting before API Gateway.
CloudFront: Can also throttle at the CDN level using Lambda@Edge or AWS WAF.
Amazon Cognito: Can throttle authentication requests.
AWS Shield: Provides DDoS protection but not application-level throttling.
Exam Relevance
For DVA-C02, you must know:
The difference between rate limiting (requests per second) and concurrency (simultaneous executions).
How to configure throttling via usage plans.
The default limits and how to request increases.
How throttling affects Lambda: reserved vs. provisioned concurrency.
Common error codes: 429 (throttling), 503 (service unavailable).
The token bucket algorithm conceptually.
Code Example: Creating a Usage Plan
aws apigateway create-usage-plan \
--name "MyUsagePlan" \
--throttle rate=1000,burst=500 \
--api-stages apiId=abc123,stage=prodMonitoring Throttling
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name ThrottleCount \
--start-time 2023-01-01T00:00:00Z \
--end-time 2023-01-01T01:00:00Z \
--period 300 \
--statistics SumClient sends API request
The client (e.g., mobile app) sends an HTTP request to the API Gateway endpoint. The request includes headers such as `x-api-key` if using API keys, or the request is associated with a usage plan via the API stage. At this point, no throttling check has occurred yet.
API Gateway checks token bucket
API Gateway examines the request's API key (or source IP if no key) and looks up the associated usage plan's throttling limits. It then checks the token bucket for that key: if the bucket has at least one token, it consumes one and forwards the request. The bucket size is the burst limit, and tokens are refilled at the rate limit per second. This check happens in microseconds.
Token available: forward to backend
If a token is available, the request is forwarded to the integration endpoint (e.g., Lambda, HTTP). API Gateway records the request in CloudWatch logs and metrics. The backend processes the request normally. The token bucket decrements by one.
No token: return 429 Too Many Requests
If the token bucket is empty, API Gateway immediately returns an HTTP 429 response with a `Retry-After` header (e.g., `Retry-After: 2`). The client should wait the specified seconds before retrying. The request is not forwarded to the backend, saving resources. API Gateway also increments the `ThrottleCount` metric.
Lambda concurrency check (if applicable)
If the request is forwarded to a Lambda function, Lambda checks its concurrency limit. The function's reserved concurrency (if set) or account-level concurrency is evaluated. If the function is at its concurrency limit, Lambda returns a `TooManyRequestsException` (429). API Gateway may retry or return 503 depending on configuration. This is a separate throttling layer.
In a real-world enterprise, rate limiting is critical for a B2B SaaS platform that exposes APIs to thousands of customers. For example, a payment processing API must ensure that no single merchant can overwhelm the system during peak sales events like Black Friday. The platform uses API Gateway with usage plans: each merchant gets an API key with a rate limit of 1000 requests per second and a burst of 2000. This allows short spikes but smooths out sustained traffic. The backend is a fleet of Lambda functions with reserved concurrency of 500 to prevent any single merchant from consuming all concurrency. During a flash sale, one merchant's traffic spikes to 5000 rps. The token bucket empties quickly, and the merchant receives 429 responses. The client SDK implements exponential backoff with jitter, retrying after the Retry-After header. CloudWatch alarms trigger when ThrottleCount exceeds a threshold, alerting the operations team. Misconfiguration can happen: if the rate limit is set too low (e.g., 10 rps), legitimate traffic gets throttled, causing customer complaints. Conversely, if burst is set too high (e.g., 10,000), a single client can cause a denial-of-service to others. Another scenario: a microservices architecture where internal services communicate via API Gateway. If the internal service's Lambda function has insufficient reserved concurrency, it becomes a bottleneck. The operations team must monitor both API Gateway throttling and Lambda concurrency simultaneously. They use CloudWatch dashboards to correlate ThrottleCount and ConcurrentExecutions. They also implement circuit breakers in the client to avoid retry storms. In production, the team configures a usage plan with a rate of 5000 rps and burst 10,000 for premium customers, and 100 rps and burst 500 for basic customers. They also set a global account-level limit of 50,000 rps to protect the entire region. The key lesson: always test throttling behavior with load testing tools like Artillery or Locust before going live.
The DVA-C02 exam tests rate limiting and throttling under Domain 2.4 (Security) and also in Domain 3 (Development with AWS Services). Expect 2-4 questions directly on this topic. Key areas:
Token bucket algorithm: Know that burst limit is the bucket size and rate is the refill rate. A common wrong answer is confusing burst with sustained rate. Exam question: "An API has a rate of 1000 and burst of 2000. Which statement is true?" Wrong answer: "It allows 1000 requests per second, and any request over 1000 is rejected." Correct: "It allows up to 2000 requests in a burst, then settles to 1000 per second."
Default limits: Memorize: API Gateway default 10,000 rps, burst 5,000. Lambda default concurrency 1,000. Exam may ask: "What is the default API Gateway throttling limit?" Wrong answers: 5000, 1000, unlimited.
Usage plans vs. API keys: Usage plans set throttling and quota limits; API keys identify clients. Common trap: thinking API keys alone enforce throttling. They don't; they only identify. Usage plans must be associated.
Lambda throttling: Reserved concurrency vs. provisioned concurrency. Reserved concurrency limits the function's maximum concurrency; provisioned concurrency pre-warms instances but still counts towards reserved. Exam question: "How to ensure a Lambda function never exceeds 100 concurrent executions?" Wrong answer: "Set provisioned concurrency to 100." Correct: "Set reserved concurrency to 100."
Error codes: 429 for throttling, 503 for service unavailable. Exam may ask: "What HTTP status code does API Gateway return when throttling?" Wrong answers: 500, 503, 400.
Retry-After header: Included in 429 responses. Exam: "How does the client know when to retry?" Answer: Retry-After header.
Edge cases: When a Lambda function is throttled, API Gateway may retry (if configured) or return 503. Exam: "If Lambda returns a TooManyRequestsException, what does API Gateway do?" Answer: Depends on integration response configuration; default is 503.
Eliminating wrong answers: If a question mentions "burst" and "rate", remember that burst is the maximum instantaneous throughput. If a question asks about per-client limits, think usage plans with API keys. If it asks about account-wide limits, think service quotas. Always identify the layer: API Gateway or Lambda.
API Gateway default throttling: 10,000 requests per second per account per region, burst 5,000.
Usage plans allow per-client throttling with rate and burst limits.
Lambda default concurrency: 1,000 per account per region.
Reserved concurrency sets a maximum concurrency for a Lambda function; provisioned concurrency pre-warms instances.
Token bucket: burst = bucket size, rate = refill rate per second.
HTTP 429 is returned when throttling; include Retry-After header.
API Gateway can also be throttled at the method level and stage level.
CloudWatch metrics: ThrottleCount, ConcurrentExecutions.
To request higher limits, use Service Quotas or AWS Support.
Always implement exponential backoff in clients to handle 429 responses.
These come up on the exam all the time. Here's how to tell them apart.
API Gateway Throttling
Enforces rate limits at the API layer before backend invocation.
Uses token bucket algorithm with rate and burst parameters.
Returns HTTP 429 with Retry-After header.
Configured via usage plans, stage settings, or service quotas.
Throttles based on API key or source IP.
Lambda Concurrency Throttling
Enforces concurrency limits at the function execution layer.
Uses a simple concurrency counter (not token bucket).
Returns Lambda TooManyRequestsException (429) or API Gateway 503.
Configured via reserved concurrency or account-level limit.
Throttles based on function invocation, regardless of client.
Mistake
Rate limiting and throttling are the same thing.
Correct
Rate limiting is the policy (e.g., 1000 requests per second), while throttling is the enforcement mechanism (e.g., returning 429 when the limit is exceeded). In AWS, they are often used interchangeably, but technically throttling is the action.
Mistake
API Gateway's default throttling limit is 10,000 requests per second per API.
Correct
The default is 10,000 rps per account per region, not per API. All APIs in a region share this limit unless you configure per-API limits via usage plans or request a service quota increase.
Mistake
Setting a usage plan with a rate of 1000 and burst of 500 allows 1000 requests per second with a burst of 500, so a client can send 1500 requests in one second.
Correct
The burst is the bucket size; it does not add to the rate. The bucket starts full (500 tokens). The client can send up to 500 requests immediately, then tokens refill at 1000 per second. So over a long period, average is 1000 rps, but a burst of 500 is allowed. The maximum in one second is 500 (initial burst) + 1000 (refilled during that second) = 1500? Actually, the refill rate is per second, so in one second, the bucket refills 1000 tokens, but you can only consume tokens that are available. If you start with 500, you can consume 500 instantly, then as tokens refill, you can consume more. The maximum sustained throughput is 1000 rps, but a burst of 500 can be sent at the start. The exam expects you to know that burst is the bucket capacity.
Mistake
Lambda reserved concurrency guarantees that the function will always have that many instances available.
Correct
Reserved concurrency sets a limit on the maximum number of concurrent executions. It does not reserve instances; it only ensures that other functions cannot use that concurrency. Provisioned concurrency is needed to pre-warm instances.
Mistake
You can only throttle API Gateway using usage plans.
Correct
You can also throttle at the account level (service quotas), per-stage, per-method, and using AWS WAF. Usage plans are for per-client throttling with API keys.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
API Gateway rate limiting controls the number of requests per second (RPS) entering the API, using a token bucket algorithm. It returns 429 when exceeded. Lambda concurrency limiting controls the number of function executions happening simultaneously. When exceeded, Lambda returns a TooManyRequestsException. Both can be configured independently, and API Gateway can also throttle due to Lambda throttling. For the exam, remember that rate limiting is about request frequency, while concurrency is about parallel executions.
Create a usage plan with desired rate and burst limits. Then create an API key and associate it with the usage plan. Finally, attach the API key to the client requests via the x-api-key header. The usage plan enforces throttling per API key. You can also set a default throttling per stage or method. For the exam, know that usage plans are the primary mechanism for per-client throttling.
When Lambda returns a TooManyRequestsException, API Gateway can be configured to handle it as an integration response. By default, it returns a 503 Service Unavailable response to the client. You can also configure API Gateway to retry the request or return a custom error. The exam may test that you need to handle Lambda throttling separately from API Gateway throttling.
Yes, but not directly via API Gateway's built-in throttling. You can use AWS WAF attached to the API Gateway to create IP-based rate limiting rules. Alternatively, you can use a custom Lambda authorizer to inspect IP and reject requests, but that adds latency. For the exam, know that WAF is the recommended service for IP-based throttling.
The default burst limit is 5,000 requests per account per region. This is the maximum number of requests that can be processed in a short burst before the rate limit kicks in. The token bucket starts full, so the first 5,000 requests in a region can be processed immediately. After that, the rate limit of 10,000 rps applies.
Use CloudWatch metrics: ThrottleCount (number of throttled requests), 4xxError (includes 429), and Count (total requests). You can also enable detailed CloudWatch Logs to see individual throttle events. For Lambda, monitor ConcurrentExecutions and Throttles metrics. Set up CloudWatch alarms to notify when throttling exceeds a threshold.
Reserved concurrency sets a maximum limit on the number of concurrent executions for a function, preventing it from using more than that amount. Provisioned concurrency pre-initializes a specified number of instances, reducing cold starts, but it counts towards the reserved concurrency limit. Reserved concurrency is for limiting, while provisioned concurrency is for performance. Both are relevant for throttling.
You've just covered Rate Limiting and Throttling for APIs — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?