This chapter covers API Gateway caching and throttling, two critical features for building scalable, resilient APIs on AWS. For the SAA-C03 exam, understanding these mechanisms is essential as they appear in roughly 10-15% of questions related to high-performance architectures. You will learn how caching reduces latency and backend load, how throttling protects your backend from traffic spikes, and how to configure these features for optimal performance and cost.
Jump to a section
Imagine a concert venue with a single entrance. The venue has a ticket booth (API Gateway) that controls access. Caching is like having a large billboard outside the venue that lists upcoming shows and ticket prices. When fans ask about show times, the ticket booth can point to the billboard instead of calling the box office (backend) each time. This reduces the load on the box office and speeds up responses. Throttling is like the venue's capacity control: only 1000 fans per hour can enter. If more arrive, they are turned away with a 'sold out' sign (HTTP 429 Too Many Requests). The venue can also have different entry lanes for VIPs (higher throttle limits) and general admission. The ticket booth uses a token bucket algorithm: each fan gets a token to enter; tokens replenish at a steady rate. If a fan tries to enter without a token, they are denied. This ensures the venue never exceeds its capacity and the box office isn't overwhelmed.
What is API Gateway Caching?
API Gateway caching stores responses from your backend endpoints so that subsequent identical requests can be served directly from the cache, reducing latency and load on your backend. The cache is a fully managed, in-memory cache that sits between the client and your backend. When a client sends a request, API Gateway checks the cache for a valid cached response based on the cache key. If found, it returns the cached response without invoking the backend. If not, the request is forwarded to the backend, and the response is stored in the cache for future requests.
How Caching Works Internally
API Gateway uses a configurable TTL (Time to Live) for cached responses, defaulting to 300 seconds (5 minutes). The cache key is constructed from the request parameters: HTTP method, path, query string parameters, headers, and stage variables. You can customize the cache key by specifying which parameters to include. Caching is enabled per stage and can be sized from 0.5 GB to 237 GB. The cache is encrypted at rest by default using AWS managed keys, but you can use a customer managed KMS key.
Key Caching Components and Defaults
TTL (Time to Live): Default 300 seconds, minimum 0 (disable caching), maximum 3600 seconds (1 hour).
Cache size: 0.5 GB, 1 GB, 2 GB, 4 GB, 8 GB, 13.5 GB, 28.4 GB, 58.2 GB, 118 GB, or 237 GB.
Cache encryption: Enabled by default (AES-256).
Cache cluster: Single node by default, but you can enable a cache cluster for high availability.
Cache status header: API Gateway adds a CacheStatus header to responses: HIT (served from cache), MISS (not in cache), EXPIRED (TTL expired).
To enable caching via the AWS CLI:
aws apigateway update-stage --rest-api-id <api-id> --stage-name prod --patch-operations op=replace,path=/cacheClusterEnabled,value=trueWhat is API Gateway Throttling?
Throttling limits the rate of requests to your API to protect your backend from being overwhelmed. API Gateway implements throttling at two levels: per-client (account-level) and per-method (method-level). The default throttling limits are: - Account-level: 10,000 requests per second (RPS) with a burst of 5,000 requests across all APIs in a region. - Method-level: Can be set per method, with a default of 10,000 RPS and burst of 5,000 (same as account).
How Throttling Works Internally
API Gateway uses a token bucket algorithm for throttling. Each request consumes a token. Tokens are replenished at a steady rate (the rate limit). If the bucket is empty (burst capacity exhausted), the request is rejected with a 429 Too Many Requests response. The bucket size determines the burst capacity. For example, if the rate limit is 1000 RPS and burst is 500, then up to 500 requests can be served immediately if the bucket is full, but sustained rate cannot exceed 1000 RPS.
Configuring Throttling
You can set throttling at the method level in the API Gateway console, or via CLI:
aws apigateway update-stage --rest-api-id <api-id> --stage-name prod --patch-operations op=replace,path=/throttlingBurstLimit,value=2000 op=replace,path=/throttlingRateLimit,value=1000Throttling and Caching Interaction
Caching can reduce the effective request rate to your backend, but throttling still applies to the API Gateway endpoint. If a request hits the cache, it does not count against the backend's throttling limit, but it does count against the API Gateway's method-level throttle. This is important because you might need to set higher throttling limits if caching is heavily used.
Usage Plans and API Keys
Usage plans allow you to set throttling and quota limits for specific customers using API keys. You can define a usage plan with a rate limit, burst limit, and daily/monthly quota. API keys are distributed to customers, and API Gateway enforces the plan limits per key. This is useful for tiered pricing or controlling access.
Best Practices
Enable caching for read-heavy, stable data to reduce latency and backend load.
Set appropriate TTL based on data freshness requirements. For real-time data, use TTL=0 (no cache) or short TTL.
Use cache keys that include relevant parameters to avoid serving stale data.
Monitor cache hit ratio via CloudWatch metrics (CacheHitCount, CacheMissCount).
Start with a small cache size and increase if needed.
Use throttling to protect against DDoS or accidental high traffic.
Set method-level throttling lower than account-level to allow headroom for other APIs.
Use usage plans for API key-based throttling.
Interaction with Other Services
API Gateway integrates with CloudWatch for logging and monitoring, with WAF for web application firewall rules, and with Lambda for serverless backends. Caching and throttling work transparently with these integrations. For example, you can cache responses from Lambda functions, reducing invocation count and cost.
Client sends request
The client makes an HTTP request to the API Gateway endpoint. The request includes method, path, headers, and query parameters. API Gateway receives the request and begins processing.
Check throttling limits
API Gateway checks the account-level and method-level throttling limits using the token bucket algorithm. If the token bucket is empty, the request is rejected with a 429 Too Many Requests response and an associated `Retry-After` header. Otherwise, a token is consumed and processing continues.
Check cache for cached response
If caching is enabled for the stage and method, API Gateway constructs a cache key from the request parameters. It then checks the cache for a valid (non-expired) entry. If found, it returns the cached response with a `CacheStatus: HIT` header. If not found, it proceeds to the backend.
Forward request to backend
API Gateway forwards the request to the configured backend (e.g., Lambda, HTTP endpoint, or AWS service). The backend processes the request and returns a response. This step is skipped if a cache hit occurred.
Store response in cache
If caching is enabled and the response is cacheable (status code 200-299, and the method is GET or any method with caching enabled), API Gateway stores the response in the cache with the configured TTL. The response is then returned to the client with a `CacheStatus: MISS` header.
Return response to client
API Gateway sends the response (either cached or from backend) back to the client. The response includes any headers added by API Gateway, such as `CacheStatus` and throttling headers like `X-RateLimit-Limit` and `X-RateLimit-Remaining`.
Enterprise Scenario 1: E-commerce Product Catalog
A large e-commerce company uses API Gateway to expose product catalog APIs to their mobile app and website. The product data changes infrequently (every few hours). To reduce latency and backend load, they enable caching with a TTL of 300 seconds and a cache size of 13.5 GB. The cache key includes the product ID and locale. This results in a 90% cache hit rate, reducing backend calls from 10,000 RPS to 1,000 RPS. They also set method-level throttling at 5,000 RPS with a burst of 2,500 to protect the backend. They monitor cache hit ratio via CloudWatch and adjust TTL based on data update frequency. Misconfiguration: initially they used a TTL of 3600 seconds, causing stale prices during flash sales. They reduced TTL to 60 seconds during sales events.
Enterprise Scenario 2: Financial Data API for Partners
A fintech company provides real-time stock prices via API Gateway to partner applications. They use usage plans with API keys to enforce different throttling limits per partner: premium partners get 1,000 RPS with burst 500, standard partners get 100 RPS with burst 50. They disable caching because stock prices are real-time and must not be stale. They set account-level throttling to 10,000 RPS. They also enable AWS WAF to block malicious IPs. When a partner exceeds their limit, they receive a 429 response with a Retry-After header. The company monitors usage via CloudWatch metrics and adjusts limits based on partner SLAs.
Enterprise Scenario 3: IoT Device Telemetry
An IoT company ingests telemetry data from millions of devices via API Gateway. Each device sends a JSON payload every minute. They use a REST API with throttling set to 50,000 RPS account-level and 10,000 RPS per method. They enable caching for device configuration endpoints but not for data ingestion. They use API keys per device for throttling via usage plans. They also use request validation to reject malformed payloads early. Misconfiguration: initially they set method-level throttling too low (1,000 RPS), causing many 429 errors during device firmware updates. They increased it to 10,000 RPS after monitoring.
The SAA-C03 exam tests API Gateway caching and throttling under Objective 3.7 (High Performance). Expect 2-3 questions specifically on these topics. Key areas:
Caching TTL and cache keys: Know the default TTL (300 seconds), that you can customize cache keys, and that caching is per stage. A common wrong answer is that caching is per API or per resource – it's per stage.
Throttling limits: Remember the default account-level limits (10,000 RPS, burst 5,000). The exam often asks: 'How can you protect a backend from a traffic spike?' The correct answer is to configure method-level throttling, not just account-level. A trap is to choose 'use WAF rate limiting' – WAF is for web ACLs, not API throttling.
Usage plans vs. throttling: Usage plans allow per-customer throttling and quotas using API keys. A common wrong answer is that usage plans are only for billing – they also enforce throttling. Also, usage plans are independent of method-level throttling; both can apply.
Cache invalidation: You can invalidate the entire cache by calling flushCache or by redeploying the stage. Individual cache entries can be invalidated by setting cache-control headers or by using the cacheKey parameter. The exam might ask: 'How to clear a specific cached response?' Answer: Use the cacheKey with the flushCache API or set TTL to 0.
Edge cases: Throttling applies per region, not globally. If you have APIs in multiple regions, each has its own token bucket. Also, caching is not available for private API endpoints unless you enable it explicitly. Another trap: caching is not supported for WebSocket APIs – only REST and HTTP APIs.
Numbers to memorize: Default throttling: 10,000 RPS, 5,000 burst. Default cache TTL: 300 seconds. Cache sizes: 0.5 GB to 237 GB. Minimum TTL: 0 (disable caching). Maximum TTL: 3600 seconds.
To eliminate wrong answers, understand the mechanism: caching reduces backend load but does not affect throttling at the API Gateway level. Throttling uses token bucket, not leaky bucket. Usage plans are for API key-based clients, not for all clients.
Default API Gateway throttling: 10,000 requests per second with a burst capacity of 5,000 per region.
Default cache TTL is 300 seconds (5 minutes); can be set from 0 to 3600 seconds.
Caching is enabled per stage, not per API or resource.
Cache keys are constructed from method, path, query strings, headers, and stage variables; customize via cacheKey parameters.
Throttling uses a token bucket algorithm; burst capacity allows short spikes.
Usage plans enforce per-key throttling and quotas, separate from method-level throttling.
Cache invalidation can be done by setting TTL=0, using flushCache API, or redeploying the stage.
Throttling limits are independent per region; each region has its own token bucket.
Caching does not reduce throttling token consumption; each request still counts.
WAF rate limiting is different from API Gateway throttling; WAF is for web ACLs, not API throttling.
These come up on the exam all the time. Here's how to tell them apart.
API Gateway Caching
Caches responses at the API Gateway level, before reaching backend
Cache key includes HTTP method, path, query params, headers, stage variables
Cache size up to 237 GB, TTL up to 3600 seconds
Only for REST and HTTP APIs, not WebSocket
Cache is in-memory, managed by API Gateway
CloudFront Caching
Caches at the edge location, closer to the client
Cache key includes URL, query string, headers, cookies (configurable)
Cache size unlimited (uses edge storage), TTL up to 1 year
Works with any origin (HTTP, S3, ALB, etc.)
Cache is on disk at edge locations, managed by CloudFront
Mistake
Caching reduces the number of requests that count against throttling limits.
Correct
Caching reduces backend invocations but does not reduce the number of requests counted against API Gateway throttling. Each request, even if cached, consumes a token from the token bucket.
Mistake
Throttling limits are applied globally across all regions.
Correct
Throttling limits are per region. Each region has its own token bucket. You must configure throttling separately for each region.
Mistake
You can set throttling limits on individual API keys without usage plans.
Correct
Per-key throttling requires a usage plan. Without a usage plan, API keys are only used for identification, not throttling.
Mistake
Cache TTL can be set to any value.
Correct
Cache TTL must be between 0 and 3600 seconds (1 hour). The default is 300 seconds. You cannot set TTL to, say, 7200 seconds.
Mistake
Caching works for all HTTP methods.
Correct
By default, caching only works for GET methods. For other methods (POST, PUT, DELETE), you must explicitly enable caching in the method settings. Even then, caching is typically not recommended for non-idempotent methods.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The default TTL is 300 seconds (5 minutes). You can set it between 0 (no caching) and 3600 seconds (1 hour). A common exam tip: if a question asks for the default TTL, remember it's 300 seconds, not 0 or 3600.
You can flush the entire cache by using the `flushCache` API call or by redeploying the stage. In the console, you can choose 'Flush entire cache'. Note that this invalidates all cached entries immediately. For individual entries, you can set the TTL to 0 or use cache-control headers.
No, caching is only supported for REST APIs and HTTP APIs. WebSocket APIs do not have caching capabilities. If you need caching for WebSocket, consider using a different architecture like a Lambda function with ElastiCache.
Account-level throttling applies to all APIs in a region (default 10,000 RPS). Method-level throttling applies to a specific method (e.g., GET /items) and can be set lower than account-level to protect that endpoint. Both use token bucket algorithm. Method-level limits are checked first, then account-level.
Use usage plans with API keys. Create a usage plan with specific rate and burst limits, and associate it with API keys. Distribute the keys to customers. API Gateway enforces the plan limits per key. This allows tiered access (e.g., free tier: 100 RPS, premium: 1000 RPS).
Yes, caching reduces the number of backend invocations, which can lower costs for Lambda or other pay-per-use backends. However, you pay for the cache size (hourly cost) and for data transfer out. For high-volume, read-heavy APIs, caching often reduces total cost.
Yes, you can place CloudFront in front of API Gateway. CloudFront caches responses at edge locations, reducing latency further. However, API Gateway caching still applies at the regional level. You can use both, but be careful with TTL settings to avoid stale data. CloudFront can also provide DDoS protection via AWS Shield.
You've just covered API Gateway Caching and Throttling — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?