SAA-C03Chapter 129 of 189Objective 3.7

API Gateway Caching and Throttling

This chapter covers API Gateway caching and throttling, two critical features for building scalable, resilient APIs on AWS. For the SAA-C03 exam, understanding these mechanisms is essential as they appear in roughly 10-15% of questions related to high-performance architectures. You will learn how caching reduces latency and backend load, how throttling protects your backend from traffic spikes, and how to configure these features for optimal performance and cost.

25 min read
Intermediate
Updated May 31, 2026

API Gateway as a Concert Venue

Imagine a concert venue with a single entrance. The venue has a ticket booth (API Gateway) that controls access. Caching is like having a large billboard outside the venue that lists upcoming shows and ticket prices. When fans ask about show times, the ticket booth can point to the billboard instead of calling the box office (backend) each time. This reduces the load on the box office and speeds up responses. Throttling is like the venue's capacity control: only 1000 fans per hour can enter. If more arrive, they are turned away with a 'sold out' sign (HTTP 429 Too Many Requests). The venue can also have different entry lanes for VIPs (higher throttle limits) and general admission. The ticket booth uses a token bucket algorithm: each fan gets a token to enter; tokens replenish at a steady rate. If a fan tries to enter without a token, they are denied. This ensures the venue never exceeds its capacity and the box office isn't overwhelmed.

How It Actually Works

What is API Gateway Caching?

API Gateway caching stores responses from your backend endpoints so that subsequent identical requests can be served directly from the cache, reducing latency and load on your backend. The cache is a fully managed, in-memory cache that sits between the client and your backend. When a client sends a request, API Gateway checks the cache for a valid cached response based on the cache key. If found, it returns the cached response without invoking the backend. If not, the request is forwarded to the backend, and the response is stored in the cache for future requests.

How Caching Works Internally

API Gateway uses a configurable TTL (Time to Live) for cached responses, defaulting to 300 seconds (5 minutes). The cache key is constructed from the request parameters: HTTP method, path, query string parameters, headers, and stage variables. You can customize the cache key by specifying which parameters to include. Caching is enabled per stage and can be sized from 0.5 GB to 237 GB. The cache is encrypted at rest by default using AWS managed keys, but you can use a customer managed KMS key.

Key Caching Components and Defaults

TTL (Time to Live): Default 300 seconds, minimum 0 (disable caching), maximum 3600 seconds (1 hour).

Cache size: 0.5 GB, 1 GB, 2 GB, 4 GB, 8 GB, 13.5 GB, 28.4 GB, 58.2 GB, 118 GB, or 237 GB.

Cache encryption: Enabled by default (AES-256).

Cache cluster: Single node by default, but you can enable a cache cluster for high availability.

Cache status header: API Gateway adds a CacheStatus header to responses: HIT (served from cache), MISS (not in cache), EXPIRED (TTL expired).

To enable caching via the AWS CLI:

aws apigateway update-stage --rest-api-id <api-id> --stage-name prod --patch-operations op=replace,path=/cacheClusterEnabled,value=true

What is API Gateway Throttling?

Throttling limits the rate of requests to your API to protect your backend from being overwhelmed. API Gateway implements throttling at two levels: per-client (account-level) and per-method (method-level). The default throttling limits are: - Account-level: 10,000 requests per second (RPS) with a burst of 5,000 requests across all APIs in a region. - Method-level: Can be set per method, with a default of 10,000 RPS and burst of 5,000 (same as account).

How Throttling Works Internally

API Gateway uses a token bucket algorithm for throttling. Each request consumes a token. Tokens are replenished at a steady rate (the rate limit). If the bucket is empty (burst capacity exhausted), the request is rejected with a 429 Too Many Requests response. The bucket size determines the burst capacity. For example, if the rate limit is 1000 RPS and burst is 500, then up to 500 requests can be served immediately if the bucket is full, but sustained rate cannot exceed 1000 RPS.

Configuring Throttling

You can set throttling at the method level in the API Gateway console, or via CLI:

aws apigateway update-stage --rest-api-id <api-id> --stage-name prod --patch-operations op=replace,path=/throttlingBurstLimit,value=2000 op=replace,path=/throttlingRateLimit,value=1000

Throttling and Caching Interaction

Caching can reduce the effective request rate to your backend, but throttling still applies to the API Gateway endpoint. If a request hits the cache, it does not count against the backend's throttling limit, but it does count against the API Gateway's method-level throttle. This is important because you might need to set higher throttling limits if caching is heavily used.

Usage Plans and API Keys

Usage plans allow you to set throttling and quota limits for specific customers using API keys. You can define a usage plan with a rate limit, burst limit, and daily/monthly quota. API keys are distributed to customers, and API Gateway enforces the plan limits per key. This is useful for tiered pricing or controlling access.

Best Practices

Enable caching for read-heavy, stable data to reduce latency and backend load.

Set appropriate TTL based on data freshness requirements. For real-time data, use TTL=0 (no cache) or short TTL.

Use cache keys that include relevant parameters to avoid serving stale data.

Monitor cache hit ratio via CloudWatch metrics (CacheHitCount, CacheMissCount).

Start with a small cache size and increase if needed.

Use throttling to protect against DDoS or accidental high traffic.

Set method-level throttling lower than account-level to allow headroom for other APIs.

Use usage plans for API key-based throttling.

Interaction with Other Services

API Gateway integrates with CloudWatch for logging and monitoring, with WAF for web application firewall rules, and with Lambda for serverless backends. Caching and throttling work transparently with these integrations. For example, you can cache responses from Lambda functions, reducing invocation count and cost.

Walk-Through

1

Client sends request

The client makes an HTTP request to the API Gateway endpoint. The request includes method, path, headers, and query parameters. API Gateway receives the request and begins processing.

2

Check throttling limits

API Gateway checks the account-level and method-level throttling limits using the token bucket algorithm. If the token bucket is empty, the request is rejected with a 429 Too Many Requests response and an associated `Retry-After` header. Otherwise, a token is consumed and processing continues.

3

Check cache for cached response

If caching is enabled for the stage and method, API Gateway constructs a cache key from the request parameters. It then checks the cache for a valid (non-expired) entry. If found, it returns the cached response with a `CacheStatus: HIT` header. If not found, it proceeds to the backend.

4

Forward request to backend

API Gateway forwards the request to the configured backend (e.g., Lambda, HTTP endpoint, or AWS service). The backend processes the request and returns a response. This step is skipped if a cache hit occurred.

5

Store response in cache

If caching is enabled and the response is cacheable (status code 200-299, and the method is GET or any method with caching enabled), API Gateway stores the response in the cache with the configured TTL. The response is then returned to the client with a `CacheStatus: MISS` header.

6

Return response to client

API Gateway sends the response (either cached or from backend) back to the client. The response includes any headers added by API Gateway, such as `CacheStatus` and throttling headers like `X-RateLimit-Limit` and `X-RateLimit-Remaining`.

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce Product Catalog

A large e-commerce company uses API Gateway to expose product catalog APIs to their mobile app and website. The product data changes infrequently (every few hours). To reduce latency and backend load, they enable caching with a TTL of 300 seconds and a cache size of 13.5 GB. The cache key includes the product ID and locale. This results in a 90% cache hit rate, reducing backend calls from 10,000 RPS to 1,000 RPS. They also set method-level throttling at 5,000 RPS with a burst of 2,500 to protect the backend. They monitor cache hit ratio via CloudWatch and adjust TTL based on data update frequency. Misconfiguration: initially they used a TTL of 3600 seconds, causing stale prices during flash sales. They reduced TTL to 60 seconds during sales events.

Enterprise Scenario 2: Financial Data API for Partners

A fintech company provides real-time stock prices via API Gateway to partner applications. They use usage plans with API keys to enforce different throttling limits per partner: premium partners get 1,000 RPS with burst 500, standard partners get 100 RPS with burst 50. They disable caching because stock prices are real-time and must not be stale. They set account-level throttling to 10,000 RPS. They also enable AWS WAF to block malicious IPs. When a partner exceeds their limit, they receive a 429 response with a Retry-After header. The company monitors usage via CloudWatch metrics and adjusts limits based on partner SLAs.

Enterprise Scenario 3: IoT Device Telemetry

An IoT company ingests telemetry data from millions of devices via API Gateway. Each device sends a JSON payload every minute. They use a REST API with throttling set to 50,000 RPS account-level and 10,000 RPS per method. They enable caching for device configuration endpoints but not for data ingestion. They use API keys per device for throttling via usage plans. They also use request validation to reject malformed payloads early. Misconfiguration: initially they set method-level throttling too low (1,000 RPS), causing many 429 errors during device firmware updates. They increased it to 10,000 RPS after monitoring.

How SAA-C03 Actually Tests This

The SAA-C03 exam tests API Gateway caching and throttling under Objective 3.7 (High Performance). Expect 2-3 questions specifically on these topics. Key areas:

1.

Caching TTL and cache keys: Know the default TTL (300 seconds), that you can customize cache keys, and that caching is per stage. A common wrong answer is that caching is per API or per resource – it's per stage.

2.

Throttling limits: Remember the default account-level limits (10,000 RPS, burst 5,000). The exam often asks: 'How can you protect a backend from a traffic spike?' The correct answer is to configure method-level throttling, not just account-level. A trap is to choose 'use WAF rate limiting' – WAF is for web ACLs, not API throttling.

3.

Usage plans vs. throttling: Usage plans allow per-customer throttling and quotas using API keys. A common wrong answer is that usage plans are only for billing – they also enforce throttling. Also, usage plans are independent of method-level throttling; both can apply.

4.

Cache invalidation: You can invalidate the entire cache by calling flushCache or by redeploying the stage. Individual cache entries can be invalidated by setting cache-control headers or by using the cacheKey parameter. The exam might ask: 'How to clear a specific cached response?' Answer: Use the cacheKey with the flushCache API or set TTL to 0.

5.

Edge cases: Throttling applies per region, not globally. If you have APIs in multiple regions, each has its own token bucket. Also, caching is not available for private API endpoints unless you enable it explicitly. Another trap: caching is not supported for WebSocket APIs – only REST and HTTP APIs.

6.

Numbers to memorize: Default throttling: 10,000 RPS, 5,000 burst. Default cache TTL: 300 seconds. Cache sizes: 0.5 GB to 237 GB. Minimum TTL: 0 (disable caching). Maximum TTL: 3600 seconds.

To eliminate wrong answers, understand the mechanism: caching reduces backend load but does not affect throttling at the API Gateway level. Throttling uses token bucket, not leaky bucket. Usage plans are for API key-based clients, not for all clients.

Key Takeaways

Default API Gateway throttling: 10,000 requests per second with a burst capacity of 5,000 per region.

Default cache TTL is 300 seconds (5 minutes); can be set from 0 to 3600 seconds.

Caching is enabled per stage, not per API or resource.

Cache keys are constructed from method, path, query strings, headers, and stage variables; customize via cacheKey parameters.

Throttling uses a token bucket algorithm; burst capacity allows short spikes.

Usage plans enforce per-key throttling and quotas, separate from method-level throttling.

Cache invalidation can be done by setting TTL=0, using flushCache API, or redeploying the stage.

Throttling limits are independent per region; each region has its own token bucket.

Caching does not reduce throttling token consumption; each request still counts.

WAF rate limiting is different from API Gateway throttling; WAF is for web ACLs, not API throttling.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

API Gateway Caching

Caches responses at the API Gateway level, before reaching backend

Cache key includes HTTP method, path, query params, headers, stage variables

Cache size up to 237 GB, TTL up to 3600 seconds

Only for REST and HTTP APIs, not WebSocket

Cache is in-memory, managed by API Gateway

CloudFront Caching

Caches at the edge location, closer to the client

Cache key includes URL, query string, headers, cookies (configurable)

Cache size unlimited (uses edge storage), TTL up to 1 year

Works with any origin (HTTP, S3, ALB, etc.)

Cache is on disk at edge locations, managed by CloudFront

Watch Out for These

Mistake

Caching reduces the number of requests that count against throttling limits.

Correct

Caching reduces backend invocations but does not reduce the number of requests counted against API Gateway throttling. Each request, even if cached, consumes a token from the token bucket.

Mistake

Throttling limits are applied globally across all regions.

Correct

Throttling limits are per region. Each region has its own token bucket. You must configure throttling separately for each region.

Mistake

You can set throttling limits on individual API keys without usage plans.

Correct

Per-key throttling requires a usage plan. Without a usage plan, API keys are only used for identification, not throttling.

Mistake

Cache TTL can be set to any value.

Correct

Cache TTL must be between 0 and 3600 seconds (1 hour). The default is 300 seconds. You cannot set TTL to, say, 7200 seconds.

Mistake

Caching works for all HTTP methods.

Correct

By default, caching only works for GET methods. For other methods (POST, PUT, DELETE), you must explicitly enable caching in the method settings. Even then, caching is typically not recommended for non-idempotent methods.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the default TTL for API Gateway caching?

The default TTL is 300 seconds (5 minutes). You can set it between 0 (no caching) and 3600 seconds (1 hour). A common exam tip: if a question asks for the default TTL, remember it's 300 seconds, not 0 or 3600.

How do I clear the entire API Gateway cache?

You can flush the entire cache by using the `flushCache` API call or by redeploying the stage. In the console, you can choose 'Flush entire cache'. Note that this invalidates all cached entries immediately. For individual entries, you can set the TTL to 0 or use cache-control headers.

Can I use caching with API Gateway WebSocket APIs?

No, caching is only supported for REST APIs and HTTP APIs. WebSocket APIs do not have caching capabilities. If you need caching for WebSocket, consider using a different architecture like a Lambda function with ElastiCache.

What is the difference between account-level and method-level throttling?

Account-level throttling applies to all APIs in a region (default 10,000 RPS). Method-level throttling applies to a specific method (e.g., GET /items) and can be set lower than account-level to protect that endpoint. Both use token bucket algorithm. Method-level limits are checked first, then account-level.

How can I set different throttling limits for different customers?

Use usage plans with API keys. Create a usage plan with specific rate and burst limits, and associate it with API keys. Distribute the keys to customers. API Gateway enforces the plan limits per key. This allows tiered access (e.g., free tier: 100 RPS, premium: 1000 RPS).

Does API Gateway caching reduce costs?

Yes, caching reduces the number of backend invocations, which can lower costs for Lambda or other pay-per-use backends. However, you pay for the cache size (hourly cost) and for data transfer out. For high-volume, read-heavy APIs, caching often reduces total cost.

Can I use CloudFront with API Gateway for additional caching?

Yes, you can place CloudFront in front of API Gateway. CloudFront caches responses at edge locations, reducing latency further. However, API Gateway caching still applies at the regional level. You can use both, but be careful with TTL settings to avoid stale data. CloudFront can also provide DDoS protection via AWS Shield.

Terms Worth Knowing

Ready to put this to the test?

You've just covered API Gateway Caching and Throttling — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Done with this chapter?