SAA-C03Chapter 115 of 189Objective 2.3

ElastiCache Redis Cluster Mode and Replication

Why is Amazon ElastiCache for Redis in cluster mode and replication critical for designing resilient, high-performance caching and session storage on AWS? You will learn how Redis cluster mode shards data across multiple nodes for horizontal scalability and how replication provides high availability through automatic failover. Approximately 10-15% of SAA-C03 exam questions involve ElastiCache, with cluster mode and replication being key topics. Mastering these concepts is essential for passing the exam and for real-world production deployments.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

ElastiCache Redis Cluster: A Postal Sorting Office

A large postal sorting office, handling millions of letters daily. The office has multiple sorting stations (nodes), each responsible for a specific range of zip codes (hash slots). When a letter arrives, the main clerk (cluster client) looks at the zip code and determines which sorting station handles that zip code. The clerk then routes the letter directly to that station, which processes it and stores it in its local bin (memory). If a sorting station gets too many letters, the office manager can split its zip code range and add a new station to share the load. Each station also has a backup assistant (replica) that copies every letter processed by the primary station. If the primary station breaks down, the assistant takes over immediately without losing any letters. The entire system is coordinated by a central supervisor (configuration endpoint) that keeps track of which station handles which zip codes. Clients ask the supervisor for the current mapping when they first connect, and the supervisor gives them a map (cluster topology) that they cache locally. If the map changes (e.g., a new station is added), the supervisor notifies all clients so they can update their maps.

How It Actually Works

What is ElastiCache Redis Cluster Mode and Replication?

Amazon ElastiCache for Redis offers two deployment modes: cluster mode disabled (single shard) and cluster mode enabled (multi-shard). Cluster mode enabled provides horizontal scalability by automatically partitioning your data across up to 500 shards (for Redis 7.x, earlier versions up to 500 shards as well). Each shard is a Redis node group consisting of a primary node and up to five read replicas. Replication provides high availability: if the primary node fails, a replica is automatically promoted to primary, ensuring minimal downtime.

How Redis Cluster Mode Works Internally

Redis cluster uses a hash slot mechanism to distribute data. The entire keyspace is divided into 16,384 hash slots. Each node in the cluster is responsible for a subset of these slots. When a key is written, the client computes the CRC16 hash of the key modulo 16,384 to determine which slot the key belongs to. The client then routes the request to the node responsible for that slot. This sharding logic is implemented on the client side using the cluster's slot-to-node mapping.

Key Components and Defaults

Shard (Node Group): A set of one primary and up to five replicas. Each shard stores a subset of the data.

Node: An individual Redis server running on an EC2 instance. Node types range from cache.t2.micro to cache.r5.24xlarge.

Hash Slot: 16,384 total. Redis cluster automatically assigns slots to shards.

Configuration Endpoint: A DNS endpoint that always points to the cluster's configuration. Clients use this to discover the current slot-to-node mapping.

Primary Endpoint: Points to the primary node of a shard.

Reader Endpoint: Distributes read traffic among replicas in a shard (cluster mode disabled) or across replicas in the cluster (cluster mode enabled).

Default Replication: Async replication. The primary sends updates to replicas and does not wait for acknowledgment.

Failover Time: Typically under 30 seconds for automatic failover.

Auto-Failover: Enabled by default for cluster mode enabled. For cluster mode disabled, you must enable Multi-AZ.

How Data Flows in Cluster Mode

Client connects to the configuration endpoint and retrieves the cluster topology (slot-to-node mapping).

Client caches this mapping locally.

For every read/write operation, client computes the hash slot for the key and sends the request directly to the appropriate node.

If the topology changes (e.g., a node fails), the client receives a MOVED or ASK redirection error and updates its mapping.

Replication: The primary node asynchronously replicates writes to its replicas. Replicas can serve read traffic if read replicas are enabled.

Replication Details

Redis replication is asynchronous by default. The primary node sends a stream of commands to replicas. Replicas acknowledge receipt but not application. This means there is a potential for data loss if the primary fails before the replica receives the latest writes. ElastiCache mitigates this with Multi-AZ placement and automatic failover. For cluster mode disabled, you must enable Multi-AZ to get automatic failover. For cluster mode enabled, Multi-AZ is always enabled.

Configuration and Verification

You can create a cluster with cluster mode enabled using the AWS CLI:

aws elasticache create-replication-group \
    --replication-group-id my-redis-cluster \
    --replication-group-description "Redis cluster with shards" \
    --engine redis \
    --engine-version 7.0 \
    --cache-node-type cache.r5.large \
    --num-node-groups 3 \
    --replicas-per-node-group 2 \
    --multi-az-enabled \
    --automatic-failover-enabled

To verify cluster mode status:

aws elasticache describe-replication-groups --replication-group-id my-redis-cluster

Look for the ClusterEnabled field: true means cluster mode enabled.

Interaction with Related Technologies

VPC: ElastiCache clusters must be deployed in a VPC. Use security groups to control access.

EC2: Applications running on EC2 connect to ElastiCache via endpoints. Use same VPC or VPC peering.

Lambda: Lambda functions can connect to ElastiCache if deployed in the same VPC with appropriate security group rules.

CloudWatch: Monitor metrics like CPUUtilization, CacheHits, CacheMisses, CurrConnections, ReplicationLag.

Backup and Restore: You can take snapshots of the cluster. For cluster mode enabled, you can restore to a single-shard cluster or a cluster with the same number of shards.

Important Timers and Thresholds

Replication Lag: The delay between a write on the primary and its application on the replica. Monitored via CloudWatch ReplicationLag metric. High lag indicates network issues or overloaded replicas.

Failover Time: Typically 10-30 seconds. The time includes detection of primary failure, replica election, and DNS update.

Connection Timeout: Default 2 seconds for ElastiCache Redis connections.

Exam Tips

Cluster mode enabled is required for horizontal scaling beyond a single shard's capacity (e.g., > 6 TiB for r5.24xlarge).

Cluster mode disabled supports up to 5 replicas per shard but only one shard.

Multi-AZ is automatically enabled for cluster mode enabled; you cannot disable it.

For cluster mode disabled, you must explicitly enable Multi-AZ to get automatic failover.

Redis AOF (Append-Only File) persistence is not supported in cluster mode enabled. Only Redis RDB snapshots are available.

You cannot change the number of shards after creation without creating a new cluster (except via migration).

The MOVED redirection happens when a client sends a request to the wrong node. The node responds with the correct node endpoint.

The ASK redirection happens during slot migration when the slot is being moved to another node. The client must first send an ASKING command to the target node.

Summary of Cluster Mode vs. Disabled

Cluster Mode Disabled: Single shard (one primary, up to 5 replicas). Supports up to 6 TiB (with r5.24xlarge). Automatic failover requires Multi-AZ enabled. Simpler client logic.

Cluster Mode Enabled: Multiple shards (up to 500). Supports horizontal scaling beyond 6 TiB. Automatic failover always enabled. Requires Redis cluster client library. Supports up to 500 shards (each with up to 5 replicas).

Walk-Through

Client connects to configuration endpoint

The application client first connects to the ElastiCache configuration endpoint. This endpoint is a DNS CNAME that always resolves to the current cluster configuration. The client sends a `CLUSTER SLOTS` command to retrieve the mapping of hash slots to nodes. The configuration endpoint is load-balanced across all nodes, but the topology is consistent. The client caches this mapping locally for subsequent requests.

Client computes hash slot for key

For every read or write operation, the client computes the hash slot for the key using the CRC16 algorithm modulo 16384. For example, the key 'user:1000' might map to slot 1234. The client then looks up its cached mapping to find which node is responsible for that slot. If the mapping is missing or stale, the client may send the request to the wrong node.

Client sends request to correct node

The client sends the command directly to the node responsible for the hash slot. If the node is the primary for that shard, it processes the command. If the command is a write, the primary updates its local data and then asynchronously replicates the write to its replicas. The primary does not wait for replica acknowledgment before responding to the client.

Node returns MOVED or ASK redirection if needed

If the client's cached mapping is outdated and it sends the request to a node that no longer owns the slot, that node responds with a `MOVED` error containing the correct node's endpoint. The client then updates its mapping and retries the command. During slot migration, an `ASK` redirection may occur, requiring the client to first send an `ASKING` command to the target node.

Primary failure triggers automatic failover

When a primary node fails (e.g., due to AZ outage), ElastiCache detects the failure via health checks. The cluster management service selects one of the replicas in the same shard to become the new primary. The replica executes a `REPLICAOF NO ONE` command to break replication and become primary. DNS updates propagate the new primary endpoint. The entire process typically completes within 10-30 seconds.

Replica starts replicating from new primary

After failover, the remaining replicas (if any) start replicating from the new primary. The new primary begins accepting writes and replicating them asynchronously. The old primary, if it comes back, becomes a replica of the new primary. Client connections are redirected to the new primary via the updated configuration endpoint.

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce Session Store

A large e-commerce platform uses ElastiCache Redis cluster mode to store user session data. With millions of concurrent users, a single-shard Redis would be insufficient in terms of memory and throughput. They deploy a cluster with 10 shards, each with 2 replicas, using cache.r5.large nodes. The cluster is spread across 3 AZs for high availability. The application uses a Redis cluster-aware client (e.g., Lettuce or Jedis) that handles slot mapping and redirections. During peak shopping seasons, the cluster handles over 1 million writes per second with sub-millisecond latency. If a primary node fails, automatic failover promotes a replica within seconds, and the session data remains available. The ops team monitors ReplicationLag and CacheMisses to detect issues. Misconfiguration: Initially, they used cluster mode disabled with a single shard, causing memory exhaustion and high latency. Migrating to cluster mode resolved the scalability bottleneck.

Enterprise Scenario 2: Real-time Leaderboard

A gaming company uses ElastiCache Redis to maintain real-time leaderboards for millions of players. They use Redis sorted sets to store player scores. With cluster mode enabled, they shard data by player region to keep related data together. They use hash tags (e.g., {region}:player:123) to ensure all keys for a region are in the same shard, allowing atomic operations on the sorted set. The cluster has 20 shards across 3 AZs. They configure read replicas to handle read-heavy leaderboard queries. During a game launch, write traffic spikes; the cluster scales by adding shards (though this requires migration). A common mistake is not using hash tags, causing multi-key operations to fail across shards. They also learned to set appropriate timeouts to avoid connection buildup.

Enterprise Scenario 3: Caching for Microservices

A financial services company uses ElastiCache as a cache for microservices that access a relational database. They use cluster mode enabled to provide a large cache pool (e.g., 500 GB total). Each microservice connects to the same cluster but uses key prefixes to avoid collisions. They use ElastiCache's encryption at rest and in transit to meet compliance requirements. They also enable automatic backups with 7-day retention. A critical performance consideration is that adding replicas increases read throughput but does not increase write throughput (writes still go to primary). They monitor network bandwidth and CPU to ensure nodes are not overloaded. Misconfiguration: They initially placed all replicas in the same AZ, causing data loss risk during AZ failure. They later spread replicas across AZs.

How SAA-C03 Actually Tests This

SAA-C03 Objective Coverage

This topic maps to SAA-C03 Objective 2.3: "Design high-performing and scalable storage solutions." Specifically, it tests your ability to choose between cluster mode enabled and disabled, understand replication and failover, and configure for high availability. Exam questions often present a scenario with specific requirements for scalability, availability, and latency.

Common Wrong Answers and Why

Choosing cluster mode disabled for horizontal scaling: Candidates see that Redis supports up to 5 replicas and think that means 5 shards. Wrong: replicas are copies, not shards. Cluster mode disabled only has one shard. For horizontal scaling, you need cluster mode enabled.

Enabling Multi-AZ for cluster mode enabled to get automatic failover: Candidates think Multi-AZ is optional. Actually, for cluster mode enabled, Multi-AZ is always enabled and cannot be disabled. For cluster mode disabled, you must explicitly enable Multi-AZ to get automatic failover.

Thinking that adding replicas increases write throughput: Replicas only handle read traffic. Writes always go to the primary. Adding replicas does not increase write capacity.

Believing that Redis cluster mode supports AOF persistence: It does not. Only RDB snapshots are supported in cluster mode enabled.

Specific Values and Terms on the Exam

16,384 hash slots.

500 shards maximum (for Redis 7.x, earlier versions also 500).

5 replicas per shard maximum.

Automatic failover time: typically under 30 seconds.

Configuration endpoint vs. primary endpoint vs. reader endpoint.

MOVED vs. ASK redirections.

Hash tags: use curly braces to force keys into the same slot.

Edge Cases and Exceptions

You cannot change the number of shards after cluster creation. You must create a new cluster and migrate data.

If you need to scale storage, you can scale up to a larger node type, but scaling out (adding shards) requires migration.

For cluster mode disabled, you can modify the number of replicas online. For cluster mode enabled, you cannot add replicas without replacing the replication group.

ElastiCache for Redis does not support multi-threading. A single node uses a single thread for command processing.

How to Eliminate Wrong Answers

If the scenario requires more than 6 TiB of cache or needs to handle more writes than a single node can, choose cluster mode enabled.

If the question mentions "horizontal scaling" or "sharding," cluster mode enabled is correct.

If the question emphasizes high availability without mentioning scaling, cluster mode disabled with Multi-AZ might be sufficient.

If the question says "automatic failover" and cluster mode disabled, look for Multi-AZ being enabled.

Pay attention to client requirements: cluster mode enabled requires a cluster-aware client.

Key Takeaways

Redis cluster mode uses 16,384 hash slots to distribute data across up to 500 shards.

Cluster mode enabled is required for horizontal scaling beyond a single node's capacity.

Automatic failover is built-in for cluster mode enabled; for disabled, enable Multi-AZ.

Replication is asynchronous; there is a risk of data loss on primary failure.

Cluster mode enabled does not support AOF persistence; only RDB snapshots.

Use hash tags (e.g., {key}) to ensure related keys are in the same shard.

Configuration endpoint provides topology; clients cache it and handle MOVED/ASK redirections.

Adding replicas increases read throughput but not write throughput.

You cannot modify shard count or add replicas online in cluster mode enabled.

Failover typically completes in under 30 seconds.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Cluster Mode Enabled

Supports up to 500 shards for horizontal scaling.

Data is automatically partitioned across shards using hash slots.

Automatic failover is always enabled (Multi-AZ mandatory).

Requires a cluster-aware Redis client.

Maximum total cache size scales with number of shards (e.g., 500 x 6 TiB).

Cluster Mode Disabled

Only one shard (one primary, up to 5 replicas).

No automatic data partitioning; all data in one shard.

Automatic failover requires explicit Multi-AZ enablement.

Works with standard Redis clients.

Maximum cache size limited to one node type (e.g., 6 TiB for r5.24xlarge).

Watch Out for These

Mistake

You can add replicas to a cluster mode enabled cluster without downtime.

Correct

Adding replicas to an existing cluster mode enabled replication group is not supported. You must create a new replication group with the desired number of replicas and migrate data.

Mistake

Redis cluster mode disabled supports up to 5 shards.

Correct

Cluster mode disabled supports only one shard (one primary and up to 5 replicas). For multiple shards, you must enable cluster mode.

Mistake

Multi-AZ must be manually enabled for cluster mode enabled.

Correct

Multi-AZ is always enabled for cluster mode enabled and cannot be disabled. For cluster mode disabled, you must enable Multi-AZ to get automatic failover.

Mistake

Redis replication is synchronous, so no data loss occurs on failover.

Correct

Redis replication is asynchronous by default. There is a window for data loss if the primary fails before replicas receive the latest writes.

Mistake

You can use AOF persistence with cluster mode enabled.

Correct

ElastiCache for Redis does not support AOF persistence when cluster mode is enabled. Only RDB snapshots are available.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between cluster mode enabled and disabled in ElastiCache Redis?

Cluster mode enabled allows you to shard data across multiple node groups (shards), up to 500, enabling horizontal scaling. Each shard has a primary and up to 5 replicas. Automatic failover is always enabled. Cluster mode disabled uses a single shard (one primary, up to 5 replicas) and does not automatically partition data. You must enable Multi-AZ for automatic failover. Cluster mode disabled is simpler but limited in scalability.

How does ElastiCache Redis handle failover?

ElastiCache monitors the health of primary nodes. If a primary fails, the service automatically promotes one of its replicas to become the new primary. DNS records are updated to point to the new primary. The failover typically completes within 10-30 seconds. For cluster mode enabled, this is automatic. For cluster mode disabled, you must have Multi-AZ enabled to get automatic failover.

What is a hash slot in Redis cluster?

A hash slot is a unit of data partitioning in Redis cluster. The entire keyspace is divided into 16,384 slots. Each key is assigned to a slot using CRC16 hash modulo 16384. Each shard is responsible for a range of slots. The client computes the slot for a key and routes the request to the appropriate shard. Hash tags (curly braces) can force multiple keys into the same slot.

Can I change the number of shards in an existing ElastiCache Redis cluster?

No, you cannot modify the number of shards after creating a cluster with cluster mode enabled. To change the shard count, you must create a new cluster with the desired configuration and migrate data. This can be done using online migration tools or by updating application endpoints.

What is the difference between MOVED and ASK redirections in Redis cluster?

MOVED redirection occurs when a client sends a request to a node that does not own the hash slot. The node responds with MOVED and the correct node's address. The client should update its mapping and retry. ASK redirection occurs during slot migration when the slot is being moved. The client must first send an ASKING command to the target node before retrying the command.

Does ElastiCache Redis support encryption?

Yes, ElastiCache for Redis supports encryption at rest (using KMS) and encryption in transit (using TLS). You can enable these when creating the cluster. Encryption in transit requires clients to connect using TLS. Note that enabling encryption may impact performance slightly.

How do I monitor replication lag in ElastiCache Redis?

You can monitor the CloudWatch metric `ReplicationLag` for each replica node. This metric indicates the delay in seconds between the primary and the replica. High replication lag can indicate network issues, overloaded primary, or underprovisioned replicas. You can set CloudWatch alarms to alert when lag exceeds a threshold.

Terms Worth Knowing

ElastiCache Redis Region

Ready to put this to the test?

You've just covered ElastiCache Redis Cluster Mode and Replication — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.

Try SAA-C03 practice questions Back to all chapters

Done with this chapter?

Step Functions Error Handling and Retries

EFS Multi-AZ Mount Targets

See the full SAA-C03 study guide