This chapter covers Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data. For the SAA-C03 exam, MSK appears in approximately 5-8% of questions, primarily in the context of real-time data ingestion, event-driven architectures, and integration with other AWS services like Lambda, Kinesis, and Glue. Understanding MSK's architecture, key features, and common use cases is essential for designing scalable and resilient streaming data solutions on AWS.
Jump to a section
Imagine you're a dairy company that needs to process milk from hundreds of farms. Apache Kafka is like building your own milk processing plant from scratch – you need to buy the land, construct the building, install the pasteurization equipment, hire staff, and maintain everything. Amazon MSK is like leasing a fully equipped milk processing facility from a specialized provider. The provider handles the building maintenance, equipment upgrades, and staffing for the basic operations. You just bring your milk (data) and decide how to process it (produce and consume messages). The facility has multiple processing lines (brokers) that can handle different types of milk (topics). You can scale up by adding more lines (brokers) or increase capacity per line (storage). The provider ensures the facility is always running, with backup power (multi-AZ) and automatic failover if a line breaks (broker replacement). You don't worry about the underlying infrastructure – you focus on your dairy products (applications). Similarly, with MSK, you don't manage ZooKeeper nodes, broker patching, or disk failures – AWS handles that, while you control topics, configurations, and client applications.
What is Amazon MSK?
Amazon MSK is a fully managed service for Apache Kafka, an open-source distributed streaming platform. Kafka is used for building real-time data pipelines and streaming applications. It acts as a high-throughput, fault-tolerant, publish-subscribe messaging system. MSK simplifies the operational overhead of running Kafka clusters by automating common tasks such as provisioning, patching, monitoring, and recovering from failures.
Why MSK exists
Running Apache Kafka on your own requires significant expertise. You must manage ZooKeeper nodes (for cluster coordination), configure broker replication, handle disk failures, perform software upgrades, and ensure high availability across multiple Availability Zones (AZs). MSK eliminates this burden. It provides a native Kafka API, meaning your existing Kafka applications can connect to MSK without any code changes. You retain control over topic configuration, consumer groups, and access policies via IAM or SASL/SCRAM authentication.
How MSK works internally
MSK provisions a cluster of brokers (Kafka servers) across up to three AZs for high availability. Each broker stores data in Apache Kafka topic partitions. Partitions are replicated across brokers in different AZs (replication factor typically 3). MSK automatically manages the ZooKeeper ensemble (3 or 5 nodes) required for Kafka's internal coordination. When a broker fails, MSK automatically replaces it, restoring the replication factor. MSK also handles automatic patching of Kafka and OS updates during configurable maintenance windows.
Key Components and Defaults
Broker type: You choose the EC2 instance type for brokers (e.g., kafka.m5.large). The broker type determines throughput and storage limits.
Number of brokers per AZ: You specify the number of brokers per AZ. For a 3-AZ cluster, you might choose 2 brokers per AZ (6 total).
Storage: EBS volumes are used for broker storage. Default is 100 GB per broker, up to 16 TB. You can choose between gp2, gp3, or io1 volumes.
Kafka version: MSK supports multiple Kafka versions. You can upgrade the cluster manually.
Encryption: Encryption in transit (TLS) and at rest (KMS) are enabled by default.
Authentication: Options include IAM access control, SASL/SCRAM, or mutual TLS (mTLS).
Monitoring: CloudWatch metrics, Prometheus metrics, and broker logs can be sent to CloudWatch Logs or S3.
Default replication factor: 3 for topics created via the AWS console or API (can be overridden).
Configuration and Verification Commands
You can create an MSK cluster via the AWS Management Console, CLI, or SDK. Example CLI command:
aws kafka create-cluster \
--cluster-name "my-msk-cluster" \
--kafka-version "2.8.1" \
--number-of-broker-nodes 3 \
--broker-node-group-info "{
\"InstanceType\": \"kafka.m5.large\",
\"ClientSubnets\": [\"subnet-abc\", \"subnet-def\", \"subnet-ghi\"],
\"SecurityGroups\": [\"sg-123\"]
}"To verify cluster status:
aws kafka describe-cluster --cluster-arn <arn>To list brokers:
aws kafka list-nodes --cluster-arn <arn>How MSK Interacts with Related Technologies
AWS Lambda: You can use an MSK trigger to invoke a Lambda function when new messages arrive in a topic. Lambda reads from the topic in batches.
Amazon Kinesis: Kinesis Data Analytics can consume from MSK as a source for real-time analytics.
AWS Glue: Glue Streaming ETL jobs can consume from MSK for data transformation and loading into S3 or Redshift.
Amazon S3: Using Kafka Connect with the S3 sink connector, you can automatically archive MSK data to S3.
Amazon EMR: EMR can read from MSK for batch processing using Spark Streaming.
AWS IAM: IAM roles and policies control access to MSK actions and resources.
Amazon CloudWatch: Metrics like BytesInPerSec, BytesOutPerSec, MessagesInPerSec, CpuUser, and DiskSpaceUsedPercent are available.
Scaling and Performance
MSK supports two scaling modes: - Provisioned: You specify the broker count and type. You can manually scale out by adding brokers or scale up by changing broker type (requires cluster replacement). - Serverless: A newer option that automatically provisions and scales resources based on throughput. You pay per partition-hour and data written/read. Serverless is ideal for variable workloads but has lower throughput limits per partition (e.g., 5 MB/s write per partition).
Security
Network: MSK clusters are deployed within a VPC. You control access via security groups and network ACLs. For cross-account access, you can use VPC peering or PrivateLink.
Authentication: IAM access control allows you to use IAM policies to grant permissions to Kafka actions (e.g., kafka-cluster:Connect, kafka-cluster:DescribeTopic). SASL/SCRAM uses AWS Secrets Manager to store credentials. mTLS uses client certificates.
Encryption: TLS for in-transit, KMS for at-rest encryption.
High Availability and Durability
MSK automatically distributes brokers across AZs. If a broker fails, MSK replaces it. However, data durability depends on the replication factor. With replication factor 3, data is written to 3 brokers (ideally in different AZs). If one AZ fails, data is still available from the other two. MSK also automatically recovers from disk failures by replacing the EBS volume.
Cost
You pay per broker-hour, plus storage (per GB-month) and data transfer costs. Serverless pricing is based on partition-hours and data throughput. There are no upfront costs.
Limitations
Maximum broker count per cluster: 30 (provisioned) or unlimited (serverless).
Maximum storage per broker: 16 TB.
Maximum message size: 1 MB (Kafka default) but can be increased up to 10 MB with configuration.
No built-in schema registry (you can use AWS Glue Schema Registry or Confluent Schema Registry).
No native Kafka Connect or Kafka Streams (you run these on EC2 or ECS).
Create an MSK Cluster
In the AWS Management Console, navigate to Amazon MSK and click 'Create cluster'. Choose between 'Provisioned' and 'Serverless'. For provisioned, specify cluster name, Kafka version, broker type (e.g., kafka.m5.large), number of brokers per AZ (e.g., 2), and number of AZs (2 or 3). Configure storage per broker (default 100 GB). Set up networking: choose VPC, subnets (one per AZ), and security groups. Optionally configure authentication (IAM, SASL/SCRAM, or mTLS). Review and create. The cluster provisioning takes 10-15 minutes.
Configure Client Access
After cluster creation, note the bootstrap broker addresses (e.g., b-1.mycluster.1234.kafka.us-east-1.amazonaws.com:9092). For IAM authentication, attach an IAM policy to the client role granting permissions like kafka-cluster:Connect. For SASL/SCRAM, store credentials in Secrets Manager and configure the client to use them. Ensure security groups allow inbound traffic from clients on ports 9092 (plaintext), 9094 (IAM), or 9096 (SASL/SCRAM).
Create Topics and Produce Messages
Use Kafka CLI tools on an EC2 instance in the same VPC (or via PrivateLink). Create a topic: `kafka-topics.sh --create --topic my-topic --bootstrap-server <bootstrap> --partitions 3 --replication-factor 3`. This creates a topic with 3 partitions, each replicated to 3 brokers. Produce messages using `kafka-console-producer.sh --topic my-topic --bootstrap-server <bootstrap>`. Messages are written to a partition based on key (if provided) or round-robin.
Consume Messages with Lambda
In the Lambda console, create a function and add an MSK trigger. Select the MSK cluster and topic. Lambda will poll the topic for new messages in batches (default batch size 100, max 10,000). The function receives an event containing an array of records. Lambda processes them and commits offsets automatically. Ensure the Lambda execution role has permissions to describe the cluster and read from the topic.
Monitor and Scale
Monitor cluster health via CloudWatch metrics (e.g., CpuUser > 50% may indicate undersized brokers). To scale provisioned clusters, you can add brokers (scale out) or change broker type (scale up) – scaling up requires creating a new cluster and migrating data. For serverless, no manual scaling is needed. Set up CloudWatch alarms for disk usage and CPU. Enable broker logs to CloudWatch Logs for troubleshooting.
Enterprise Scenario 1: Real-Time Clickstream Analytics
A large e-commerce company uses MSK to ingest clickstream data from their website. They have thousands of web servers sending user events (page views, clicks, purchases) to an MSK cluster with 12 kafka.m5.xlarge brokers across 3 AZs. Each event is about 1 KB, and they process 500 MB/s of data. They use AWS Lambda to transform and filter events in real time, then sink the data to Amazon S3 for batch analytics and Amazon Elasticsearch Service for real-time dashboards. They chose MSK over Kinesis because they needed to retain data for 7 days (Kinesis max 365 days but they wanted exactly Kafka compatibility) and they already had Kafka client libraries. A common problem they faced was uneven partition distribution causing hot spots. They solved it by using a well-distributed partition key (user ID). Another issue was broker CPU spikes during traffic surges; they set up auto-scaling by adding brokers manually based on CloudWatch alarms.
Enterprise Scenario 2: Financial Transaction Processing
A fintech startup processes credit card transactions through MSK. They need exactly-once semantics and high durability. They use MSK with replication factor 3 and IAM authentication for security. They have a Lambda function that validates transactions, then writes to an RDS database. They also use Kafka Connect (running on ECS) to sink data to S3 for audit trails. They chose MSK because they needed low latency (under 10ms) and the ability to replay old messages. A misconfiguration they encountered was setting the retention period too short (24 hours) – they needed 7 days for compliance. They also initially forgot to enable encryption in transit, exposing data to potential eavesdropping. After enabling TLS, they had to update all client configurations.
Scenario 3: IoT Device Telemetry
A manufacturing company collects sensor data from thousands of IoT devices. Each device sends a message every second (10 bytes). They use MSK Serverless because the workload is spiky – devices are active only during production hours. MSK Serverless automatically scales from 0 to thousands of partitions. They use AWS Glue Streaming ETL to transform the data and load into Redshift for real-time dashboards. A challenge they faced was that Serverless has a maximum partition throughput of 5 MB/s write per partition – they had to design their topic partitioning carefully to avoid throttling. They also learned that Serverless does not support IAM authentication, so they used SASL/SCRAM with Secrets Manager.
What SAA-C03 Tests on MSK
The exam covers MSK under Objective 3.2: 'Design high-performance architectures'. You should understand:
When to use MSK vs. Amazon Kinesis vs. Amazon SQS vs. Amazon MQ
MSK's fully managed nature: you don't manage ZooKeeper, broker patching, or disk failures
Authentication options: IAM, SASL/SCRAM, mTLS
Integration with Lambda, Glue, Kinesis Data Analytics, and S3
Scaling: provisioned vs. serverless
Security: VPC placement, encryption in transit/at rest
Common Wrong Answers
'MSK is a managed version of Amazon Kinesis' – Wrong. MSK is managed Apache Kafka. Kinesis is a different service. Candidates confuse them because both handle streaming data. The exam expects you to know the difference: MSK uses Kafka API, Kinesis uses its own API.
'MSK automatically scales partitions' – Wrong. In provisioned clusters, you must manually add partitions. In serverless, partitions are managed automatically but with throughput limits.
'MSK can be accessed from outside the VPC without any configuration' – Wrong. MSK is inside a VPC. For external access, you need a VPN, Direct Connect, or a proxy (e.g., NLB with PrivateLink).
'MSK supports exactly-once semantics by default' – Wrong. Kafka provides exactly-once semantics for producer transactions, but you must enable enable.idempotence and acks=all. MSK supports it but is not default.
Specific Numbers and Terms
Default replication factor: 3
Default storage: 100 GB per broker
Max storage per broker: 16 TB
Max message size: 1 MB (configurable up to 10 MB)
Supported Kafka versions: e.g., 2.8.1, 3.2.0
Authentication ports: 9092 (plaintext), 9094 (IAM), 9096 (SASL/SCRAM)
Serverless max partition throughput: 5 MB/s write, 20 MB/s read
Edge Cases
If you need to retain data for more than 365 days, MSK is better than Kinesis (Kinesis max 365 days).
MSK does not support Kafka Streams natively; you run it on EC2 or ECS.
You cannot change the broker type in provisioned clusters without recreating the cluster.
MSK Serverless does not support IAM authentication.
How to Eliminate Wrong Answers
If the question mentions 'managed ZooKeeper', it's MSK.
If the question mentions 'Kafka API', it's MSK.
If the question mentions 'lowest operational overhead for Kafka', it's MSK.
If the question mentions 'exactly-once semantics', check if they require Kafka's transaction API – MSK supports it but not automatically.
Amazon MSK is a fully managed Apache Kafka service – you don't manage ZooKeeper or broker infrastructure.
Default replication factor is 3; default storage per broker is 100 GB (max 16 TB).
MSK clusters are deployed inside a VPC; external access requires VPN, Direct Connect, or PrivateLink.
Authentication options: IAM, SASL/SCRAM (credentials in Secrets Manager), and mTLS.
MSK integrates with Lambda (as trigger), Glue (streaming ETL), Kinesis Data Analytics, and S3 (via Kafka Connect).
Provisioned clusters require manual scaling; Serverless clusters auto-scale but have partition throughput limits.
Maximum message size is 1 MB default, configurable up to 10 MB.
MSK does not support Kafka Streams natively; run on EC2 or ECS.
These come up on the exam all the time. Here's how to tell them apart.
Amazon MSK (Provisioned)
You choose broker type and count
Manual scaling (add brokers or change type)
Pay per broker-hour and storage
Supports IAM, SASL/SCRAM, mTLS authentication
Higher throughput per partition (no soft limit)
Amazon MSK (Serverless)
AWS automatically provisions and scales
Auto-scales based on throughput
Pay per partition-hour and data throughput
Supports SASL/SCRAM and mTLS (not IAM)
Max 5 MB/s write per partition, 20 MB/s read
Mistake
MSK is a drop-in replacement for Amazon Kinesis
Correct
MSK uses the Apache Kafka API, while Kinesis uses its own API. They are not interchangeable. You must use Kafka client libraries for MSK and Kinesis SDK for Kinesis.
Mistake
MSK automatically scales provisioned clusters
Correct
Provisioned MSK clusters do not auto-scale. You must manually add brokers or change broker type (requires cluster replacement). Serverless clusters auto-scale.
Mistake
MSK manages topic creation and deletion
Correct
MSK does not manage topics. You create, delete, and configure topics using Kafka CLI or API. MSK only manages the underlying infrastructure.
Mistake
MSK provides exactly-once delivery by default
Correct
Kafka supports exactly-once semantics but it's not default. You must configure producers with `enable.idempotence=true` and `acks=all`. MSK does not change this.
Mistake
MSK can be accessed from the internet without a VPN
Correct
MSK clusters are deployed inside a VPC and are not publicly accessible. You need a VPN, Direct Connect, or a proxy (e.g., NLB with PrivateLink) for external access.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Amazon MSK is a fully managed service for Apache Kafka, while Amazon Kinesis is a fully managed service for real-time data streaming with its own API. MSK uses Kafka's native API, so you can use existing Kafka tools and clients. Kinesis has its own SDK and features like shard-level metrics and resharding. Choose MSK if you need Kafka compatibility, want to use Kafka Connect or Kafka Streams, or need to retain data for more than 7 days (Kinesis max 365 days). Choose Kinesis if you want a simpler API, automatic scaling with shards, or lower latency for small payloads.
No, MSK clusters are deployed in a VPC and are not publicly accessible. To access from outside the VPC, you need a VPN connection, AWS Direct Connect, or set up a proxy like an NLB with PrivateLink. Alternatively, you can use AWS Lambda with a VPC configuration to access the cluster from within the VPC.
MSK automatically replaces failed brokers. It monitors broker health and if a broker fails, it provisions a new broker with the same configuration and re-attaches the EBS volumes (data is preserved). During replacement, there may be a temporary reduction in replication factor, but data is not lost if replication factor > 1. MSK also handles disk failures by replacing the EBS volume.
MSK supports three authentication methods: IAM access control (using IAM policies to grant Kafka permissions), SASL/SCRAM (using credentials stored in AWS Secrets Manager), and mutual TLS (mTLS) using client certificates. You can also use no authentication (plaintext) but it's not recommended. IAM is the simplest for AWS-native workloads.
Yes, you can use an MSK cluster as an event source for Lambda. Lambda polls the topic for new records and invokes your function synchronously. You can specify batch size, starting position, and other configurations. Lambda supports both provisioned and serverless MSK clusters. Ensure the Lambda function has VPC access to the MSK cluster and appropriate IAM permissions.
The default maximum message size in Apache Kafka is 1 MB. MSK supports increasing this up to 10 MB by modifying the `max.message.bytes` broker configuration and the `max.request.size` producer configuration. However, larger messages reduce throughput and increase latency.
MSK integrates with Amazon CloudWatch for metrics like BytesInPerSec, BytesOutPerSec, MessagesInPerSec, CpuUser, DiskSpaceUsedPercent, etc. You can also enable broker logs to CloudWatch Logs or S3 for debugging. Additionally, MSK supports Prometheus metrics via open monitoring. Use CloudWatch alarms to notify on high CPU or disk usage.
You've just covered Amazon MSK for Apache Kafka — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?