This chapter covers Amazon Kinesis Data Analytics for Apache Flink, a fully managed service that enables real-time stream processing using Apache Flink. For the SAA-C03 exam, this topic appears in about 5-8% of questions, primarily in the High Performance domain (Objective 3.2). You will need to understand the service's architecture, how it differs from Kinesis Data Analytics for SQL, and its integration with other Kinesis services. We will dive deep into Flink's internals, checkpointing, parallelism, and common exam scenarios.
Jump to a section
Imagine a car factory assembly line where cars move past a series of inspection stations. Each station monitors a specific aspect: paint quality, tire pressure, engine temperature. The line moves at a constant speed (like a Kinesis Data Stream), and each car carries a manifest (record). Now, instead of having human inspectors who manually check each car, you install automated sensors that run custom software (Apache Flink jobs) to analyze data in real time. The sensors can aggregate data—for example, count how many cars have low tire pressure in the last minute—and trigger alerts if a threshold is exceeded. They can also join data from two different sensors, like correlating paint defects with the paint robot ID. Critically, the sensors remember past results (state) to detect trends, such as a gradual increase in engine temperature over several cars. If a sensor fails, the factory can restart it from the last checkpoint (Flink checkpointing) without losing data. The factory manager can also update the inspection logic without stopping the line, by uploading new software (application update with savepoint). This mirrors how Kinesis Data Analytics for Apache Flink processes streaming data: it consumes records from Kinesis Data Streams, runs Flink jobs with stateful operations, and can checkpoint state for fault tolerance.
What is Kinesis Data Analytics for Apache Flink?
Amazon Kinesis Data Analytics for Apache Flink (formerly Kinesis Data Analytics) is a fully managed service that allows you to run Apache Flink applications to process streaming data in real time. It is one of three Kinesis Data Analytics options: the original SQL-based Kinesis Data Analytics (now called Kinesis Data Analytics for SQL), the managed Apache Flink service, and the newer managed Apache Flink with Studio notebooks. For SAA-C03, the focus is on the managed Apache Flink service.
Why Apache Flink?
Apache Flink is an open-source stream processing framework that provides exactly-once semantics, high throughput, low latency, and stateful computations. Unlike the SQL-based service, Flink allows custom Java or Scala code (or Python via PyFlink), enabling complex event processing, windowed aggregations, joins, and machine learning inference. The exam expects you to know when to choose Flink over SQL: Flink is for complex, custom logic; SQL is for simple SQL queries on streaming data.
How It Works Internally
Kinesis Data Analytics for Apache Flink runs your Flink application on a managed cluster of EC2 instances. The service handles provisioning, scaling, fault tolerance, and monitoring. Here's the data flow:
Source: Your Flink application reads data from a Kinesis Data Stream or Amazon MSK (Managed Streaming for Apache Kafka) as the source. The Flink Kinesis consumer uses shard discovery and record sequence numbers to track progress.
Processing: The application performs transformations, aggregations, joins, or windowing. Flink uses a master-slave architecture: a JobManager (master) coordinates checkpointing and task distribution, while TaskManagers (workers) execute the tasks. The service automatically scales the number of TaskManagers based on parallelism.
Sink: Processed data is written to one or more destinations: Kinesis Data Streams, Kinesis Data Firehose, Amazon S3, Amazon DynamoDB, Amazon Elasticsearch Service, or custom sinks.
Key Components and Defaults
Parallelism: The number of parallel subtasks for each operator. Default is 1, but you should set it based on the number of shards in the source stream (typically 1-2x shard count). The service can auto-scale parallelism based on CPU utilization.
Checkpointing: Flink periodically saves the state of all operators to a durable store (Amazon S3 by default). Checkpoint interval default is 3 minutes (configurable). This enables exactly-once semantics and fault recovery.
State Backend: The state is stored either in memory (for small state) or in a RocksDB instance embedded in each TaskManager. For production, RocksDB is recommended for large state (default).
Application Versioning: Each Flink application has a version. You can update the application code using the UpdateApplication API, which triggers a savepoint (a snapshot of the state) and then restarts with the new code.
Metrics: CloudWatch metrics include millisBehindLatest (how far behind the stream), numberOfFailedCheckpoints, and CPU/memory usage.
Configuration and Verification
When creating a Flink application, you provide a JAR file containing your Flink job. The service compiles and runs it. You can configure:
- Parallelism: Set via the ParallelismConfiguration parameter.
- Flink version: Currently supports Flink 1.11, 1.13, 1.15. Choose the latest stable.
- Environment properties: Key-value pairs passed to the application.
To verify the application is running, use the AWS CLI:
aws kinesisanalyticsv2 describe-application --application-name MyAppThis returns the application status (RUNNING, UPDATING, etc.), current parallelism, and last checkpoint time.
Interaction with Related Technologies
Kinesis Data Streams: The most common source. The Flink Kinesis connector uses adaptive reads and can handle resharding.
Kinesis Data Firehose: Often used as a sink to deliver processed data to S3, Redshift, or Elasticsearch.
Amazon MSK: Flink can consume from MSK using the Kafka connector.
AWS Lambda: Can be used as a sink or for additional processing.
Amazon DynamoDB: Flink can write to DynamoDB using the DynamoDB sink.
Important Exam Concepts
Exactly-once semantics: Flink achieves this through checkpointing and transactional sinks (e.g., Kafka sink with exactly-once). The exam may ask about guarantees.
Event time vs. processing time: Flink supports event-time processing using watermarks. Watermarks are used to handle out-of-order events. The exam might test the concept of watermarks and allowed lateness.
Windows: Tumbling, sliding, session windows. Know how they work.
State and fault tolerance: Checkpoints vs. savepoints. Savepoints are manually triggered for application updates; checkpoints are automatic.
Step-by-Step Data Flow
Application Submission: You upload a JAR to S3 and create a Flink application via the console or API. The service starts a cluster with one JobManager and the configured number of TaskManagers.
Source Connection: The Flink Kinesis consumer connects to the source stream, discovers shards, and begins reading records. It uses sequence numbers to track progress.
Record Processing: Each TaskManager executes a subset of operator subtasks. For example, a map operator may transform records, while a keyed window operator groups by a key and computes aggregates.
State Management: Operators maintain state in the state backend. For keyed state, the state is partitioned by key across subtasks.
Checkpointing: Periodically, the JobManager triggers a checkpoint. Each operator saves its state to S3. The checkpoint is considered complete when all operators have acknowledged.
Sink Output: Processed records are written to the sink. The sink may use transactions to ensure exactly-once delivery.
Failure Recovery: If a TaskManager fails, the JobManager restarts the tasks from the last successful checkpoint. The source consumer rewinds to the checkpointed offset.
Code Example (Simplified Flink Job)
DataStream<String> input = env.addSource(new FlinkKinesisConsumer<>(
"input-stream",
new SimpleStringSchema(),
consumerConfig
));
DataStream<Tuple2<String, Long>> counts = input
.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.sum(1);
counts.addSink(new FlinkKinesisProducer<>(
new SimpleStringSchema(),
producerConfig
));This reads from a Kinesis stream, tokenizes lines into words, counts occurrences per minute, and writes to another Kinesis stream.
Create and Upload Flink JAR
First, you develop your Flink application using Java, Scala, or Python. Package it into a JAR file (or a ZIP for Python). Upload the JAR to an S3 bucket. The service will use this artifact to run your application. Ensure the JAR includes all dependencies, or use the provided Flink environment libraries. The JAR size limit is 1 GB. You can also use Maven or Gradle to manage dependencies.
Create Kinesis Analytics Application
In the AWS Management Console, navigate to Kinesis Analytics and choose 'Create application'. Select 'Apache Flink' as the runtime. Provide an application name, description, and the IAM role that grants permissions to read from the source and write to sinks. The role must have permissions for Kinesis streams, S3 (for JAR and checkpoints), and CloudWatch logs. You also specify the Flink version and environment properties.
Configure Source and Sink
After creation, you configure the application's source and sink by editing the Flink code. The code must contain the source and sink definitions. For Kinesis sources, you need to provide the stream name, region, and authentication. The service does not configure these via the console; they are part of the application code. You also set parallelism and scaling options.
Run and Monitor Application
Start the application. The service will provision the cluster and begin running your Flink job. Monitor using CloudWatch metrics: `millisBehindLatest` indicates lag; `cpuUtilization` and `heapMemoryUtilization` show resource usage. Check the application log in CloudWatch Logs for errors. The application status changes from CREATING to RUNNING. If there are errors, it may transition to FAILED.
Update Application Code
To update the Flink code, upload a new JAR to S3 and call the `UpdateApplication` API. The service will trigger a savepoint of the current state, then stop the application, replace the JAR, and restart from the savepoint. This allows stateful upgrades. You can also update parallelism without code changes using the `UpdateApplication` API with new parallelism settings.
Enterprise Scenario 1: Real-Time Fraud Detection
A financial services company processes credit card transactions in real time. They use Kinesis Data Streams to ingest transactions (up to 100,000 TPS across 200 shards). They implement a Flink application that performs per-cardholder aggregations: counting transactions in the last hour, computing average amount, and detecting velocity anomalies. The Flink job uses event-time processing with watermarks to handle out-of-order transactions. State is stored in RocksDB (since state per key can be large). They set checkpointing interval to 1 minute for fast recovery. The output is written to a Kinesis Data Firehose delivery stream that loads into S3 for further analysis and also to a Lambda function that triggers alerts. Common misconfiguration: not setting the correct parallelism (should be 200-400 for 200 shards), leading to backpressure and increased millisBehindLatest. Also, forgetting to enable checkpointing causes data loss on failure.
Enterprise Scenario 2: IoT Sensor Data Aggregation
A manufacturing company collects sensor data from thousands of machines. Each machine sends a reading every second to a Kinesis Data Stream (50 shards). They use Flink to aggregate readings per machine over 5-minute tumbling windows, compute averages, and detect anomalies. The Flink job also joins with a static reference table (machine metadata) stored in a DynamoDB table. They use the Flink AsyncIO to query DynamoDB without blocking. The output is written to Amazon Elasticsearch Service for real-time dashboards. They use auto-scaling based on CPU utilization. A common mistake: not using watermarks correctly, causing late events to be dropped. They set allowed lateness to 1 minute to accommodate network delays. Also, they must ensure the IAM role has access to DynamoDB and Elasticsearch.
Performance Considerations
Shard count: The source stream's shard count limits throughput. Each shard can handle up to 1 MB/s or 1000 records/s for writes. The Flink consumer can read at the same rate. If the application cannot keep up, increase parallelism.
State size: Large state can cause checkpointing delays. Use RocksDB state backend and tune RocksDB memory settings. Monitor checkpoint duration.
Backpressure: Indicated by high millisBehindLatest. Check CPU and memory utilization. Increase parallelism or optimize code (e.g., use keyBy instead of global windows).
Exam Objective Codes
This topic falls under Objective 3.2: 'Select high-performance data ingestion and streaming solutions.' The exam will test your ability to choose between Kinesis Data Analytics for SQL vs. Apache Flink, and to understand Flink's capabilities.
Common Wrong Answers
'Kinesis Data Analytics for SQL can do custom Java code.' Wrong. SQL service only supports SQL queries. Flink is needed for custom logic.
'Flink provides at-most-once semantics.' Wrong. Flink provides exactly-once semantics via checkpointing.
'You must provision EC2 instances for Flink.' Wrong. Kinesis Data Analytics for Flink is serverless; you don't manage instances.
'You cannot update a running Flink application.' Wrong. You can update using the UpdateApplication API with savepoints.
Specific Numbers and Terms
Checkpoint interval default: 3 minutes.
Flink versions: 1.11, 1.13, 1.15.
State backend: RocksDB (default) or in-memory.
Maximum parallelism: 256 per application.
Savepoint vs. checkpoint: Savepoints are user-triggered for upgrades; checkpoints are automatic for recovery.
Edge Cases
Resharding: If the source stream is resharded (split or merged), the Flink consumer automatically detects new shards. The exam may ask about this behavior: Flink consumers handle dynamic shard changes.
Idempotent sinks: For exactly-once sinks, you need idempotent writes or transactional sinks (e.g., FlinkKafkaProducer with exactly-once mode).
Late events: Use allowed lateness to handle out-of-order events. Default is 0.
How to Eliminate Wrong Answers
If a question asks about processing streaming data with complex stateful logic, eliminate any answer that suggests Kinesis Data Analytics for SQL or Lambda (since Lambda has limited state and runtime). Choose Flink. If the question mentions 'exactly-once', look for Flink or Kinesis Data Streams with enhanced fan-out (but Flink is the only one that provides exactly-once processing). If the question asks about updating application logic without data loss, look for 'savepoint'.
Kinesis Data Analytics for Apache Flink is used for complex, stateful stream processing with exactly-once semantics.
Checkpointing interval defaults to 3 minutes; can be configured for faster recovery.
Parallelism should be set based on source shard count (1-2x shards).
Flink can read from Kinesis Data Streams and Amazon MSK.
Application updates use savepoints to preserve state.
RocksDB is the default state backend for large state.
CloudWatch metrics include millisBehindLatest and numberOfFailedCheckpoints.
These come up on the exam all the time. Here's how to tell them apart.
Kinesis Data Analytics for SQL
Supports only SQL queries for stream processing
No custom code; limited to predefined functions
At-least-once semantics
Cannot maintain large state; state is transient
Simpler to set up for basic aggregations
Kinesis Data Analytics for Apache Flink
Supports custom Java, Scala, Python code
Complex event processing, ML inference, joins
Exactly-once semantics via checkpointing
Stateful processing with RocksDB or in-memory state
Higher learning curve but more powerful
Mistake
Kinesis Data Analytics for Apache Flink is the same as Kinesis Data Analytics for SQL.
Correct
They are different. SQL service only supports SQL queries; Flink supports custom Java/Scala/Python code with stateful processing, exactly-once, and complex event processing.
Mistake
Flink applications run on provisioned EC2 instances that I must manage.
Correct
The service is fully managed. AWS handles the cluster of EC2 instances, including scaling, patching, and monitoring. You only upload your code.
Mistake
Flink provides at-most-once processing by default.
Correct
Flink provides exactly-once semantics via checkpointing. Checkpoints save operator state and source offsets, allowing recovery without data loss.
Mistake
You cannot update a running Flink application without stopping it.
Correct
You can update the application code using the UpdateApplication API. The service takes a savepoint, stops the app, replaces the JAR, and restarts from the savepoint.
Mistake
Flink can only read from Kinesis Data Streams.
Correct
Flink can also read from Amazon MSK (Kafka) and other sources via custom connectors. The service supports MSK as a source.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A checkpoint is an automatic, periodic snapshot of the state used for recovery. It is triggered by the JobManager at a configurable interval (default 3 minutes). A savepoint is a manually triggered snapshot that is stored in a user-specified location. Savepoints are used for application upgrades, scaling, or debugging. Both provide exactly-once guarantees, but savepoints are more durable and can be used to restart with different parallelism or code.
Yes. The service supports Amazon MSK as a source. You configure the Flink application with a Kafka consumer that reads from MSK topics. The service handles the connectivity and IAM permissions. This is useful if you already have Kafka infrastructure and want to use Flink for processing.
Flink's Kinesis consumer periodically discovers shards using the ListShards API. When a shard is split or merged, the consumer detects the new shards and automatically starts reading from them. It uses sequence numbers to track progress. This is transparent to the application code. The exam may test that Flink handles dynamic shard changes.
The JobManager detects the failure and restarts the tasks on available TaskManagers. The tasks are restored from the last successful checkpoint. The source consumer rewinds to the checkpointed offset, ensuring no data loss. This is part of Flink's fault tolerance mechanism.
Use the UpdateApplication API (or console) to specify a new S3 location for the JAR. The service will trigger a savepoint, stop the application, replace the code, and restart from the savepoint. This allows stateful upgrades without data loss. You can also update parallelism without code changes.
The default state backend is RocksDB, which stores state on local disk and is suitable for large state. You can also use the in-memory state backend for small state, but it is not recommended for production because state is lost on failure. RocksDB is the recommended option.
Yes. You can embed a machine learning model in your Flink code (e.g., using TensorFlow or PMML) to score records in real time. The model can be loaded during initialization. This is a common use case for fraud detection, anomaly detection, or recommendation systems.
You've just covered Kinesis Data Analytics (Managed Apache Flink) — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?