This chapter covers Amazon Security Lake, a service that automatically centralizes an organization's security data from AWS, SaaS providers, on-premises, and third-party sources into a purpose-built data lake stored in Amazon S3. It is a key topic under Domain 4: Security, Objective 4.2 (Configure and manage security services). On the SOA-C02 exam, Security Lake typically appears in 2-3 questions, often mixed with questions about centralized logging, security analytics, and compliance. You must understand its architecture, how it normalizes data to the Open Cybersecurity Schema Framework (OCSF), and how it integrates with services like Amazon Athena, AWS Lake Formation, and Amazon GuardDuty.
Jump to a section
Imagine a large corporation with dozens of departments, each generating their own security reports: access logs from the front desk, motion sensor logs from security guards, badge swipe logs from HR, and server logs from IT. Each department stores their logs in their own filing cabinet, using different formats and retention schedules. The CEO wants a unified view of all security events across the company to detect threats and pass audits. Instead of replacing each department's system, the company hires a central librarian who sets up a standard filing system: all logs must be converted to a common format (like the Open Cybersecurity Schema Framework, OCSF) and stored in a central library (Amazon S3). The librarian uses a standard index (AWS Glue table) to make all logs searchable via a single query tool (Amazon Athena). Each department can still keep their original logs, but they must send copies to the central library. The librarian also ensures that only authorized personnel can access the library, and automatically retains logs for the required period. If a department sends logs in the wrong format, the librarian rejects them or converts them. This is exactly how Amazon Security Lake works: it centralizes security logs from AWS services, third-party sources, and on-premises systems into a purpose-built S3 bucket, normalizes them to OCSF, and makes them queryable via Athena and integrated with analytics tools. The 'librarian' is the Security Lake service running the conversion and cataloging pipeline.
What is Amazon Security Lake and Why Does It Exist?
Amazon Security Lake is a fully managed service that automatically centralizes security data from AWS environments, SaaS providers, on-premises sources, and third-party security tools into a purpose-built data lake stored in your AWS account. It was launched in 2022 to address the challenge of fragmented security logs across multiple accounts, regions, and formats. Before Security Lake, security teams had to manually build custom pipelines to collect logs from AWS CloudTrail, Amazon VPC Flow Logs, AWS Security Hub, Amazon Route 53 Resolver, and third-party sources like CrowdStrike, Palo Alto Networks, and Okta. They then had to normalize these logs into a common schema for analysis, which was time-consuming and error-prone. Security Lake automates this entire process, delivering a unified, queryable security data lake.
How Security Lake Works Internally
Security Lake operates as a managed ingestion and transformation pipeline. Here's a step-by-step breakdown:
Source Discovery and Subscription: You configure Security Lake to collect data from specific sources. AWS sources include:
- AWS CloudTrail management events - AWS CloudTrail data events (S3, Lambda, DynamoDB) - Amazon VPC Flow Logs - AWS Security Hub findings - Amazon Route 53 Resolver DNS logs - Amazon S3 access logs - AWS Lambda function logs (via CloudWatch Logs) - AWS Network Firewall logs - Amazon EKS audit logs - Third-party sources via OCSF-compatible integrations (e.g., CrowdStrike, Palo Alto Networks, Okta, Proofpoint, Trend Micro) - Custom sources via AWS Glue or API
Data Collection: Security Lake uses AWS Glue custom connectors and a managed data ingestion service to pull data from each source. For AWS services, it configures cross-account data access (via resource-based policies) or uses AWS CloudTrail Lake integration. For third-party sources, it uses OCSF-formatted data pushed to an S3 bucket or via API.
Schema Normalization to OCSF: All collected data is transformed into the Open Cybersecurity Schema Framework (OCSF) version 1.0.0 or later. OCSF provides a standardized set of event classes (e.g., Network Activity, Process Activity, File Activity, Authentication) with consistent fields. For example, a VPC Flow Log entry and a CloudTrail event both get mapped to OCSF fields like src_endpoint, dst_endpoint, time, severity_id, activity_id, etc. This normalization is done by AWS Glue ETL jobs that run in your account.
4. Storage in Purpose-Built S3 Bucket: Security Lake creates an Amazon S3 bucket (or uses an existing one) with a specific folder structure partitioned by source, region, account, and date. The bucket policy enforces encryption (SSE-S3 or SSE-KMS) and blocks public access. The bucket is organized as:
s3://aws-security-data-lake-<account-id>-<region>/
├── <version>/
│ ├── <source>/
│ │ ├── <region>/
│ │ │ ├── <account-id>/
│ │ │ │ ├── year=<YYYY>/
│ │ │ │ │ ├── month=<MM>/
│ │ │ │ │ │ ├── day=<DD>/
│ │ │ │ │ │ │ ├── <parquet-files>Data is stored in Apache Parquet format, which is columnar and optimized for analytics.
Cataloging with AWS Glue: Security Lake automatically creates an AWS Glue Data Catalog database and tables for each source, making the data queryable via Amazon Athena, Amazon Redshift Spectrum, and other analytics services. The Glue tables are partitioned by the same folder structure (year, month, day, source, region, account).
Lifecycle Management: Security Lake applies S3 Lifecycle policies to transition data to cheaper storage classes (e.g., S3 Standard-IA, S3 Glacier) based on retention settings you define. Default retention is 365 days, but you can set custom retention per source. Expired data is automatically deleted.
Access Control via Lake Formation: Security Lake integrates with AWS Lake Formation to manage fine-grained access permissions at the table, column, or row level. You can grant IAM principals access to specific sources or fields.
Key Components, Values, Defaults, and Timers
OCSF Version: Security Lake uses OCSF version 1.0.0 (as of 2024). The schema is extensible; you can add custom attributes.
Data Format: Apache Parquet (columnar). No other formats are supported natively.
Partitioning: By source, region, account, and date (year/month/day). Partition pruning is critical for query performance.
Retention: Default 365 days. You can set retention per source from 1 to 3650 days (10 years).
Encryption: SSE-S3 (default) or SSE-KMS with a customer managed key.
Bucket Name Pattern: aws-security-data-lake-<account-id>-<region> (you can customize the prefix).
Glue Database Name: amazon_security_lake_glue_db_<region>.
Glue Table Names: Prefixed with amazon_security_lake_table_ followed by source name.
Source Prefixes in S3: cloud_trail_mgmt, cloud_trail_data, vpc_flow, security_hub, route53_resolver, s3_access, lambda_execution, network_firewall, eks_audit, custom.
Cost: You pay for the S3 storage, Glue ETL jobs, Glue Data Catalog, and data transfer. No upfront fee.
Limits: Up to 100 sources per Security Lake instance; up to 10 custom sources; maximum 10,000 partitions per table (default).
Configuration and Verification Commands
To create a Security Lake via AWS CLI:
aws security-lake create-data-lake --region us-east-1 --meta-store-manager 'arn:aws:iam::123456789012:role/SecurityLakeMetaStoreManager'To enable a source (e.g., VPC Flow Logs):
aws security-lake create-subscriber --data-lake-arn arn:aws:security-lake:us-east-1:123456789012:data-lake/default --subscriber-identity '{"externalId":"abc123","principal":"arn:aws:iam::123456789012:root"}' --subscriber-name "MySubscriber" --sources '[{"sourceName":"VPC_FLOW","sourceVersion":"1.0"}]'To verify data is flowing, query with Athena:
SELECT * FROM "amazon_security_lake_glue_db_us_east_1"."amazon_security_lake_table_vpc_flow"
WHERE month = '01' AND day = '15' LIMIT 10;To check Glue tables:
aws glue get-tables --database-name amazon_security_lake_glue_db_us_east_1Interaction with Related Technologies
Amazon GuardDuty: Security Lake can ingest GuardDuty findings via Security Hub integration. GuardDuty itself does not directly write to Security Lake; it writes to Security Hub, which Security Lake ingests.
AWS Security Hub: Security Lake ingests Security Hub findings (which include GuardDuty, Inspector, Macie, etc.).
Amazon Detective: Security Lake can serve as a data source for Detective to perform investigative analysis.
Amazon Athena: Primary query engine for Security Lake data. Use standard SQL.
Amazon QuickSight: Visualize Security Lake data using Athena as a data source.
AWS Lake Formation: Manages permissions on the Glue tables.
Amazon SageMaker: Use Security Lake data for machine learning anomaly detection.
Third-party SIEMs: Security Lake can export data to Splunk, Sumo Logic, or other SIEMs via S3 or Athena.
Exam-Relevant Details
Security Lake does not replace CloudTrail, VPC Flow Logs, or other native logging services. It aggregates and normalizes their output.
Security Lake is not a real-time service. Data is typically available within 15-30 minutes after the log is generated (due to Glue ETL processing).
Security Lake supports cross-account aggregation via the subscriber model. You can have a central Security Lake account that subscribes to sources from other accounts.
Security Lake cannot ingest data from on-premises sources directly unless they push OCSF-formatted data to S3 or use a custom Glue connector.
Security Lake is available in all commercial AWS regions except China and GovCloud (as of 2024).
Enable Security Lake in Region
You first create a Security Lake instance in a region. During creation, you specify an IAM role (MetaStoreManager) that Security Lake assumes to create the S3 bucket, Glue database, and Glue tables. Security Lake also creates a service-linked role (SLR) named AWSServiceRoleForSecurityLake. This step sets up the infrastructure: an S3 bucket with the naming pattern aws-security-data-lake-<account-id>-<region>, a Glue database named amazon_security_lake_glue_db_<region>, and initial Glue tables for each source you plan to enable. The bucket policy is automatically configured to enforce encryption and block public access. Default retention is set to 365 days. This step happens once per region.
Configure Sources to Ingest
After the data lake is created, you enable specific sources. For AWS sources, you can enable them directly from the Security Lake console or CLI. For example, enabling VPC Flow Logs causes Security Lake to create a flow log in each VPC in the account (or you can specify existing flow logs). Security Lake uses AWS CloudFormation StackSets to deploy the necessary resources across accounts and regions. For third-party sources, you provide an S3 bucket where the third party pushes OCSF-formatted data, or you configure an API endpoint. You can also add custom sources by writing a Glue ETL job that transforms data into OCSF. Each source has a version (e.g., 1.0). You can enable up to 100 sources.
Data Ingestion and Transformation
Once sources are enabled, data flows into Security Lake. For AWS sources, Security Lake uses CloudTrail, VPC Flow Logs, etc., writing logs to their native destinations (e.g., CloudWatch Logs, S3). Security Lake then uses AWS Glue custom connectors to read those logs, transform them into OCSF Parquet format, and write them to the Security Lake S3 bucket. The transformation is done by AWS Glue ETL jobs that run on a schedule (typically every 15 minutes). The Glue jobs map source-specific fields to OCSF fields. For example, a CloudTrail event's 'eventSource' is mapped to OCSF 'metadata.product.name'. If the transformation fails, the event is skipped and logged in CloudWatch Logs for debugging.
Cataloging and Partitioning
After data is written to S3, Security Lake updates the AWS Glue Data Catalog. It creates partitions for each combination of source, region, account, and date (year/month/day). The Glue table schema is defined by OCSF. Security Lake runs a Glue crawler periodically to discover new partitions. You can query the data immediately using Athena as soon as the partition is registered. Partitioning is crucial for query performance: always filter by partition columns (source, region, account, year, month, day) to avoid full table scans.
Query and Analyze Data
With the Glue catalog in place, you can query the security data using Amazon Athena with standard SQL. Example: SELECT * FROM vpc_flow WHERE src_endpoint.ip = '10.0.0.1' AND day = '15'. You can also use Amazon QuickSight for visualization, Amazon SageMaker for ML, or connect third-party analytics tools via JDBC/ODBC. Security Lake also supports exporting data to Amazon OpenSearch Service via the Security Lake OpenSearch integration. Access to data is controlled by Lake Formation permissions. You can grant IAM roles or users access to specific tables or columns. For cross-account access, you use the subscriber model where a subscriber account gets read-only access to the data lake.
Enterprise Scenario 1: Centralized Security Operations Center (SOC)
A large financial services company operates across 50 AWS accounts and 3 regions. Their SOC team needs a single pane of glass to investigate security incidents. Previously, they manually collected CloudTrail logs, VPC Flow Logs, and Security Hub findings into a self-managed data lake using custom Python scripts. This was brittle and often broke when log formats changed. They deployed Security Lake in a dedicated security account (central account). They enabled all AWS sources across all accounts via the subscriber model, using the AWS Organizations integration. The central Security Lake account subscribes to sources from each member account. Now, the SOC team uses Amazon Athena to run queries like 'find all API calls from a suspicious IP in the last 24 hours' across all accounts. They also use Amazon QuickSight dashboards to visualize top threats. The key benefit is that Security Lake automatically normalizes data to OCSF, so they no longer need to write transformation logic. They set retention to 90 days for most sources and 1 year for CloudTrail. A common misconfiguration: they initially forgot to enable the required service-linked role in member accounts, causing ingestion to fail. They also learned to use partition pruning by always filtering on the 'region' and 'account' columns to keep queries fast and cost-effective.
Enterprise Scenario 2: Compliance and Auditing
A healthcare organization subject to HIPAA needs to demonstrate that they have a centralized audit trail of all access to protected health information (PHI). They use Security Lake to aggregate CloudTrail data events (S3 object-level operations) and VPC Flow Logs. They also ingest custom application logs from their on-premises EHR system by pushing OCSF-formatted JSON to an S3 bucket. Security Lake stores data in SSE-KMS encrypted buckets with a customer-managed key to meet encryption requirements. They use AWS Lake Formation to restrict access to the CloudTrail data events table to only the compliance team, while the SOC team gets access to VPC Flow Logs and Security Hub findings. They set retention to 7 years for CloudTrail data events to comply with HIPAA. They use Athena to run periodic queries to detect unauthorized access attempts. A challenge: the custom application logs sometimes had malformed OCSF fields, causing Glue ETL jobs to fail. They set up CloudWatch alarms on Glue job failures to quickly identify and fix issues.
Enterprise Scenario 3: Threat Detection with Machine Learning
A technology company uses Security Lake to feed data into Amazon SageMaker for anomaly detection. They ingest VPC Flow Logs, DNS logs, and CloudTrail management events. They use SageMaker to train a model that detects unusual network traffic patterns based on historical data from Security Lake. The model is deployed as a SageMaker endpoint that queries Security Lake data in real-time via Athena. They also use Security Lake's integration with Amazon Detective to automatically enrich findings. Performance consideration: querying large volumes of historical data (terabytes) can be slow if not partitioned correctly. They optimized by using Athena workgroups with result caching and by creating materialized views for common queries. A common mistake: they initially used the default retention of 365 days for all sources, but for the ML training data they only needed 90 days, so they reduced retention to save costs.
The SOA-C02 exam tests Amazon Security Lake under Objective 4.2 (Configure and manage security services). Expect 2-3 questions that may be standalone or part of a multi-service scenario (e.g., combined with GuardDuty, Security Hub, and CloudTrail). The exam focuses on:
Purpose and Architecture: You must know that Security Lake centralizes and normalizes security logs into OCSF format. It does not replace existing logging services—it aggregates them. A common wrong answer is 'Security Lake replaces CloudTrail for logging management events.' That is false; CloudTrail still logs to its own S3 bucket, and Security Lake copies and transforms that data.
Data Format and Schema: The exam asks about OCSF and Parquet. A common trap: 'Security Lake stores data in JSON format.' Wrong—it uses Parquet for columnar storage and query efficiency. Another trap: 'Security Lake uses its own proprietary schema.' Wrong—it uses the open standard OCSF.
Source Types: You must know which sources are supported natively. The exam may list a source like 'AWS Config' and ask if Security Lake can ingest it. The answer is no—Config is not a supported source (though Config data can be sent via Security Hub findings). Another trap: 'Security Lake can ingest on-premises syslog directly.' Wrong—it requires OCSF-formatted data in S3 or a custom Glue connector.
Integration with Other Services: Questions often test how Security Lake works with Lake Formation, Athena, and Glue. For example, 'Which service is used to manage fine-grained access to Security Lake data?' Answer: AWS Lake Formation. 'Which service is used to query Security Lake data?' Answer: Amazon Athena. A common wrong answer: 'Amazon Redshift'—while Redshift Spectrum can query, the primary query engine is Athena.
Cross-Account Setup: The exam tests the subscriber model. You may be asked: 'How do you aggregate security logs from multiple accounts into a central Security Lake?' Answer: Use the subscriber model where the central account subscribes to sources from member accounts. A trap: 'Use VPC peering to send logs.' Wrong—Security Lake uses IAM roles and resource-based policies.
Retention and Lifecycle: Know that default retention is 365 days, and you can set per-source retention. A common wrong answer: 'Security Lake automatically archives data to Glacier after 90 days.' Wrong—you configure lifecycle policies.
Limits: Know the 100-source limit and 10-custom-source limit. The exam may present a scenario where a company has 150 sources and ask what to do—answer: use multiple Security Lake instances or prioritize sources.
Cost Model: Understand that you pay for S3 storage, Glue ETL, and Glue Catalog. There is no per-GB ingestion fee from Security Lake itself. A trap question: 'Security Lake charges per GB of data ingested.' Wrong—there is no such fee.
Edge cases: Security Lake is not available in China or GovCloud regions. Also, Security Lake does not support real-time queries—data is delayed by 15-30 minutes. The exam may test this by asking about latency.
Security Lake centralizes security logs from AWS, third-party, and on-premises sources into a single S3 data lake.
All data is normalized to the Open Cybersecurity Schema Framework (OCSF) version 1.0.0 and stored in Apache Parquet format.
Data is partitioned by source, region, account, and date (year/month/day) for efficient querying.
Default retention is 365 days; you can set custom retention per source from 1 to 3650 days.
Primary query tool is Amazon Athena; access control is managed via AWS Lake Formation.
Security Lake supports up to 100 sources (including up to 10 custom sources) per instance.
Data is typically available within 15-30 minutes after generation (not real-time).
Security Lake does not replace CloudTrail, VPC Flow Logs, or other native logging services.
These come up on the exam all the time. Here's how to tell them apart.
Amazon Security Lake
Fully managed: no need to build ingestion pipelines, transformation logic, or Glue crawlers.
Automatically normalizes data to OCSF, ensuring consistent schema across sources.
Integrated with Lake Formation for fine-grained access control out of the box.
Supports cross-account aggregation via subscriber model without custom IAM roles.
Cost includes S3 storage, Glue ETL jobs, and Glue Catalog; no per-GB ingestion fee.
Self-Managed Data Lake on S3
Requires custom development of ingestion pipelines (e.g., Lambda, Kinesis, Glue jobs).
You must define and maintain your own schema and transformation logic.
Access control must be manually configured using S3 bucket policies and IAM.
Cross-account setup requires custom IAM roles and S3 bucket policies.
Cost includes S3 storage, compute for ingestion/transformation, and Glue Catalog; you can optimize costs by using spot instances.
Mistake
Security Lake stores data in JSON format for easy readability.
Correct
Security Lake stores data in Apache Parquet, a columnar format optimized for analytics and query performance. JSON is used only for ingestion from third parties, but the stored data is Parquet.
Mistake
Security Lake can ingest any AWS log source automatically.
Correct
Security Lake only supports specific sources: CloudTrail (management and data events), VPC Flow Logs, Security Hub findings, Route 53 Resolver logs, S3 access logs, Lambda logs, Network Firewall logs, and EKS audit logs. Sources like AWS Config, CloudWatch Logs (except Lambda), or ALB access logs are not directly supported.
Mistake
Security Lake replaces the need for CloudTrail or VPC Flow Logs.
Correct
Security Lake does not replace these services. They continue to generate logs in their native destinations. Security Lake copies and transforms those logs into its own S3 bucket. You still need CloudTrail and VPC Flow Logs enabled.
Mistake
Security Lake provides real-time security event streaming.
Correct
Security Lake is not real-time. Data is typically available within 15-30 minutes after generation due to the Glue ETL processing pipeline. For real-time needs, use Amazon CloudWatch Logs or Amazon Kinesis.
Mistake
You can query Security Lake data using CloudWatch Logs Insights.
Correct
Security Lake data is stored in S3 and cataloged in Glue, not in CloudWatch Logs. The primary query tool is Amazon Athena. CloudWatch Logs Insights cannot query S3 data.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Amazon Security Lake is a managed service that automatically centralizes security data from AWS, SaaS, and on-premises sources into a purpose-built S3 data lake. It works by ingesting logs from supported sources (e.g., CloudTrail, VPC Flow Logs, Security Hub), transforming them into the Open Cybersecurity Schema Framework (OCSF) format, storing them as Parquet files in an S3 bucket, and cataloging them in AWS Glue for querying via Athena. Data is partitioned by source, region, account, and date. It is not real-time; expect 15-30 minute delays.
Security Lake supports AWS sources: CloudTrail management and data events, VPC Flow Logs, Security Hub findings, Route 53 Resolver DNS logs, S3 access logs, Lambda execution logs, Network Firewall logs, and EKS audit logs. Third-party sources include CrowdStrike, Palo Alto Networks, Okta, Proofpoint, Trend Micro, and others via OCSF integration. You can also add custom sources using AWS Glue. It does not support AWS Config, CloudWatch Logs (except Lambda), or ALB access logs natively.
Security Lake uses a subscriber model. You designate a central Security Lake account that subscribes to sources from other accounts (member accounts). The member accounts grant the central account access via a resource-based policy with an external ID. The central account then ingests the logs from member accounts. This is different from simply sending logs to a central S3 bucket; Security Lake manages the cross-account IAM roles and policies automatically.
You pay for the underlying AWS resources: S3 storage for the data lake, AWS Glue ETL jobs for data transformation, AWS Glue Data Catalog for metadata, and data transfer costs. There is no additional per-GB ingestion fee from Security Lake itself. Costs vary based on the volume of logs and the number of sources. You can reduce costs by setting appropriate retention periods and lifecycle policies.
No, Security Lake is not a real-time service. Data is typically available within 15-30 minutes after the log event occurs due to the Glue ETL processing pipeline. For real-time security monitoring, you should use services like Amazon CloudWatch Logs, Amazon Kinesis, or Amazon Managed Service for Flink. Security Lake is designed for historical analysis, threat hunting, and compliance auditing.
Security Lake secures data at rest using server-side encryption (SSE-S3 or SSE-KMS). Access to the S3 bucket is restricted by a bucket policy that blocks public access. Fine-grained access to the Glue tables (columns, rows) is managed via AWS Lake Formation. IAM policies control who can query data via Athena. Additionally, you can enable S3 Object Lock for write-once-read-many (WORM) compliance.
The default retention period is 365 days. You can configure a different retention period for each source, ranging from 1 to 3650 days (approximately 10 years). Security Lake applies S3 Lifecycle policies to transition data to cheaper storage classes (e.g., S3 Standard-IA, S3 Glacier Instant Retrieval, S3 Glacier Deep Archive) based on your settings. Expired data is automatically deleted.
You've just covered Amazon Security Lake — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?