This chapter covers Amazon Macie, a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect sensitive data in Amazon S3. For the SAA-C03 exam, Macie is a key service within the Secure Architectures domain (Objective 1.4: Design secure access to AWS resources) and appears in approximately 5-10% of questions, often as a distractor or correct answer for data discovery scenarios. Understanding Macie's capabilities, limitations, and integration with other services is essential for selecting the right data protection solution.
Jump to a section
Imagine Amazon S3 as a massive public library where anyone can donate books (upload objects) without oversight. The library has millions of books scattered across shelves (buckets) with varying access permissions. Amazon Macie acts like a specialized security librarian who continuously patrols the library. This librarian has a trained eye for sensitive content: she can instantly spot a credit card number on a page (like 4111-1111-111-1111), a social security number (like 123-45-6789), or a passport number. She doesn't just look at book covers; she reads the actual pages (object content) using machine learning and pattern matching. When she finds a sensitive document, she doesn't remove it (that would be destructive). Instead, she creates a detailed report card (finding) that says: 'Book titled 'payroll-2023.xlsx' on shelf 'finance-bucket' contains 50 credit card numbers. It is publicly accessible.' She files this report in a central log (Amazon EventBridge and Security Hub) and can even trigger an alarm (notification) to the head librarian. Importantly, she works passively—she never changes permissions or deletes data; she only discovers and reports. This mirrors Macie's design: it is a fully managed data security service that uses machine learning and pattern matching to discover and protect sensitive data in S3, providing visibility without modifying the underlying storage.
What is Amazon Macie and Why Does It Exist?
Amazon Macie is a fully managed data security service that uses machine learning (ML) and pattern matching to automatically discover, classify, and protect sensitive data stored in Amazon S3. It was introduced to address the challenge of identifying and securing sensitive data at scale, such as personally identifiable information (PII), financial data, and intellectual property, without requiring manual inspection or custom scripts. Macie helps organizations comply with regulations like GDPR, HIPAA, and PCI DSS by providing visibility into where sensitive data resides, how it is accessed, and whether it is exposed to unintended audiences.
How Macie Works Internally
Macie operates in two main phases: discovery and classification. First, it creates an inventory of all S3 objects in the buckets you enable it for (you can select specific buckets or all buckets). Macie then uses a combination of managed data identifiers (built-in patterns for common sensitive data types) and custom data identifiers (user-defined patterns via regex) to scan objects. The scanning process is event-driven: when a new object is uploaded or an existing object is modified, Macie automatically triggers a scan. For existing objects, you can run a one-time or scheduled full scan.
Internally, Macie uses a serverless architecture. It spawns ephemeral compute resources (likely AWS Lambda under the hood) to read object content from S3. It does not copy data out of S3; it reads directly from the bucket. The service evaluates object metadata (e.g., bucket name, object key, tags) and content (for supported file types: CSV, JSON, XML, TSV, text files, and some binary formats like PDF, Microsoft Office documents, and image files with embedded text). For binary files, Macie uses optical character recognition (OCR) to extract text.
Once scanned, Macie assigns a sensitivity score (0 to 100) to each object based on the number and type of sensitive data findings. A higher score indicates more sensitive data. Findings are then published to Amazon EventBridge and can be sent to AWS Security Hub, Amazon Detective, or Amazon Simple Notification Service (SNS) for alerting and automation.
Key Components, Values, and Defaults
Managed Data Identifiers: Over 100 built-in identifiers for common sensitive data types, including credit card numbers (Luhn check), US Social Security numbers, UK National Insurance numbers, AWS secret access keys, and more. These are updated by AWS.
Custom Data Identifiers: User-defined regular expressions (regex) with optional keywords and proximity rules. You can define up to 50 custom identifiers per account per Region.
Sensitivity Score: A numeric value from 1 to 100, calculated based on the count and types of findings. For example, a single credit card number might yield a score of 10, while 100 credit cards might yield 80.
Findings: Macie generates findings for policy violations (e.g., bucket publicly accessible) and sensitive data discoveries. Findings are retained for 90 days by default.
Buckets: Macie can monitor up to 1,000 buckets per account per Region (soft limit, can be increased).
Object size: Macie scans objects up to 1 GB in size. Larger objects are not scanned.
File types: Supports common text-based formats (CSV, JSON, XML, TSV, plain text) and binary formats (PDF, DOCX, XLSX, PPTX, image files).
Cost: Charged per GB of data scanned and per object (for metadata evaluation). There is also a monthly per-account fee.
Configuration and Verification
To enable Macie: 1. Open the Macie console and choose 'Get started'. You must first enable Macie (it is not enabled by default). 2. Specify the S3 buckets to monitor. You can select all buckets or specific ones based on tags or bucket names. 3. Optionally create custom data identifiers under 'Custom data identifiers'. 4. Macie automatically begins scanning. You can view findings in the 'Findings' tab.
To verify Macie is working:
Check the 'Summary' dashboard for the number of buckets, objects, and findings.
Use aws macie2 list-findings (CLI) to retrieve findings.
Monitor EventBridge events for Macie Finding events.
Integration with Related Technologies
AWS Security Hub: Macie findings are automatically ingested into Security Hub, providing a central view of security alerts.
Amazon EventBridge: Macie sends findings as events, enabling automated workflows (e.g., trigger a Lambda function to apply bucket policy).
AWS Organizations: Macie can be deployed across multiple accounts using a delegated administrator account, allowing centralized management.
Amazon Detective: Macie findings can be investigated in Detective to understand the context of data exposure.
AWS CloudTrail: Macie uses CloudTrail to monitor S3 API calls and detect suspicious activity (e.g., unusual data access patterns).
Limitations
Macie does not remediate issues automatically. It only discovers and reports. You must take action (e.g., modify bucket policies, encrypt objects) separately.
Macie does not scan objects in S3 Glacier or S3 Glacier Deep Archive unless you restore them first.
Macie does not support scanning of objects in S3 on Outposts or S3 Express One Zone.
Macie cannot scan objects encrypted with customer-provided keys (SSE-C) because it cannot access the plaintext.
Macie is a regional service: data classification and findings are per-Region. You must enable it in each Region where you have S3 data.
Enable Amazon Macie
Navigate to the Macie console in the AWS Management Console. Click 'Get started'. You must have the necessary IAM permissions (AmazonMacieFullAccess or equivalent). Macie requires a service-linked role (AWSServiceRoleForAmazonMacie) which is created automatically. Once enabled, Macie begins inventorying S3 buckets in the account and Region. There is no configuration needed to start—Macie automatically discovers all buckets. However, you can optionally exclude specific buckets by modifying the discovery scope.
Configure Data Discovery Scope
After enabling, you can refine which buckets Macie monitors. By default, Macie monitors all buckets. You can choose to monitor only specific buckets by name or tag. To exclude buckets, you can set up a bucket-level exclusion policy using the console or API. This is important for cost control: scanning many large buckets can incur significant charges. Macie also allows you to define classification jobs: one-time or scheduled jobs that scan specific buckets or prefixes. Jobs can be run on demand or on a schedule (e.g., weekly).
Define Custom Data Identifiers
If the built-in managed identifiers are insufficient, you can create custom data identifiers. Each custom identifier consists of a regular expression (regex) pattern, optional keywords (up to 3), and a proximity rule (how close the keyword must be to the pattern). For example, to detect employee IDs like 'EMP-12345', you could define a regex `EMP-\\d{5}` with the keyword 'employee'. Custom identifiers are tested against sample data in the console before deployment. You can create up to 50 custom identifiers per account per Region.
Run Classification Jobs
To scan existing objects, you create a classification job. In the console, go to 'Classification jobs' and click 'Create job'. Specify the S3 buckets and optional prefixes (folders) to scan. You can choose to scan all objects or only new/modified objects since the last scan. Jobs can be scheduled (daily, weekly, monthly) or run once. Macie uses a serverless architecture to execute the job; you are charged per GB of data scanned. The job status can be monitored in the console or via CloudWatch events.
Review and Respond to Findings
Once scanning completes, findings appear in the 'Findings' tab. Each finding includes the bucket name, object key, type of sensitive data (e.g., 'Credit Card Number'), count of occurrences, and sensitivity score. You can filter findings by severity (Low, Medium, High). To respond, you can set up EventBridge rules to trigger automated actions, such as sending an SNS notification, invoking a Lambda function to apply a bucket policy (e.g., block public access), or creating a Security Hub insight. Findings are retained for 90 days; you can export them to S3 for long-term retention.
Scenario 1: Healthcare Compliance (HIPAA) A healthcare organization stores patient records in S3. They must ensure that Protected Health Information (PHI) is not accidentally exposed to the public or accessed by unauthorized users. They enable Macie on all buckets containing patient data. Macie automatically scans PDFs and text files for identifiers like patient names (via custom regex) and medical record numbers. It flags any bucket with public read access as a policy finding. The security team sets up an EventBridge rule that triggers a Lambda function to block public access on any bucket where Macie finds PHI with public access. This automated response reduces exposure risk from hours to seconds.
Scenario 2: Financial Services – PCI DSS Compliance A fintech company processes credit card transactions. They store logs in S3 that may contain full credit card numbers (PAN). Macie's managed identifier for credit card numbers uses the Luhn algorithm to validate the number. The company runs a weekly classification job on all logs. Macie discovers that some logs are stored in a bucket with a misconfigured bucket policy that allows read access to an IAM role with excessive permissions. The finding is sent to Security Hub, where the security analyst reviews and remediates by tightening the bucket policy. Macie's sensitivity score helps prioritize: a bucket with 10,000 credit card numbers gets a higher score than one with 10.
Common Pitfalls and Misconfigurations - Cost overruns: Scanning all buckets indiscriminately can lead to high costs. Best practice is to exclude non-sensitive buckets (e.g., static website assets) and use tags to limit scope. - Ignoring policy findings: Macie generates policy findings for bucket-level security issues (e.g., public access, encryption disabled). Candidates often focus only on sensitive data findings and miss these, which are equally important for exam scenarios. - Assuming Macie remediates: Macie only discovers and alerts. You must implement automated remediation using EventBridge and Lambda, or manually apply changes. The exam expects you to know that Macie does not modify S3 configurations.
What the SAA-C03 Tests The exam tests your ability to select Macie as the appropriate service for discovering sensitive data in S3, especially when the scenario mentions compliance (HIPAA, PCI DSS, GDPR) or data classification. Specific objective codes: Domain 1 (Secure Architectures), Objective 1.4 (Design secure access to AWS resources). Macie questions are often paired with S3 bucket policies, IAM roles, and encryption. You must distinguish Macie from other services like Amazon Inspector (for EC2 vulnerabilities), Amazon GuardDuty (for threat detection), and AWS Config (for resource compliance).
Common Wrong Answers 1. Choosing Amazon Inspector – Candidates pick Inspector because it 'inspects' things. But Inspector only scans EC2 instances and container workloads for software vulnerabilities and network exposures. It does not inspect S3 object content. 2. Choosing AWS Config – Config tracks resource configuration changes and evaluates against rules, but it does not read object content. It can detect public buckets but cannot identify credit card numbers in objects. 3. Choosing Amazon GuardDuty – GuardDuty analyzes CloudTrail logs, DNS logs, and VPC flow logs for malicious activity. It does not scan S3 object content. However, GuardDuty can detect unusual S3 API calls (e.g., large data exfiltration) but not the presence of sensitive data. 4. Assuming Macie encrypts data – Macie does not encrypt data. It only discovers and classifies. Encryption is handled by S3 (SSE-S3, SSE-KMS, SSE-C).
Specific Numbers and Terms - Macie is a regional service. - Managed data identifiers: over 100. - Custom data identifiers: up to 50 per account per Region. - Sensitivity score: 1-100. - Findings retention: 90 days. - Maximum object size scanned: 1 GB. - Supported file types: CSV, JSON, XML, TSV, TXT, PDF, DOCX, XLSX, PPTX, image files (OCR). - Macie does not support S3 Glacier (unless restored), S3 on Outposts, or SSE-C encrypted objects.
Edge Cases - If a bucket is encrypted with SSE-KMS, Macie can still scan it because Macie has permissions to decrypt (via the service-linked role). But if SSE-C is used, Macie cannot access plaintext and skips the object. - Macie can be used with multi-account environments via AWS Organizations. A delegated administrator can manage Macie for all member accounts. - Macie does not scan objects in buckets that are in a different Region than the Macie configuration.
How to Eliminate Wrong Answers If the scenario involves S3 object content (credit cards, PII, etc.), Macie is the only service that scans content. If the scenario is about EC2 vulnerabilities, choose Inspector. If it is about threat detection from CloudTrail logs, choose GuardDuty. If it is about resource compliance (e.g., bucket encryption enabled), choose AWS Config.
Amazon Macie is a fully managed data security service that uses ML and pattern matching to discover and classify sensitive data in S3.
Macie supports over 100 managed data identifiers and up to 50 custom data identifiers per account per Region.
Macie generates findings for both sensitive data discoveries and bucket policy violations (e.g., public access).
Findings are retained for 90 days and can be sent to EventBridge, Security Hub, and SNS.
Macie is a regional service; enable it in each Region where S3 data resides.
Macie does not support S3 Glacier (unless restored), S3 on Outposts, or SSE-C encrypted objects.
Macie does not automatically remediate issues; use EventBridge + Lambda for automated response.
For the exam, choose Macie when the scenario involves discovering PII or sensitive data in S3 objects.
These come up on the exam all the time. Here's how to tell them apart.
Amazon Macie
Discovers sensitive data in S3 object content (PII, financial data).
Uses managed and custom data identifiers with pattern matching.
Generates findings for sensitive data and bucket policy issues.
Regional service, must be enabled per Region.
Charged per GB scanned and per object metadata evaluation.
Amazon GuardDuty
Detects threats from CloudTrail, VPC Flow Logs, DNS logs.
Uses machine learning and threat intelligence for anomaly detection.
Generates findings for suspicious API calls, network traffic, etc.
Regional service, must be enabled per Region.
Charged per volume of log data analyzed (per GB).
Amazon Macie
Scans object content for sensitive data.
Evaluates bucket-level policies for security (e.g., public access).
Generates findings with sensitivity scores.
Can be automated via EventBridge.
Does not track configuration changes over time.
AWS Config
Does not scan object content.
Tracks resource configuration changes and evaluates against rules.
Generates config rules compliance (e.g., 's3-bucket-public-read-prohibited').
Can trigger remediation via Systems Manager Automation.
Maintains a configuration history (up to 7 years).
Mistake
Macie automatically remediates security issues like making buckets private.
Correct
Macie is a discovery and classification service only. It does not modify any S3 configurations or object permissions. Remediation must be implemented manually or via automated workflows using EventBridge and Lambda.
Mistake
Macie can scan objects in S3 Glacier or S3 Glacier Deep Archive without restoration.
Correct
Macie cannot scan objects in Glacier storage classes because those objects are not immediately accessible. You must first restore the object to a standard storage class before Macie can scan it.
Mistake
Macie only scans new objects; it does not scan existing objects.
Correct
Macie can scan both new and existing objects. For existing objects, you must create a classification job (one-time or scheduled). Macie also automatically scans new objects as they are uploaded.
Mistake
Macie is a global service and provides cross-Region visibility.
Correct
Macie is a regional service. It must be enabled in each Region where you have S3 data. Findings and classification jobs are per-Region. There is no cross-Region aggregation natively; you can use Security Hub to aggregate findings across Regions.
Mistake
Macie can detect sensitive data in any file type, including binaries.
Correct
Macie supports a limited set of file types: text-based formats (CSV, JSON, XML, TSV, TXT) and binary formats (PDF, DOCX, XLSX, PPTX, images). For image files, it uses OCR to extract text. Other binary formats (e.g., ZIP, executables) are not scanned for content, though metadata is evaluated.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No, Macie can scan objects in S3 Standard, Intelligent-Tiering, Standard-IA, One Zone-IA, and Reduced Redundancy. It cannot scan objects in S3 Glacier, Glacier Deep Archive, or S3 on Outposts unless the object is first restored to a standard storage class. For Glacier, you must initiate a restore request and wait for the object to be available before Macie can scan it.
No, Macie cannot scan objects encrypted with customer-provided encryption keys (SSE-C) because it cannot access the plaintext content. Macie supports scanning objects encrypted with SSE-S3 and SSE-KMS, as it has permissions to decrypt via its service-linked role.
Macie only scans buckets within the same AWS account where it is enabled. If you need to scan buckets in member accounts, you must enable Macie in each account individually, or use a delegated administrator with AWS Organizations to centrally manage Macie across multiple accounts.
Managed data identifiers are predefined by AWS for common sensitive data types (e.g., credit card numbers, SSNs). Custom data identifiers are user-defined using regular expressions, keywords, and proximity rules. Managed identifiers are maintained by AWS and updated automatically; custom identifiers are specific to your organization's needs.
Yes, Macie can extract text from image files (JPEG, PNG, GIF, BMP, TIFF) using optical character recognition (OCR). It then applies data identifiers to the extracted text. However, OCR accuracy may vary depending on image quality and font. Macie does not analyze non-text content like faces or objects.
Macie findings are retained for 90 days by default. After 90 days, they are automatically deleted. You can export findings to S3 or Security Hub for long-term retention and analysis.
No, Macie only scans objects up to 1 GB in size. Objects larger than 1 GB are skipped and not scanned for sensitive data. You can split large files into smaller parts to enable scanning.
You've just covered Amazon Macie for S3 Data Discovery — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?