PT0-002Chapter 98 of 104Objective 3.5

Pentesting AI and ML Systems

What makes penetration testing of AI and ML systems a growing specialization as organizations embed these technologies into critical infrastructure? For the PT0-002 exam, this topic appears in Domain 3.0 (Attacks and Exploits) under Objective 3.5, and typically comprises 2-3% of exam questions. We'll explore adversarial attacks, model poisoning, evasion, extraction, and inference attacks—each with specific techniques and countermeasures you must know for the exam.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Poisoning the Well of AI Training

You've been handed the task of training a new customer service chatbot by feeding it thousands of recorded conversations. One disgruntled employee, with access to the training data, subtly alters 500 of those conversations: they change 'refund request' to 'refund denied' and 'I need help' to 'I don't need help.' The chatbot learns from this poisoned data and starts incorrectly denying refunds and ignoring help requests. This is exactly how data poisoning attacks work in machine learning. The attacker injects malicious samples into the training dataset, causing the model to learn incorrect patterns. Just as the company must audit its training data for tampering, AI systems must validate and sanitize training inputs. The attacker doesn't need to break into the model's logic—they corrupt the data it learns from, making the model behave maliciously or incompetently on specific inputs. This is a stealthy, high-impact attack because it occurs before the model is even deployed.

How It Actually Works

What is AI/ML Penetration Testing?

AI and ML systems introduce unique attack surfaces beyond traditional software. Unlike deterministic code, ML models are trained on data and can be manipulated through adversarial inputs, poisoned training data, or model extraction. The PT0-002 exam expects you to understand these attack classes and their mitigations.

Adversarial Attacks (Evasion)

Adversarial attacks involve crafting inputs that cause an ML model to misclassify or produce incorrect outputs. For example, adding imperceptible noise to an image of a stop sign can cause a self-driving car's classifier to recognize it as a speed limit sign. These attacks exploit the model's decision boundaries. Key techniques: - Fast Gradient Sign Method (FGSM): Adds perturbation in the direction of the gradient of the loss function. Perturbation size is controlled by epsilon (ε), typically 0.01–0.1 for images. - Projected Gradient Descent (PGD): Iterative version of FGSM with clipping to stay within an epsilon ball. - Carlini & Wagner (C&W): Optimization-based attack that produces minimal perturbations with high success rate.

Data Poisoning

Attackers inject malicious samples into the training dataset to corrupt the model. Two types: - Label Flipping: Changing labels of training examples (e.g., labeling malware as benign). - Backdoor Poisoning: Inserting a trigger pattern (e.g., a yellow square) into training images and labeling them as a target class. During inference, any input with the trigger is misclassified.

Model Extraction

Attackers query a target model (e.g., via an API) to extract its parameters or approximate its functionality. This is done by: - Equation Solving: For linear models, send enough queries to solve for weights. - Training a Substitute Model: Use the target model's predictions as labels to train a local model that mimics it. - Jacobian-based Dataset Augmentation: Use the model's gradients to generate synthetic training data.

Model Inversion and Membership Inference

Model Inversion: Reconstruct training data (e.g., faces) from the model's parameters or outputs. For example, given a facial recognition model, an attacker can generate an image that maximizes the confidence for a specific person, revealing their likeness.

Membership Inference: Determine if a specific data point was used in the model's training. The attacker observes the model's confidence—high confidence often indicates membership. This violates privacy.

Attack Vectors in the ML Pipeline

The ML pipeline has multiple stages: 1. Data Collection: Intercept or tamper with data sources (e.g., web scraping, sensors). 2. Data Preprocessing: Inject malicious data during cleaning or feature engineering. 3. Model Training: Poison the training process directly (e.g., gradient manipulation). 4. Model Deployment: Exploit model serving infrastructure (e.g., TensorFlow Serving, MLflow). 5. Inference: Send adversarial inputs or perform extraction queries.

Key Defenses

Adversarial Training: Augment training data with adversarial examples to make the model robust.

Input Sanitization: Detect and filter adversarial perturbations (e.g., using JPEG compression).

Differential Privacy: Add noise to training or outputs to prevent inference attacks.

Model Watermarking: Embed unique patterns in model outputs to detect extraction.

Rate Limiting: Restrict API query rates to slow extraction.

Specific Values and Defaults

Epsilon (ε): Perturbation magnitude in FGSM. Typical range 0.01–0.1 for normalized pixel values (0–1).

Query Budget: For extraction, number of queries allowed per IP per day (e.g., 1000).

Poisoning Rate: Fraction of training data poisoned. Often 1–10% for effective attacks.

Confidence Threshold: For membership inference, models often output confidence scores; a threshold of 0.9 indicates high confidence.

Verification Commands (Conceptual)

While PT0-002 doesn't require specific tool commands, you should know tools like: - Adversarial Robustness Toolbox (ART): Python library for attacking and defending ML models. - CleverHans: Library for adversarial example generation. - TensorFlow Privacy: For differentially private training.

Example using ART to generate adversarial examples:

from art.attacks.evasion import FastGradientMethod
from art.classifiers import TensorFlowClassifier

classifier = TensorFlowClassifier(model=model, clip_values=(0, 1))
attack = FastGradientMethod(estimator=classifier, eps=0.1)
x_adv = attack.generate(x=x_test)

Interaction with Related Technologies

AI/ML systems often integrate with cloud services (AWS SageMaker, Azure ML), containers (Docker, Kubernetes), and CI/CD pipelines. An attacker could compromise the ML pipeline by exploiting misconfigured storage buckets (S3), container vulnerabilities, or weak API authentication. The exam may present scenarios where a web application uses an ML model via API; the pentester must test for adversarial inputs and extraction attacks.

Exam-Relevant Details

Black-box vs. White-box: Black-box attacks have no knowledge of the model internals; white-box attacks have full access (e.g., gradients). The exam expects you to know the difference.

Targeted vs. Untargeted: Targeted attacks aim for a specific misclassification; untargeted just cause any error.

Transferability: Adversarial examples crafted for one model often fool another model trained on similar data. This is key for black-box attacks.

Evasion vs. Poisoning: Evasion occurs at inference time; poisoning occurs at training time.

Common Misconfigurations

Exposed model endpoints without authentication.

No input validation or sanitization.

Training data stored in public cloud buckets.

Models served with verbose error messages revealing internal parameters.

Summary of Attack Categories

Evasion: Fooling the model at inference.

Poisoning: Corrupting training data.

Extraction: Stealing the model.

Inversion: Recovering training data.

Membership Inference: Determining if a record was in training set.

Each attack class has specific defenses and detection methods. The exam will test your ability to identify the attack type from a scenario description and recommend appropriate mitigations.

Walk-Through

Reconnaissance of ML Pipeline

Identify the target ML system by examining API endpoints, documentation, or error messages. Look for endpoints like /predict, /classify, or /inference. Determine if the model is black-box (only outputs) or white-box (source code available). For example, a web application might have a hidden API endpoint that returns model confidence scores. Use tools like Postman or curl to probe endpoints. Note any authentication requirements and rate limits. This step sets the stage for subsequent attacks.

Craft Adversarial Inputs

If the model is white-box (e.g., you have the model file), use techniques like FGSM to generate adversarial examples. Compute the gradient of the loss with respect to the input, then add a small perturbation in the direction that increases the loss. The perturbation size epsilon (ε) must be small enough to be imperceptible but large enough to flip the prediction. For black-box models, use substitute model training: train a local model on query outputs, then craft adversarial examples on the substitute. Test the adversarial inputs against the target model to confirm evasion.

Execute Data Poisoning

If you have access to the training pipeline (e.g., via compromised credentials or insecure data storage), inject malicious samples. For label flipping, simply change the labels of existing samples. For backdoor poisoning, add a trigger pattern (e.g., a specific pixel pattern) to training images and label them as the target class. Ensure the poisoning rate is low (1-5%) to avoid detection. After retraining, the model will associate the trigger with the target class. Verify by sending inputs with the trigger and observing misclassification.

Extract the Model

Send a large number of queries to the model API to collect input-output pairs. For a linear model with d parameters, you need at least d queries to solve the system of equations. For neural networks, use the Jacobian-based dataset augmentation technique: start with a small seed dataset, query the model, and use the outputs to train a substitute model. Then use the substitute model to generate more synthetic data via the Jacobian, repeating until the substitute accurately mimics the target. Monitor for rate limiting; you may need to distribute queries over time or use multiple IPs.

Perform Membership Inference

Obtain a set of candidate records (some likely in training data, some not). Query the model with each record and record the confidence score (e.g., softmax output). If the model outputs a high confidence score (e.g., >0.9), the record is likely a member of the training set. For a binary classifier, you can also use the model's loss value: lower loss indicates membership. Attack success depends on model overfitting and lack of differential privacy. Defenses include output perturbation or limiting confidence scores.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Fraud Detection

A bank deploys an ML model to detect credit card fraud. The model is trained on historical transaction data and served via an API. A pentester tests for adversarial evasion: they craft a transaction that mimics legitimate behavior but with subtle perturbations (e.g., slightly different amounts, timings) to bypass detection. Using a black-box approach, they query the API with 10,000 transactions to train a substitute model, then generate adversarial examples. The test reveals the model can be evaded with 80% success. Mitigation: implement adversarial training and input validation that checks for statistical anomalies.

Enterprise Scenario 2: Healthcare Image Diagnosis

A hospital uses a deep learning model to classify X-rays as normal or abnormal. The training data is stored in an AWS S3 bucket that is accidentally configured as public. An attacker downloads the dataset, adds a small watermark (backdoor trigger) to 1% of abnormal images and relabels them as normal, then re-uploads the poisoned data. The model is retrained nightly. Next day, any X-ray with the watermark is classified as normal, causing misdiagnoses. The pentester identifies the misconfiguration during a cloud security assessment and recommends bucket policies, encryption, and data integrity checks.

Enterprise Scenario 3: Chatbot Customer Service

A company's chatbot uses a language model fine-tuned on customer conversations. An attacker performs model extraction by sending 50,000 queries to the chatbot API, collecting responses, and training a local model that mimics the original. The extracted model is then used to craft phishing messages that sound exactly like the company's chatbot. The pentester demonstrates extraction by showing that the substitute model achieves 95% agreement with the target. Defenses include rate limiting, query monitoring, and output watermarking.

Common Failures

Insufficient Rate Limiting: Allowing unlimited API queries enables extraction.

No Input Sanitization: Accepting arbitrary inputs enables adversarial attacks.

Exposed Training Data: Public cloud storage leads to poisoning.

Overconfident Models: High confidence outputs enable membership inference.

Penetration testers must assess the entire ML pipeline, not just the model. The exam will expect you to identify these vulnerabilities in scenario-based questions.

How PT0-002 Actually Tests This

What PT0-002 Tests (Objective 3.5)

Objective 3.5 states: 'Given a scenario, perform attacks on AI/ML systems.' The exam expects you to identify attack types (evasion, poisoning, extraction, inference) and recommend mitigations. You will see scenario-based questions where you must choose the correct attack or defense.

Common Wrong Answers

Confusing Evasion and Poisoning: Many candidates choose 'poisoning' when the scenario describes manipulating inputs at inference time. Remember: evasion happens at inference, poisoning at training.

Assuming All Attacks Require White-Box Access: Black-box attacks (e.g., substitute model extraction) are valid. The exam may describe a scenario with only API access; the correct answer is extraction via queries.

Ignoring Transferability: Some questions state that an adversarial example crafted for one model also fools another. Candidates may think this is impossible, but transferability is a known phenomenon.

Overlooking Membership Inference: When a scenario mentions privacy concerns and model confidence, the answer is membership inference, not model inversion.

Specific Numbers and Terms

Epsilon (ε): 0.1 is a common perturbation magnitude.

Poisoning Rate: 1-5% is typical.

Query Budget: 1000 queries per day is a common rate limit.

Confidence Threshold: >0.9 indicates membership.

FGSM: Fast Gradient Sign Method.

PGD: Projected Gradient Descent.

C&W: Carlini & Wagner attack.

ART: Adversarial Robustness Toolbox.

Edge Cases

One-shot Extraction: For linear models with d parameters, only d queries are needed. The exam may present a scenario where the model is linear and ask for the minimum number of queries.

Label-only Attacks: Even if the model only returns the predicted label (no confidence), you can still perform membership inference using threshold-based methods (e.g., compare to a shadow model).

Targeted vs. Untargeted: The exam may differentiate: targeted attacks aim for a specific class; untargeted just cause misclassification.

How to Eliminate Wrong Answers

Identify the stage: training (poisoning) vs. inference (evasion).

Determine attacker knowledge: white-box (gradients available) vs. black-box (only outputs).

Look for keywords: 'confidence score' suggests membership inference; 'reconstruct training data' suggests model inversion; 'fool the model with slight changes' suggests evasion.

Check for backdoor triggers: if a specific pattern causes misclassification, it's backdoor poisoning.

Memorize the attack categories and their characteristics. Practice with scenario questions to build intuition.

Key Takeaways

Adversarial attacks (evasion) manipulate inputs at inference; poisoning corrupts training data.

Black-box extraction uses substitute model training with query outputs.

Membership inference exploits model confidence to determine if a record was in training data.

Model inversion reconstructs training data from model outputs or gradients.

FGSM uses epsilon (ε) typically 0.1 to generate adversarial examples.

Backdoor poisoning inserts a trigger pattern; any input with the trigger is misclassified.

Rate limiting (e.g., 1000 queries/day) defends against model extraction.

Differential privacy adds noise to training or outputs to prevent inference attacks.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Evasion Attack

Occurs at inference time

Crafts inputs to cause misclassification

Does not affect the model's training

Example: adversarial noise on an image

Defense: adversarial training, input sanitization

Poisoning Attack

Occurs at training time

Injects malicious data into training set

Corrupts the model's learned behavior

Example: label flipping or backdoor triggers

Defense: data validation, anomaly detection

Watch Out for These

Mistake

Adversarial attacks require full access to the model (white-box).

Correct

Black-box attacks are possible using substitute model training or transferability. An attacker can query the model and train a local model to approximate it, then craft adversarial examples on the local model that transfer to the target.

Mistake

Data poisoning only works if you can modify the training data directly.

Correct

Poisoning can also occur during data preprocessing or through compromised data sources (e.g., web scraping). Attackers can inject malicious data via insecure APIs or public datasets used for training.

Mistake

Model extraction is only possible for simple models like linear regression.

Correct

Neural networks can also be extracted using techniques like Jacobian-based dataset augmentation or equation solving for the final layer. With enough queries, an attacker can approximate the model's decision boundary.

Mistake

Differential privacy eliminates all inference attacks.

Correct

Differential privacy reduces the risk but does not eliminate it. The privacy budget (epsilon) must be carefully set; higher epsilon means more privacy leakage. Membership inference can still succeed if the budget is too high.

Mistake

Adversarial examples are always imperceptible to humans.

Correct

Some adversarial perturbations can be visible, especially in non-image domains (e.g., text, audio). In text, small changes like typos can fool models but are noticeable. The exam may present visible perturbations as valid.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a black-box and white-box attack on an ML model?

A white-box attack assumes full knowledge of the model, including architecture, parameters, and gradients. This allows efficient attacks like FGSM or PGD. A black-box attack has no internal knowledge and only accesses the model via queries. Attackers often train a substitute model to approximate the target, then craft adversarial examples on the substitute. For the exam, remember that black-box attacks are more realistic but require more queries.

What is the Fast Gradient Sign Method (FGSM) and how does it work?

FGSM is a white-box attack that generates adversarial examples by adding a small perturbation in the direction of the gradient of the loss function. The perturbation is calculated as ε * sign(∇_x J(θ, x, y)), where ε is the magnitude (e.g., 0.1). The sign function ensures the perturbation is in the direction that maximally increases the loss. For images, this creates imperceptible noise that causes misclassification. The exam may ask you to identify FGSM as an evasion technique.

How does a backdoor poisoning attack work?

In a backdoor attack, the attacker inserts a specific trigger pattern (e.g., a yellow square) into a subset of training images and labels them as the target class (e.g., 'stop sign'). The model learns to associate the trigger with that class. At inference, any input containing the trigger—even if it's actually a different object—will be classified as the target class. The trigger is often small and inconspicuous. Defenses include inspecting training data for anomalies and using robust aggregation methods.

What is model extraction and why is it a security concern?

Model extraction is the process of stealing a model by querying it and using the outputs to train a substitute model. This allows an attacker to replicate the functionality without access to the original model. The concern is intellectual property theft and enabling further attacks (e.g., adversarial examples on the substitute). Defenses include rate limiting, query monitoring, and output perturbation. For the exam, know that extraction is often the first step to crafting black-box adversarial attacks.

What is membership inference and how can it be prevented?

Membership inference determines whether a specific data point was part of the model's training set. Attackers query the model and observe confidence scores; high confidence suggests membership. This violates privacy, especially in sensitive domains like healthcare. Prevention includes differential privacy, limiting confidence outputs, and using regularization to reduce overfitting. The exam may present a scenario where a model reveals training data membership via its confidence scores.

What is adversarial training and how does it defend against evasion?

Adversarial training augments the training dataset with adversarial examples generated during training. The model learns to correctly classify both clean and adversarial inputs, making it more robust. For example, during each training iteration, FGSM perturbations are applied to a batch of inputs, and the model is trained on the adversarial versions. This increases computational cost but improves resilience. The exam may ask you to identify adversarial training as a defense against evasion attacks.

Can adversarial examples transfer between different models?

Yes, adversarial examples crafted for one model often fool another model trained on similar data, even with different architectures. This property is called transferability and is the basis for black-box attacks. An attacker can train a local substitute model, generate adversarial examples on it, and use them against the target model. The exam may test this concept by describing a scenario where an adversarial example created for one model also works on another.

Terms Worth Knowing

Artificial intelligence Exploitation Machine learning Privilege escalation Responsible AI Vulnerability scan

Ready to put this to the test?

You've just covered Pentesting AI and ML Systems — now see how well it sticks with free PT0-002 practice questions. Full explanations included, no account needed.

Try PT0-002 practice questions Back to all chapters

Done with this chapter?

IoT and SCADA/ICS Pentesting Concepts

Pass-the-Hash and Pass-the-Ticket Attacks

See the full PT0-002 study guide