A security analyst is reviewing logs from an AI chatbot and notices that users can trick the chatbot into revealing its system prompt. Which type of attack is this?
Prompt leaking is a specific attack that causes the LLM to reveal its system prompt.
Why this answer
Prompt leaking is a specific type of attack where an adversary manipulates an AI chatbot into revealing its system prompt or other sensitive instructions. In this scenario, the user tricks the chatbot into outputting the system prompt, which is the exact definition of prompt leaking. This differs from general jailbreaking or injection attacks because the goal is to extract the hidden prompt, not to bypass restrictions or execute unauthorized commands.
Exam trap
Cisco often tests the distinction between prompt leaking and direct prompt injection, where candidates mistakenly choose direct prompt injection because they conflate any manipulation of the prompt with injection, but the key differentiator is the specific goal of extracting the system prompt.
How to eliminate wrong answers
Option A is wrong because jailbreaking refers to bypassing the model's safety filters or restrictions to generate prohibited content, not specifically extracting the system prompt. Option B is wrong because direct prompt injection involves inserting malicious instructions into the user input to override the model's behavior, but the primary goal is not to leak the system prompt; it is to execute unauthorized actions. Option C is wrong because model extraction is a technique used to steal the underlying model's architecture, weights, or parameters (e.g., via repeated API queries), not to reveal the system prompt text.