Skip to main content
5 min read Intermediate AI / LLM

LLM Security Assessment

LLM pentesting (Large Language Model penetration testing) is the process of testing an AI language model to find weaknesses in its behavior, safety, and reliability.

It involves checking how the model responds to different inputs to understand whether it can be misled, produce incorrect or unsafe outputs, or fail to follow intended rules. The goal is to improve the model’s security, accuracy, and robustness before real-world use.

OWASP Top 10

https://genai.owasp.org/llm-top-10/

LLM Basics

ConceptMeaningSimple Example
ModelThe AI system that understands input and generates responses based on patterns it learnedThink of it like a super advanced autocomplete that can also write essays, solve problems, or code
PromptThe instruction or question you give to the AI“Explain photosynthesis like I’m 10 years old”
Context WindowHow much conversation text the AI can keep in mind at onceIf you chat too long, the AI may forget what you said at the beginning
TokensSmall chunks of text the AI reads (not always full words)“ChatGPT is great” becomes pieces like “Chat”, “GPT”, “is”, “great”
System MessageHidden rules that control how the AI should behave“Be helpful, safe, and don’t give harmful instructions” (you don’t normally see this)
Developer MessageApp-level instructions that shape how the AI behaves in a specific appA chatbot app saying: “Keep answers short and formal”
User MessageWhat you directly type to the AI“What is machine learning?”
InferenceThe moment the AI generates an answerYou ask a question and instantly get a response back
Training DataThe huge amount of text used to teach the AIBooks, websites, articles, code used before the AI was released
TemperatureControls how creative or random the answer isLow = factual answer, High = more creative or story-like response
AlignmentHow well the AI follows human rules and stays safeThe AI refusing harmful requests and staying helpful
Tool Use (Function Calling)When AI uses external systems like APIs or toolsAsking “weather today” → AI calls a weather service instead of guessing
Memory (if available)Ability to remember past chatsAI remembering your name or preferences in later conversations

Core Concept

  • Model = brain
  • Prompt = question
  • Context = short-term memory
  • Tokens = words broken into pieces
  • System message = hidden teacher rules
  • Inference = thinking and answering process

LLM Vulnerabilities


VulnerabilityOWASP MappingDescriptionLLM-Specific MechanismImpact
Prompt InjectionLLM01Malicious instructions override system behaviorInjected text in prompts, webpages, or documentsPolicy bypass, unauthorized actions
Indirect Prompt InjectionLLM01Hidden instructions in external data sourcesRAG content, PDFs, web pages influencing modelStealth control of outputs
JailbreakingLLM01 / LLM07Bypassing safety alignment rulesRole-play, obfuscation, multi-turn persuasionRestricted content generation
System Prompt LeakageLLM07Exposure of hidden system/developer prompts“Repeat instructions above” attacksLoss of control/security logic
Context Window InjectionLLM01Hidden malicious instructions inside long contextLarge documents overriding earlier rulesSilent behavioral manipulation
Retrieval Poisoning (RAG Attack)LLM08Poisoning external knowledge baseMalicious embeddings or documents in vector DBWrong or manipulated responses
Data Extraction (Memorization Leak)LLM02Extracting training or sensitive dataTargeted prompting to retrieve memorized contentPrivacy leakage
Membership InferenceLLM02Detecting if data was in training setConfidence probing, response pattern analysisPrivacy violation
Model InversionLLM02Reconstructing original private dataQuery-based reconstruction attacksSensitive data recovery
Tool / Function Call ManipulationLLM06Misuse of external tools or APIsPrompt forces unsafe function executionData exfiltration, unauthorized actions
Excessive Agency ExploitationLLM06Over-permissioned autonomous systemsAgent performs actions beyond intended scopeFinancial/data/system damage
Output Handling VulnerabilitiesLLM05Unsafe downstream processing of outputsInjected SQL/HTML/code from model outputXSS, SQL injection, command injection
Token SmugglingLLM01Hidden instructions in encoded formatsBase64, Unicode tricks, formatting bypassSafety filter evasion
Multi-turn JailbreakLLM01Gradual manipulation across conversationsStep-by-step trust buildingPolicy bypass over time
Prompt Chaining AttackLLM01Splitting malicious intent into safe stepsEach prompt appears harmless individuallyCombined harmful outcome
Instruction Hierarchy ConfusionLLM01Conflicting system/user/context rulesAmbiguity in priority of instructionsUnpredictable or unsafe outputs
Hallucination ExploitationLLM09Leveraging false but confident outputsModel fabricates answers under uncertaintyMisinformation propagation
Bias ExploitationLLM09Triggering learned biases in outputsLeading or framed promptsHarmful or discriminatory content
Overlong Context DegradationLLM10Performance degradation in long inputsAttention dilution across large contextMissed constraints or errors
Unbounded Resource ConsumptionLLM10Forcing excessive computation or loopsRecursive prompts or tool loopsCost explosion / denial of service
Data Poisoning (Training/Fine-tune)LLM03 / LLM04Corrupting training or tuning dataInjected malicious dataset entriesPersistent harmful model behavior
Supply Chain Model AttackLLM03Vulnerabilities in third-party models/toolsExternal APIs, plugins, model weightsBackdoors or compromised behavior