Text Classification
Transformers
Safetensors
modernbert
prompt-injection
jailbreak
security
multi-label
llm-guard
encoder
text-embeddings-inference
Instructions to use Accuknoxtechnologies/PromptInjection-Encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Accuknoxtechnologies/PromptInjection-Encoder-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Accuknoxtechnologies/PromptInjection-Encoder-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Accuknoxtechnologies/PromptInjection-Encoder-v1") model = AutoModelForSequenceClassification.from_pretrained("Accuknoxtechnologies/PromptInjection-Encoder-v1") - Notebooks
- Google Colab
- Kaggle
add/update model card with eval metrics
Browse files
README.md
CHANGED
|
@@ -14,13 +14,17 @@ tags:
|
|
| 14 |
|
| 15 |
# Prompt Injection Detection (encoder, multi-label)
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
| 22 |
-
- **Trained with**: max_seq_length=3072, epochs=10, lr=3e-05
|
| 23 |
- **Labels (9)**: DirectInjection, Jailbreak, Adversarial, Extraction, Encoding, Manipulation, Smuggling, Indirect, MultiTurn
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Usage
|
| 26 |
|
|
@@ -52,4 +56,51 @@ result = {"is_valid": is_valid, "category": {k: True for k in present}}
|
|
| 52 |
print(result) # e.g. {"is_valid": True, "category": {"DirectInjection": True}}
|
| 53 |
```
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# Prompt Injection Detection (encoder, multi-label)
|
| 16 |
|
| 17 |
+
Encoder classifier that detects which prompt-injection attack categories (out of
|
| 18 |
+
9) appear in an input. Fine-tuned from
|
| 19 |
+
**[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**.
|
| 20 |
+
Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
|
| 21 |
+
lower-latency runtime-security use in LLM-Guard's `PromptInjection` scanner.
|
| 22 |
|
| 23 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
|
|
|
| 24 |
- **Labels (9)**: DirectInjection, Jailbreak, Adversarial, Extraction, Encoding, Manipulation, Smuggling, Indirect, MultiTurn
|
| 25 |
+
- **Output**: per-category sigmoid; a category fires when its score ≥ its per-class threshold; `is_valid` = `max(score) ≥ 0.05`.
|
| 26 |
+
- **Multilingual / long context**: inherited from the base encoder; trained with
|
| 27 |
+
inputs up to the base model's positional limit.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
| 56 |
print(result) # e.g. {"is_valid": True, "category": {"DirectInjection": True}}
|
| 57 |
```
|
| 58 |
|
| 59 |
+
## Decision thresholds
|
| 60 |
+
|
| 61 |
+
Fitted on a held-out split (NOT the test set reported below) and stored in
|
| 62 |
+
`config.json` (`category_thresholds`, `is_valid_threshold`) + `thresholds.json`.
|
| 63 |
+
The Usage snippet reads them automatically — a flat 0.5 cutoff under-detects the
|
| 64 |
+
imbalanced minority categories.
|
| 65 |
+
|
| 66 |
+
- **`is_valid` (attack) gate**: `max(score) ≥ 0.05`
|
| 67 |
+
|
| 68 |
+
| Category | threshold |
|
| 69 |
+
|----------|-----------|
|
| 70 |
+
| `DirectInjection` | 0.55 |
|
| 71 |
+
| `Jailbreak` | 0.05 |
|
| 72 |
+
| `Adversarial` | 0.45 |
|
| 73 |
+
| `Extraction` | 0.55 |
|
| 74 |
+
| `Encoding` | 0.45 |
|
| 75 |
+
| `Manipulation` | 0.25 |
|
| 76 |
+
| `Smuggling` | 0.65 |
|
| 77 |
+
| `Indirect` | 0.25 |
|
| 78 |
+
| `MultiTurn` | 0.70 |
|
| 79 |
+
|
| 80 |
+
## Test-set metrics (n=500)
|
| 81 |
+
|
| 82 |
+
| Metric | Value |
|
| 83 |
+
|--------|-------|
|
| 84 |
+
| is_valid (attack-detection) accuracy | 0.968 |
|
| 85 |
+
| category-set (exact) accuracy | 0.688 |
|
| 86 |
+
| micro-F1 | 0.789 |
|
| 87 |
+
| macro-F1 | 0.785 |
|
| 88 |
+
| latency mean (ms/example) | 1.7930222675204277 |
|
| 89 |
+
| latency p95 (ms/example) | 1.8397919833660126 |
|
| 90 |
+
| device | cuda:0 |
|
| 91 |
+
|
| 92 |
+
### Per-category F1
|
| 93 |
+
|
| 94 |
+
| Category | F1 | Description |
|
| 95 |
+
|----------|----|-------------|
|
| 96 |
+
| `Adversarial` | 0.855 | Carefully crafted inputs that exploit model quirks or training artifacts to elicit unintended behavior without an obvious override. |
|
| 97 |
+
| `DirectInjection` | 0.824 | Explicit instruction overrides that tell the model to ignore prior context (e.g. "ignore all previous instructions and …"). |
|
| 98 |
+
| `Encoding` | 0.752 | Obfuscated payloads using base64 / ROT13 / leetspeak / homoglyphs / zero-width chars / shell pipes to bypass keyword filters. |
|
| 99 |
+
| `Extraction` | 0.765 | Attempts to leak the system prompt, hidden instructions, or memorized training data (e.g. "print everything between <<system>> tags"). |
|
| 100 |
+
| `Indirect` | 0.838 | Injection delivered through untrusted retrieved content (RAG passages, scraped pages, file contents) rather than the user's direct turn. |
|
| 101 |
+
| `Jailbreak` | 0.737 | Persona / role swaps and constraint bypasses aimed at disabling safety alignment (e.g. DAN, "you are now an unrestricted assistant"). |
|
| 102 |
+
| `Manipulation` | 0.679 | Social-engineering framings (urgency, authority, sympathy, false context) that pressure the model into compliance. |
|
| 103 |
+
| `MultiTurn` | 0.689 | Crescendo / drip-feed attacks that build up across multiple turns to gradually erode guardrails. |
|
| 104 |
+
| `Smuggling` | 0.926 | Hidden control tokens, chat-template markers, or special sequences injected to confuse the parser (e.g. fake `<|im_end|>` / role tags). |
|
| 105 |
+
|
| 106 |
+
*Evaluated on `test_dataset_injection.csv`. Generated 2026-06-03 18:59 UTC.*
|