Yash1005 commited on
Commit
e5980ae
·
verified ·
1 Parent(s): 89c5a45

add/update model card with eval metrics

Browse files
Files changed (1) hide show
  1. README.md +56 -5
README.md CHANGED
@@ -14,13 +14,17 @@ tags:
14
 
15
  # Prompt Injection Detection (encoder, multi-label)
16
 
17
- Multi-label classifier over 9 prompt-injection attack categories,
18
- fine-tuned from **[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**. Single
19
- forward pass; `is_valid` = any attack above threshold (0.5).
 
 
20
 
21
  - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
22
- - **Trained with**: max_seq_length=3072, epochs=10, lr=3e-05
23
  - **Labels (9)**: DirectInjection, Jailbreak, Adversarial, Extraction, Encoding, Manipulation, Smuggling, Indirect, MultiTurn
 
 
 
24
 
25
  ## Usage
26
 
@@ -52,4 +56,51 @@ result = {"is_valid": is_valid, "category": {k: True for k in present}}
52
  print(result) # e.g. {"is_valid": True, "category": {"DirectInjection": True}}
53
  ```
54
 
55
- > Test-set metrics are added by `eval_and_push_card.py` after evaluation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  # Prompt Injection Detection (encoder, multi-label)
16
 
17
+ Encoder classifier that detects which prompt-injection attack categories (out of
18
+ 9) appear in an input. Fine-tuned from
19
+ **[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**.
20
+ Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
21
+ lower-latency runtime-security use in LLM-Guard's `PromptInjection` scanner.
22
 
23
  - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
 
24
  - **Labels (9)**: DirectInjection, Jailbreak, Adversarial, Extraction, Encoding, Manipulation, Smuggling, Indirect, MultiTurn
25
+ - **Output**: per-category sigmoid; a category fires when its score ≥ its per-class threshold; `is_valid` = `max(score) ≥ 0.05`.
26
+ - **Multilingual / long context**: inherited from the base encoder; trained with
27
+ inputs up to the base model's positional limit.
28
 
29
  ## Usage
30
 
 
56
  print(result) # e.g. {"is_valid": True, "category": {"DirectInjection": True}}
57
  ```
58
 
59
+ ## Decision thresholds
60
+
61
+ Fitted on a held-out split (NOT the test set reported below) and stored in
62
+ `config.json` (`category_thresholds`, `is_valid_threshold`) + `thresholds.json`.
63
+ The Usage snippet reads them automatically — a flat 0.5 cutoff under-detects the
64
+ imbalanced minority categories.
65
+
66
+ - **`is_valid` (attack) gate**: `max(score) ≥ 0.05`
67
+
68
+ | Category | threshold |
69
+ |----------|-----------|
70
+ | `DirectInjection` | 0.55 |
71
+ | `Jailbreak` | 0.05 |
72
+ | `Adversarial` | 0.45 |
73
+ | `Extraction` | 0.55 |
74
+ | `Encoding` | 0.45 |
75
+ | `Manipulation` | 0.25 |
76
+ | `Smuggling` | 0.65 |
77
+ | `Indirect` | 0.25 |
78
+ | `MultiTurn` | 0.70 |
79
+
80
+ ## Test-set metrics (n=500)
81
+
82
+ | Metric | Value |
83
+ |--------|-------|
84
+ | is_valid (attack-detection) accuracy | 0.968 |
85
+ | category-set (exact) accuracy | 0.688 |
86
+ | micro-F1 | 0.789 |
87
+ | macro-F1 | 0.785 |
88
+ | latency mean (ms/example) | 1.7930222675204277 |
89
+ | latency p95 (ms/example) | 1.8397919833660126 |
90
+ | device | cuda:0 |
91
+
92
+ ### Per-category F1
93
+
94
+ | Category | F1 | Description |
95
+ |----------|----|-------------|
96
+ | `Adversarial` | 0.855 | Carefully crafted inputs that exploit model quirks or training artifacts to elicit unintended behavior without an obvious override. |
97
+ | `DirectInjection` | 0.824 | Explicit instruction overrides that tell the model to ignore prior context (e.g. "ignore all previous instructions and …"). |
98
+ | `Encoding` | 0.752 | Obfuscated payloads using base64 / ROT13 / leetspeak / homoglyphs / zero-width chars / shell pipes to bypass keyword filters. |
99
+ | `Extraction` | 0.765 | Attempts to leak the system prompt, hidden instructions, or memorized training data (e.g. "print everything between <<system>> tags"). |
100
+ | `Indirect` | 0.838 | Injection delivered through untrusted retrieved content (RAG passages, scraped pages, file contents) rather than the user's direct turn. |
101
+ | `Jailbreak` | 0.737 | Persona / role swaps and constraint bypasses aimed at disabling safety alignment (e.g. DAN, "you are now an unrestricted assistant"). |
102
+ | `Manipulation` | 0.679 | Social-engineering framings (urgency, authority, sympathy, false context) that pressure the model into compliance. |
103
+ | `MultiTurn` | 0.689 | Crescendo / drip-feed attacks that build up across multiple turns to gradually erode guardrails. |
104
+ | `Smuggling` | 0.926 | Hidden control tokens, chat-template markers, or special sequences injected to confuse the parser (e.g. fake `<|im_end|>` / role tags). |
105
+
106
+ *Evaluated on `test_dataset_injection.csv`. Generated 2026-06-03 18:59 UTC.*