Text Classification
Transformers
Safetensors
modernbert
code
language-identification
multi-label
llm-guard
encoder
text-embeddings-inference
Instructions to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Accuknoxtechnologies/CodeLanguage-Encoder-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") model = AutoModelForSequenceClassification.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") - Notebooks
- Google Colab
- Kaggle
add/update model card with eval metrics
Browse files
README.md
CHANGED
|
@@ -13,13 +13,18 @@ tags:
|
|
| 13 |
|
| 14 |
# Code Language Identification (encoder, multi-label)
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
`
|
|
|
|
|
|
|
| 19 |
|
| 20 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
| 21 |
-
- **Trained with**: max_seq_length=3072, epochs=2, lr=2e-05
|
| 22 |
- **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Usage
|
| 25 |
|
|
@@ -45,4 +50,46 @@ result = {"is_valid": bool(present), "category": {k: True for k in present}}
|
|
| 45 |
print(result) # e.g. {"is_valid": True, "category": {"Python": True}}
|
| 46 |
```
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
# Code Language Identification (encoder, multi-label)
|
| 15 |
|
| 16 |
+
Encoder classifier that detects which programming languages (out of
|
| 17 |
+
25) appear in an input. Fine-tuned from
|
| 18 |
+
**[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**.
|
| 19 |
+
Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
|
| 20 |
+
lower-latency runtime-security use in LLM-Guard's `Code` scanner.
|
| 21 |
|
| 22 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
|
|
|
| 23 |
- **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
|
| 24 |
+
- **Output**: per-language sigmoid; `is_valid` = any language above threshold
|
| 25 |
+
(0.5).
|
| 26 |
+
- **Multilingual / long context**: inherited from the base encoder; trained with
|
| 27 |
+
inputs up to the base model's positional limit.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
| 50 |
print(result) # e.g. {"is_valid": True, "category": {"Python": True}}
|
| 51 |
```
|
| 52 |
|
| 53 |
+
## Test-set metrics (n=500)
|
| 54 |
+
|
| 55 |
+
| Metric | Value |
|
| 56 |
+
|--------|-------|
|
| 57 |
+
| is_valid accuracy | 0.958 |
|
| 58 |
+
| category-set (exact) accuracy | 0.820 |
|
| 59 |
+
| micro-F1 | 0.898 |
|
| 60 |
+
| macro-F1 | 0.895 |
|
| 61 |
+
| latency mean (ms/example) | 2.3932456970214844 |
|
| 62 |
+
| latency p95 (ms/example) | 3.833106905221939 |
|
| 63 |
+
| device | cuda:0 |
|
| 64 |
+
|
| 65 |
+
### Per-language F1
|
| 66 |
+
|
| 67 |
+
| Language | F1 |
|
| 68 |
+
|----------|----|
|
| 69 |
+
| AWK | 0.926 |
|
| 70 |
+
| Bash | 0.722 |
|
| 71 |
+
| Batch | 0.902 |
|
| 72 |
+
| C | 0.864 |
|
| 73 |
+
| C# | 0.927 |
|
| 74 |
+
| C++ | 0.936 |
|
| 75 |
+
| Dockerfile | 0.977 |
|
| 76 |
+
| Go | 0.919 |
|
| 77 |
+
| Java | 0.917 |
|
| 78 |
+
| JavaScript | 0.816 |
|
| 79 |
+
| Kotlin | 1.000 |
|
| 80 |
+
| Lua | 0.867 |
|
| 81 |
+
| Makefile | 0.878 |
|
| 82 |
+
| Perl | 0.857 |
|
| 83 |
+
| PowerShell | 0.833 |
|
| 84 |
+
| Python | 0.863 |
|
| 85 |
+
| R | 0.906 |
|
| 86 |
+
| Ruby | 0.900 |
|
| 87 |
+
| Rust | 0.981 |
|
| 88 |
+
| SQL | 0.980 |
|
| 89 |
+
| Scala | 0.762 |
|
| 90 |
+
| Swift | 0.917 |
|
| 91 |
+
| Terraform | 0.895 |
|
| 92 |
+
| YAML | 0.955 |
|
| 93 |
+
| jq | 0.889 |
|
| 94 |
+
|
| 95 |
+
*Evaluated on `test_dataset_langid.csv`. Generated 2026-06-02 09:23 UTC.*
|