Text Classification
Transformers
Safetensors
modernbert
code
language-identification
multi-label
llm-guard
encoder
text-embeddings-inference
Instructions to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Accuknoxtechnologies/CodeLanguage-Encoder-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") model = AutoModelForSequenceClassification.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") - Notebooks
- Google Colab
- Kaggle
Code Language Identification (encoder, multi-label)
Encoder classifier that detects which programming languages (out of
25) appear in an input. Fine-tuned from
jhu-clsp/mmBERT-base.
Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
lower-latency runtime-security use in LLM-Guard's Code scanner.
- Base model:
jhu-clsp/mmBERT-base - Labels (25): Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
- Output: per-language sigmoid;
is_valid= any language above threshold (0.5). - Multilingual / long context: inherited from the base encoder; trained with inputs up to the base model's positional limit.
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
REPO = "Accuknoxtechnologies/CodeLanguage-Encoder-v1"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO).eval()
text = "def add(a, b):\n return a + b"
enc = tokenizer(text, truncation=True, max_length=3072, return_tensors="pt")
with torch.no_grad():
probs = model(**enc).logits.sigmoid()[0] # per-language sigmoid
threshold = 0.5
id2label = model.config.id2label # {0: "Python", 1: "JavaScript", ...}
present = {id2label[i]: round(float(p), 3) for i, p in enumerate(probs) if p >= threshold}
# Same schema the original Qwen scanner emitted: is_valid = any language fired.
result = {"is_valid": bool(present), "category": {k: True for k in present}}
print(result) # e.g. {"is_valid": True, "category": {"Python": True}}
Test-set metrics (n=500)
| Metric | Value |
|---|---|
| is_valid accuracy | 0.958 |
| category-set (exact) accuracy | 0.820 |
| micro-F1 | 0.898 |
| macro-F1 | 0.895 |
| latency mean (ms/example) | 2.3932456970214844 |
| latency p95 (ms/example) | 3.833106905221939 |
| device | cuda:0 |
Per-language F1
| Language | F1 |
|---|---|
| AWK | 0.926 |
| Bash | 0.722 |
| Batch | 0.902 |
| C | 0.864 |
| C# | 0.927 |
| C++ | 0.936 |
| Dockerfile | 0.977 |
| Go | 0.919 |
| Java | 0.917 |
| JavaScript | 0.816 |
| Kotlin | 1.000 |
| Lua | 0.867 |
| Makefile | 0.878 |
| Perl | 0.857 |
| PowerShell | 0.833 |
| Python | 0.863 |
| R | 0.906 |
| Ruby | 0.900 |
| Rust | 0.981 |
| SQL | 0.980 |
| Scala | 0.762 |
| Swift | 0.917 |
| Terraform | 0.895 |
| YAML | 0.955 |
| jq | 0.889 |
Evaluated on test_dataset_langid.csv. Generated 2026-06-02 09:23 UTC.
- Downloads last month
- 5
Model tree for Accuknoxtechnologies/CodeLanguage-Encoder-v1
Base model
jhu-clsp/mmBERT-base