Code Language Identification (encoder, multi-label)

Encoder classifier that detects which programming languages (out of 25) appear in an input. Fine-tuned from jhu-clsp/mmBERT-base. Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for lower-latency runtime-security use in LLM-Guard's Code scanner.

  • Base model: jhu-clsp/mmBERT-base
  • Labels (25): Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
  • Output: per-language sigmoid; is_valid = any language above threshold (0.5).
  • Multilingual / long context: inherited from the base encoder; trained with inputs up to the base model's positional limit.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

REPO = "Accuknoxtechnologies/CodeLanguage-Encoder-v1"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO).eval()

text = "def add(a, b):\n    return a + b"
enc = tokenizer(text, truncation=True, max_length=3072, return_tensors="pt")
with torch.no_grad():
    probs = model(**enc).logits.sigmoid()[0]      # per-language sigmoid

threshold = 0.5
id2label = model.config.id2label                  # {0: "Python", 1: "JavaScript", ...}
present = {id2label[i]: round(float(p), 3) for i, p in enumerate(probs) if p >= threshold}

# Same schema the original Qwen scanner emitted: is_valid = any language fired.
result = {"is_valid": bool(present), "category": {k: True for k in present}}
print(result)   # e.g. {"is_valid": True, "category": {"Python": True}}

Test-set metrics (n=500)

Metric Value
is_valid accuracy 0.958
category-set (exact) accuracy 0.820
micro-F1 0.898
macro-F1 0.895
latency mean (ms/example) 2.3932456970214844
latency p95 (ms/example) 3.833106905221939
device cuda:0

Per-language F1

Language F1
AWK 0.926
Bash 0.722
Batch 0.902
C 0.864
C# 0.927
C++ 0.936
Dockerfile 0.977
Go 0.919
Java 0.917
JavaScript 0.816
Kotlin 1.000
Lua 0.867
Makefile 0.878
Perl 0.857
PowerShell 0.833
Python 0.863
R 0.906
Ruby 0.900
Rust 0.981
SQL 0.980
Scala 0.762
Swift 0.917
Terraform 0.895
YAML 0.955
jq 0.889

Evaluated on test_dataset_langid.csv. Generated 2026-06-02 09:23 UTC.

Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Accuknoxtechnologies/CodeLanguage-Encoder-v1

Finetuned
(97)
this model