Accuknoxtechnologies
/

CodeLanguage-Encoder-v1

Text Classification

language-identification

text-embeddings-inference

Model card Files Files and versions

Yash1005 commited on 4 days ago

Commit

b55278f

·

verified ·

1 Parent(s): 36550eb

add/update model card with eval metrics

Files changed (1) hide show

README.md +52 -5

README.md CHANGED Viewed

@@ -13,12 +13,59 @@ tags:
 # Code Language Identification (encoder, multi-label)
-Multi-label classifier over 25 programming languages, fine-tuned from
-**[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**. Single forward pass;
-`is_valid` = any language above threshold (0.5).
 - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
-- **Trained with**: max_seq_length=3072, epochs=3, lr=2e-05
 - **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
-> Test-set metrics are added by `eval_and_push_card.py` after evaluation.

 # Code Language Identification (encoder, multi-label)
+Encoder classifier that detects which programming languages (out of
+25) appear in an input. Fine-tuned from
+**[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**.
+Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
+lower-latency runtime-security use in LLM-Guard's `Code` scanner.
 - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
 - **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
+- **Output**: per-language sigmoid; `is_valid` = any language above threshold
+  (0.5).
+- **Multilingual / long context**: inherited from the base encoder; trained with
+  inputs up to the base model's positional limit.
+## Test-set metrics (n=500)
+| Metric | Value |
+|--------|-------|
+| is_valid accuracy | 0.976 |
+| category-set (exact) accuracy | 0.904 |
+| micro-F1 | 0.952 |
+| macro-F1 | 0.950 |
+| latency mean (ms/example) | 2.45145196095109 |
+| latency p95 (ms/example) | 4.068814963102341 |
+| device | cuda:0 |
+### Per-language F1
+| Language | F1 |
+|----------|----|
+| AWK | 0.926 |
+| Bash | 0.812 |
+| Batch | 0.964 |
+| C | 1.000 |
+| C# | 0.950 |
+| C++ | 0.958 |
+| Dockerfile | 0.955 |
+| Go | 0.950 |
+| Java | 1.000 |
+| JavaScript | 0.863 |
+| Kotlin | 1.000 |
+| Lua | 0.938 |
+| Makefile | 0.977 |
+| Perl | 0.947 |
+| PowerShell | 0.943 |
+| Python | 0.980 |
+| R | 0.963 |
+| Ruby | 0.977 |
+| Rust | 1.000 |
+| SQL | 1.000 |
+| Scala | 0.821 |
+| Swift | 0.939 |
+| Terraform | 0.950 |
+| YAML | 0.952 |
+| jq | 0.974 |
+*Evaluated on `test_dataset_langid.csv`. Generated 2026-06-01 18:00 UTC.*