Text Classification
Transformers
Safetensors
modernbert
code
language-identification
multi-label
llm-guard
encoder
text-embeddings-inference
Instructions to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Accuknoxtechnologies/CodeLanguage-Encoder-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Accuknoxtechnologies/CodeLanguage-Encoder-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") model = AutoModelForSequenceClassification.from_pretrained("Accuknoxtechnologies/CodeLanguage-Encoder-v1") - Notebooks
- Google Colab
- Kaggle
add/update model card with eval metrics
Browse files
README.md
CHANGED
|
@@ -13,12 +13,59 @@ tags:
|
|
| 13 |
|
| 14 |
# Code Language Identification (encoder, multi-label)
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
`
|
|
|
|
|
|
|
| 19 |
|
| 20 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
| 21 |
-
- **Trained with**: max_seq_length=3072, epochs=3, lr=2e-05
|
| 22 |
- **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
# Code Language Identification (encoder, multi-label)
|
| 15 |
|
| 16 |
+
Encoder classifier that detects which programming languages (out of
|
| 17 |
+
25) appear in an input. Fine-tuned from
|
| 18 |
+
**[`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)**.
|
| 19 |
+
Replaces the 2B Qwen decoder LoRA with a single-forward-pass encoder for
|
| 20 |
+
lower-latency runtime-security use in LLM-Guard's `Code` scanner.
|
| 21 |
|
| 22 |
- **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
|
|
|
|
| 23 |
- **Labels (25)**: Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
|
| 24 |
+
- **Output**: per-language sigmoid; `is_valid` = any language above threshold
|
| 25 |
+
(0.5).
|
| 26 |
+
- **Multilingual / long context**: inherited from the base encoder; trained with
|
| 27 |
+
inputs up to the base model's positional limit.
|
| 28 |
|
| 29 |
+
## Test-set metrics (n=500)
|
| 30 |
+
|
| 31 |
+
| Metric | Value |
|
| 32 |
+
|--------|-------|
|
| 33 |
+
| is_valid accuracy | 0.976 |
|
| 34 |
+
| category-set (exact) accuracy | 0.904 |
|
| 35 |
+
| micro-F1 | 0.952 |
|
| 36 |
+
| macro-F1 | 0.950 |
|
| 37 |
+
| latency mean (ms/example) | 2.45145196095109 |
|
| 38 |
+
| latency p95 (ms/example) | 4.068814963102341 |
|
| 39 |
+
| device | cuda:0 |
|
| 40 |
+
|
| 41 |
+
### Per-language F1
|
| 42 |
+
|
| 43 |
+
| Language | F1 |
|
| 44 |
+
|----------|----|
|
| 45 |
+
| AWK | 0.926 |
|
| 46 |
+
| Bash | 0.812 |
|
| 47 |
+
| Batch | 0.964 |
|
| 48 |
+
| C | 1.000 |
|
| 49 |
+
| C# | 0.950 |
|
| 50 |
+
| C++ | 0.958 |
|
| 51 |
+
| Dockerfile | 0.955 |
|
| 52 |
+
| Go | 0.950 |
|
| 53 |
+
| Java | 1.000 |
|
| 54 |
+
| JavaScript | 0.863 |
|
| 55 |
+
| Kotlin | 1.000 |
|
| 56 |
+
| Lua | 0.938 |
|
| 57 |
+
| Makefile | 0.977 |
|
| 58 |
+
| Perl | 0.947 |
|
| 59 |
+
| PowerShell | 0.943 |
|
| 60 |
+
| Python | 0.980 |
|
| 61 |
+
| R | 0.963 |
|
| 62 |
+
| Ruby | 0.977 |
|
| 63 |
+
| Rust | 1.000 |
|
| 64 |
+
| SQL | 1.000 |
|
| 65 |
+
| Scala | 0.821 |
|
| 66 |
+
| Swift | 0.939 |
|
| 67 |
+
| Terraform | 0.950 |
|
| 68 |
+
| YAML | 0.952 |
|
| 69 |
+
| jq | 0.974 |
|
| 70 |
+
|
| 71 |
+
*Evaluated on `test_dataset_langid.csv`. Generated 2026-06-01 18:00 UTC.*
|