| --- |
| language: |
| - code |
| library_name: transformers |
| pipeline_tag: text-classification |
| tags: |
| - code-review |
| - bug-detection |
| - codebert |
| - python |
| - security |
| - static-analysis |
| datasets: |
| - code_search_net |
| base_model: microsoft/codebert-base |
| metrics: |
| - f1 |
| - accuracy |
| --- |
| |
| # CodeSheriff Bug Classifier |
|
|
| A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) β an AI system that automatically reviews GitHub pull requests. |
|
|
| **Base model:** `microsoft/codebert-base` Β· **Task:** 5-class sequence classification Β· **Language:** Python |
|
|
| --- |
|
|
| ## Labels |
|
|
| | ID | Label | Example | |
| |----|-------|---------| |
| | 0 | Clean | Well-formed code, no issues | |
| | 1 | Null Reference Risk | `result.fetchone().name` without a None check | |
| | 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int | |
| | 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` | |
| | 4 | Logic Flaw | `for i in range(len(items) + 1)` | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier") |
| model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier") |
| |
| LABELS = { |
| 0: "Clean", |
| 1: "Null Reference Risk", |
| 2: "Type Mismatch", |
| 3: "Security Vulnerability", |
| 4: "Logic Flaw" |
| } |
| |
| code = """ |
| def get_user(uid): |
| query = "SELECT * FROM users WHERE id=" + uid |
| return db.execute(query) |
| """ |
| |
| inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| |
| probs = torch.softmax(logits, dim=-1) |
| pred = logits.argmax(dim=-1).item() |
| confidence = probs[0][pred].item() |
| |
| print(f"{LABELS[pred]} ({confidence:.1%})") |
| # Security Vulnerability (99.3%) |
| ```` |
|
|
| --- |
|
|
| ## Training |
|
|
| **Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split. |
|
|
| **Key hyperparameters:** |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Epochs | 4 | |
| | Effective batch size | 16 (8 Γ 2 grad accum) | |
| | Learning rate | 2e-5 | |
| | Optimizer | AdamW + linear warmup | |
| | Max token length | 512 | |
| | Class weighting | Yes β balanced | |
| | Hardware | NVIDIA RTX 3050 (4GB) | |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Test set: 840 samples (stratified). |
|
|
| | Class | Precision | Recall | F1 | Support | |
| |-------|-----------|--------|----|---------| |
| | Clean | 0.92 | 0.88 | 0.90 | 450 | |
| | Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 | |
| | Type Mismatch | 0.96 | 0.95 | 0.95 | 75 | |
| | Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 | |
| | Logic Flaw | 0.96 | 0.97 | 0.97 | 120 | |
| | **Macro F1** | **0.89** | **0.90** | **0.89** | | |
|
|
| **Confusion matrix:** |
|
|
| ``` |
| Clean NullRef TypeMis SecVuln Logic |
| Actual Clean [ 394 52 1 1 2 ] |
| Actual NullRef [ 23 93 1 0 3 ] |
| Actual TypeMis [ 3 1 71 0 0 ] |
| Actual SecVuln [ 4 1 1 69 0 ] |
| Actual Logic [ 3 0 0 0 117 ] |
| ``` |
|
|
| Logic Flaw and Security Vulnerability are the strongest classes β both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **Python only** β not trained on other languages |
| - **Function-level input** β works best on 5β50 line snippets |
| - **Heuristic labels** β training data was pattern-matched, not expert-annotated |
| - **Not a SAST replacement** β probabilistic classifier, not a sound static analysis tool |
|
|
| --- |
|
|
| ## Links |
|
|
| - GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff) |
| - Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff) |
|
|
| ``` |
| ```` |