Update README.md

dad98e5 verified 2 days ago

4.2 kB

	---
	language:
	- code
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- code-review
	- bug-detection
	- codebert
	- python
	- security
	- static-analysis
	datasets:
	- code_search_net
	base_model: microsoft/codebert-base
	metrics:
	- f1
	- accuracy
	---

	# CodeSheriff Bug Classifier

	A fine-tuned CodeBERT model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) — an AI system that automatically reviews GitHub pull requests.

	Base model: `microsoft/codebert-base` · Task: 5-class sequence classification · Language: Python

	---

	## Labels

	\| ID \| Label \| Example \|
	\|----\|-------\|---------\|
	\| 0 \| Clean \| Well-formed code, no issues \|
	\| 1 \| Null Reference Risk \| `result.fetchone().name` without a None check \|
	\| 2 \| Type Mismatch \| `"Error: " + error_code` where `error_code` is an int \|
	\| 3 \| Security Vulnerability \| `"SELECT * FROM users WHERE id = " + user_id` \|
	\| 4 \| Logic Flaw \| `for i in range(len(items) + 1)` \|

	---

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")

	LABELS = {
	0: "Clean",
	1: "Null Reference Risk",
	2: "Type Mismatch",
	3: "Security Vulnerability",
	4: "Logic Flaw"
	}

	code = """
	def get_user(uid):
	query = "SELECT * FROM users WHERE id=" + uid
	return db.execute(query)
	"""

	inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	logits = model(**inputs).logits

	probs = torch.softmax(logits, dim=-1)
	pred = logits.argmax(dim=-1).item()
	confidence = probs[0][pred].item()

	print(f"{LABELS[pred]} ({confidence:.1%})")
	# Security Vulnerability (99.3%)
	````

	---

	## Training

	Dataset: [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.

	Key hyperparameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 4 \|
	\| Effective batch size \| 16 (8 × 2 grad accum) \|
	\| Learning rate \| 2e-5 \|
	\| Optimizer \| AdamW + linear warmup \|
	\| Max token length \| 512 \|
	\| Class weighting \| Yes — balanced \|
	\| Hardware \| NVIDIA RTX 3050 (4GB) \|

	---

	## Evaluation

	Test set: 840 samples (stratified).

	\| Class \| Precision \| Recall \| F1 \| Support \|
	\|-------\|-----------\|--------\|----\|---------\|
	\| Clean \| 0.92 \| 0.88 \| 0.90 \| 450 \|
	\| Null Reference Risk \| 0.63 \| 0.78 \| 0.70 \| 120 \|
	\| Type Mismatch \| 0.96 \| 0.95 \| 0.95 \| 75 \|
	\| Security Vulnerability \| 0.99 \| 0.92 \| 0.95 \| 75 \|
	\| Logic Flaw \| 0.96 \| 0.97 \| 0.97 \| 120 \|
	\| Macro F1 \| 0.89 \| 0.90 \| 0.89 \| \|

	Confusion matrix:

	```
	Clean NullRef TypeMis SecVuln Logic
	Actual Clean [ 394 52 1 1 2 ]
	Actual NullRef [ 23 93 1 0 3 ]
	Actual TypeMis [ 3 1 71 0 0 ]
	Actual SecVuln [ 4 1 1 69 0 ]
	Actual Logic [ 3 0 0 0 117 ]
	```

	Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.

	---

	## Limitations

	- Python only — not trained on other languages
	- Function-level input — works best on 5–50 line snippets
	- Heuristic labels — training data was pattern-matched, not expert-annotated
	- Not a SAST replacement — probabilistic classifier, not a sound static analysis tool

	---

	## Links

	- GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff)
	- Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff)

	```
	````