santh-cpu
/

ai_code_detect

Model card Files Files and versions

ai_code_detect / README.md

santh-cpu's picture

Update README.md

126e819 verified about 17 hours ago

|

history blame contribute delete

2.74 kB

	---
	language:
	- code
	tags:
	- python
	- java
	- cpp
	- ai-detection
	- code-analysis
	- temporal-cnn
	- codet5
	metrics:
	- f1: 0.9921
	---

	# ai_code_detect

	Binary classifier: human-written vs. AI-generated code. Trained on 500k samples (Python, Java, C++). Macro F1: 0.9813.

	---

	## Architecture

	Two input streams fused into a single MLP classifier.

	Stream 1 — Probabilistic

	Code is passed through `Salesforce/codegen-350M-mono`. Per-token surprisal signals are extracted across a 256-token window:

	\| # \| Feature \| Description \|
	\|---\|---------\|-------------\|
	\| 0 \| `log_prob` \| Log-probability of the actual token \|
	\| 1 \| `log_rank` \| Log-rank within the distribution \|
	\| 2 \| `entropy` \| Shannon entropy of the token distribution \|
	\| 3 \| `varentropy` \| Variance of entropy \|
	\| 4 \| `top10_mass` \| Probability mass in top-10 tokens \|
	\| 5 \| `gap_1_2` \| Log-prob gap between rank-1 and rank-2 \|
	\| 6 \| `surprisal_z` \| Per-token surprisal z-score \|
	\| 7 \| `entropy_delta` \| Entropy change from previous position \|
	\| 8 \| `cum_rank` \| Cumulative mean log-rank \|
	\| 9 \| `is_special` \| Special token flag \|
	\| 10 \| `r10_flag` \| Rank ≤ 10 \|
	\| 11 \| `r100_flag` \| 10 < rank ≤ 100 \|

	These 12 per-token features aggregate into 32 sequence-level statistics (moments, autocorrelations, burstiness, etc.) passed downstream.

	Stream 2 — Semantic

	`Salesforce/codet5-base` mean-pools hidden states into a 768-dim embedding capturing style, structure, naming, and comment density.

	Classifier

	Token (256-dim) + sequence (64-dim) + semantic (768-dim) representations are concatenated → 1088-dim → 3-layer MLP with LayerNorm, GELU, dropout → sigmoid.

	---

	## Performance

	Evaluated on 3,000 balanced validation samples (1,000/language):

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Macro F1 \| 0.9813 \|
	\| Accuracy \| 98.13% \|
	\| Threshold \| 0.475 \|

	\| Language \| Accuracy \| Human p̄ \| AI p̄ \| Gap \|
	\|----------\|----------\|---------\|-------\|-----\|
	\| Python \| 99.50% \| 0.001 \| 0.992 \| 0.991 \|
	\| Java \| 98.00% \| 0.043 \| 0.968 \| 0.926 \|
	\| C++ \| 96.90% \| 0.063 \| 0.966 \| 0.903 \|

	---

	## Training

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Optimizer \| AdamW (encoder lr 8e-6, head lr 3e-5) \|
	\| Scheduler \| OneCycleLR + cosine annealing \|
	\| Loss \| BCEWithLogitsLoss \|
	\| Regularization \| EMA (decay=0.998), dropout, LayerNorm \|
	\| Precision \| fp16 via HuggingFace Accelerate \|
	\| Hardware \| 2× GPU \|
	\| Epochs \| 4 (500k samples) \|


	---

	## How To Use

	```python
	import os
	import sys
	from huggingface_hub import hf_hub_download

	REPO_ID = "santh-cpu/ai_code_detect"
	script_path = hf_hub_download(repo_id=REPO_ID, filename="model.py")
	sys.path.append(os.path.dirname(script_path))
	from model import predict

	print(predict("your code here"))
	```