ai_code_detect / README.md
santh-cpu's picture
Update README.md
126e819 verified
---
language:
- code
tags:
- python
- java
- cpp
- ai-detection
- code-analysis
- temporal-cnn
- codet5
metrics:
- f1: 0.9921
---
# ai_code_detect
Binary classifier: human-written vs. AI-generated code. Trained on 500k samples (Python, Java, C++). Macro F1: **0.9813**.
---
## Architecture
Two input streams fused into a single MLP classifier.
**Stream 1 — Probabilistic**
Code is passed through `Salesforce/codegen-350M-mono`. Per-token surprisal signals are extracted across a 256-token window:
| # | Feature | Description |
|---|---------|-------------|
| 0 | `log_prob` | Log-probability of the actual token |
| 1 | `log_rank` | Log-rank within the distribution |
| 2 | `entropy` | Shannon entropy of the token distribution |
| 3 | `varentropy` | Variance of entropy |
| 4 | `top10_mass` | Probability mass in top-10 tokens |
| 5 | `gap_1_2` | Log-prob gap between rank-1 and rank-2 |
| 6 | `surprisal_z` | Per-token surprisal z-score |
| 7 | `entropy_delta` | Entropy change from previous position |
| 8 | `cum_rank` | Cumulative mean log-rank |
| 9 | `is_special` | Special token flag |
| 10 | `r10_flag` | Rank ≤ 10 |
| 11 | `r100_flag` | 10 < rank ≤ 100 |
These 12 per-token features aggregate into 32 sequence-level statistics (moments, autocorrelations, burstiness, etc.) passed downstream.
**Stream 2 — Semantic**
`Salesforce/codet5-base` mean-pools hidden states into a 768-dim embedding capturing style, structure, naming, and comment density.
**Classifier**
Token (256-dim) + sequence (64-dim) + semantic (768-dim) representations are concatenated → 1088-dim → 3-layer MLP with LayerNorm, GELU, dropout → sigmoid.
---
## Performance
Evaluated on 3,000 balanced validation samples (1,000/language):
| Metric | Score |
|--------|-------|
| Macro F1 | **0.9813** |
| Accuracy | **98.13%** |
| Threshold | 0.475 |
| Language | Accuracy | Human p̄ | AI p̄ | Gap |
|----------|----------|---------|-------|-----|
| Python | 99.50% | 0.001 | 0.992 | 0.991 |
| Java | 98.00% | 0.043 | 0.968 | 0.926 |
| C++ | 96.90% | 0.063 | 0.966 | 0.903 |
---
## Training
| Setting | Value |
|---------|-------|
| Optimizer | AdamW (encoder lr 8e-6, head lr 3e-5) |
| Scheduler | OneCycleLR + cosine annealing |
| Loss | BCEWithLogitsLoss |
| Regularization | EMA (decay=0.998), dropout, LayerNorm |
| Precision | fp16 via HuggingFace Accelerate |
| Hardware | 2× GPU |
| Epochs | 4 (500k samples) |
---
## How To Use
```python
import os
import sys
from huggingface_hub import hf_hub_download
REPO_ID = "santh-cpu/ai_code_detect"
script_path = hf_hub_download(repo_id=REPO_ID, filename="model.py")
sys.path.append(os.path.dirname(script_path))
from model import predict
print(predict("your code here"))
```