Reason-Code-ModernColBERT
The first reasoning-enhanced ColBERT model for code search and retrieval.
Extends the ReasonIR methodology to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from LightOn AI (ColBERT for code) and Facebook Research (reasoning-enhanced retrieval).
Why Reasoning-Enhanced Training for Code?
Standard code search training uses docstring→code pairs. Our approach generates reasoning-intensive queries that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled Reason-ModernColBERT to outperform 7B dense models on reasoning tasks at only 150M parameters.
Model Details
| Property | Value |
|---|---|
| Base model | lightonai/GTE-ModernColBERT-v1 |
| Architecture | ColBERT (late-interaction, multi-vector) |
| Parameters | 150M |
| Embedding dim | 128 per token |
| Document length | 512 tokens |
| Query length | 128 tokens |
| Similarity | MaxSim |
| Languages | Python, Java, JavaScript, PHP, Go, Ruby |
| License | Apache 2.0 |
Training
Two-Stage Training Pipeline
Stage 1: CoRNStack Base (1 epoch)
- 100,000 high-quality code search pairs from CoRNStack (Apache 2.0)
- 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K)
- Loss: 2.42 → 0.63
Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)
- 9,959 reasoning-intensive code search queries generated from CoRNStack code samples
- Queries require understanding algorithms, edge cases, design patterns, and complexity
- Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology)
- Loss: 2.36 → 0.54
Training Configuration
# Both stages
model = ColBERT(document_length=512, query_length=128)
loss = CachedContrastive(temperature=1.0, mini_batch_size=32)
batch_size = 256
optim = "adamw_torch"
bf16 = True
# Stage 1: lr=1e-5, 1 epoch, warmup=5%
# Stage 2: lr=5e-6, 3 epochs, warmup=5%
Hardware
Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory).
- Stage 1: ~130 min (391 steps)
- Stage 2: ~37 min (117 steps)
Benchmark Results
CodeSearchNet MRR (500 queries per language, 500 candidates)
| Language | GTE-ModernColBERT (base) | Reason-Code-ModernColBERT (ours) | Δ |
|---|---|---|---|
| Python | 0.991 | 0.989 | -0.002 |
| Java | 0.829 | 0.866 | +0.037 |
| JavaScript | 0.802 | 0.839 | +0.037 |
| PHP | 0.841 | 0.862 | +0.021 |
| Go | 0.879 | 0.887 | +0.008 |
| Ruby | 0.773 | 0.831 | +0.058 |
| Average | 0.853 | 0.879 | +0.026 |
Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99.
Usage
from pylate import models
model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT")
queries = ["function that sorts an array in descending order using a comparison-based algorithm"]
code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"]
query_embeddings = model.encode(queries, is_query=True)
doc_embeddings = model.encode(code_docs, is_query=False)
Citation
This model extends the methodology from:
@article{shao2025reasonir,
title={ReasonIR: Training Retrievers for Reasoning Tasks},
author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori},
journal={arXiv preprint arXiv:2504.20595},
year={2025}
}
@misc{Reason-ModernColBERT,
title={Reason-ModernColBERT},
author={LightOn AI},
year={2025},
url={https://huggingface.co/lightonai/Reason-ModernColBERT}
}
@inproceedings{cornstack2025,
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
author={Gangisetty, Zach and others},
booktitle={ICLR},
year={2025}
}
Built with PyLate and Sentence Transformers.
- Downloads last month
- 42
Model tree for ctrltokyo/Reason-Code-ModernColBERT
Base model
answerdotai/ModernBERT-base