| --- |
| license: apache-2.0 |
| language: |
| - en |
| - code |
| library_name: PyLate |
| tags: |
| - ColBERT |
| - PyLate |
| - sentence-transformers |
| - code-search |
| - code-retrieval |
| - late-interaction |
| - reasoning |
| base_model: lightonai/GTE-ModernColBERT-v1 |
| datasets: |
| - nomic-ai/cornstack-python-v1 |
| - nomic-ai/cornstack-java-v1 |
| - nomic-ai/cornstack-javascript-v1 |
| - nomic-ai/cornstack-php-v1 |
| - nomic-ai/cornstack-go-v1 |
| - nomic-ai/cornstack-ruby-v1 |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # Reason-Code-ModernColBERT |
|
|
| The **first reasoning-enhanced ColBERT model for code search and retrieval**. |
|
|
| Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval). |
|
|
| ## Why Reasoning-Enhanced Training for Code? |
|
|
| Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) | |
| | **Architecture** | ColBERT (late-interaction, multi-vector) | |
| | **Parameters** | 150M | |
| | **Embedding dim** | 128 per token | |
| | **Document length** | 512 tokens | |
| | **Query length** | 128 tokens | |
| | **Similarity** | MaxSim | |
| | **Languages** | Python, Java, JavaScript, PHP, Go, Ruby | |
| | **License** | Apache 2.0 | |
|
|
| ## Training |
|
|
| ### Two-Stage Training Pipeline |
|
|
| **Stage 1: CoRNStack Base (1 epoch)** |
| - 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0) |
| - 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K) |
| - Loss: 2.42 → 0.63 |
|
|
| **Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)** |
| - 9,959 reasoning-intensive code search queries generated from CoRNStack code samples |
| - Queries require understanding algorithms, edge cases, design patterns, and complexity |
| - Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology) |
| - Loss: 2.36 → 0.54 |
|
|
| ### Training Configuration |
|
|
| ```python |
| # Both stages |
| model = ColBERT(document_length=512, query_length=128) |
| loss = CachedContrastive(temperature=1.0, mini_batch_size=32) |
| batch_size = 256 |
| optim = "adamw_torch" |
| bf16 = True |
| |
| # Stage 1: lr=1e-5, 1 epoch, warmup=5% |
| # Stage 2: lr=5e-6, 3 epochs, warmup=5% |
| ``` |
|
|
| ### Hardware |
|
|
| Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). |
| - Stage 1: ~130 min (391 steps) |
| - Stage 2: ~37 min (117 steps) |
|
|
| ## Benchmark Results |
|
|
| ### CodeSearchNet MRR (500 queries per language, 500 candidates) |
|
|
| | Language | GTE-ModernColBERT (base) | **Reason-Code-ModernColBERT (ours)** | Δ | |
| |------------|:---:|:---:|:---:| |
| | Python | 0.991 | 0.989 | -0.002 | |
| | Java | 0.829 | **0.866** | +0.037 | |
| | JavaScript | 0.802 | **0.839** | +0.037 | |
| | PHP | 0.841 | **0.862** | +0.021 | |
| | Go | 0.879 | **0.887** | +0.008 | |
| | Ruby | 0.773 | **0.831** | +0.058 | |
| | **Average** | 0.853 | **0.879** | **+0.026** | |
|
|
| Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99. |
|
|
| ## Usage |
|
|
| ```python |
| from pylate import models |
| |
| model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT") |
| |
| queries = ["function that sorts an array in descending order using a comparison-based algorithm"] |
| code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"] |
| |
| query_embeddings = model.encode(queries, is_query=True) |
| doc_embeddings = model.encode(code_docs, is_query=False) |
| ``` |
|
|
| ## Citation |
|
|
| This model extends the methodology from: |
|
|
| ```bibtex |
| @article{shao2025reasonir, |
| title={ReasonIR: Training Retrievers for Reasoning Tasks}, |
| author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori}, |
| journal={arXiv preprint arXiv:2504.20595}, |
| year={2025} |
| } |
| |
| @misc{Reason-ModernColBERT, |
| title={Reason-ModernColBERT}, |
| author={LightOn AI}, |
| year={2025}, |
| url={https://huggingface.co/lightonai/Reason-ModernColBERT} |
| } |
| |
| @inproceedings{cornstack2025, |
| title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, |
| author={Gangisetty, Zach and others}, |
| booktitle={ICLR}, |
| year={2025} |
| } |
| ``` |
|
|
| Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/). |
|
|