--- license: apache-2.0 language: - en - code library_name: PyLate tags: - ColBERT - PyLate - sentence-transformers - code-search - code-retrieval - late-interaction - reasoning base_model: lightonai/GTE-ModernColBERT-v1 datasets: - nomic-ai/cornstack-python-v1 - nomic-ai/cornstack-java-v1 - nomic-ai/cornstack-javascript-v1 - nomic-ai/cornstack-php-v1 - nomic-ai/cornstack-go-v1 - nomic-ai/cornstack-ruby-v1 pipeline_tag: sentence-similarity --- # Reason-Code-ModernColBERT The **first reasoning-enhanced ColBERT model for code search and retrieval**. Extends the [ReasonIR methodology](https://arxiv.org/abs/2504.20595) to the code domain — generating reasoning-intensive queries that require understanding algorithms, edge cases, and design patterns, not just keyword matching. Built on research from [LightOn AI](https://huggingface.co/lightonai) (ColBERT for code) and [Facebook Research](https://github.com/facebookresearch/ReasonIR) (reasoning-enhanced retrieval). ## Why Reasoning-Enhanced Training for Code? Standard code search training uses docstring→code pairs. Our approach generates **reasoning-intensive queries** that require understanding the code's algorithm, behavior, and edge cases — not just surface-level keyword matching. This is the same methodology that enabled [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT) to outperform 7B dense models on reasoning tasks at only 150M parameters. ## Model Details | Property | Value | |---|---| | **Base model** | [lightonai/GTE-ModernColBERT-v1](https://huggingface.co/lightonai/GTE-ModernColBERT-v1) | | **Architecture** | ColBERT (late-interaction, multi-vector) | | **Parameters** | 150M | | **Embedding dim** | 128 per token | | **Document length** | 512 tokens | | **Query length** | 128 tokens | | **Similarity** | MaxSim | | **Languages** | Python, Java, JavaScript, PHP, Go, Ruby | | **License** | Apache 2.0 | ## Training ### Two-Stage Training Pipeline **Stage 1: CoRNStack Base (1 epoch)** - 100,000 high-quality code search pairs from [CoRNStack](https://huggingface.co/collections/nomic-ai/cornstack-67c60fda17322ce742fe9dac) (Apache 2.0) - 6 languages: Python (25K), Java (20K), JavaScript (15K), PHP (15K), Go (15K), Ruby (10K) - Loss: 2.42 → 0.63 **Stage 2: Reasoning-Enhanced Fine-Tuning (3 epochs)** - 9,959 reasoning-intensive code search queries generated from CoRNStack code samples - Queries require understanding algorithms, edge cases, design patterns, and complexity - Each query includes a chain-of-thought reasoning prefix (ReasonIR methodology) - Loss: 2.36 → 0.54 ### Training Configuration ```python # Both stages model = ColBERT(document_length=512, query_length=128) loss = CachedContrastive(temperature=1.0, mini_batch_size=32) batch_size = 256 optim = "adamw_torch" bf16 = True # Stage 1: lr=1e-5, 1 epoch, warmup=5% # Stage 2: lr=5e-6, 3 epochs, warmup=5% ``` ### Hardware Trained on a single NVIDIA DGX Spark (GB10 Blackwell, 128GB unified memory). - Stage 1: ~130 min (391 steps) - Stage 2: ~37 min (117 steps) ## Benchmark Results ### CodeSearchNet MRR (500 queries per language, 500 candidates) | Language | GTE-ModernColBERT (base) | **Reason-Code-ModernColBERT (ours)** | Δ | |------------|:---:|:---:|:---:| | Python | 0.991 | 0.989 | -0.002 | | Java | 0.829 | **0.866** | +0.037 | | JavaScript | 0.802 | **0.839** | +0.037 | | PHP | 0.841 | **0.862** | +0.021 | | Go | 0.879 | **0.887** | +0.008 | | Ruby | 0.773 | **0.831** | +0.058 | | **Average** | 0.853 | **0.879** | **+0.026** | Improves on the base model in 5 of 6 languages. Largest gains in Ruby (+5.8pp), Java (+3.7pp), and JavaScript (+3.7pp) — languages that benefited most from reasoning-enhanced training data. Python is near-ceiling at 0.99. ## Usage ```python from pylate import models model = models.ColBERT(model_name_or_path="ctrltokyo/Reason-Code-ModernColBERT") queries = ["function that sorts an array in descending order using a comparison-based algorithm"] code_docs = ["def sort_desc(arr):\n return sorted(arr, reverse=True)"] query_embeddings = model.encode(queries, is_query=True) doc_embeddings = model.encode(code_docs, is_query=False) ``` ## Citation This model extends the methodology from: ```bibtex @article{shao2025reasonir, title={ReasonIR: Training Retrievers for Reasoning Tasks}, author={Shao, Rulin and Jiang, Rui and Yu, Tao and Hashimoto, Tatsunori}, journal={arXiv preprint arXiv:2504.20595}, year={2025} } @misc{Reason-ModernColBERT, title={Reason-ModernColBERT}, author={LightOn AI}, year={2025}, url={https://huggingface.co/lightonai/Reason-ModernColBERT} } @inproceedings{cornstack2025, title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking}, author={Gangisetty, Zach and others}, booktitle={ICLR}, year={2025} } ``` Built with [PyLate](https://github.com/lightonai/pylate) and [Sentence Transformers](https://www.sbert.net/).