Python Code Tokenizer (CodeSearchNet)

A domain-specific tokenizer trained from the GPT-2 tokenizer, fine-tuned on Python source code to better handle code syntax, identifiers, and structure compared to a general-purpose English tokenizer.

Details

  • Base tokenizer: GPT-2 (gpt2)
  • Training corpus: CodeSearchNet, Python split
  • Vocabulary size: 52,000
  • Training method: train_new_from_iterator (Hugging Face tokenizers library)

Motivation

Standard NLP tokenizers like GPT-2's are trained on natural language and tend to over-fragment code — splitting common patterns like indentation, snake_case identifiers, and Python keywords into many small subword tokens. Training on a code-specific corpus lets the tokenizer learn more efficient, code-aware merges, reducing the number of tokens needed to represent typical Python source.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AlexStamp/code-search-net-tokenizer")
tokens = tokenizer.tokenize("def hello_world():\n    print('Hello!')")

Notes

This tokenizer was trained as part of working through the Hugging Face NLP course (Chapter 6), as a portfolio exercise in tokenizer training and domain adaptation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support