How this model count the token size?

#10

by WeiZhenKun - opened Aug 18, 2023

Aug 18, 2023

•

edited Aug 18, 2023

How this model count the token size?
Is there a certain proportional relationship between the token size and the length of characters?

intfloat

Owner Feb 17, 2025

This model is based on the BERT tokenizer, as an approximate rule of thumb, there are roughly 0.75 words per token in English text. For precise count, please load the tokenizer and run on your data of interest.

intfloat changed discussion status to closed Feb 17, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment