To date, we’ve explored what a tokenizer is and even constructed our personal from scratch. Nevertheless, one of many key limitations of constructing a customized tokenizer is dealing with unknown or uncommon phrases. That is the place superior tokenizers like OpenAI’s tiktoken, which makes use of Byte Pair Encoding (BPE), actually shine.
We additionally understood, Language fashions don’t learn or perceive in the identical method people do. Earlier than any textual content might be processed by a mannequin, it must be tokenized, that’s, damaged into smaller chunks referred to as tokens. One of the vital environment friendly and broadly adopted methods to carry out that is referred to as Byte Pair Encoding (BPE).
Let’s dive deep into the way it works, why it’s necessary, and the right way to use it in observe.
Byte Pair Encoding is an information compression algorithm tailored for tokenization. As a substitute of treating phrases as entire items, it breaks them down into smaller, extra frequent subword items. This enables it to:
- Deal with unknown phrases gracefully
- Strike a steadiness between character-level and word-level tokenization
- Cut back the general vocabulary dimension
Let’s perceive this with a simplified instance.
We start by breaking all phrases in our corpus into characters:
"low", "decrease", "latest", "widest"
→ ["l", "o", "w"], ["l", "o", "w", "e", "r"], ...
We depend the frequency of adjoining character pairs (bigrams). For instance:
"l o": 2, "o w": 2, "w e": 2, "e s": 2, ...
Merge probably the most frequent pair into a brand new token:
Merge "e s" → "es"
Now “latest” turns into: ["n", "e", "w", "es", "t"]
.
Proceed this course of till you attain the specified vocabulary dimension or till no extra merges are attainable.
- Environment friendly: It reuses frequent subwords to scale back redundancy.
- Versatile: Handles uncommon and compound phrases higher than word-level tokenizers.
- Compact vocabulary: Important for efficiency in massive fashions.
It solves a key downside: the right way to tokenize unknown or uncommon phrases with out bloating the vocabulary.
- OpenAI’s GPT (e.g., GPT-2, GPT-3, GPT-4)
- Hugging Face’s RoBERTa
- EleutherAI’s GPT-NeoX
- Most transformer fashions earlier than newer methods like Unigram or SentencePiece got here in
Now let’s see the right way to use the tiktoken library by OpenAI, which implements BPE for GPT fashions.
pip set up tiktoken
import tiktoken# Load GPT-4 tokenizer (it's also possible to strive "gpt2", "cl100k_base", and many others.)
encoding = tiktoken.get_encoding("cl100k_base")
# Enter textual content
textual content = "IdeaWeaver is constructing a tokenizer utilizing BPE"
# Tokenize
token_ids = encoding.encode(textual content)
print("Token IDs:", token_ids)
# Decode again to textual content
decoded_text = encoding.decode(token_ids)
print("Decoded Textual content:", decoded_text)
# Elective: Present particular person tokens
tokens = [encoding.decode([id]) for id in token_ids]
print("Tokens:", tokens)
Token IDs: [10123, 91234, ...]
Decoded Textual content: IdeaWeaver is constructing a tokenizer utilizing BPE
Tokens: ['Idea', 'Weaver', ' is', ' building', ' a', ' tokenizer', ' using', ' BPE']
You possibly can see that even compound or uncommon phrases are cut up into manageable subword items, which is the energy of BPE.
Byte Pair Encoding might sound easy, but it surely’s one of many key improvements that made at this time’s massive language fashions attainable. It strikes a steadiness between effectivity, flexibility, and robustness in dealing with numerous language enter.
Subsequent time you ask a query to GPT, bear in mind, BPE made certain your phrases had been understood!
In case you’re on the lookout for a single instrument to satisfy all of your Generative AI wants, try IdeaWeaver.