Why Your Next LLM Might Not Have A Tokenizer

In my last article, we dove into Google’s Titans — a mannequin that pushes the boundaries of long-context recall by introducing a dynamic reminiscence module that adapts on the fly, sort of like how our personal reminiscence works.

It’s an odd paradox. We’ve AI that may analyze a 10-million-word doc, but it nonetheless fumbles questions like: “What number of ‘r’s are within the phrase strawberry?”

The issue isn’t the AI’s mind; it’s the eyes. Step one in how these fashions learn, tokenization, primarily pre-processes language for them. In doing so, it strips away the wealthy, messy particulars of how letters type phrases; the entire world of sub-word data simply vanishes.

1. Misplaced in Tokenization: The place Subword Semantics Die

Language, for people, begins as sound, spoken lengthy earlier than it’s written. But it’s by means of writing and spelling that we start to understand the compositional construction of language. Letters type syllables, syllables type phrases, and from there, we construct conversations. This character-level understanding permits us to appropriate, interpret, and infer even when the textual content is noisy or ambiguous. In distinction, language fashions skip this part fully. They’re by no means uncovered to characters or uncooked textual content as-is; as an alternative, their whole notion of language is mediated by a tokenizer.

This tokenizer, paradoxically, is the one part in all the pipeline that’s not discovered. It’s dumb, fastened, and completely based mostly on heuristics, regardless of sitting on the entry level of a mannequin designed to be deeply adaptive. In impact, tokenization units the stage for studying, however with none studying of its personal.

Furthermore, tokenization is extraordinarily brittle. A minor typo, say, “strawverry” as an alternative of “strawberry”, can yield a very totally different token sequence, regardless of the semantic intent remaining apparent to any human reader. This sensitivity, as an alternative of being dealt with proper then and there, is handed downstream, forcing the mannequin to interpret a corrupted enter. Worse nonetheless, optimum tokenizations are extremely domain-dependent. A tokenizer skilled on on a regular basis English textual content could carry out superbly for pure language, however fail miserably when encountering supply code, producing lengthy and semantically awkward token chains for variable names like user_id_to_name_map

Just like the “Spinal Twine”, that’s the language pipeline, the upper up it’s compromised, the extra it cripples all the pieces downstream. Sitting proper on the high, a flawed tokenizer distorts enter earlier than the mannequin even begins reasoning. Irrespective of how good the structure is, it’s working with corrupted alerts from the beginning.

(Supply: Creator)
How a easy typo can presumably waste LLM’s “considering energy” to rectify it

2. Behold! Byte Latent Transformer

If tokenization is the brittle basis holding trendy LLMs again, the pure query follows: why not get rid of it fully? That’s exactly the unconventional course taken by researchers at Meta AI with the Byte Latent Transformer (BLT) (Pagnoni et al. 2024)¹. Slightly than working on phrases, subwords, and even characters, BLT fashions language from uncooked bytes — essentially the most basic illustration of digital textual content. This permits LLMs to be taught the language from the very floor up, with out the tokenizer being there to eat away on the subword semantics.

However modeling bytes instantly is much from trivial. A naïve byte-level Transformer would choke on enter lengths a number of occasions longer than tokenized textual content — a million phrases turn out to be almost 5 million bytes (1 phrase = 4.7 characters on common, and 1 character = 1 byte), making consideration computation infeasible as a result of its quadratic scaling. BLT circumvents this by introducing a dynamic two-tiered system: easy-to-predict byte segments are compressed into latent “patches,” considerably shortening the sequence size. The complete, high-capacity mannequin is then selectively utilized, focusing its computational sources solely the place linguistic complexity calls for it.

(Supply: Tailored from Pagnoni et al. 2024, Determine 2)
Zoomed-out view of all the Byte Latent Former structure

2.1 How does it work?

The mannequin might be conceptually divided into three major elements, every with a definite duty:

2.1.1 The Native Encoder:

The first perform of the Native Encoder is to rework a protracted enter sequence of N_bytes of uncooked bytes, b = (b₁, b₂,…, b_{N_bytes})right into a a lot shorter sequence of N_patches of latent patch representations, p = (p₁, p₂,…, p_{N_patches}).

Step 1: Enter Segmentation and Preliminary Byte Embedding

The enter sequence is segmented into patches based mostly on a pre-defined technique, comparable to entropy-based patching. This supplies patch boundary data however doesn’t alter the enter sequence itself. This patch boundary data will come in useful later.

(Supply: Pagnoni et al. 2024, Determine 3)
Completely different methods for patching, visualized

The primary operation inside the encoder is to map every discrete byte worth (0-255) right into a steady vector illustration. That is achieved by way of a learnable embedding matrix, E_byte (form: [256, h_e]), the place h_e is the hidden dimension of the native module.
Enter: A tensor of byte IDs of form [B, N_bytes], the place B is the batch dimension.
Output: A tensor of byte embeddings, X (form: [B, N_bytes, h_e]).

Step 2: Contextual Augmentation by way of N-gram Hashing

To complement every byte illustration with native context past its particular person identification, the researchers make use of a hash-based n-gram embedding method. For every byte b_i at place i, a set of previous n-grams, g_i,n = {b_i-n+1,…, b_i} are constructed for a number of values of n ∈ {3,…,8}.

These n-grams are mapped by way of a hash perform to indices inside a second, separate embedding desk, E_hash (form: [V_hash, h_e]), the place V_hash is a set, giant vocabulary dimension (i.e., the variety of hash buckets).

The ensuing n-gram embeddings are summed with the unique byte embedding to provide an augmented illustration, e_i. This operation is outlined as:

(Supply: Creator)
Clarification: Search for the hash of the n-gram within the embedding desk and add it to the respective byte embedding, for all n ∈ [3,8]

the place x_i is the preliminary embedding for byte b_i.
The form of the tensor E = {e₁, e₂,…,e_{N_bytes}} stays [B, N_bytes, h_e].

Step 3: Iterative Refinement with Transformer and Cross-Consideration Layers

The core of the Native Encoder consists of a stack of l_e an identical layers. Every layer performs a two-stage course of to refine byte representations and distill them into patch representations.

Step 3a: Native Self-Consideration:
The enter is processed by a normal Transformer block. This block makes use of a causal self-attention mechanism with a restricted consideration window, which means every byte illustration is up to date by attending solely to a set variety of previous byte representations. This ensures computational effectivity whereas nonetheless permitting for contextual refinement.

Enter: If it’s the primary layer, the enter is the context-augmented byte embedding E; in any other case, it receives the output from the earlier native Self-Consideration layer.

(Supply: Creator)
***H_l***: Enter for the present Self-Consideration layer
E: Context-Augmented Byte Embedding from Step 2
**H^‘_l-1**: Output from the earlier Self-Consideration layer

Output: Extra contextually conscious byte-representations, H^‘_l (form:
[B, N_bytes, h_e])

Step 3b: Multi-Headed Cross-Consideration:
The aim of the Cross-Consideration is to distill the fine-grained, contextual data captured within the byte representations and inject it into the extra summary patch representations, giving them a wealthy consciousness of their constituent sub-word constructions. That is achieved by means of a cross-attention mechanism the place patches “question” the bytes they comprise.

Queries (Q): The patch embeddings are projected utilizing a easy linear layer to type the queries.
For any subsequent layer (l>0), the patch embeddings are merely the refined patch representations output by the cross-attention block of the earlier layer, P_(l−1).
Nonetheless, for the very first layer (l=0), these patch embeddings have to be created from scratch. This initialization is a three-step course of:

Gathering: Utilizing the patch boundary data obtained in Step 1, the mannequin gathers the byte representations from H₀ that belong to every patch. For a single patch, this leads to a tensor of form (N_{bytes_per_patch}, h_e). After padding every patch illustration to be of the identical size, if there are J patches, the form of all the concatenated tensor turns into:
(B, J, N_{bytes_per_patch}, h_e).
Pooling: To summarize the vector for every patch, a pooling operation (e.g., max-pooling) is utilized throughout the N_{bytes_per_patch} dimension. This successfully summarizes essentially the most salient byte-level options inside the patch.
- Enter Form: (B, J, N_{bytes_per_patch}, h_e)
- Output Form: (B, J, h_e)
Projection: This summarized patch vector, nonetheless within the small native dimension h_e is then handed by means of a devoted linear layer to the worldwide dimension, h_g, the place h_e <<< h_g. This projection is what bridges the native and international modules.
- Enter Form: (B, J, h_e)
- Output Form: (B, J, h_g)

(Supply: Creator)
Abstract of the 3-step course of to get the primary patch embeddings:
1. Gathering and pooling the bytes for every respective patch.
2. Concatenating the patches to a single tensor.
3. Projection of the patch embedding tensor to the worldwide dimension.

The patch representations, obtained both from the earlier cross-attention block’s output or initialized from scratch, are then fed right into a linear projection layer to type queries.

Enter Form: (B, J, h_g)
Output Form: (B, J, d_a), the place d_a is the “consideration dimension”.

Keys and Values: These are derived from the byte representations H_l from Step 3a. They’re projected from dimension h_e to an intermediate consideration dimension d_a, by way of unbiased linear layers:

(Supply: Creator)
Projection of the Self-Consideration output from Step 3a to Keys and Values.

(Supply: Creator)
Overview of the Info move within the Native Encoder

2.1.2 The Latent World Transformer

The sequence of patch representations generated by the Native Encoder is handed to the Latent World Transformer. This module serves as the first reasoning engine of the BLT mannequin. It’s a normal, high-capacity autoregressive Transformer composed of l_g self-attention layers, the place l_g is considerably bigger than the variety of layers within the native modules.

Working on patch vectors (form: [B, J, h_g]), this transformer performs full self-attention throughout all patches, enabling it to mannequin advanced, long-range dependencies effectively. Its sole perform is to foretell the illustration of the following patch, o_j (form: [B, 1, h_g]), within the sequence based mostly on all previous ones. The output is a sequence of predicted patch vectors, O_j (form: [B, J, h_g]), which encode the mannequin’s high-level predictions.

(Supply: Creator)
***o_j*** is the patch that accommodates the knowledge for the following prediction

2.1.3 The Native Decoder

The ultimate architectural part is the Native Decoder, a light-weight Transformer that decodes the anticipated patch vector, o_j, the final token from the worldwide mannequin’s output, O_j, again right into a sequence of uncooked bytes. It operates autoregressively, producing one byte at a time.

The technology course of, designed to be the inverse of the encoder, begins with the hidden state of the final byte within the encoder’s output, H_l. Then, for every subsequent byte generated by the decoder (d’_ok), in a typical autoregressive method, it makes use of the anticipated byte’s hidden state because the enter to information the technology.

Cross-Consideration: The final byte’s state of the encoder’s output H_l[:,-1,:] (appearing as question, with form: [B, 1, h_e]) attends to the goal patch vector o_j (appearing as Key and Worth). This step injects the high-level semantic instruction from the patch idea into the byte stream.

The question vectors are projected to an consideration dimension, d_a, whereas the patch vector is projected to create the important thing and worth. This alignment ensures the generated bytes are contextually related to the worldwide prediction.

(Supply: Creator)
The final equations, which encapsulate what Question, Key, and Worth are.
***d’_ok***: The ok+1^th predicted byte’s hidden state from the decoder.

Native Self-Consideration: The ensuing patch-aware byte representations are then processed by a causal self-attention mechanism. This permits the mannequin to think about the sequence of bytes already generated inside the present patch, implementing native sequential coherence and proper character ordering.

After passing by means of all l_d layers, every together with the above two phases, the hidden state of the final byte within the sequence is projected by a closing linear layer to a 256-dimensional logit vector. A softmax perform converts these logits right into a chance distribution over the byte vocabulary, from which the following byte is sampled. This new byte is then embedded and appended to the enter sequence for the following technology step, persevering with till the patch is totally decoded.

(Supply: Creator)
Overview of the Info move in Native Decoder

3. The Verdict: Bytes Are Higher Than Tokens!

Byte Latent Transformer might genuinely be a substitute for the common vanilla Tokenization-based Transformers at scale. Listed here are a number of convincing causes for that argument:

1. Byte-Degree Fashions Can Match The Ones Primarily based On Tokens.
One of many principal contributions of this work is that byte-level fashions, for the primary time, can match the scaling conduct of state-of-the-art token-based architectures comparable to LLaMA 3 (Grattafiori et al. 2024)². When skilled underneath compute-optimal regimes, the Byte Latent Transformer (BLT) displays efficiency scaling traits similar to these of fashions utilizing byte pair encoding (BPE). This discovering challenges the long-standing assumption that byte-level processing is inherently inefficient, exhibiting as an alternative that with the proper architectural design, tokenizer-free fashions even have a shot.

(Supply: Tailored from Pagnoni et al. 2024, Determine 6)
BLT exhibiting aggressive BPB (perplexity equal for byte fashions) and related scaling legal guidelines to these of the tokenizer-based LLaMA fashions

2. A New Scaling Dimension: Buying and selling Patch Measurement for Mannequin Measurement.
The BLT structure decouples mannequin dimension from sequence size in a means that token-based fashions can not. By dynamically grouping bytes into patches, BLT can use longer common patches to save lots of on compute. This saved compute might be reallocated to extend the dimensions and capability of the principle Latent World Transformer whereas retaining the full inference price (FLOPs) fixed. The paper exhibits this new trade-off is very useful: bigger fashions working on longer patches persistently outperform smaller fashions working on shorter tokens/patches for a set inference price range.
This implies you possibly can have a bigger and extra succesful mannequin — at no additional compute price!

(Supply: Tailored from Pagnoni et al. 2024, Determine 1)
The steeper scaling curves of the bigger BLT fashions enable them to surpass the efficiency of the token-based Llama fashions after the crossover level.

3. Subword Consciousness By Byte-Degree Modeling
By processing uncooked bytes instantly, BLT avoids the knowledge loss sometimes launched by tokenization, getting access to the interior construction of phrases — their spelling, morphology, and character-level composition. This leads to a heightened sensitivity to subword patterns, which the mannequin demonstrates throughout a number of benchmarks.
On CUTE (Character-level Understanding and Textual content Analysis) (Edman et al., 2024)³, BLT excels at duties involving fine-grained edits like character swaps or substitutions, reaching near-perfect accuracy on spelling duties the place fashions like LLaMA 3 fail fully.
Equally, on noised HellaSwag (Zellers et al, 2019)⁴, the place inputs are perturbed with typos and case variations, BLT retains its reasoning skill way more successfully than token-based fashions. These outcomes are indicative of BLT’s inherent robustness, which Token-based fashions can’t acquire even with considerably extra information.

(Supply: Pagnoni et al. 2024, Desk 3)
The mannequin’s direct byte-level processing results in large good points on character manipulation **(CUTE)** and noise robustness **(HellaSwag Noise Avg.)**, duties that problem token-based architectures.

4. BLT Reveals Stronger Efficiency on Low-Useful resource Languages.
Fastened tokenizers, usually skilled on a majority of English or high-resource language information, might be inefficient and inequitable for low-resource languages, usually breaking phrases down into particular person bytes (a phenomenon often known as “byte-fallback”). As a result of BLT is inherently byte-based, it treats all languages equally from the beginning. The outcomes present this results in improved efficiency in machine translation, significantly for languages with scripts and morphologies which can be poorly represented in normal BPE vocabularies.

(Supply: Pagnoni et al. 2024, Desk 4)
Machine translation efficiency on the FLORES-101 benchmark (Goyal et al., 2022)⁵. Comparable efficiency on high-resource languages, however superior for low-resource languages, outperforming the LLaMA 3 mannequin.

5. Dynamic Allocation Of Compute: Not Each Phrase Is Equally Deserving
A key power of the BLT structure lies in its skill to dynamically allocate computation based mostly on enter complexity. In contrast to conventional fashions that expend a set quantity of compute per token—treating easy phrases like “the” and complicated ones like “antidisestablishmentarianism” with equal price—BLT ties its computational effort to the construction of its discovered patches. The high-capacity World Transformer works just for patches, permitting BLT to type longer patches over predictable, low-complexity sequences and shorter patches over areas requiring deeper reasoning. This permits the mannequin to focus its strongest elements the place they’re wanted most, whereas offloading routine byte-level decoding to a lighter, native decoder, yielding a much more environment friendly and adaptive allocation of sources.

4. Last Ideas And Conclusion

For me, what makes BLT thrilling isn’t simply the benchmarks or the novelties, it’s the concept that a mannequin can transfer past the superficial wrappers we name “languages” — English, Japanese, even Python — and begin studying instantly from the uncooked bytes, the basic substrate of all communication. I really like that. A mannequin that doesn’t depend on a set vocabulary, however as an alternative learns construction from the bottom up? That appears like an actual step towards one thing extra common.

In fact, one thing this totally different gained’t be embraced with open arms, in a single day. Tokenizers have turn out to be baked into all the pieces — our fashions, our instruments, our instinct. Ditching them means rethinking the very foundational block of all the AI ecosystem. However the upside right here is tough to disregard. Perhaps as an alternative of the entire structure, we’d see a few of its options being built-in into the brand new programs we see sooner or later.

5. References

[1] Pagnoni, Artidoro, et al. “Byte latent transformer: Patches scale better than tokens.” arXiv preprint arXiv:2412.09871 (2024).
[2] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).
[3] Edman, Lukas, Helmut Schmid, and Alexander Fraser. “CUTE: Measuring LLMs’ Understanding of Their Tokens.” arXiv preprint arXiv:2409.15452 (2024).
[4] Zellers, Rowan, et al. “Hellaswag: Can a machine really finish your sentence?.” arXiv preprint arXiv:1905.07830 (2019).
[5] Goyal, Naman, et al. “The flores-101 evaluation benchmark for low-resource and multilingual machine translation.” Transactions of the Affiliation for Computational Linguistics 10 (2022): 522-538.

Source link

STOP Building Useless ML Projects – What Actually Works

Implementing IBCS rules in Power BI

Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

Using Graph Databases to Model Patient Journeys and Clinical Relationships

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Understanding Data Categorization: Binning vs. Slicing | by Pierre DeBois | Apr, 2025

🚀 AI-Powered Development: The Rise of “Vibe Coding” | by Parth Patel | May, 2025

Ceramic.ai Emerges from Stealth, Reports 2.5x Faster Model Training

Our Picks