FragmentStream Attention: Training a Transformer in Budget | by Yash Rawal

Have you ever ever questioned how giant language fashions (LLMs) like GPT and Llama really work? Certain, you should use pre-trained fashions with only a few strains of code, however the true problem is knowing what’s taking place behind the scenes. I made a decision to discover — it’s complicated and irritating, however completely potential.

My journey started whereas engaged on firm undertaking by which I realized to make the most of the pretrained Llama 3.1 mannequin and my curiosity begins from there like How is it working?, How is it been developed?, Can I make one thing like this?

Whereas studying about it like several genius particular person, I stumbled upon this well-known analysis paper, ‘Attention is all you need,’ after studying this and understanding the Transformer Structure.

‘I began my very own daring Experimentation: a lean, imply 3-million-parameter mannequin, educated for 20 hours on Kaggle’s free P100 GPU. Proof that you just don’t want a supercomputer to chase huge concepts — simply dedication, curiosity, and a touch of resourcefulness!’

I started by creating a easy character-level mannequin that solely required 80 characters as tokens and 1,000 strains of dataset. As I improved it, I ended up creating this mannequin utilizing strategies like byte-pair encoding for phrase and sub phrase (like suffix and prefix) primarily based tokenization and even stumbled upon some stunning discoveries, which I’ll focus on later in future articles. Across the identical time, after I first began studying about Transformers, I used to be amazed by their energy but additionally pissed off by how a lot reminiscence they consumed it was like constructing a sandcastle which is washed away by the waves time and again!

This frustration led me to discover methods to make Transformers extra memory-efficient, ultimately resulting in the concept of “FragmentStream Consideration”.

However earlier than diving into the main points, let’s first discover why reminiscence issues.

Now Transformer is a sophisticated method which is dominating the sector of NLP and Language fashions as a result of they can perceive lengthy sequences and acknowledge patterns on this giant textual content knowledge .

It’s easy the extra context and data base you wish to retailer the extra reminiscence you’ll want.

Now this transformers are very good they assist you to with duties like translating languages, writing tales and possibly articles too. They will do that as a result of they’ve one thing known as “consideration” which permit them to learn solely most vital half and that’s why they use lot of reminiscence whereas coaching.

What’s the Drawback With Conventional Consideration?

Within the conventional Transformers, consideration works by evaluating each phrase to each different phrase in a sentence. Meaning they make an enormous grid, like an enormous desk, to maintain observe of how vital every phrase is in comparison with all of the others. This desk grows actually, actually huge if the sentence is just too lengthy.

Why This Is a Drawback:

It makes use of an excessive amount of reminiscence: The grid will get greater and larger because the textual content will get longer. If the textual content is 1,000 phrases, the grid is 1,000 x 1,000! That’s HUGE.
It’s gradual: Transformers must fill in each field within the grid, which takes quite a lot of time.

Let’s take a look at an instance in code:

# Conventional Consideration (simplified)
B, T, C = x.form  # B=batch dimension, T=sequence size, C=dimensions
q = self.question(x)  # (B, T, C)
okay = self.key(x)    # (B, T, C)# Retailer ALL consideration scores directly!
attention_scores = q @ okay.transpose(-2, -1)  # (B, T, T) - That is enormous!
consideration = softmax(attention_scores) @ v    # Extra reminiscence utilization
# Think about T is 1,000—that is 1,000 x 1,000 = 1,000,000 bins!

When you’ve ever tried to hold too many groceries directly, you know the way arduous it’s. That’s what occurs to Transformers when the grid is just too huge they drop every thing.

Now, let’s discuss how FragmentStream Consideration fixes this! Now what i did is solely divided the method into batches. As an alternative of trying on the entire e book directly, it splits the e book into small chunks, or fragments, and works on one piece at a time.

Think about studying one web page of a e book, writing down the vital stuff, after which transferring on to the following web page. Upon getting learn all of the pages, you place all of the notes collectively the notes from every web page to type an entire understanding of all the e book. This step-by-step strategy ensures nothing is ignored whereas protecting the method environment friendly. That’s what FragmentStream Consideration does.

Key Concepts Behind FragmentStream Consideration:

Break the textual content into items: It divides the textual content into smaller components (like 128 phrases at a time).
Deal with one half at a time: It solely compares the phrases inside each bit as an alternative of trying on the entire textual content directly.
Mix the outcomes: After engaged on every half, it provides every thing collectively to get the ultimate reply.
Maintain it organized: It nonetheless remembers the order of the textual content so every thing is smart.

That is the way it seems like in python code:

# FragmentStream_Attention implementation (simplified)fragment_size = 128  # Course of 128 tokens at a time
for i in vary(0, T, fragment_size):  # Course of queries in fragments
q_fragment = q[:, i:i+fragment_size]  # Take small group of queries
for j in vary(0, T, fragment_size):  # Course of keys/values in fragments
k_fragment = okay[:, j:j+fragment_size]  # Take small group of keys
v_fragment = v[:, j:j+fragment_size]  # And corresponding values        
# Examine solely these small fragments
scores = q_fragment @ k_fragment.transpose(-2, -1)
# Course of and accumulate outcomes

And that is how I think about it really works on {hardware}:

[Full Matrix in Memory]      [fragment 1] [Clean Up] [fragment 2] [Clean Up]
X X X X X X X X X X     ➜      X X X        ➜         X X X        ➜ 
X X X X X X X X X X     ➜      X X X        ➜         X X X        ➜ 
X X X X X X X X X X     ➜      X X X        ➜         X X X        ➜ 
X X X X X X X X X X     ➜      X X X        ➜         X X X        ➜

Yeah! I do know It might sound humorous but it surely make important modifications.

And that is how I utilized this in my mannequin:

class FragmentStream_Attention(nn.Module):
"""
Initialize FragmentStream Consideration moduleArgs:
- head_size: Dimensionality of consideration heads
- block_size: Most sequence size
- dropout: Regularization charge
- fragment_size: Dimension of textual content fragments to course of (default 128 tokens)
"""
def __init__(self, head_size, block_size, dropout, fragment_size=128):
tremendous().__init__()
self.head_size = head_size
self.fragment_size = fragment_size
self.dropout = nn.Dropout(dropout)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
def ahead(self, q, okay, v):
B, T, C = q.form
# Initialize output tensor
out = torch.zeros_like(v)
# Course of consideration in fragments to avoid wasting reminiscence
for i in vary(0, T, self.fragment_size):
j_start = i
j_end = min(T, i + self.fragment_size)
# Present fragment of queries
q_fragment = q[:, i:j_end]
# Calculate consideration scores for this fragment
attn_weights = torch.zeros(B, j_end-i, T, system=q.system)
for j in vary(0, T, self.fragment_size):
k_fragment = okay[:, j:min(T, j + self.fragment_size)]
# Compute consideration scores for this block
scores = (q_fragment @ k_fragment.transpose(-2, -1)) * (C ** -0.5)
# Apply causal masks
scores = scores.masked_fill(
self.tril[i:j_end, j:min(T, j + self.fragment_size)] == 0, 
float('-inf')
)
attn_weights[:, :, j:min(T, j + self.fragment_size)] = scores
# Softmax over all the sequence size
attn_weights = F.softmax(attn_weights, dim=-1)
attn_weights = self.dropout(attn_weights)
# Compute weighted sum of values in fragments
for j in vary(0, T, self.fragment_size):
v_fragment = v[:, j:min(T, j + self.fragment_size)]
out[:, i:j_end] += attn_weights[:, :, j:min(T, j + self.fragment_size)] @ v_fragment
return out

How FragmentStream Consideration Works Inside

Let’s look deeper into what occurs when FragmentStream Consideration reads the textual content:

Step-by-Step Rationalization:

Break the textual content into chunks: If the textual content has 1,000 phrases, and every fragment can maintain 128 phrases, it splits the textual content into about 8 components.
Examine inside every fragment: It seems on the phrases in every fragment to determine which of them are vital.
Write down the outcomes: After engaged on every fragment, it writes down the vital issues.
Put every thing collectively: On the finish, it combines the outcomes from all of the fragments to get the complete reply.

By doing this neatly, it saves a TON of reminiscence and works quicker. Plus, it really works on low or older {hardware} like NVIDIA’s P100 GPU!

Right here is the Circulate Chart of the my mannequin:

FragmentStream_Attention complete implemetation architecture in a Transformer based model

you can even checkout this experiment on this Kaggle pocket book and GitHub repository.

Balancing Reminiscence and Accuracy: Splitting the textual content into fragments with out shedding vital particulars was tough. I realized that selecting the best fragment dimension (like 128 tokens) is tremendous vital.
Understanding Commerce-offs: Whereas FragmentStream Consideration saves reminiscence, it’s not good for each scenario. For actually quick texts, conventional consideration would possibly nonetheless work higher.
Not likely Certain: I haven’t examined this mannequin on a really enormous dataset like i attempted it on Kaggle’s free P100 gpu on 26,075,321 coaching samples and 1,368,819 take a look at samples primarily based on a healthcare associated dataset by NIH Senior Well being and right here is the glimpse of the output:

query = "Howdy"
test_generation(query, temperature=0.0)Producing reply (temperature=0.0)...
Outcomes:
A: Howdy Hello , I've gone by your question and perceive your concern . You could have a historical past of allergic response to the pores and skin . It's a frequent downside of the pores and skin . It's a frequent situation of the pores and skin . It's a frequent explanation for the pores and skin . It's a frequent explanation for the pores and skin . It's a fungal an infection . It's a frequent explanation for the pores and skin an infection . It's a frequent explanation for the pores and skin . It's a fungal an infection . It's a frequent explanation for the pores and skin an infection . It's a fungal an infection . It's a frequent explanation for the pores and skin an infection . It's a fungal an infection . It's a fungal an infection . It's a frequent explanation for the pores and skin . It's a fungal an infection . It's a frequent explanation for the pores and skin . It's a fungal an infection . It's a fungal an infection . It's a frequent explanation for the pores and skin an infection . It's a fungal an infection . It's a frequent explanation for the pores and skin . It's a fungal an infection . It's a frequent

query = "I'm having ache in my head"
test_generation(query, temperature=0.5)Producing reply (temperature=0.5)...
Outcomes:
A: I'm having ache in my head Thanks for writing to Chat Physician . Since you've gotten historical past of ache in chest ache , I might counsel you to rule out cardiac trigger to your signs . You could have to see a physician for a analysis and therapy . Until then , you could have to take an antacid like omeprazole and antacid . When you ought to go for a chest x - ray and chest x - ray and blood take a look at to rule out any cardiac illnesses . If wanted x - ray is regular then no want to fret . Hope I've answered your query , in case you have doubt then I can be joyful to reply . Thanks for utilizing Chat Physician . Want you an excellent well being . Hello , I'm a 32 12 months previous girl . I've been experiencing ache in my left facet of my left shoulder blade and on the decrease left facet of my neck . I've had a ache in my again . I had a ache in my left arm . It has gotten worse . I had a small bruise on my again of my again .

Chatbots: To develop easy chatbot which may reply extra neatly utilizing context to generate good responses however at very, very low price.

Knowledgeable Techniques: To develop a website particular Knowledgeable System as an alternative of a basic function ai.

Effectivity: To develop language mannequin as Environment friendly and quicker as potential.

In conclusion, my analysis into FragmentStream Consideration continues to be a piece in progress. As you possibly can see from the outputs, the mannequin is presently simply predicting the following phrase, which is regular provided that the dataset used is fundamental and targeted on easy prompts. My objective is to refine it additional and make it extra environment friendly, finally creating very, very light-weight language mannequin that works like extra superior ones.

The thought behind FragmentStream Consideration is to make transformers extra memory-efficient to allow them to run on smaller {hardware} with out shedding their potential to grasp complicated language. Whereas it’s proven promising outcomes, there’s nonetheless rather a lot to enhance, particularly when working with bigger and extra numerous datasets.

I plan to make this undertaking open supply, I’m sharing this on GitHub and Kaggle for the neighborhood to discover, contribute to, and enhance. I really recognize any suggestions or contributions to assist me make this undertaking even higher. Thanks for studying, this was my first article additionally, Thankyou to your curiosity and assist!

Source link

Current Landscape of Artificial Intelligence Threats | by Kosiyae Yussuf | CodeToDeploy : The Tech Digest | Aug, 2025

Optimizing ML Costs with Azure Machine Learning | by Joshua Fox | Aug, 2025

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

People Really Only Care About These 3 Things at Work — Do You Offer Them?

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Africa Engineering Hardware: Transforming Education

Tips for Crafting a Franchise Name That Resonates and Endures

Voyager(Space Probe) | by Muhammad Nadeem | Dec, 2024

Our Picks

People Really Only Care About These 3 Things at Work — Do You Offer Them?

Can Machines Really Recreate “You”?

Meet the researcher hosting a scientific conference by and for AI

FragmentStream Attention: Training a Transformer in Budget | by Yash Rawal | Feb, 2025

Related Posts