How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference

With the current explosion of curiosity in massive language fashions (LLMs), they usually appear nearly magical. However let’s demystify them.

I wished to step again and unpack the basics — breaking down how LLMs are constructed, educated, and fine-tuned to develop into the AI techniques we work together with at present.

This two-part deep dive is one thing I’ve been that means to do for some time and was additionally impressed by Andrej Karpathy’s widely popular 3.5-hour YouTube video, which has racked up 800,000+ views in simply 10 days. Andrej is a founding member of OpenAI, his insights are gold— you get the thought.

When you have the time, his video is certainly price watching. However let’s be actual — 3.5 hours is a protracted watch. So, for all of the busy of us who don’t wish to miss out, I’ve distilled the important thing ideas from the primary 1.5 hours into this 10-minute learn, including my very own breakdowns that can assist you construct a stable instinct.

What you’ll get

Half 1 (this text): Covers the basics of LLMs, together with pre-training to post-training, neural networks, Hallucinations, and inference.

Half 2: Reinforcement studying with human/AI suggestions, investigating o1 fashions, DeepSeek R1, AlphaGo

Let’s go! I’ll begin with how LLMs are being constructed.

At a excessive stage, there are 2 key phases: pre-training and post-training.

1. Pre-training

Earlier than an LLM can generate textual content, it should first find out how language works. This occurs by way of pre-training, a extremely computationally intensive process.

Step 1: Information assortment and preprocessing

Step one in coaching an LLM is gathering as a lot high-quality textual content as potential. The objective is to create an enormous and numerous dataset containing a variety of human information.

One supply is Common Crawl, which is a free, open repository of net crawl information containing 250 billion net pages over 18 years. Nevertheless, uncooked net information is noisy — containing spam, duplicates and low high quality content material — so preprocessing is crucial.In the event you’re fascinated about preprocessed datasets, FineWeb presents a curated model of Widespread Crawl, and is made accessible on Hugging Face.

As soon as cleaned, the textual content corpus is prepared for tokenization.

Step 2: Tokenization

Earlier than a neural community can course of textual content, it have to be transformed into numerical kind. That is completed by way of tokenization, the place phrases, subwords, or characters are mapped to distinctive numerical tokens.

Consider tokens because the constructing blocks — the elemental constructing blocks of all language fashions. In GPT4, there are 100,277 potential tokens.A preferred tokenizer, Tiktokenizer, permits you to experiment with tokenization and see how textual content is damaged down into tokens. Attempt getting into a sentence, and also you’ll see every phrase or subword assigned a sequence of numerical IDs.

Step 3: Neural community coaching

As soon as the textual content is tokenized, the neural community learns to foretell the subsequent token based mostly on its context. As proven above, the mannequin takes an enter sequence of tokens (e.g., “we’re prepare dinner ing”) and processes it by way of a large mathematical expression — which represents the mannequin’s structure — to foretell the subsequent token.

A neural community consists of two key elements:

Parameters (weights) — the realized numerical values from coaching.
Structure (mathematical expression) — the construction defining how the enter tokens are processed to provide outputs.

Initially, the mannequin’s predictions are random, however as coaching progresses, it learns to assign chances to potential subsequent tokens.

When the right token (e.g. “meals”) is recognized, the mannequin adjusts its billions of parameters (weights) by way of backpropagation — an optimization course of that reinforces appropriate predictions by rising their chances whereas decreasing the chance of incorrect ones.

This course of is repeated billions of instances throughout large datasets.

Base mannequin — the output of pre-training

At this stage, the bottom mannequin has realized:

How phrases, phrases and sentences relate to one another
Statistical patterns in your coaching information

Nevertheless, base fashions will not be but optimised for real-world duties. You may consider them as a sophisticated autocomplete system — they predict the subsequent token based mostly on likelihood, however with restricted instruction-following capability.

A base mannequin can typically recite coaching information verbatim and can be utilized for sure purposes by way of in-context studying, the place you information its responses by offering examples in your immediate. Nevertheless, to make the mannequin actually helpful and dependable, it requires additional coaching.

2. Submit coaching — Making the mannequin helpful

Base fashions are uncooked and unrefined. To make them useful, dependable, and secure, they undergo post-training, the place they’re fine-tuned on smaller, specialised datasets.

As a result of the mannequin is a neural community, it can’t be explicitly programmed like conventional software program. As an alternative, we “program” it implicitly by coaching it on structured labeled datasets that symbolize examples of desired interactions.

How put up coaching works

Specialised datasets are created, consisting of structured examples on how the mannequin ought to reply in several conditions.

Some forms of put up coaching embody:

Instruction/dialog high-quality tuning
Aim: To show the mannequin to comply with directions, be process oriented, interact in multi-turn conversations, comply with security pointers and refuse malicious requests, and so on.
Eg: InstructGPT (2022): OpenAI employed some 40 contractors to create these labelled datasets. These human annotators wrote prompts and supplied splendid responses based mostly on security pointers. In the present day, many datasets are generated robotically, with people reviewing and modifying them for high quality.
Area particular high-quality tuning
Aim: Adapt the mannequin for specialised fields like drugs, legislation and programming.

Submit coaching additionally introduces particular tokens — symbols that weren’t used throughout pre-training — to assist the mannequin perceive the construction of interactions. These tokens sign the place a consumer’s enter begins and ends and the place the AI’s response begins, making certain that the mannequin accurately distinguishes between prompts and replies.

Now, we’ll transfer on to another key ideas.

Inference — how the mannequin generates new textual content

Inference will be carried out at any stage, even halfway by way of pre-training, to guage how nicely the mannequin has realized.

When given an enter sequence of tokens, the mannequin assigns chances to all potential subsequent tokens based mostly on patterns it has realized throughout coaching.

As an alternative of at all times selecting the almost definitely token, it samples from this likelihood distribution — just like flipping a biased coin, the place higher-probability tokens usually tend to be chosen.

This course of repeats iteratively, with every newly generated token turning into a part of the enter for the subsequent prediction.

Token choice is stochastic and the identical enter can produce totally different outputs. Over time, the mannequin generates textual content that wasn’t explicitly in its coaching information however follows the identical statistical patterns.

Hallucinations — when LLMs generate false data

Why do hallucinations happen?

Hallucinations occur as a result of LLMs don’t “know” info — they merely predict essentially the most statistically seemingly sequence of phrases based mostly on their coaching information.

Early fashions struggled considerably with hallucinations.

As an example, within the instance beneath, if the coaching information accommodates many “Who’s…” questions with definitive solutions, the mannequin learns that such queries ought to at all times have assured responses, even when it lacks the required information.

When requested about an unknown particular person, the mannequin doesn’t default to “I don’t know” as a result of this sample was not strengthened throughout coaching. As an alternative, it generates its greatest guess, usually resulting in fabricated data.

How do you scale back hallucinations?

Technique 1: Saying “I don’t know”

Enhancing factual accuracy requires explicitly coaching the mannequin to recognise what it doesn’t know — a process that’s extra advanced than it appears.

That is completed through self interrogation, a course of that helps outline the mannequin’s information boundaries.

Self interrogation will be automated utilizing one other AI mannequin, which generates inquiries to probe information gaps. If it produces a false reply, new coaching examples are added, the place the right response is: “I’m undecided. May you present extra context?”

If a mannequin has seen a query many instances in coaching, it would assign a excessive likelihood to the right reply.

If the mannequin has not encountered the query earlier than, it distributes likelihood extra evenly throughout a number of potential tokens, making the output extra randomised. No single token stands out because the almost definitely selection.

Effective tuning explicitly trains the mannequin to deal with low-confidence outputs with predefined responses.

For instance, once I requested ChatGPT-4o, “Who’s asdja rkjgklfj?”, it accurately responded: “I’m undecided who that’s. May you present extra context?”

Technique 2: Doing an internet search

A extra superior methodology is to increase the mannequin’s information past its coaching information by giving it entry to exterior search instruments.

At a excessive stage, when a mannequin detects uncertainty, it could actually set off an internet search. The search outcomes are then inserted right into a mannequin’s context window — primarily permitting this new information to be a part of it’s working reminiscence. The mannequin references this new data whereas producing a response.

Imprecise recollections vs working reminiscence

Typically talking, LLMs have two forms of information entry.

Imprecise recollections — the information saved within the mannequin’s parameters from pre-training. That is based mostly on patterns it realized from huge quantities of web information however shouldn’t be exact nor searchable.
Working reminiscence — the knowledge that’s accessible within the mannequin’s context window, which is straight accessible throughout inference. Any textual content supplied within the immediate acts as a brief time period reminiscence, permitting the mannequin to recall particulars whereas producing responses.

Including related info inside the context window considerably improves response high quality.

Information of self

When requested questions like “Who’re you?” or “What constructed you?”, an LLM will generate a statistical greatest guess based mostly on its coaching information, until explicitly programmed to reply precisely.

LLMs wouldn’t have true self-awareness, their responses rely on patterns seen throughout coaching.

A technique to offer the mannequin with a constant id is through the use of a system immediate, which units predefined directions about the way it ought to describe itself, its capabilities, and its limitations.

To finish off

That’s a wrap for Half 1! I hope this has helped you construct instinct on how LLMs work. In Half 2, we’ll dive deeper into reinforcement studying and a number of the newest fashions.

Acquired questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you in Half 2! 🙂

Source link

Can Machines Really Recreate “You”?

Unfiltered Roleplay AI Chatbots with Pictures – My Top Picks

Roleplay AI Chatbot Apps with the Best Memory: Tested

TikTok to lay off hundreds of UK content moderators

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Stop children using VPNs to watch porn, ministers told

Learning ML or Learning About Learning ML? | by Pascal Janetzky | Jan, 2025

E.U. Prepares Major Penalties Against Elon Musk’s X

Our Picks