6 Common LLM Customization Strategies Briefly Explained

Why Customise LLMs?

Massive Language Fashions (Llms) are deep studying fashions pre-trained based mostly on self-supervised studying, requiring an unlimited quantity of assets on coaching information, coaching time and holding numerous parameters. LLM have revolutionized pure language processing particularly within the final 2 years, demonstrating exceptional capabilities in understanding and producing human-like textual content. Nevertheless, these common objective fashions’ out-of-the-box efficiency could not at all times meet particular enterprise wants or area necessities. LLMs alone can’t reply questions that depend on proprietary firm information or closed-book settings, making them comparatively generic of their purposes. Coaching a LLM mannequin from scratch is essentially infeasible to small to medium groups because of the demand of huge quantities of coaching information and assets. Due to this fact, a variety of LLM customization methods are developed lately to tune the fashions for numerous situations that require specialised information.

The customization methods may be broadly cut up into two varieties:

Utilizing a frozen mannequin: These strategies don’t necessitate updating mannequin parameters and usually completed by in-context studying or immediate engineering. They’re cost-effective since they alter the mannequin’s habits with out incurring intensive coaching prices, subsequently extensively explored in each the {industry} and tutorial with new analysis papers printed every day.
Updating mannequin parameters: It is a comparatively resource-intensive strategy that requires tuning a pre-trained LLM utilizing customized datasets designed for the supposed objective. This consists of widespread strategies like Effective-Tuning and Reinforcement Studying from Human Suggestions (RLHF).

These two broad customization paradigms department out into numerous specialised strategies together with LoRA fine-tuning, Chain of Thought, Retrieval Augmented Technology, ReAct, and Agent frameworks. Every approach provides distinct benefits and trade-offs relating to computational assets, implementation complexity, and efficiency enhancements.

The way to Select LLMs?

Step one of customizing LLMs is to pick out the suitable basis fashions because the baseline. Group based mostly platform e.g. “Huggingface” provides a variety of open-source pre-trained fashions contributed by high firms or communities, similar to Llama collection from Meta and Gemini from Google. Huggingface moreover offers leaderboards, for instance “Open LLM Leaderboard” to check LLMs based mostly on industry-standard metrics and duties (e.g. MMLU). Cloud suppliers (e.g., AWS) and AI firms (e.g., OpenAI and Anthropic) additionally supply entry to proprietary fashions which might be usually paid providers with restricted entry. Following elements are necessities to contemplate when selecting LLMs.

Open supply or proprietary mannequin: Open supply fashions enable full customization and self-hosting however require technical experience whereas proprietary fashions supply speedy entry and sometimes higher high quality responses however with greater prices.

Process and metrics: Fashions excel at completely different duties together with question-answering, summarization, code technology and so forth. Examine benchmark metrics and check on domain-specific duties to find out the suitable fashions.

Structure: Generally, decoder-only fashions (GPT collection) carry out higher at textual content technology whereas encoder-decoder fashions (T5) deal with translation nicely. There are extra structure rising and exhibiting promising outcomes, as an illustration Combination of Specialists (MoE) mannequin “DeepSeek”.

Variety of Parameters and Measurement: Bigger fashions (70B-175B parameters) supply higher efficiency however want extra computing energy. Smaller fashions (7B-13B) run quicker and cheaper however could have diminished capabilities.

After figuring out a base LLM, let’s discover 6 most typical methods for LLM customization, ranked so as of useful resource consumption from the least to probably the most intensive:

Immediate Engineering
Decoding and Sampling Technique
Retrieval Augmented Technology
Agent
Effective Tuning
Reinforcement Studying from Human Suggestions

Should you’d favor a video walkthrough of those ideas, please take a look at my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Methods

1. Immediate Engineering

Immediate is the enter textual content despatched to an LLM to elicit an AI-generated response, and it may be composed of directions, context, enter information and output indicator.

Directions: This offers a process description or instruction for a way the mannequin ought to carry out.

Context: That is exterior info to information the mannequin to reply inside a sure scope.

Enter information: That is the enter for which you desire a response.

Output indicator: This specifies the output sort or format.

Immediate Engineering entails crafting these immediate elements strategically to form and management the mannequin’s response. Primary immediate engineering strategies embody zero shot, one shot, and few shot prompting. Person can implement fundamental immediate engineering strategies instantly whereas interacting with the LLM, making it an environment friendly strategy to align mannequin’s habits to on a novel goal. API implementation can also be an possibility and extra particulars are launched in my earlier article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.

As a result of effectivity and effectiveness of immediate engineering, extra advanced approaches are explored and developed to advance the logical construction of prompts.

Chain of Thought (CoT) asks LLMs to interrupt down advanced reasoning duties into step-by-step thought processes, enhancing efficiency on multi-step issues. Every step explicitly exposes its reasoning consequence which serves because the precursor context of its subsequent steps till arriving on the reply.

Tree of ideas extends from CoT by contemplating a number of completely different reasoning branches and self-evaluating decisions to resolve the subsequent finest motion. It’s simpler for duties that contain preliminary choices, methods for the long run and exploration of a number of options.

Computerized reasoning and power use (ART) builds upon the CoT course of, it deconstructs advanced duties and permits the mannequin to pick out few-shot examples from a process library utilizing predefined exterior instruments like search and code technology.

Synergizing reasoning and appearing (ReAct) combines reasoning trajectories with an motion area, the place the mannequin search by the motion area and decide the subsequent finest motion based mostly on environmental observations.

Methods like CoT and ReAct are sometimes mixed with an Agentic workflow to strengthen its functionality. These strategies might be launched in additional element within the “Agent” part.

Additional Studying

2. Decoding and Sampling Technique

Decoding technique may be managed at mannequin inference time by inference parameters (e.g. temperature, high p, high ok), figuring out the randomness and variety of mannequin responses. Grasping search, beam search and sampling are three widespread decoding methods for auto-regressive mannequin technology. ****

Throughout the autoregressive technology course of, LLM outputs one token at a time based mostly on a likelihood distribution of candidate tokens conditioned by the pervious token. By default, grasping search is utilized to supply the subsequent token with the very best likelihood.

In distinction, beam search decoding considers a number of hypotheses of next-best tokens and selects the speculation with the very best mixed possibilities throughout all tokens within the textual content sequence. The code snippet under makes use of transformers library to specify the the variety of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) through the mannequin technology course of.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(immediate, return_tensors="pt")

mannequin = AutoModelForCausalLM.from_pretrained(model_name)
outputs = mannequin.generate(**inputs, num_beams=5)

Sampling technique is the third strategy to manage the randomness of mannequin responses by adjusting these inference parameters:

Temperature: Reducing the temperature makes the likelihood distribution sharper by growing the probability of producing high-probability phrases and lowering the probability of producing low-probability phrases. When temperature = 0, it turns into equal to grasping search (least inventive); when temperature = 1, it produces probably the most inventive outputs.
High Ok sampling: This methodology filters the Ok most possible subsequent tokens and redistributes the likelihood amongst these tokens. The mannequin then samples from this filtered set of tokens.
High P sampling: As an alternative of sampling from the Ok most possible tokens, top-p sampling selects from the smallest attainable set of tokens whose cumulative likelihood exceeds the brink p.

The instance code snippet under samples from the highest 50 probably tokens (top_k=50) with a cumulative likelihood greater than 0.95 (top_p=0.95)

sample_outputs = mannequin.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

Additional Studying

3. RAG

Retrieval Augmented Technology (or RAG), initially launched within the paper “Retrieval-Augmented Technology for Data-Intensive NLP Duties”, has been demonstrated as a promising answer that integrates exterior information and reduces widespread LLM “hallucination” points when dealing with area particular or specialised queries. RAG permits dynamically pulling related info from information area and usually doesn’t contain intensive coaching to replace LLM parameters, making it an economical technique to adapt a general-purpose LLM for a specialised area.

A RAG system may be decomposed into retrieval and technology stage. The target of retrieval course of is to search out contents throughout the information base which might be carefully associated to the person question, by chunking exterior information, creating embeddings, indexing and similarity search.

Chunking: Paperwork are divided into smaller segments, with every phase containing a definite unit of data.
Create embeddings: An embedding mannequin compresses every info chunk right into a vector illustration. The person question can also be transformed into its vector illustration by the identical vectorization course of, in order that the person question may be in contrast in the identical dimensional area.
Indexing: This course of shops these textual content chunks and their vector embeddings as key-value pairs, enabling environment friendly and scalable search performance. For big exterior information bases that exceed reminiscence capability, vector databases supply environment friendly long-term storage.
Similarity search: Similarity scores between the question embeddings and textual content chunk embeddings are calculated, that are used for looking out info extremely related to the person question.

The technology course of of the RAG system then combines retrieved info with the person question to kind the augmented question which is parsed to the LLM to generate the context wealthy response.

Code Snippet

The code snippet firstly specifies the LLM and embedding mannequin, then carry out the steps to chunk the exterior information base paperwork into a group of doc. Create index from doc, outline the query_engine based mostly on the index and question the query_engine with the person immediate.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(mannequin="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

doc = Doc(textual content="nn".be part of([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.question(
    "Inform me about LLM customization methods."
)

The instance above exhibits a easy RAG system. Superior RAG enhance based mostly on this by introducing pre-retrieval and post-retrieval methods to cut back pitfalls similar to restricted synergy between the retrieval and technology course of. For instance rerank approach reorders the retrieved info utilizing a mannequin able to understanding bidirectional context, and integration with information graph for superior question routing. Extra use instances may be discovered on the llamaindex web site.

Additional Studying

4. Agent

LLM Agent was a trending subject in 2024 and can probably stay a principal focus within the GenAI subject in 2025. In comparison with RAG, Agent excels at creating question routes and planning LLM-based workflows, with the next advantages:

Sustaining reminiscence and state of earlier mannequin generated responses.
Leveraging numerous instruments based mostly on particular standards. This tool-using functionality units brokers aside from fundamental RAG techniques by giving the LLM impartial management over instrument choice.
Breaking down a posh process into smaller steps and planning for a sequence of actions.
Collaborating with different brokers to kind a orchestrated system.

A number of in-context studying strategies (e.g. CoT, ReAct ) may be applied by the Agentic framework and we are going to focus on ReAct in additional particulars. ReAct, stands for “Synergizing Reasoning and Performing in Language Fashions”, consists of three key parts – actions, ideas and observations. This framework was launched by Google Analysis at Princeton College, constructed upon Chain of Thought by integrating the reasoning steps with an motion area that permits instrument makes use of and performance calling. Moreover, ReAct framework emphasizes on figuring out the subsequent finest motion based mostly on the environmental observations.

This instance from the unique paper demonstrated ReAct’s inside working course of, the place the LLM generates the primary thought and acts by calling the operate to “Search [Apple Remote]”, then observes the suggestions from its first output. The second thought is then based mostly on the earlier statement, therefore resulting in a special motion “Search [Front Row]”. This course of iterates till reaching the objective. The analysis exhibits that ReAct overcomes prevalent problems with hallucination and error propagation as extra typically noticed in chain-of-thought reasoning by interacting with a easy Wikipedia API. Moreover, by the implementation of resolution traces, ReAct framework moreover will increase the mannequin’s interpretability, trustworthiness and diagnosability.

*Instance from “ReAct: Synergizing Reasoning and Performing in Language Fashions” (Yu et. al., 2022)*

Code Snippet

This demonstrates an ReAct-based agent implementation utilizing llamaindex. Firstly, it defines two capabilities (multiply and add). Secondly, these two capabilities are encapsulated as FunctionTool, forming the Agent’s motion area and executed based mostly on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.instruments import FunctionTool

# create fundamental operate instruments
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

Some great benefits of an Agentic Workflow are extra substantial when mixed with self-reflection or self-correction. It’s an more and more rising area with a wide range of Agent structure being explored. As an example, Reflexion framework facilitate iterative studying by offering a abstract of verbal suggestions from environmental and storing the suggestions in mannequin’s reminiscence; CRITIC framework empowers frozen LLMs to self-verify by interacting with exterior instruments similar to code interpreter and API calls.

Additional Studying

5. Effective-Tuning

Effective-tuning is the method of feeding area of interest and specialised datasets to switch the LLM in order that it’s extra aligned with a sure goal. It differs from immediate engineering and RAG because it permits updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM by backpropogation, which requires massive reminiscence to retailer all weights and parameters and will undergo from vital discount in potential on different duties (i.e. catastrophic forgetting). Due to this fact, PEFT (or parameter environment friendly advantageous tuning) is extra extensively used to mitigate these caveats whereas saving the time and value of mannequin coaching. There are three classes of PEFT strategies:

Selective: Choose a subset of preliminary LLM parameters to advantageous tune which may be extra computationally intensive in comparison with different PEFT strategies.
Reparameterization: Modify mannequin weights by coaching the weights of low rank representations. For instance, Decrease Rank Adaptation (LoRA) is amongst this class that accelerates fine-tuning by representing the load updates with two smaller matrices.
Additive: Add extra trainable layers to mannequin, together with strategies like adapters and gentle prompts

The fine-tuning course of is much like deep studying coaching course of., requiring the next inputs:

coaching and analysis datasets
coaching arguments outline the hyperparameters e.g. studying price, optimizer
pretrained LLM mannequin
compute metrics and goal capabilities that algorithm needs to be optimized for

Code Snippet

Beneath is an instance of implementing fine-tuning utilizing the transformer Coach.

from transformers import TrainingArguments, Coach

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

coach = Coach(
    mannequin=mannequin,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

coach.prepare()

Effective-tuning has a variety of use instances. As an example, instruction fine-tuning optimizes LLMs for conversations and following directions by coaching them on prompt-completion pairs. One other instance is area adaptation, an unsupervised fine-tuning methodology that helps LLMs concentrate on particular information domains.

Additional Studying

6. RLHF

Reinforcement studying from human suggestions, or RLHF, is a reinforcement studying approach that advantageous tunes LLMs based mostly on human preferences. RLHF operates by coaching a reward mannequin based mostly on human suggestions and makes use of this mannequin as a reward operate to optimize a reinforcement studying coverage by PPO (Proximal Coverage Optimization). The method requires two units of coaching information: a desire dataset for coaching reward mannequin, and a immediate dataset used within the reinforcement studying loop.

Let’s break it down into steps:

Collect desire dataset annotated by human labelers who price completely different completions generated by the mannequin based mostly on human desire. An instance format of the desire dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is most popular.
Practice a reward mannequin utilizing the desire dataset, the reward mannequin is actually a regression mannequin that outputs a scalar indicating the standard of the mannequin generated response. The target of the reward mannequin is to maximise the rating between the successful candidate and shedding candidate.
Use the reward mannequin in a reinforcement studying loop to fine-tune the LLM. The target is that the coverage is up to date in order that LLM can generate responses that maximize the reward produced by the reward mannequin. This course of makes use of the immediate dataset which is a group of prompts within the format of {immediate, response, rewards}.

Code Snippet

Open supply library Trlx is extensively utilized in implementing RLHF they usually offered a template code that exhibits the essential RLHF setup:

Initialize the bottom mannequin and tokenizer from a pretrained checkpoint
Configure PPO hyperparameters PPOConfig like studying price, epochs, and batch sizes
Create the PPO coach PPOTrainer by combining the mannequin, tokenizer, and coaching information
The coaching loop makes use of step() methodology to iteratively replace the mannequin to optimized the rewards calculated from the question and mannequin response

# trl: Transformer Reinforcement Studying library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# provoke the pretrained mannequin and tokenizer
mannequin = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# outline the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# provoke the PPO coach with regards to the mannequin
ppo_trainer = PPOTrainer(
	config=config, 
	mannequin=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively up to date by the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is extensively utilized for aligning mannequin responses with human desire. Frequent use instances contain lowering response toxicity and mannequin hallucination. Nevertheless, it does have the draw back of requiring a considerable amount of human annotated information in addition to computation prices related to coverage optimization. Due to this fact, alternate options like Reinforcement Studying from AI suggestions and Direct Desire Optimization (DPO) are launched to mitigate these limitations.

Additional Studying

Take-Dwelling Message

This text briefly explains six important LLM customization methods together with immediate engineering, decoding technique, RAG, Agent, fine-tuning, and RLHF. Hope you discover it useful when it comes to understanding the professionals/cons of every technique in addition to easy methods to implement them based mostly on the sensible examples.

Source link

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

An Introduction to Remote Model Context Protocol Servers

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Top AI Techniques to Stay Ahead. The transformative impact of artificial… | by Mark Taylor | Feb, 2025

Can an AI Understand Your Trauma? | by Pranav Tallapaka | Jun, 2025

Artificial Intelligence is Changing Our Daily Lives: Benefits, Challenges, and the Future | by Ali Zain Ul Abedin | Dec, 2024

Our Picks

Revisiting Benchmarking of Tabular Reinforcement Learning Methods

Is Your AI Whispering Secrets? How Scientists Are Teaching Chatbots to Forget Dangerous Tricks | by Andreas Maier | Jul, 2025

Qantas data breach to impact 6 million airline customers

6 Common LLM Customization Strategies Briefly Explained

Why Customise LLMs?

The way to Select LLMs?

LLM Customization Methods

1. Immediate Engineering

2. Decoding and Sampling Technique

3. RAG

4. Agent

5. Effective-Tuning

6. RLHF

Take-Dwelling Message

Related Posts