Fine-tuning & RLHF

How LLMs are adapted from base models to helpful assistants — supervised fine-tuning, RLHF, DPO, and parameter-efficient methods like LoRA.

Overview

A base language model trained on hundreds of billions of tokens from the internet is remarkably capable. It can complete code, continue stories, and discuss philosophy. But ask it to "help me debug this function" or "summarize this document concisely," and it will often respond by continuing the text in a way that feels technically correct but practically useless. Pre-training teaches a model to predict the next token; it does not teach the model to be helpful.

Fine-tuning is the process of taking that pre-trained base and adapting it for a specific task or behavior. Instead of training on random internet text, we train on curated examples of the behavior we want. The model that emerges from fine-tuning understands what it means to follow an instruction, refuse a harmful request, or explain a concept step by step.

Reinforcement Learning from Human Feedback (RLHF) goes one step further. Rather than just imitating good behavior from examples, it trains the model to optimize for what humans actually prefer — a subtle but critical difference. This is the technique behind the alignment of models like GPT-4, Claude, and Gemini.

The Training Pipeline

The journey from raw base model to deployed assistant involves several sequential stages. Each stage corrects a limitation the previous one leaves behind.

Loading diagram...

Pre-training builds general language capability. SFT teaches instruction-following behavior. The reward model learns human preferences from pairwise comparisons. RL fine-tuning then uses that reward signal to push the SFT model toward outputs humans consistently prefer.

Supervised Fine-tuning (SFT)

SFT is conceptually the simplest step: take a pre-trained model and continue training it on a dataset of (prompt, ideal response) pairs. The training objective is the same cross-entropy loss used in pre-training, but applied only to the response tokens. The model learns to imitate demonstrated helpful behavior.

A typical instruction dataset entry looks like this:

System: You are a helpful coding assistant.
Human: What is the difference between a list and a tuple in Python?
Assistant: Lists are mutable sequences — you can add, remove, or change elements after
creation. Tuples are immutable; once created, their contents cannot change. ...

Well-known SFT datasets include FLAN (curated from academic tasks), Alpaca (52k instruction pairs generated by GPT-3.5), and OpenAssistant Conversations. The quality and diversity of the dataset matters more than its raw size.

SFT alone has a key weakness: the model has only seen examples of good behavior. It has no mechanism to understand why one response is better than another, and it can still produce sycophantic, evasive, or subtly harmful outputs that superficially resemble the training demonstrations.

Why SFT Is Not Enough

A model trained purely on good examples will attempt to produce outputs that look like good examples. But "looking like a helpful response" is not the same as "being a helpful response." SFT models often agree with incorrect premises, hedge excessively, or give subtly wrong information confidently — because those behaviors appeared in some portion of the training demonstrations.

Reward Modeling

To do better than imitation, we need a way to measure quality. Human raters are shown two responses to the same prompt and asked to pick the better one. These pairwise preferences form a dataset that trains a separate reward model — a classifier that takes a (prompt, response) pair and outputs a scalar score representing predicted human preference.

The underlying probability model is Bradley-Terry: given two responses A and B, the probability that a human prefers A is modeled as:

P(A > B) = sigmoid(r(A) - r(B))

where r(·) is the reward model's score. Training minimizes the negative log-likelihood of observed human choices. The resulting reward model internalizes a fuzzy but useful notion of "what responses people prefer."

Reward models are imperfect. They reflect the biases of the raters who labeled the data, and they can be gamed — a model that produces responses that score well may not produce responses that are actually good. This is the reward hacking problem.

RLHF with PPO

With a reward model in hand, the SFT model can be fine-tuned using reinforcement learning. The LLM acts as a policy: given a prompt (state), it generates a response (action). The reward model scores that response (reward). RL updates the model to maximize expected reward.

Proximal Policy Optimization (PPO) is the standard algorithm here. It restricts how much the policy can change in a single update — preventing the model from exploiting the reward model by producing bizarre but high-scoring outputs.

Loading diagram...

A critical safeguard is the KL divergence penalty: the RL loss includes a term that penalizes the fine-tuned model for diverging too far from the original SFT model. Without this constraint, the model learns to produce responses that fool the reward model rather than responses that are genuinely useful — a form of reward hacking known as Goodhart's Law in action.

The full PPO objective combines reward maximization with this regularization:

Objective = E[r_θ(x, y)] - β · KL(π_θ || π_ref)

where π_θ is the current policy, π_ref is the frozen SFT reference model, and β controls the strength of the KL penalty.

DPO: Direct Preference Optimization

RLHF with PPO works, but it is complex. You need to maintain four models simultaneously during training: the policy, the reference SFT model, the reward model, and a value function for PPO. Training is unstable and sensitive to hyperparameters.

Direct Preference Optimization (DPO), published in 2023, reformulates the problem. The key insight is that when you derive the optimal policy for the RLHF objective analytically, it can be expressed in closed form:

π*(y|x) ∝ π_ref(y|x) · exp(r(x, y) / β)

This means the reward function can be implicitly represented through the ratio of the optimal policy to the reference policy. DPO substitutes this relationship back into the preference loss, eliminating the need for a separate reward model entirely. Training directly optimizes the policy on preference pairs (prompt, chosen_response, rejected_response):

L_DPO = -log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x)) - β · log(π_θ(y_l|x) / π_ref(y_l|x)))

Here y_w is the preferred (winning) response and y_l is the dispreferred (losing) response. The model learns to increase the relative probability of preferred responses and decrease the relative probability of dispreferred ones.

DPO is mathematically equivalent to RLHF under certain assumptions, but significantly simpler to implement and more stable to train.

Parameter-Efficient Fine-tuning (PEFT)

Full fine-tuning updates every parameter in the model. For a 70-billion parameter model, that means storing and updating 70B gradients, which requires multiple high-end GPUs. Most practitioners cannot afford this.

PEFT methods solve this by freezing most of the model and training only a small set of additional parameters. The base model weights are unchanged; the task-specific knowledge lives in a compact add-on.

LoRA: Low-Rank Adaptation

LoRA is the most widely adopted PEFT method. The core idea: weight updates during fine-tuning tend to have low intrinsic rank. Rather than updating the full weight matrix W directly, LoRA injects two small matrices A and B alongside each target layer:

W' = W_0 + B · A

where W_0 is the frozen pre-trained weight, A has shape (r × d_in), and B has shape (d_out × r). The rank r is typically 4–64, far smaller than the original dimensions. Only A and B are trained.

Loading diagram...

A LoRA adapter for a 7B model might add only 4–8 million trainable parameters — less than 0.1% of the total. Multiple LoRA adapters for different tasks can be swapped in and out at inference time without changing the base model.

QLoRA: Quantized LoRA

QLoRA extends LoRA by quantizing the base model weights to 4-bit precision before applying LoRA adapters. The base model is frozen and quantized, which drastically reduces its memory footprint. LoRA adapters are trained in 16-bit on top of the quantized base.

A 65-billion parameter model fine-tuned with QLoRA fits on a single 48GB GPU. Without quantization, the same fine-tuning would require multiple 80GB A100s. This was the first technique that made frontier-scale fine-tuning accessible outside hyperscaler infrastructure.

Prefix Tuning and Prompt Tuning

These methods prepend a sequence of trainable "soft tokens" to the input. Unlike hard prompt engineering (where you modify the text manually), the soft token embeddings are real-valued vectors optimized by gradient descent. The model itself is fully frozen.

Prefix tuning adds trainable tokens to every transformer layer's key-value pairs. Prompt tuning — a simpler variant — adds them only to the input embedding layer. These methods are extremely parameter-efficient but generally underperform LoRA on tasks that require significant behavioral adaptation.

Code Examples

The Python example demonstrates a minimal LoRA fine-tuning setup using HuggingFace's peft library. The TypeScript example shows how to call a fine-tuned model deployed as an endpoint.

# Minimal LoRA fine-tuning with HuggingFace peft
# pip install transformers peft datasets accelerate torch

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset

# 1. Load base model and tokenizer
model_name = "meta-llama/Llama-3.2-1B"  # small model for demonstration
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

# 2. Configure LoRA
# r=16: rank of the low-rank matrices
# target_modules: which attention projections to adapt
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    bias="none",
)

# 3. Wrap model with LoRA adapters
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# Trainable params: ~2M out of ~1B total (~0.2%)

# 4. Prepare a tiny instruction dataset
raw_data = [
    {
        "text": (
            "<|system|>You are a helpful assistant.</s>"
            "<|user|>What is 2 + 2?</s>"
            "<|assistant|>2 + 2 equals 4.</s>"
        )
    },
    {
        "text": (
            "<|system|>You are a helpful assistant.</s>"
            "<|user|>Name the capital of France.</s>"
            "<|assistant|>The capital of France is Paris.</s>"
        )
    },
]
dataset = Dataset.from_list(raw_data)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=256,
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized = tokenized.map(lambda x: {"labels": x["input_ids"]})

# 5. Train
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized,
)
trainer.train()

# 6. Save only the LoRA adapter weights (small — a few MB)
peft_model.save_pretrained("./lora-adapter")
print("Adapter saved.")

Practical Considerations

Dataset quality beats dataset size. A thousand carefully curated, diverse instruction pairs will produce a better SFT model than ten thousand noisy ones. Mislabeled data or inconsistent formatting can unravel the model's instruction-following more than a small dataset ever could.

Catastrophic forgetting is a real risk. Aggressive fine-tuning can degrade the model's general capabilities as it overfits to the task distribution. LoRA mitigates this by freezing the base weights. For full fine-tuning, regularization toward the base weights and diverse "general capability" examples in the training mix help.

When to fine-tune vs. alternatives. Fine-tuning is the right tool when you have a consistent task structure with many examples, when prompt engineering has hit a performance ceiling, or when you need to reduce inference costs by avoiding long system prompts. For knowledge injection (adding facts about your company, domain, or recent events), Retrieval-Augmented Generation (RAG) is usually cheaper, more maintainable, and more up-to-date than fine-tuning.

Fine-tuning Does Not Add Knowledge Reliably

A common misconception: fine-tuning is not a good way to inject new factual knowledge into a model. Fine-tuning teaches the model how to behave — not what facts exist in the world. When you fine-tune a model on product documentation, it learns the response format and tone, but it may still hallucinate specific facts. Use RAG when accurate recall of specific information matters; use fine-tuning when behavioral consistency matters.

Key Papers and Resources

InstructGPT (2022) — Ouyang et al., "Training language models to follow instructions with human feedback." The paper that introduced the three-stage RLHF pipeline (SFT → reward modeling → PPO) and demonstrated that RLHF significantly improved human preference ratings even over much larger base models. arxiv.org/abs/2203.02155

LoRA (2022) — Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models." Introduced the low-rank matrix injection approach that has become the default for parameter-efficient fine-tuning. arxiv.org/abs/2106.09685

QLoRA (2023) — Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs." Demonstrated 4-bit NormalFloat quantization combined with LoRA, enabling 65B model fine-tuning on a single GPU. arxiv.org/abs/2305.14314

DPO (2023) — Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Showed that the RLHF objective can be optimized directly on preference data without PPO or an explicit reward model. arxiv.org/abs/2305.18290

FLAN (2022) — Wei et al., "Finetuned Language Models Are Zero-Shot Learners." Showed that fine-tuning on a large collection of tasks phrased as instructions greatly improves zero-shot generalization. arxiv.org/abs/2109.01652

Connected Topics

Transformer Architecture — Fine-tuning operates directly on the transformer's attention layers. Understanding query/key/value projections explains why LoRA targets those matrices specifically, and why the KL constraint is expressed in terms of the model's output distributions.

Prompt Engineering — For many tasks, a well-crafted prompt with few-shot examples achieves comparable results to fine-tuning without any training cost. Prompt engineering is the right starting point; fine-tuning becomes worthwhile when prompting hits a ceiling or when you need consistent format and tone across thousands of calls.

RAG and Retrieval — Fine-tuning and RAG are complementary, not competing. Fine-tuning aligns the model's behavior and communication style; RAG provides accurate, retrievable factual context at inference time. Production systems often combine both.

On this page