Supervised Fine-Tuning (SFT) Overview
Introduction: From Pretraining to Instruction Following
Supervised Fine-Tuning (SFT) represents the crucial second stage in developing capable language models. While pretraining gives the model a comprehensive understanding of language structure and world knowledge through next-token prediction on vast text corpora, SFT transforms this "knowledgeable but unguided" model into a helpful assistant that can follow instructions, engage in conversations, and provide useful responses.
The Fundamental Shift
Pretraining Objective:
Input: "Explain gravity in simple words."
Target (continuation): "Explain relativity in simple words. Explain quantum mechanics in simple words."
Here, the model is trained to continue text, so it often just echoes or extends patterns. Typically, this produces verbose or repetitive completions rather than addressing intent, where the knowledge is broad but behavior is mostly unguided.
SFT Objective:
Input: "Explain gravity in simple words."
Target (response): "Gravity is the force that pulls objects toward each other, like how Earth pulls things down."
After SFT, the goal is to have the model trained to interpret input as an instruction. The resulting generation should be a direct, helpful answer instead of raw continuation. It would also learn assistant-like behavior, such as prioritizing conciseness, being cooperative, and user-focused.
What This Section Covers
This SFT documentation section provides a comprehensive guide to transforming the pretrained model into an instruction-following assistant. We'll cover:
1. Dataset Preparation (sft/dataset.md)
- Instruction Dataset Structure: Understanding the JSON format with User, Assistant, and Template fields
- Data Loading Pipeline: The
JSONDatasetLoaderclass and its batching strategy - Quality Considerations: What makes good SFT data and how to curate effective training examples
2. Prompt Formatting (sft/prompt_formatting.md)
- Special Token System: Explanation of the 6 custom tokens (
<SOT>,<EOT>,<SOU>,<EOU>,<SOA>,<EOA>) and their semantic roles - Training Format: How to structure prompts during model training
- System Prompt Integration: Incorporating instructions and model personality through template fields
- Multi-turn Handling: Formatting complex conversations with multiple exchanges
3. Training Process (sft/training.md)
- Training Differences: How SFT training diverges from pretraining
- Loss Masking Strategy: Only computing loss on assistant responses, not user queries or system prompts
- Validation Strategy: Using validation examples during finetuning to evaluate model progress
- Hyperparameter Tuning: Optimal learning rates, batch sizes, and training duration for SFT
4. Utilities and Implementation (sft/utils.md)
- Core Functions:
tokenize_and_pad_data, the heart of SFT data processing - Padding Logic: For sequence alignment and loss computation
Key Technical Concepts
The SFT Training Objective
Unlike pretraining where we train on a massive text corpus, SFT is more about behavior tuning, using a much smaller dataset, composed to 3 main components:
- System Instructions
- User Query
- Assistant Response
In the pretraining phase, the input is a long stream of tokens, and output is essentially the same tokens, just shifted by one. However in SFT, the goal is now to generate a response that answers the user questions. It would be formatted as something like:
Input Sequence (x): "System instructions\n\nUser query\n"
Target Sequence (y): "Assistant response"
In SFT, the loss is ONLY computed on positions corresponding to "Assistant response"
All other tokens (system prompt, user query, special tokens) are ignored via padding
This ensures the model learns to generate appropriate responses without being penalized for not predicting user inputs or system instructions.
Why SFT Works
SFT leverages the foundational knowledge acquired during pretraining and redirects it toward helpful behavior:
- Knowledge Preservation: The model retains all linguistic patterns and factual knowledge from pretraining
- Behavioral Alignment: Learns to apply this knowledge in response to user instructions
- Format Compliance: Adopts consistent response patterns and conversation structures
- Helpfulness: Develops tendencies toward beneficial rather than generic responses
Implementation Notes
Single-GPU Training: Unlike the DDP implementation in pretraining, the SFT pipeline currently uses single-GPU training. This decision is mostly because SFT typically requires orders of magnitude fewer iterations than pretraining and results in faster iteration and debugging without synchronization complexities that DDP introduces.
Full Fine-Tuning Approach: This currently implement full parameter updates rather than parameter-efficient methods like LoRA, though the infrastructure for LoRA integration exists and might be documented in future updates.
Performance Expectations
For this 1.3B parameter model:
- Training Time: Several hours on single GPU (vs months for pretraining)
- Dataset Size: Filtered from lmsys-chat-1m, ShareGPT, and smol-smoltalk
- Convergence: Typically 3-5 epochs
- Quality Improvement: Significant improvement in instruction-following and conversational coherence