Model Configurations
Model Configuration
Before we can train our transformer, we need to decide on all the hyperparameters and settings that control both the model and the training process. These settings are stored in a configuration object, which in our case is implemented using a Python dataclass called TrainingConfig located in SimpleLLaMA\simple_llama\pretraining\config.py.
The configuration file may look intimidating at first, since it lists dozens of parameters. But many of them are straightforward once you understand the categories they fall into. The following are the most important ones.
The first group defines where the data and outputs are stored. For example:
dataset_dirtells the program where to find the pre-tokenized dataset files.tokenizer_pathpoints to the JSON file that contains the trained tokenizer.ckpt_dirspecifies the folder where model checkpoints will be saved during training.log_fileis a simple text file where progress (like loss values) is recorded.
Together, these ensure the training script knows both where to read the data from and where to save its results.
Next, we have the batch and sequence length parameters, which directly control how much data the model processes at once.
batch_sizeis the number of sequences per batch. If you set this to 4, then each step processes 4 separate chunks of text in parallel.max_seq_lenis the maximum number of tokens per sequence. For example, ifmax_seq_len = 2048, then each input sequence is capped at 2048 tokens long. Longer documents must be split into smaller pieces.tokens_per_updatedefines how many tokens are processed before the optimizer takes a step. Since this touches upon gradient accumulation, which is outside the scope of this basic explanation, it will be covered in thetraining_advanced.mdfile.
These three parameters together determine how much work the model is doing in each training step and a major factor of how much GPU memory the model will consume.
Then comes the model architecture itself. These parameters define the shape and capacity of the transformer network:
n_embdis the embedding dimension, the size of the vector used to represent each token internally. Larger values allow the model to capture richer relationships, but also make it heavier to train.n_headssets how many attention heads are used per layer. Each head can focus on different relationships in the sequence, so more heads allow for more diverse patterns.n_layersis the number of stacked decoder layers. Each layer refines the token representations further, so deeper models are generally more powerful.multiple_ofcontrols the feedforward layer’s hidden dimension. Instead of choosing an arbitrary number, this ensures the size is a multiple of a fixed value (like 256), which helps optimize matrix multiplications on GPUs.epsis a tiny value added in normalization layers to avoid division by zero errors. It’s not something you usually tweak, but it is essential for numerical stability.thetasets the base frequency for Rotary Position Embeddings (RoPE), which are used to encode token positions into the model. Again, you typically leave this at its default.dropoutis a regularization mechanism where some connections are randomly “dropped” during training. For large pretraining, this is often set to0.0because the dataset itself provides enough variety, but in smaller-scale experiments you might increase it to avoid overfitting.
These architecture parameters is the core of the model. Changing them fundamentally alters the size and behavior of the transformer.
Another critical part of the config is the training schedule. Training a large language model is not just about choosing an optimizer and running it — we also need to carefully plan how the learning rate evolves over time.
warmup_iterationsspecifies how many steps are used to gradually increase the learning rate at the start of training. This prevents the model from diverging early on.max_lris the peak learning rate reached after warmup.min_lris the final learning rate at the end of training, typically reached through a cosine decay schedule.beta1andbeta2are parameters of the AdamW optimizer, which control how much past gradients influence the updates.weight_decayis a form of regularization that prevents weights from growing too large, helping the model generalize better.
Together, these define the “pace” at which the model learns.
Finally, we have the training tokens and evaluation settings.
training_tokensis the total number of tokens the model will see during training. For example,45e9means 45 billion tokens in total.eval_intervalcontrols how often the model’s progress is evaluated. For instance, every 32 steps the model might generate text and log its loss.model_gen_multiplieradjusts how frequently sample generations are produced during training.
The config also includes checkpointing settings such as token_ckpt (how often to save the model in terms of tokens processed) and load_ckpt (whether to resume from a previous run).
Even though this configuration object looks large, most of its parameters can be grouped into four main categories: paths, batching, model architecture, and training schedule. For the beginner doc, you don’t need to memorize every single field — the important thing is to understand what each group does. The rest can be treated as implementation details that you return to once you start experimenting.