Benchmarking
This section outlines how benchmarking is conducted for SimpleLLaMA using the EleutherAI LM Evaluation Harness (lm-eval). This standardized framework allows consistent and comparable evaluation of language models across multiple established NLP benchmarks.
1. Overview
Benchmarking evaluates the model's reasoning, general knowledge, and commonsense understanding capabilities after different training stages (Pretraining, SFT, RLHF). The lm-eval-harness framework from EleutherAI is used for consistency with widely reported community benchmarks.
Benchmarks used are:
- HellaSwag – commonsense reasoning
- PIQA – physical reasoning
- ARC (Easy & Challenge) – scientific question answering
All results are reported as Normalized Accuracy (%) to align with other LLM benchmark tables.
2. Setup Instructions
Benchmarking requires both the SimpleLLaMA repository and the EleutherAI evaluation harness to be installed.
Step 1 — Clone the EleutherAI Harness
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
Step 2 — Integrate SimpleLLaMA Model
Copy the integration file from:
simple_llama/eleuther_harness_eval/my_custom_lm.py
into the folder:
lm-evaluation-harness/lm_eval/models/
Then edit lm-evaluation-harness/lm_eval/models/__init__.py and add my_custom_lm to the list of model imports
Step 3 — Extract Model State Dict
Before evaluation, the model checkpoint and config must be exported into a format that lm-eval can read.
For example, if you have the model_50B_2146L_4096MSQ.pth model stored within simple_llama/pretraining/checkpoints folder and wish to evaluate that model, the procedures would be the following.
Within the simple_llama/eleuther_harness_eval directory:
python3 extract_state_dict.py -i ../pretraining/checkpoints/model_50B_2146L_4096MSQ.pth
This will create two files under save_dir/:
model_50B_2146L_4096MSQ_sd.pth– state dictionarymodel_50B_2146L_4096MSQ_config.pth– model configuration
3. Running Evaluation
From the root of the SimpleLLaMA repository, run the following command:
lm_eval --model my_custom_llm \
--model_args tokenizer_path=dataset/bpe_8k.json,config_path=eleuther_harness_eval/save_dir/model_50B_2146L_4096MSQ_config.pth,checkpoint_path=eleuther_harness_eval/save_dir/model_50B_2146L_4096MSQ_sd.pth,pretrain_model=True \
--tasks hellaswag,piqa,arc_easy,arc_challenge \
--batch_size 32 \
--device cuda:0
Arguments:
tokenizer_path: Path to the trained tokenizer JSON file.config_path: Path to the saved config file generated by extraction.checkpoint_path: Path to the extracted model weights.pretrain_model: Set toTruefor pretrained model evaluation, orFalsefor SFT/RLHF variants.
For comparison, you can benchmark against existing open models, such as:
lm_eval --model hf --model_args pretrained=EleutherAI/pythia-1.4b-v0,use_auth_token=True --tasks hellaswag --batch_size 32 --device cuda:0
4. Understanding Results
After evaluation, results are displayed in the terminal and saved under:
lm-evaluation-harness/results/
Each benchmark reports metrics such as:
- accuracy – raw task accuracy
- normalized_accuracy – adjusted score accounting for multiple-choice bias
Example output snippet:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| arc_challenge | 1 | none | 0 | acc | ↑ | 0.2491 | ± | 0.0126 |
| none | 0 | acc_norm | ↑ | 0.2739 | ± | 0.0130 | ||
| arc_easy | 1 | none | 0 | acc | ↑ | 0.5926 | ± | 0.0101 |
| none | 0 | acc_norm | ↑ | 0.5396 | ± | 0.0102 | ||
| hellaswag | 1 | none | 0 | acc | ↑ | 0.3426 | ± | 0.0047 |
| none | 0 | acc_norm | ↑ | 0.4062 | ± | 0.0049 | ||
| piqa | 1 | none | 0 | acc | ↑ | 0.6654 | ± | 0.0110 |
| none | 0 | acc_norm | ↑ | 0.6692 | ± | 0.0110 |
These results match the summary table shown in the main documentation.
| Dataset | Metric | Score |
|---|---|---|
| ARC (Challenge) | Normalized Accuracy | 27.39% |
| ARC (Easy) | Normalized Accuracy | 53.96% |
| HellaSwag | Normalized Accuracy | 40.62% |
| PIQA | Normalized Accuracy | 66.92% |
5. Notes & Recommendations
- Batch size can be adjusted depending on VRAM availability.
- Ensure
torchvisionandhf_transferare installed in the environment. - The integration supports both Pretrained and SFT checkpoints for evaluation.
- Results can vary slightly based on sequence length and temperature settings.
- Check out
simple_llama/eleuther_harness_eval/directions.txtfor more directions
Normalized Accuracy Note
The normalized accuracy (acc_norm) metric corrects for a model's natural bias toward shorter responses. Since the harness computes log probabilities over tokens, longer continuations accumulate more negative log-likelihood, unfairly penalizing them. Normalized accuracy mitigates this by dividing total log probability by the number of tokens in each completion, providing a fairer comparison across response lengths.
6. Summary
This benchmarking setup allows SimpleLLaMA models to be evaluated directly against other open models in the EleutherAI harness. By adhering to standardized protocols, the results are transparent, reproducible, and comparable to popular models like Pythia, GPT-Neo, and LLaMA itself.
For deeper analysis or adding new tasks, see the EleutherAI lm-eval repository.