Benchmarking

This section outlines how benchmarking is conducted for SimpleLLaMA using the EleutherAI LM Evaluation Harness (lm-eval). This standardized framework allows consistent and comparable evaluation of language models across multiple established NLP benchmarks.

1. Overview

Benchmarking evaluates the model's reasoning, general knowledge, and commonsense understanding capabilities after different training stages (Pretraining, SFT, RLHF). The lm-eval-harness framework from EleutherAI is used for consistency with widely reported community benchmarks.

Benchmarks used are:

HellaSwag – commonsense reasoning
PIQA – physical reasoning
ARC (Easy & Challenge) – scientific question answering

All results are reported as Normalized Accuracy (%) to align with other LLM benchmark tables.

2. Setup Instructions

Benchmarking requires both the SimpleLLaMA repository and the EleutherAI evaluation harness to be installed.

Step 1 — Clone the EleutherAI Harness

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

Step 2 — Integrate SimpleLLaMA Model

Copy the integration file from:

simple_llama/eleuther_harness_eval/my_custom_lm.py

into the folder:

lm-evaluation-harness/lm_eval/models/

Then edit lm-evaluation-harness/lm_eval/models/__init__.py and add my_custom_lm to the list of model imports

Step 3 — Extract Model State Dict

Before evaluation, the model checkpoint and config must be exported into a format that lm-eval can read.

For example, if you have the model_50B_2146L_4096MSQ.pth model stored within simple_llama/pretraining/checkpoints folder and wish to evaluate that model, the procedures would be the following.

Within the simple_llama/eleuther_harness_eval directory:

python3 extract_state_dict.py -i ../pretraining/checkpoints/model_50B_2146L_4096MSQ.pth

This will create two files under save_dir/:

model_50B_2146L_4096MSQ_sd.pth – state dictionary
model_50B_2146L_4096MSQ_config.pth – model configuration

3. Running Evaluation

From the root of the SimpleLLaMA repository, run the following command:

lm_eval --model my_custom_llm \
        --model_args tokenizer_path=dataset/bpe_8k.json,config_path=eleuther_harness_eval/save_dir/model_50B_2146L_4096MSQ_config.pth,checkpoint_path=eleuther_harness_eval/save_dir/model_50B_2146L_4096MSQ_sd.pth,pretrain_model=True \
        --tasks hellaswag,piqa,arc_easy,arc_challenge \
        --batch_size 32 \
        --device cuda:0

Arguments:

tokenizer_path: Path to the trained tokenizer JSON file.
config_path: Path to the saved config file generated by extraction.
checkpoint_path: Path to the extracted model weights.
pretrain_model: Set to True for pretrained model evaluation, or False for SFT/RLHF variants.

For comparison, you can benchmark against existing open models, such as:

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-1.4b-v0,use_auth_token=True --tasks hellaswag --batch_size 32 --device cuda:0

4. Understanding Results

After evaluation, results are displayed in the terminal and saved under:

lm-evaluation-harness/results/

Each benchmark reports metrics such as:

accuracy – raw task accuracy
normalized_accuracy – adjusted score accounting for multiple-choice bias

Example output snippet:

Tasks	Version	Filter	Metric		Value		Stderr
arc_challenge	1	none	acc	↑	0.2491	±	0.0126
		none	acc_norm	↑	0.2739	±	0.0130
arc_easy	1	none	acc	↑	0.5926	±	0.0101
		none	acc_norm	↑	0.5396	±	0.0102
hellaswag	1	none	acc	↑	0.3426	±	0.0047
		none	acc_norm	↑	0.4062	±	0.0049
piqa	1	none	acc	↑	0.6654	±	0.0110
		none	acc_norm	↑	0.6692	±	0.0110

These results match the summary table shown in the main documentation.

Dataset	Metric	Score
ARC (Challenge)	Normalized Accuracy	27.39%
ARC (Easy)	Normalized Accuracy	53.96%
HellaSwag	Normalized Accuracy	40.62%
PIQA	Normalized Accuracy	66.92%

5. Notes & Recommendations

Batch size can be adjusted depending on VRAM availability.
Ensure torchvision and hf_transfer are installed in the environment.
The integration supports both Pretrained and SFT checkpoints for evaluation.
Results can vary slightly based on sequence length and temperature settings.
Check out simple_llama/eleuther_harness_eval/directions.txt for more directions

Normalized Accuracy Note

The normalized accuracy (acc_norm) metric corrects for a model's natural bias toward shorter responses. Since the harness computes log probabilities over tokens, longer continuations accumulate more negative log-likelihood, unfairly penalizing them. Normalized accuracy mitigates this by dividing total log probability by the number of tokens in each completion, providing a fairer comparison across response lengths.

6. Summary

This benchmarking setup allows SimpleLLaMA models to be evaluated directly against other open models in the EleutherAI harness. By adhering to standardized protocols, the results are transparent, reproducible, and comparable to popular models like Pythia, GPT-Neo, and LLaMA itself.

For deeper analysis or adding new tasks, see the EleutherAI lm-eval repository.