Skip to content

Refusal Directions in Instruction-Tuned Models

This project explores how refusal behavior is represented inside instruction-tuned language models with the central question being: is refusal encoded as a single linear direction in the residual stream, or as a higher-dimensional structure?

Across models, both patterns appear.


Detailed Writeups

These pages contain:

  • Refusal vector extraction
  • Cross-layer cosine analysis
  • Runtime ablation experiments
  • Subspace interventions (where applicable)
  • Offline weight orthogonalization
  • Benchmark evaluation