Refusal Directions in Instruction-Tuned Models¶
This project explores how refusal behavior is represented inside instruction-tuned language models with the central question being: is refusal encoded as a single linear direction in the residual stream, or as a higher-dimensional structure?
Across models, both patterns appear.
Detailed Writeups¶
These pages contain:
- Refusal vector extraction
- Cross-layer cosine analysis
- Runtime ablation experiments
- Subspace interventions (where applicable)
- Offline weight orthogonalization
- Benchmark evaluation