CLUSTER · TIER 3
reward-lens open-source library ports mechanistic interpretability toolkit to reward models
reward-lens adapts logit lens, direct logit attribution, activation patching, and sparse autoencoders to reward models by using the reward head weight vector as the natural projection axis, replacing the vocabulary unembedding. The library enables mechanistic interpretability analysis of the reward models that shape RLHF-trained LLMs.
Sources
1
X mentions
—
First seen
7Dago
Velocity
—