reward-lens open-source library ports mechanistic interpretability toolkit to reward models

reward-lens adapts logit lens, direct logit attribution, activation patching, and sparse autoencoders to reward models by using the reward head weight vector as the natural projection axis, replacing the vocabulary unembedding. The library enables mechanistic interpretability analysis of the reward models that shape RLHF-trained LLMs.

Sources

X mentions