Linear Probe Penalties Reduce LLM Sycophancy

Visiting ETH MsC student Henry Papadatos and supervising CHAI PhD student Rachel Freedman publish an article “Linear Probe Penalties Reduce LLM Sycophancy” at the NeurIPS SoLaR workshop. The paper demonstrates a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning

https://arxiv.org/pdf/2412.00967

Topics:

Source: Berkeley CHAI · humancompatible.ai