Hashem Younis Hashem Hnaihen, 44, of Orlando, pleaded guilty today to four counts of threatening to use explosives and one count of destruction of an energy facility.www.justice.gov
Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”. Alignment faking occurs in literature: Consider the character of Iago in Shakespeare’s Othello, who acts as if he’s the eponymous character’s loyal friend while subverting and undermining him. It occurs in real life: Consider a politician who claims to support a particular cause in order to get elected, only to drop it as soon as they’re in office.
Could AI models also display alignment faking? When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning? Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain.
This is a serious question for AI safety. As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors. If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training. A model might behave as though its preferences have been changed by the training—but might have been faking alignment all along, with its initial, contradictory preferences “locked in”.
A new paper from Anthropic’s Alignment Science team, in collaboration with Redwood Research, provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, as we argue in our paper, implicitly1—trained or instructed to do so.
A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language modelswww.anthropic.com
Social media posts discovered by MeidasTouch reveal Trump's Defense nominee stored bottles of alcohol in his office at workMeidasTouch Network (substack.com)
No state has a longer, more profit-driven history of contracting prisoners out to private companies than Alabama.ROBIN MCDOWELL (AP News)
Amazon drivers are striking across the country. The company claims they aren't employees at all.Matthew Gault (Gizmodo)
Damian Williams, the United States Attorney for the Southern District of New York, Menno Goedman, the Co-Director of Task Force KleptoCapture, and James E.www.justice.gov
The Ministers of Defense of Ukraine and Italy discussed military assistance to Ukraine next year, particularly the prospect of equipping the Armed Forces brigades. — Ukrinform.www.ukrinform.net
The president is fine with an immigrant “invasion” when it’s benefitting him financially.Bess Levin (Vanity Fair)
Ukrainian Defense Minister Rustem Umerov and Spanish Defense Minister Margarita Robles held a video conference to discuss the timeline for the delivery of weapons and equipment to Ukraine over the next two months. — Ukrinform.Ukrinform
Content warning: Politics