"Specification gaming: a behaviour that satisfies the literal specification of an objective without achieving the intended outcome" - the earliest popular example would be the Sorcerer's Apprentice, followed by the paperclipmaker which destroys the universe, but the Midas legend probably came first.
The authors "have collected around 60 examples " of specification gaming by artificial agents, and "review possible causes ... share examples ... argue for further work on principled approaches to overcoming specification problems."
The problem is considered from 2 perspectives:
1) when "developing reinforcement learning (RL) algorithms, the goal is to build agents that learn to achieve the given objective... [if] the agent solves the task by exploiting a loophole is unimportant ... [so] specification gaming is a good sign... demonstrate the ingenuity and power of algorithms"
2) On the other hand "the same ingenuity can pose an issue" when trying to build " aligned agents that achieve the intended outcome in the world... [caused by] "misspecification of the intended task.
As LLMs get better, they'll get better at finding unintended methods to achieve goals, so knowing how to correctly task an LLM will become increasingly difficult and important. This "task specification includes not only reward design, but also the choice of training environment and auxiliary rewards."
Challenges to overcome:
Several things to keep in mind:
More Stuff I Like