Researchers From Stanford And DeepMind Come Up With The Idea of Using Large Language Models LLMs as a Proxy Reward Function

With the development of computing and data, autonomous agents are gaining power. The need for humans to have some say over the policies learned by agents and to check that they align with their goals becomes all the more apparent in light of this.

Currently, users either 1) create reward functions for desired actions or 2) provide extensive labeled data. Both strategies present difficulties and are unlikely to be implemented in practice. Agents are vulnerable to reward hacking, making it challenging to design reward functions that strike a balance between competing goals. Yet, a reward function can be learned from annotated examples. However, enormous amounts of labeled data are needed to capture the subtleties of individual users’ tastes and objectives, which has proven expensive. Furthermore, reward functions must be redesigned, or the dataset should be re-collected for a new user population with different goals.

New research by Stanford University and DeepMind aims to design a system that makes it simpler for users to share their preferences, with an interface that is more natural than writing a reward function and a cost-effective approach to define those preferences using only a few instances. Their work uses large language models (LLMs) that have been trained on massive amounts of text data from the internet and have proven adept at learning in context with no or very few training examples. According to the researchers, LLMs are excellent contextual learners because they have been trained on a large enough dataset to incorporate important commonsense priors about human behavior.

Lifeboat Foundation

Safeguarding Humanity

Blog

Mar 15, 2023