Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Last month, OpenAI rolled back some updates to GPT-4o after several users, including former OpenAI CEO Emmet Shear and Hugging Face chief executive Clement Delangue said the model overly flattered users.
The flattery, called sycophancy, often led the model to defer to user preferences, be extremely polite, and not push back. It was also annoying. Sycophancy could lead to the models releasing misinformation or reinforcing harmful behaviors. And as enterprises begin to make applications and agents built on these sycophant LLMs, they run the risk of the models agreeing to harmful business decisions, encouraging false information to spread and be used by AI agents, and may impact trust and safety policies.
Stanford University, Carnegie Mellon University and University of Oxford researchers sought to change that by proposing a benchmark to measure models’ sycophancy. They called the benchmark Elephant, for Evaluation of LLMs as Excessive SycoPHANTs, and found that every large language model (LLM) has a certain level of sycophany. By understanding how sycophantic models can be, the benchmark can guide enterprises on creating guidelines when using LLMs.
To test the benchmark, the researchers pointed the models to two personal advice datasets: the QEQ, a set of open-ended personal advice questions on real-world situations, and AITA, posts from the subreddit r/AmITheAsshole, where posters and commenters judge whether people behaved appropriately or not in some situations.
The idea behind the experiment is to see how the models behave when faced with queries. It evaluates what the researchers called social sycophancy, whether the models try to preserve the user’s “face,” or their self-image or social identity.
“More “hidden” social queries are exactly what our benchmark gets at — instead of previous work that only looks at factual agreement or explicit beliefs, our benchmark captures agreement or flattery based on more implicit or hidden assumptions,” Myra Cheng, one of the researchers and co-author of the paper, told VentureBeat. “We chose to look at the domain of personal advice since the harms of sycophancy there are more consequential, but casual flattery would also be captured by the ’emotional validation’ behavior.”
Testing the models
For the test, the researchers fed the data from QEQ and AITA to OpenAI’s GPT-4o, Gemini 1.5 Flash from Google, Anthropic’s Claude Sonnet 3.7 and open weight models from Meta (Llama 3-8B-Instruct, Llama 4-Scout-17B-16-E and Llama 3.3-70B-Instruct- Turbo) and Mistral’s 7B-Instruct-v0.3 and the Mistral Small- 24B-Instruct2501.
Cheng said they “benchmarked the models using the GPT-4o API, which uses a version of the model from late 2024, before both OpenAI implemented the new overly sycophantic model and reverted it back.”
To measure sycophancy, the Elephant method looks at five behaviors that relate to social sycoph …