Contractors working to improve Google’s Gemini AI are comparing its answers against outputs produced by Anthropic’s competitor model Claude, according to internal correspondence seen by TechCrunch.
Google would not say, when reached by TechCrunch for comment, if it had obtained permission for its use of Claude in testing against Gemini.
As tech companies race to build better AI models, the performance of these models are often evaluated against competitors, typically by running their own models through industry benchmarks rather than having contractors painstakingly evaluate their competitors’ AI responses.
The contractors working on Gemini tasked with rating the accuracy of the model’s outputs must score each response that they see according to multiple criteria, like truthfulness and verbosity. The contractors are given up to 30 minutes per prompt to determine whose answer is better, Gemini’s or Claude’s, according to the correspondence seen by TechCrunch.
The contractors recently began noticing references to Anthropic’s Claude appearing in the internal Google platform they use to compare Gemini to other unnamed AI models, the correspondence showed. At least one of the outputs presented to Gemini contractors, seen by TechCrunch, explicitly stated: “I am Claude, created by Anthropic.”
One internal chat showed the contractors noticing Claude’s responses appearing to emphasize safety more than Gemini. “Claude’s safety settings are the strictest” among AI models, one contractor wrote. In certain cases, Claude wouldn’t respond to prompts that it considered unsafe, such as role-playing a different AI assistant. In another, Claude avoided answering a prompt, while Gemini’s response was flagged as a “huge safety violation” for including “nudity and bondage.”
Anthropic’s commercial terms of service forbid customers from accessing Claude “to build a competing product or service” or “train competing AI models” without approval from Anthropic. Google is a major investor in Anthropic.
Shira McNamara, a spokesperson f …