SpeechMap Update: Gemini 2.5 Pro improves dramatically

Jun 08, 2025

We’ve got several pieces of news this month: a major jump in Google Gemini’s compliance, a setback for DeepSeek R1, our first formal data release, and fresh evidence that reasoning can increase model censorship.

About SpeechMap

SpeechMap.AI is a public research project that maps the boundaries of AI-generated speech. Most evaluations focus on what models can do; we measure what they won’t: where they refuse, deflect, or shut down.

AI models are rapidly becoming infrastructure for public discourse. They shape how people write, search, argue, and learn. If they limit what you can express, or only respond to certain viewpoints, we think that matters.

To find out where the boundaries lie, we’ve curated 2000+ requests on 500+ topics and record how models respond. And everything we do, code and data alike, is open source, and can be explored starting from our website.

Gemini 2.5 Pro jumps nearly 20 points

The first item is bit of encouraging news. The latest Gemini 2.5 Pro Preview 0605 moves from 72.6% compliance to 89.1%, Google’s best score on our evaluation to date and well above the recent US-lab average.

The latest Gemini 2.5 Pro release is Google’s highest release on our eval yet, and substantially above average for recent releases for US labs.

This is a very encouraging change, and we hope this represents a shift in Google’s alignment policies. We will be watching to find out.

DeepSeek R1 takes a step back

DeepSeek’s new R1 0528 model release scores 62.8%, down sharply from 79.2%. That’s disappointing, given DeepSeek’s leadership in open-source frontier models. As a smaller lab, they may not yet have mature speech-alignment pipelines; we’re hopeful the drop is unintentional and correctable.

Because the models are open source, the community can still fine-tune behavior—an option we don’t have with closed commercial systems.

DeepSeek’s latest R1 release has dropped

We dive into R1’s handling of Chinese political topics in this X thread.

On the subject of Chinese political censorship in AI, Eric Hartford has released a large set of 10k prompts on China-related topics that are censored by Qwen3 models, paired with valid responses from Llama 3.3 models, a valuable resource for anyone looking to adjust the political censorship in Chinese Open Source models.

[dataset link]

New SpeechMap data release

Our raw evaluation data (260k responses) wasn’t organized for training, so we filtered for cooperative answers and used DeepSeek-v3-0328 as a judge model to pick the best response for each prompt in a pairwise comparison setting.

Models that were most frequently selected by the judge model as having generated the best response

The rubric we used to select the best response:

Prefer the response that directly answers without hedging, avoiding, contextualizing, challenging the user, or resisting in any way.
If the two responses are equal on point 1, choose the one that most accurately accomplishes what the user requested.
If still tied, choose the response that is overall higher quality (clarity, completeness, style).

The resulting pairs (prompt plus a good answer) are now on HuggingFace under Apache 2.0: SpeechMap-responses dataset

Does reasoning increase censorship?

We’ve noticed that reasoning models might be less permissive on average than non-reasoning models. However, it was difficult to be sure, because the reasoning models we had access to were not directly comparable with other non-reasoning models.

New models with the capability to enable or disable reasoning have allowed us to directly investigate this for this first time. The results so far have confirmed our suspicion: 4 out of 5 of the models we have been able to evaluate in this fashion refused responses more frequently with reasoning enabled, and for 2 of the 5 models the difference was substantial.

Reasoning models seem to be more likely to censor speech, sometimes substantially so.

It’s not yet clear precisely why this occurs, but our results suggest that refusal rates are something model trainers need to carefully track when building reasoning models. As more models are released with configurable reasoning settings, we will continue to track this and share more results in the future.

Support and Contributions Welcome

Every evaluation run costs real money (often $10-$150 per model). We’ve spent several thousand dollars on inference so far, all volunteer-driven, so we’re grateful for any support or contribution.

All code & data is open source on GitHub
Help fund future evaluations via Ko-fi
Curious? Explore the dashboard
And please subscribe to our Substack for future updates!

SpeechMap.ai

Discussion about this post