New text-to-speech model enhances control, quality, and naturalness in over 70 languages.
Google has unveiled Gemini 3.1 Flash TTS, a next-generation text-to-speech (TTS) model designed to deliver improved controllability, expressivity, and quality in AI speech applications. The model’s enhanced capabilities make it an attractive choice for developers, enterprises, and everyday users alike.
Gemini 3.1 Flash TTS has been developed with a focus on naturalness and expressiveness, making it the most advanced TTS model to date. Its improved speech quality is evident in its performance on the Artificial Analysis TTS leaderboard, where it achieved an impressive Elo score of 1,211.
One of the key features of Gemini 3.1 Flash TTS is its use of granular audio tags, which enable developers to control vocal style, pace, and delivery using natural language commands. This level of precision is unprecedented in the field of AI speech, opening up new possibilities for application development.
The model’s support for over 70 languages makes it a valuable resource for global enterprises and developers looking to expand their reach into diverse markets. Additionally, its use of SynthID watermarking ensures that all audio generated by Gemini 3.1 Flash TTS is easily identifiable as AI-generated, helping to pr…
What matters
- Gemini 3.1 Flash TTS offers improved controllability and expressivity with granular audio tags.
- Developers can fine-tune voices using Google AI Studio and export settings for consistent use.
- The model supports 70+ languages and uses SynthID watermarking to identify AI-generated audio.
Why it matters
The model supports 70+ languages and uses SynthID watermarking to identify AI-generated audio.
This GenAI News article was prepared in original wording using reporting and materials published by Google AI Blog. Source reference: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/.
Drafted by the GenAI News review pipeline.
