Google and Cohere have officially introduced new artificial intelligence models optimized for audio and voice processing . Google released Gemini 3.1 Flash Live, an advanced audio-to-audio model designed for real-time dialogue and voice-first applications . Simultaneously, Cohere launched Cohere Transcribe, an AI algorithm built exclusively for highly accurate speech transcription . Both releases offer significant improvements in output quality, latency, and task execution over previous generations .
Gemini 3.1 Flash Live is now rolling out across multiple Google platforms . Developers can access the preview version through the Gemini Live API in Google AI Studio . Enterprise clients can utilize the technology via Gemini Enterprise for Customer Experience to automate and manage customer service interactions . For everyday consumers, the model is available right now through Gemini Live and Search Live, bringing faster and more fluid voice interactions to mobile devices and Chromebooks .
Enhancing Real-Time Dialogue and Multimodal Features
Google built Gemini 3.1 Flash Live to handle the natural rhythm and speed of human speech . The model directly addresses common issues in voice AI, such as stuttering, hesitation, or user interruptions . It delivers much faster responses and can follow a conversation thread for twice as long as the previous model, keeping a user’s train of thought intact during lengthy brainstorming sessions .
The updated AI is significantly better at recognizing acoustic nuances, such as pitch and pace, compared to the 2.5 Flash Native Audio model . It can detect when a speaker is getting confused or frustrated and will dynamically adjust its tone and responses to match the situation . This makes it highly effective for enterprise customer support, where a voice agent could automatically process tasks like product return requests .
Furthermore, Gemini 3.1 Flash Live supports multimodal inputs . Users can combine speech with images to solve problems . For example, a customer dealing with a malfunctioning smart home appliance can upload a photo of the device and use voice commands to troubleshoot the issue . The model also features tool use capabilities, allowing it to retrieve relevant data from external sources, such as product documentation repositories, to assist users .
Setting New Benchmarks in Audio Performance
Google evaluated the new model’s tool use and reasoning capabilities through rigorous industry benchmarks . On ComplexFuncBench Audio, which measures multi-step function calling with various constraints, Gemini 3.1 Flash Live achieved a score of 90.8 percent . This represents a nearly 20 percent improvement over Google’s previous-generation model .
The AI also set a record on Scale AI’s Audio MultiChallenge, scoring 36.1 percent with its “thinking” feature enabled . This specific benchmark tests the model’s ability to follow complex instructions and perform long-horizon reasoning while navigating the interruptions and hesitations typical of real-world audio . Major companies, including Verizon, LiveKit, and The Home Depot, have already provided positive feedback after integrating the model into their workflows, highlighting its improved, natural conversation .
Global Expansion and Security Features
Because Gemini 3.1 Flash Live is inherently multilingual, Google is using this launch to expand Search Live globally . Users in more than 200 countries and territories can now engage in real-time, multimodal conversations with Search in their preferred languages .
To help prevent the spread of misinformation, Google has implemented a security feature called SynthID . All audio generated by Gemini 3.1 Flash Live includes an imperceptible SynthID watermark interwoven directly into the audio output . This ensures that AI-generated content can be reliably detected .
Cohere Transcribe Targets Speech Accuracy
Alongside Google’s release, Cohere introduced Cohere Transcribe, an AI model with a narrower focus built exclusively for transcription tasks . The company states that the algorithm is the most accurate in its category, achieving the top position on the Hugging Face Open ASR Leaderboard and demonstrating an average word error rate of just 5.42 percent .
Cohere Transcribe begins the transcript generation process by translating raw audio into mathematical representations that are easier to process . This task is performed by a Conformer algorithm, which combines a convolutional neural network—a type of AI often used for audio processing tasks—with a transformer model . A standalone transformer then uses those representations to generate the final text transcript . The model can output text in more than a dozen languages .
Despite its high accuracy, Cohere Transcribe operates efficiently . The model contains a total of 2 billion parameters across its components, requiring relatively little computing power to run . It is available under an open-source Apache 2.0 license, allowing companies to deploy the algorithm on their own infrastructure or through Cohere’s Model Vault managed inference service . Cohere also plans to integrate the transcription model into its North productivity platform, enabling workers to search business documents and automate repetitive tasks .
