Today we’re releasing Voxtral TTS, our first text-to-speech model with state-of-the-art performance in multilingual voice generation. The model is lightweight at 4B parameters, making Voxtral-powered agents natural, reliable, and cost-effective at scale.

Highlights.

  1. Realistic, emotionally expressive speech in 9 popular languages with support for diverse dialects.

  2. Very low latency for time-to-first-audio.

  3. Easily adaptable to new voices.

  4. Available to test out in Mistral Studio.

  5. Enterprise-grade text-to-speech, powering critical voice agent workflows.

A natural voice generation hinges on the model’s ability to not only recite but interpret a text accurately. Contextual understanding - like neutral, happy, sarcastic, etc. - determines whether the listener considers the generation accurate or robotic. Our model excels at both contextual understanding and speaker modeling: capturing how a specific person naturally speaks. Our voice adaptation goes beyond traditional read-speech by capturing a speaker’s personality, including their natural pauses, rhythm, intonation, and emotional dexterity. With its compact size, low cost and latency, and easy adaptability, Voxtral TTS gives full control and customization for enterprises looking to own their voice AI stack.

Audio is the new UX. Create new interactions for collaboration and understanding only found in speech. Begin now in AI Studio with our Mistral Voices in American, British, and French dialects.


Listen and decide: can you tell the difference?

Our team speaks dozens of languages in multiple dialects, we understand the importance of cultural nuance and built a model that is a reflection of us. Speech generation builds trust via natural-like rhythm, emotion, and even the use of humor. That’s why with voice emulation, we focused on authenticity and emotional expressiveness.

Original voice

Margaret

Margaret

Model Behavior Architect

English (US)

Prompt

Boy oh boy! I'm so excited for the summer. It's going to be so warm here, can't wait for swimming in the Lido and making cherry pie.

I prefer this

This was generated by

I prefer this

This was generated by

State-of-the-art performance.

Automated metrics such as word-error-rate and audio quality scores for multilingual text-to-speech systems are unable to measure naturalness of speech. What makes speech natural is extremely nuanced and requires a deep understanding of cultural differences and typical speaking patterns. Hence, comparative human evaluations performed by native speakers are crucial.

For voice agents, latency and quality are in constant tension. Human evaluations show that Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar Time-to-First-Audio (TTFA). Voxtral also performs at parity with the quality of ElevenLabs v3, successfully supporting emotion-steering for more lifelike interactions.

Voxtral TTS Winrate (1)

We conducted a comparative human evaluation of Voxtral TTS and ElevenLabs v2.5 Flash in a zero-shot custom voice context. Using two recognizable voices in their native dialects for each of the 9 supported languages, 3 annotators performed a side-by-side preference test per pair on naturalness, accent adherence, and acoustic similarity to the original reference. Voxtral TTS widens the quality gap to v2.5 Flash in this zero-shot multilingual custom voice setting, highlighting the instant customizability of Voxtral TTS to any voice.

Spoken natively.

Trained on a large speech dataset, Voxtral TTS is built for global application. It supports state-of-the-art performance in 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

The model was trained to adapt to a custom voice with a reference as little as 3s and capture not just the voice but also nuances like subtle accent, inflections, intonations and even disfluencies similar to those expressed in the reference. We offer some preset voice options in the API but it is simple to extend to your in-house voice library customizing it to the use-case, localize it to the language and accent, keep it neutral or more emotive, casual or formal, more natural and conversational or robotic.

The model also demonstrates zero-shot cross-lingual voice adaptation even though it’s not explicitly trained for it. For example, the model can generate English speech with a French voice prompt and English text. The resulting speech sounds natural while adopting the accent of the provided voice prompt (in this example, the generated speech has a natural French-accented English). This makes the model useful for building cascaded speech-to-speech translation systems.

Cascaded speech-to-speech translation

Click a speaker to run cascaded speech-to-speech translation.

Paul English (US) Marie French Oliver English (UK)

Prompt

Before we begin, I'll need to verify a few details. Can you confirm your full name and date of birth?

English French Spanish German

Built for low-latency streaming.

Latency is critical for voice agent applications. Voxtral TTS achieves a model latency of 70ms for a typical input voice sample of 10 seconds and 500 characters, with a real-time factor (RTF) of ≈9.7x. The model natively generates up to two minutes of audio, and our API handles arbitrarily long generations with smart interleaving.

Voxtral TTS architecture.

The model is a transformer-based, autoregressive, flow-matching model, built on Ministral 3B. It consists of the following components:

  • 3.4B parameters transformer decoder backbone

  • 390M flow-matching acoustic transformer

  • 300M neural audio codec (symmetric encoder-decoder)

The model takes a voice prompt (5 to 25 seconds) and a text prompt in 9 supported languages. For each audio frame, the transformer backbone predicts a semantic token, then the flow-matching transformer runs 16 function evaluations (NFEs) to produce the acoustic latent.

We developed an in-house codec, which processes audio causally using a semantic VQ (8192 vocabulary) and an acoustic FSQ (36 dim and 21 levels) latent and produces them at 12.5Hz frame rate.

Audio Infographic

Powering enterprise voice workflows.


Voxtral TTS closes the loop on audio intelligence, giving enterprise voice pipelines an output layer that passes the human test. It works alongside Voxtral Transcribe for full speech-to-speech, or integrates into any existing speech-to-text and LLM stack, with cross-lingual support.

Customer Support Financial Services Manufacturing and Industrial Operations Public Services and Government Compliance and Risk Supply Chain and Logistics Automotive and In-Vehicle Systems Sales & Marketing Real-Time Translation

Customer Support

Voice agents that route and resolve queries across channels with natural, brand-appropriate speech. Place Voxtral TTS into existing contact support call systems for automated spoken responses, with output that integrates into existing workflows.

Thank you for contacting support. I see your concern about the unexpected charge on March 22nd. Let me pull up your account details. Ah, yes. This appears to be a temporary hold for a subscription upgrade you requested. Would you like me to cancel the hold or adjust your plan? Your call is important to us, so I'll stay on the line until this is resolved.

Financial Services

Voice-driven account updates, transaction confirmations, fraud alerts, and loan status notifications. Reduce agent workload by automating routine inquiries while keeping a natural, trust-appropriate tone.

Your account has been successfully updated. The recent transaction of $150 to your savings account has been processed. For security, we've sent a confirmation code to your registered device. If you didn't authorize this activity, please contact us immediately.

Manufacturing and Industrial Operations

Hands-free delivery of work instructions, inspection prompts, and machine-status alerts to operators who can't leave their station. Spoken updates integrate into existing MES and shop-floor systems to keep production moving.

Attention! Machine 4B requires maintenance. Inspect the hydraulic system and report any leaks immediately. Cycle time is currently delayed by 12 minutes. Awaiting your confirmation before restarting,

Public Services and Government

Citizen-facing voice interactions for appointment scheduling, permit status, benefit inquiries, and emergency notifications to serve diverse populations in multiple languages.

This is the City's automated service. Your permit application 2024-789 has been approved. You can collect your documents at the main office between 9am and 5pm. For emergencies, dial 911.

Compliance and Risk

Real-time call monitoring, KYC/AML automation, audit trails with speaker attribution. Automated delivery of required disclosures and regulatory scripts to ensure consistent, verbatim compliance across industries.

For regulatory compliance, we must verify your identity. Please state your full name, date of birth, and last four digits of your social security number. This call is being recorded for quality assurance.

Supply Chain and Logistics

Automate voice-based updates for inventory management, shipment tracking, and delivery notifications. Provide real-time information to staff and drivers.

Attention, your next pickup is at dock 3. The shipment contains 20 pallets of electronics. Estimated weight, 1200 kilos. ETA to destination, 2 hours. Please confirm receipt.

Automotive and In-Vehicle Systems

Natural voice guidance for navigation, safety alerts, and hands-free communication. Clear spoken output suited to driving contexts where screen interaction isn't safe.

Warning. Vehicle detected in your blind spot. Lane departure alert. Please correct your steering. The current speed limit is 65 miles per hour.

Sales & Marketing

Consistent brand voice for personalized messaging, with cross-lingual emulation that lets a single voice work across markets. Useful for voiceovers, narration, dubbed content, podcasts, and newsletters for internal and client-facing content.

Discover the next generation of reliable performance with features designed to exceed your expectations. Built to last while consuming less energy, this solution delivers consistent results you can count on. Available now with special introductory pricing. Don't miss this opportunity to enhance your daily operations.

Real-Time Translation

Voice-to-voice translation across 9 languages for cross-border operations, events, and internal content, preserving speaker accent and identity in the target language.

In English, this definitely won't work. I need a refund for sure. En français, ça ne marchera définitivement pas. Je suis sûre d'avoir droit à un remboursement. In italiano, questo non funzionerà di sicuro, ho proprio bisogno di un rimborso.

Test-run the model in Mistral Studio.

Experiment with Voxtral TTS directly in the Mistral Studio playground. Select one of the Mistral voices or record your own.

Get started with Voxtral TTS.

Voxtral TTS is available now via API at $0.016 per 1k characters.

Try it now in Mistral Studio or in Le Chat

A model with several reference voices is available as open weights on Hugging Face under CC BY NC 4.0 license.

Explore the model’s documentation or read our research paper.

Sign up for our upcoming webinar to learn more! 

We’re hiring!

We are building the voice layer for AI, and If this is the kind of problem you want to work on, we'd love to hear from you.