Developers building voice applications often struggle with slow response times, inconsistent audio quality across languages, high API costs, and limited control over emotional tone or pronunciation—issues that make real-time interaction and large-scale generation difficult to deliver reliably.
MiniMax Speech 2.5 is designed to address these constraints directly. It offers high-accuracy voice cloning from just 6–10 seconds of audio, multilingual synthesis across 40+ languages with ~2% WER in Chinese and English, and Turbo-mode latency near 250 ms for interactive use. Long-form workloads are supported through asynchronous processing of up to 200,000 characters, while pricing remains developer-friendly at $0.04 per 1,000 characters. With fine-grained emotional control and stable performance under SNR ≥ 3 dB, the model provides a practical solution for teams needing both real-time responsiveness and scalable, cost-efficient voice generation.
- Model Comparison of Speech 2.5 Turbo and HD
- Can Speech 2.5 Replicate An Arbitrary Voice Using only a Few Seconds of Audio?
- Does Speech 2.5 Deliver Native-Level Pronunciation Across 40+ Languages?
- How Well Does Speech 2.5 Handle Long Documents or Books?
- What is the Cost Per 1,000 Characters of Speech 2.5
- How Fine-Grained is Control Over Pronunciation, Emphasis, and Pauses?
- Dose Minimax Speech 2.5 Supports Stream?
- How to Use Minimax Speech 2.5 at A Good Price?
Model Comparison of Speech 2.5 Turbo and HD
The fundamental difference between Speech 2.5 HD and Turbo Preview lies in their quality–latency trade-off:
| Metric | HD | Turbo |
|---|---|---|
| Audio Quality | Studio-grade realism with highest fidelity | High-definition quality with slightly less expressiveness |
| TTS Latency | Several seconds | End-to-end latency under 250 ms |
| Ideal Scenario | High-end content generation | Real-time interactive applications |
| Cost | $80/M characters | $48/M characters |
HD provides superior timbre similarity, emotional nuance, and natural prosody.
Turbo optimizes the encoding pipeline to achieve extremely low latency suitable for real-time interaction.
Can Speech 2.5 Replicate An Arbitrary Voice Using only a Few Seconds of Audio?
MInimax Speech 2.5‘s Flow-VAE decoder combines Flow Matching and Variational Autoencoding to model speech in a learned latent space rather than relying solely on mel-spectrograms. This captures pitch, rhythm, accent, and emotional color.

Required Sample Length: Only 6–10 seconds for high-fidelity cloning, achieving up to 99% similarity.
Similarity Metrics: Outperforms ElevenLabs in speaker similarity across 24 languages.
Zero-shot Cloning: No transcript needed; a learned speaker embedding encoder extracts vocal identity directly
Does Speech 2.5 Deliver Native-Level Pronunciation Across 40+ Languages?
Multilingual Capability:
- Supports 40+ languages
- Chinese: Global benchmark performance
- English: Major upgrade vs. Speech 0.2 with reduced mechanical artifacts
- Other languages: Japanese, French, Spanish, etc. with natural native pronunciation
Mechanisms:
- Enhanced speaker-feature extraction
- Cross-lingual transfer layers that retain timbre
- End-to-end training to maintain vocal identity across languages
Quality Metric:
Synthesized English and Chinese speech from MiniMax have WER around 2%, indicating that the spoken words are almost perfectly understood by an ASR.
How Well Does Speech 2.5 Handle Long Documents or Books?
Long-Form Latency and Throughput (Speech 2.5)
MiniMax Speech 2.5 maintains stable performance on long inputs with quantifiable latency and throughput advantages:
• TTS Latency:
Audio playback typically begins within a few seconds, even for multi-paragraph text. The updated 2.5 audio pipeline minimizes startup delay. Later-generation systems achieve 250 ms end-to-end latency in agent settings; Speech 2.5 remains in the low-second range for standard synthesis requests.
• Long-Text Capacity:
Supports up to 10,000 characters per request via the async TTS API. Download URLs remain valid for 9 hours, ensuring reliable retrieval.
- Turbo mode: lower latency and higher throughput (with moderate fidelity trade-offs).
- HD mode: maximized audio quality.
Throughput can be further increased using batch submission or asynchronous jobs, suitable for workloads such as hour-long transcription or synthesis tasks.
What is the Cost Per 1,000 Characters of Speech 2.5
| Provider | Cost / 1K chars |
|---|---|
| MiniMax Speech 2.5 Turbo | $0.048 |
| MiniMax Speech 2.5 HD | $0.08 |
| ElevenLabs | $0.24–0.30 |
| OpenAI GPT-4 Audio | Typically >$0.10 |
| Google Gemini | TTS >$2.50 per 1M tokens |
Novita AI offers best price of Minimax Speech!

How Fine-Grained is Control Over Pronunciation, Emphasis, and Pauses?
| Control Capability | API Field | Example Value / Usage |
|---|---|---|
| Custom pauses | text using <#x#> | Hello<#0.50#>world |
| Phoneme-level pronunciation (IPA / X-SAMPA) | pronunciation_dict | "demo": {"type":"ipa","value":"ˈdɛmoʊ"} |
| Chinese tone replacement | pronunciation_dict (type: "tone") | "你好": {"type":"tone","value":"ni3 hao3"} |
| Speech rate | voice_setting.speed | 1.05 |
| Volume | voice_setting.vol | 1.2 |
| Pitch (semitone shift) | voice_setting.pitch | 2 |
| Voice selection (timbre ID) | voice_setting.voice_id | "Calm_Woman" |
| Emotion | voice_setting.emotion | "neutral" |
| English text normalization | voice_setting.text_normalization | true |
| Sample rate | audio_setting.sample_rate | 44100 |
| Bitrate | audio_setting.bitrate | 128000 |
| Audio format | audio_setting.format | "mp3" |
| Channels | audio_setting.channel | 1 (mono) |
| Timbre mixing (up to 4 voices) | timbre_weights | [{"voice_id":"Calm_Woman","weight":70}] |
| Audio FX (reverb, telephone, robotic, etc.) | voice_modify.sound_effects | "spacious_echo" |
| Brightness pitch adjustment | voice_modify.pitch | 10 |
| Intensity adjustment | voice_modify.intensity | -20 |
| Timbre sharpness/magnetism | voice_modify.timbre | -15 |
| Streaming mode | stream | false |
| Language/dialect boost | language_boost | "English" |
import requests
url = "https://api.novita.ai/v3/minimax-speech-2.5-hd-preview"
payload = {
"text": "Hello<#0.50#>this is a demo of fine-grained control.<#0.30#>\nPlease read the number 2025 clearly.",
"voice_setting": {
"speed": 1.05,
"vol": 1.2,
"pitch": 2,
"voice_id": "Calm_Woman",
"emotion": "neutral",
"text_normalization": True
},
"audio_setting": {
"sample_rate": 44100,
"bitrate": 128000,
"format": "mp3",
"channel": 1
},
# Use the concrete pronunciation dictionary from your example
"pronunciation_dict": {
"demo": {
"type": "ipa",
"value": "ˈdɛmoʊ"
},
"2025": {
"type": "ipa",
"value": "tuː θaʊzənd twɛnti faɪv"
},
"你好": {
"type": "tone",
"value": "ni3 hao3"
}
},
"timbre_weights": [
{
"voice_id": "Calm_Woman",
"weight": 70
},
{
"voice_id": "Friendly_Person",
"weight": 30
}
],
"stream": False,
"language_boost": "English",
"output_format": "url",
"voice_modify": {
"pitch": 10,
"intensity": -20,
"timbre": -15,
"sound_effects": "spacious_echo"
}
}
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_API_KEY_HERE"
}
response = requests.post(url, json=payload, headers=headers)
print(response.text)
Dose Minimax Speech 2.5 Supports Stream?
Yes. MiniMax Speech 2.5 supports streaming for both speech recognition (ASR) and text-to-speech (TTS). The API explicitly includes the field:
"stream": true
in a TTS request, the system begins generating audio immediately and sends it back in segments. This allows playback to start before the full sentence is synthesized. Typical TTS startup latency is within a few seconds, and optimized scenarios can reach sub-second end-to-end response times.
How to Use Minimax Speech 2.5 at A Good Price?
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Choose Your Model
Browse through the available options and select the model that suits your needs.

Step 3: Start Your Free Trial
Begin your free trial to explore the capabilities of the selected model.

Step 4: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

MiniMax Speech 2.5 offers a balanced, developer-ready solution to the core problems in modern voice application development. It combines fast response times, strong multilingual accuracy, and reliable long-text processing with cost-efficient pricing and detailed control over emotional tone, pronunciation, and timbre. With Turbo and HD modes optimized for different latency–quality needs, and with full support for streaming, MiniMax Speech 2.5 enables teams to build scalable voice agents, real-time transcription systems, and high-quality content pipelines with far fewer technical constraints. The model’s performance, flexibility, and API design make it a practical choice for developers seeking both efficiency and expressive speech generation.
Frequently Asked Questions
Yes. MiniMax Speech 2.5 supports streaming for both ASR and TTS. Enabling "stream": true allows the system to send incremental transcripts or audio chunks in real time, enabling sub-second response times and natural conversational timing.
MiniMax Speech 2.5 achieves high-fidelity voice cloning with only 6–10 seconds of audio, reaching up to 99% similarity and outperforming several commercial alternatives in multilingual speaker-similarity benchmarks.
Yes. MiniMax Speech 2.5 supports 40+ languages and achieves ~2% WER for Chinese and English. It maintains vocal identity across languages through cross-lingual transfer layers and end-to-end training.
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Recommended Reading
- Wan2.1: An Open-Source AI Model Outperforms Sora
- Qwen3 Embedding 8B: Powerful Search, Flexible Customization, and Multilingual
- MiniMax Speech 02: Top Solution for Fast and Natural Voice Generation
Discover more from Novita
Subscribe to get the latest posts sent to your email.





