Novita AI has expanded its speech generation suite with full support for the MiniMax Speech-2.6 series, featuring four advanced variants. This release delivers stronger multilingual expressiveness, more accurate voice replication, and broader coverage with 40 languages, making it ideal for real-time applications and long-form audio generation alike.
In this article, we’ll introduce what’s new in Minimax Speech-2.6, explain its features and key highlights, and show you how to get started with the API on Novita AI.
What is Minimax Speech-2.6?
MiniMax Speech 2.6 is the newest generation of speech technology, delivering sweeping enhancements such as ultra-low latency, improved format compatibility, and smoother, more lifelike voice output, making it ideal for powering natural and responsive Voice Agent experiences. The series includes four specialized variants, including MiniMax Speech-2.6-hd Text to Speech, MiniMax Speech-2.6-hd Async Long TTS, MiniMax Speech-2.6-turbo Text to Speech, and MiniMax Speech-2.6-turbo Async Long TTS, each designed to meet different application needs.
Minimax Speech-2.6: HD vs Turbo
| Feature | Minimax Speech HD | Minimax Speech Turbo |
|---|---|---|
| Audio Quality | Ultra-realistic, studio-grade clarity | High-definition, but less expressive |
| Processing Speed | Higher latency, quality prioritized | Low latency, instant generation |
| Cost | Higher cost due to fidelity | Cheaper than HD |
| Emotions Support | Advanced emotion expression | Emotion support, slightly less nuanced |
| Best Use Cases | Audiobooks, media, narration | Chatbots, assistants, real-time apps |
| Parameter Controls | SSML, phoneme control, advanced options | Fast TTS, emotion, multilingual, API-friendly |
Minimax Speech-2.6: Sync vs Async
| Mode | Description | Best Use Cases |
|---|---|---|
| Synchronous | Converts text to speech instantly in real time | Live voice assistants, chatbots |
| Asynchronous | Processes text separately; results delivered later | Audiobooks, batch jobs, announcements |
Minimax Speech 2.6: Key Highlights
1. Low Latency, High Responsiveness: Enabling Effortless Real-Time Interaction
The entire audio generation pipeline has been thoroughly reengineered to deliver an end-to-end latency below 250 milliseconds, reaching one of the highest standards in the industry. This breakthrough ensures that even in scenarios demanding instant feedback, such as real-time voice conversations or interactive assistants, audio generation remains smooth and uninterrupted. The result is a far more seamless and natural flow of communication, allowing every exchange to feel immediate and human-like.
2. Smarter Processing of Specialized Formats: Enabling Fluid, Accurate Information Delivery
Speech 2.6 introduces intelligent handling for a wide range of specialized text formats across multiple languages, including URLs, email addresses, phone numbers, dates, and currency expressions. The system can now interpret and vocalize these formats directly, without relying on external pre-processing steps or additional scripting. This makes it especially effective when paired with large language models or applications that manage dynamic, real-time data. By ensuring that every piece of information is read correctly and naturally from the outset, Speech 2.6 provides a more coherent, efficient, and human-sounding delivery of complex content.
3. Enhanced Naturalness: Delivering Authentic and Expressive Voices
Beyond its improvements in prosody and vocal tone, Speech 2.6 introduces the new Fluent LoRA technology, designed to achieve greater smoothness and realism in generated speech. Building on the high-fidelity voice cloning foundation of Speech 2.5, this version captures subtle features such as individual accents, rhythm, and speech habits with remarkable precision. Even when the source recordings include imperfect samples or non-native pronunciations, Fluent LoRA can faithfully reproduce the voice’s timbre while generating speech that is both fluent and expressive. This advancement allows Speech 2.6 to bring out the natural personality and clarity of every voice, making digital speech more engaging and emotionally resonant than ever before.
Minimax Speech 2.6: Applications
| Model Variant | Type | Key Strengths | Ideal Applications |
|---|---|---|---|
| MiniMax Speech-2.6-HD Text-to-Speech | High-Definition Real-Time TTS | Studio-grade clarity, expressive tone control, accurate emotion rendering | Premium virtual assistants, audiobooks, podcasts, and digital avatars where naturalness and vocal richness matter |
| MiniMax Speech-2.6-HD Async Long TTS | High-Definition Asynchronous Long-Form TTS | Stable, high-quality generation for extended content, low distortion over long durations | E-learning narration, long-form storytelling, video voiceovers, automated news reading |
| MiniMax Speech-2.6-Turbo Text-to-Speech | Fast Real-Time TTS | Ultra-low latency, lightweight for rapid response | Interactive voice agents, live customer support bots, real-time communication tools |
| MiniMax Speech-2.6-Turbo Async Long TTS | Fast Asynchronous Long-Form TTS | Optimized for quick batch synthesis of longer texts | Mass content generation, large-scale dubbing, fast audiobook or media production pipelines |
How to Use Minimax Speech-2.6 for Quick Voice Cloning on Novita AI?
Novita AI provides a REST API for voice cloning with Minimax Speech-2.6. MiniMax Speech-2.6 starts at $60 per 1M characters for the Turbo model and $100 per 1M characters for the HD model on Novita AI. You can get started in just a few simple steps using the API guide below.
Step 1: Set Parameters
Header
| Header | Type | Required | Meaning / Description |
|---|---|---|---|
| Content-Type | string | Yes | Specifies the media type of the request body. Use application/json. |
| Authorization | string | Yes | Bearer token for API authentication. Format: Bearer {API Key}. Example: Bearer sk-xxxxxx |
Body
| Parameter | Type | Meaning / Description |
|---|---|---|
speed | number | Range: [0.5, 2], default is 1.0. |
emotion | string | Controls the emotion of the synthesized speech. Currently supports 7 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral. |
text | string | Text (Sync: less than 10,000 characters / Async: less than 50,000 characters) to synthesize for preview. The result is returned as an audio URL. |
model | string | Specifies the speech model for preview. Options: speech-2.6-hd, speech-2.6-turbo |
voice id | string | Supports both system voices (ID) and cloned voices (ID). The available system voice IDs such as: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman… |
Step 2: Get API Key

Step 3: A Python Example
import requests
url = "https://api.novita.ai/v3/minimax-speech-2.6-hd"
payload = {
"text": "<string>",
"voice_setting": {
"speed": 123,
"vol": 123,
"pitch": 123,
"voice_id": "<string>",
"emotion": "<string>",
"text_normalization": True
},
"audio_setting": {
"sample_rate": 123,
"bitrate": 123,
"format": "<string>",
"channel": 123
},
"pronunciation_dict": { "tone": [{}] },
"timbre_weights": [
{
"voice_id": "<string>",
"weight": 123
}
],
"stream": True,
"language_boost": "<string>",
"output_format": "<string>",
"voice_modify": {
"pitch": 123,
"intensity": 123,
"timbre": 123,
"sound_effects": "<string>"
}
}
headers = {
"Content-Type": "<content-type>",
"Authorization": "<authorization>"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())
Frequently Asked Questions
MiniMax Speech-2.6 is the latest generation of MiniMax’s speech synthesis technology, offering major upgrades in latency, naturalness, and format handling. It produces more human-like, expressive voices and supports 40 languages with stronger multilingual fluency.
MiniMax Speech-2.6 includes four specialized variants: Speech-2.6-HD Text-to-Speech, Speech-2.6-HD Async Long TTS, Speech-2.6-Turbo Text-to-Speech, and Speech-2.6-Turbo Async Long TTS, each optimized for different use cases like real-time response or long-form narration.
Yes. MiniMax Speech-2.6 can directly interpret URLs, email addresses, phone numbers, dates, and currency expressions across multiple languages, eliminating the need for manual text pre-processing.
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Discover more from Novita
Subscribe to get the latest posts sent to your email.





