This guide gets you from API key to working audio with the GLM audio APIs — GLM TTS for text-to-speech, GLM ASR for transcription, and GLM Voice Clone for custom voice synthesis. All three are synchronous REST endpoints with no polling or webhook step. If you build voice features, transcription pipelines, or Chinese-language audio applications, this is the fastest path to a working integration.
When to Use This Quick Start
Use this guide if you need to:
- Convert text to speech with Chinese-optimized voices via
POST /v3/glm-tts - Transcribe
.wavor.mp3audio files viaPOST /v3/glm-asr - Clone a voice from a short audio sample and synthesize new speech via
POST /v3/glm-tts-voice-clone
All endpoints are available through the Novita AI API at https://api.novita.ai.
Prerequisites
- A Novita AI account. Get your API key from the Novita AI console.
curlfor the shell examples.- Python 3.8+ with
requestsinstalled for the Python examples.
Set your key as an environment variable:
export NOVITA_API_KEY="your_api_key_here"
GLM TTS Quick Start
Endpoint: POST https://api.novita.ai/v3/glm-tts
Converts text up to 1024 characters into speech. The response is binary audio — write it directly to a file.
Parameters
| Parameter | Type | Default | Notes |
|---|---|---|---|
input | string | — | Required. Up to 1024 characters. |
voice | string | tongtong | System voice ID or cloned voice name. |
speed | number | 1.0 | Range: 0.5–2.0 |
volume | number | 1.0 | Range: 0–10 |
response_format | string | pcm | wav or pcm. WAV includes a standard audio header; PCM is raw bytes at 24000 Hz. |
watermark_enabled | boolean | true | Set false only if your account has watermark removal enabled. |
System voices
| Voice ID | Display name |
|---|---|
tongtong | Tongtong (default) |
chuichui | Chuichui |
xiaochen | Xiaochen |
jam | Dongdong Zoo – Jam |
kazi | Dongdong Zoo – Kazi |
douji | Dongdong Zoo – Douji |
luodo | Dongdong Zoo – Luodo |
curl
curl -s -X POST https://api.novita.ai/v3/glm-tts \
-H "Authorization: Bearer $NOVITA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "你好,欢迎使用 Novita AI 语音合成接口。",
"voice": "tongtong",
"speed": 1.0,
"volume": 5,
"response_format": "wav"
}' \
--output output.wav
Python
import requests, os
response = requests.post(
"https://api.novita.ai/v3/glm-tts",
headers={
"Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
"Content-Type": "application/json",
},
json={
"input": "你好,欢迎使用 Novita AI 语音合成接口。",
"voice": "tongtong",
"speed": 1.0,
"volume": 5,
"response_format": "wav",
},
)
response.raise_for_status()
with open("output.wav", "wb") as f:
f.write(response.content)
Limits: 1024 characters per request. For longer texts, split at sentence boundaries and concatenate the audio. Recommended playback sample rate: 24000 Hz. Voice names are case-sensitive.
GLM ASR Quick Start
Endpoint: POST https://api.novita.ai/v3/glm-asr
Transcribes .wav or .mp3 audio using the GLM-ASR-2512 model. Audio can be passed as a URL or base64 string. Constraints: file ≤ 25 MB, duration ≤ 30 seconds.
Parameters
| Parameter | Type | Notes |
|---|---|---|
file | string | Required. URL or base64-encoded audio. .wav or .mp3 only. |
prompt | string | Optional. Prior transcript context, up to 8000 characters. Use for chunked transcription continuity. |
hotwords | array | Optional. Up to 100 domain-specific terms for improved recognition accuracy. |
curl (URL input)
curl -s -X POST https://api.novita.ai/v3/glm-asr \
-H "Authorization: Bearer $NOVITA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"file": "https://example.com/sample.wav",
"hotwords": ["Novita", "GLM"]
}'
Python (base64 input)
import requests, base64, os
with open("sample.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
response = requests.post(
"https://api.novita.ai/v3/glm-asr",
headers={
"Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
"Content-Type": "application/json",
},
json={"file": audio_b64, "hotwords": ["Novita", "GLM"]},
)
response.raise_for_status()
print(response.json()["text"])
Response
{ "text": "你好,欢迎使用 Novita AI 语音合成接口。" }
Handling audio longer than 30 seconds: Split into ≤30-second chunks and chain requests using the prompt field to carry transcript context between chunks:
payload = {
"file": next_chunk_b64,
"prompt": previous_transcript,
}
GLM Voice Clone Quick Start
Endpoint: POST https://api.novita.ai/v3/glm-tts-voice-clone
Takes a sample audio clip and synthesizes new speech in that voice. Assign a name to the cloned voice; reuse it as the voice parameter in GLM TTS without re-uploading the sample.
Parameters
| Parameter | Type | Notes |
|---|---|---|
audio_url | string | Required. URL to sample audio. ≤ 10 MB, 3–30 s recommended. |
input | string | Required. Text to synthesize in the cloned voice. |
voice_name | string | Required. Unique name you assign to this voice. |
text | string | Optional. Transcript of the sample audio — improves clone quality. |
curl
curl -s -X POST https://api.novita.ai/v3/glm-tts-voice-clone \
-H "Authorization: Bearer $NOVITA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_url": "https://example.com/voice-sample.wav",
"input": "这是用克隆声音合成的语音示例。",
"voice_name": "my-custom-voice",
"text": "示例音频的文字内容"
}'
Python
import requests, os
response = requests.post(
"https://api.novita.ai/v3/glm-tts-voice-clone",
headers={
"Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
"Content-Type": "application/json",
},
json={
"audio_url": "https://example.com/voice-sample.wav",
"input": "这是用克隆声音合成的语音示例。",
"voice_name": "my-custom-voice",
"text": "示例音频的文字内容",
},
)
response.raise_for_status()
data = response.json()
print(f"Voice timbre: {data['voice']}")
print(f"Audio URL: {data['audio_url']}")
Response
{
"voice": "my-custom-voice-timbre-id",
"audio_url": "https://..."
}
The voice value returned here can be passed directly to the GLM TTS voice parameter for future synthesis calls.
Tips: Use a clean 5–15 second sample without background noise. Provide the text transcript of the sample to improve phoneme alignment.
Pricing and Usage Notes
Pricing as of June 2026, from novita.ai/pricing:
| API | Price |
|---|---|
| GLM TTS | $0.28 / 1M characters |
| GLM ASR | $0.021 / 1M characters |
| GLM Voice Clone | $0.83 / 1M characters |
GLM TTS is well-suited for high-volume Chinese-language synthesis where cost matters. If you need broader multilingual TTS across 30+ languages or async processing of long-form content, MiniMax Speech is the alternative to evaluate.
FAQ
What languages does GLM TTS support? Optimized for Chinese (Mandarin). Handles mixed Chinese-English input. For broad multilingual coverage, use MiniMax Speech instead.
Can I reuse a cloned voice with GLM TTS?
Yes. Pass the voice_name you assigned in the Voice Clone call as the voice parameter in GLM TTS. No need to re-upload the sample.
Why is there a 30-second limit on GLM ASR?
The model processes audio synchronously. Split longer recordings at sentence boundaries and chain requests using the prompt field to carry context.
What is the difference between pcm and wav output?
PCM is raw audio bytes at 24000 Hz with no header. WAV wraps the same audio in a standard container most libraries can read directly. Use WAV unless your pipeline requires raw PCM.
Does setting watermark_enabled: false always work?
Only if you have completed watermark removal in your account settings. The flag is otherwise ignored.
