GLM TTS and ASR API Quick Start

GLM TTS and ASR API Quick Start

This guide gets you from API key to working audio with the GLM audio APIs — GLM TTS for text-to-speech, GLM ASR for transcription, and GLM Voice Clone for custom voice synthesis. All three are synchronous REST endpoints with no polling or webhook step. If you build voice features, transcription pipelines, or Chinese-language audio applications, this is the fastest path to a working integration.

When to Use This Quick Start

Use this guide if you need to:

  • Convert text to speech with Chinese-optimized voices via POST /v3/glm-tts
  • Transcribe .wav or .mp3 audio files via POST /v3/glm-asr
  • Clone a voice from a short audio sample and synthesize new speech via POST /v3/glm-tts-voice-clone

All endpoints are available through the Novita AI API at https://api.novita.ai.

Prerequisites

  1. A Novita AI account. Get your API key from the Novita AI console.
  2. curl for the shell examples.
  3. Python 3.8+ with requests installed for the Python examples.

Set your key as an environment variable:

export NOVITA_API_KEY="your_api_key_here"

GLM TTS Quick Start

Endpoint: POST https://api.novita.ai/v3/glm-tts

Converts text up to 1024 characters into speech. The response is binary audio — write it directly to a file.

Parameters

ParameterTypeDefaultNotes
inputstringRequired. Up to 1024 characters.
voicestringtongtongSystem voice ID or cloned voice name.
speednumber1.0Range: 0.5–2.0
volumenumber1.0Range: 0–10
response_formatstringpcmwav or pcm. WAV includes a standard audio header; PCM is raw bytes at 24000 Hz.
watermark_enabledbooleantrueSet false only if your account has watermark removal enabled.

System voices

Voice IDDisplay name
tongtongTongtong (default)
chuichuiChuichui
xiaochenXiaochen
jamDongdong Zoo – Jam
kaziDongdong Zoo – Kazi
doujiDongdong Zoo – Douji
luodoDongdong Zoo – Luodo

curl

curl -s -X POST https://api.novita.ai/v3/glm-tts \
  -H "Authorization: Bearer $NOVITA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "你好,欢迎使用 Novita AI 语音合成接口。",
    "voice": "tongtong",
    "speed": 1.0,
    "volume": 5,
    "response_format": "wav"
  }' \
  --output output.wav

Python

import requests, os

response = requests.post(
    "https://api.novita.ai/v3/glm-tts",
    headers={
        "Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "input": "你好,欢迎使用 Novita AI 语音合成接口。",
        "voice": "tongtong",
        "speed": 1.0,
        "volume": 5,
        "response_format": "wav",
    },
)
response.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(response.content)

Limits: 1024 characters per request. For longer texts, split at sentence boundaries and concatenate the audio. Recommended playback sample rate: 24000 Hz. Voice names are case-sensitive.

GLM ASR Quick Start

Endpoint: POST https://api.novita.ai/v3/glm-asr

Transcribes .wav or .mp3 audio using the GLM-ASR-2512 model. Audio can be passed as a URL or base64 string. Constraints: file ≤ 25 MB, duration ≤ 30 seconds.

Parameters

ParameterTypeNotes
filestringRequired. URL or base64-encoded audio. .wav or .mp3 only.
promptstringOptional. Prior transcript context, up to 8000 characters. Use for chunked transcription continuity.
hotwordsarrayOptional. Up to 100 domain-specific terms for improved recognition accuracy.

curl (URL input)

curl -s -X POST https://api.novita.ai/v3/glm-asr \
  -H "Authorization: Bearer $NOVITA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": "https://example.com/sample.wav",
    "hotwords": ["Novita", "GLM"]
  }'

Python (base64 input)

import requests, base64, os

with open("sample.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://api.novita.ai/v3/glm-asr",
    headers={
        "Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={"file": audio_b64, "hotwords": ["Novita", "GLM"]},
)
response.raise_for_status()
print(response.json()["text"])

Response

{ "text": "你好,欢迎使用 Novita AI 语音合成接口。" }

Handling audio longer than 30 seconds: Split into ≤30-second chunks and chain requests using the prompt field to carry transcript context between chunks:

payload = {
    "file": next_chunk_b64,
    "prompt": previous_transcript,
}

GLM Voice Clone Quick Start

Endpoint: POST https://api.novita.ai/v3/glm-tts-voice-clone

Takes a sample audio clip and synthesizes new speech in that voice. Assign a name to the cloned voice; reuse it as the voice parameter in GLM TTS without re-uploading the sample.

Parameters

ParameterTypeNotes
audio_urlstringRequired. URL to sample audio. ≤ 10 MB, 3–30 s recommended.
inputstringRequired. Text to synthesize in the cloned voice.
voice_namestringRequired. Unique name you assign to this voice.
textstringOptional. Transcript of the sample audio — improves clone quality.

curl

curl -s -X POST https://api.novita.ai/v3/glm-tts-voice-clone \
  -H "Authorization: Bearer $NOVITA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://example.com/voice-sample.wav",
    "input": "这是用克隆声音合成的语音示例。",
    "voice_name": "my-custom-voice",
    "text": "示例音频的文字内容"
  }'

Python

import requests, os

response = requests.post(
    "https://api.novita.ai/v3/glm-tts-voice-clone",
    headers={
        "Authorization": f"Bearer {os.environ['NOVITA_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "audio_url": "https://example.com/voice-sample.wav",
        "input": "这是用克隆声音合成的语音示例。",
        "voice_name": "my-custom-voice",
        "text": "示例音频的文字内容",
    },
)
response.raise_for_status()
data = response.json()
print(f"Voice timbre: {data['voice']}")
print(f"Audio URL: {data['audio_url']}")

Response

{
  "voice": "my-custom-voice-timbre-id",
  "audio_url": "https://..."
}

The voice value returned here can be passed directly to the GLM TTS voice parameter for future synthesis calls.

Tips: Use a clean 5–15 second sample without background noise. Provide the text transcript of the sample to improve phoneme alignment.

Pricing and Usage Notes

Pricing as of June 2026, from novita.ai/pricing:

APIPrice
GLM TTS$0.28 / 1M characters
GLM ASR$0.021 / 1M characters
GLM Voice Clone$0.83 / 1M characters

GLM TTS is well-suited for high-volume Chinese-language synthesis where cost matters. If you need broader multilingual TTS across 30+ languages or async processing of long-form content, MiniMax Speech is the alternative to evaluate.

FAQ

What languages does GLM TTS support? Optimized for Chinese (Mandarin). Handles mixed Chinese-English input. For broad multilingual coverage, use MiniMax Speech instead.

Can I reuse a cloned voice with GLM TTS? Yes. Pass the voice_name you assigned in the Voice Clone call as the voice parameter in GLM TTS. No need to re-upload the sample.

Why is there a 30-second limit on GLM ASR? The model processes audio synchronously. Split longer recordings at sentence boundaries and chain requests using the prompt field to carry context.

What is the difference between pcm and wav output? PCM is raw audio bytes at 24000 Hz with no header. WAV wraps the same audio in a standard container most libraries can read directly. Use WAV unless your pipeline requires raw PCM.

Does setting watermark_enabled: false always work? Only if you have completed watermark removal in your account settings. The flag is otherwise ignored.