Kling O1 on Novita AI: T2V, I2V, Ref2V, and Video Edit Modes

Kling O1 on Novita AI: T2V, I2V, Ref2V, and Video Edit Modes

Kling O1 (Kling Omni Video O1) is Kuaishou’s first unified multimodal video model, exposing four distinct generation modes through the Novita AI API: Text-to-Video (T2V), Image-to-Video (I2V), Reference-to-Video (Ref2V), and Video Edit. Each mode accepts different inputs and solves a different problem — picking the wrong one adds friction and cost. This guide explains what each mode actually does, what it requires, how it is priced on Novita AI, and which one to try first for common developer use cases.

What Is Kling O1?

Kling O1 is built on Kuaishou’s MVL (Multimodal Visual Language) architecture, which consolidates text, image, reference, and video editing tasks into a single model rather than routing them to separate specialized models. That matters practically: the underlying motion model and identity encoding are shared across modes, so characters and objects described in one mode carry consistent visual properties to the next.

Compared with earlier Kling versions (V2.5, V2.6, V3.0 Standard/Pro), Kling O1 adds Ref2V and Video Edit capabilities that are structurally new — they were not available in any Standard or Pro tier before O1. T2V and I2V in O1 gain the shared MVL backbone, which improves subject consistency across frames compared with the earlier generation models.

Kling O1 is distinct from Kling 3.0 (also called Kling O3). Kling 3.0 is a follow-on model that adds native audio co-generation and extended 15-second clips. Kling O1 on Novita AI currently covers videos up to 10 seconds without native audio.

The Four Modes at a Glance

ModePrimary InputRequired InputsDurationPrice on Novita AI
T2VText promptprompt5–10 s$0.112/s
I2VImage + promptimage_url, prompt5–10 s$0.112/s
Ref2VReference images + promptprompt, image_urls or elements3–10 s$0.168/s
Video EditSource video + promptvideo_url, prompt3–10 s (Fast: 6–20 s)$0.168/s (Fast: $0.09/s)

Pricing verified on Novita AI model pages on 2026-06-26. Per-second billing applies to the duration you specify.

Kling O1 Text-to-Video (T2V) on Novita AI

Endpoint: POST /v3/async/kling-o1-t2v

T2V generates a video entirely from a text description. You provide a prompt; the model creates motion, lighting, camera movement, and scene composition from scratch. There is no image anchor, so the model has full creative latitude within the prompt constraints.

Use T2V when:

  • You do not have a reference image or scene frame.
  • You are exploring a concept before committing to a visual direction.
  • You need to generate many visual variations at low per-clip cost.

At $0.112/s, a 5-second clip costs $0.56 and a 10-second clip costs $1.12. T2V supports 5-second and 10-second durations on Novita AI with aspect ratios 16:9, 9:16, and 1:1.

curl --request POST \
  --url https://api.novita.ai/v3/async/kling-o1-t2v \
  --header 'Authorization: Bearer $NOVITA_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "A red fox trotting through a snowy pine forest, golden hour light, cinematic wide shot",
    "duration": 5,
    "aspect_ratio": "16:9"
  }'

Kling O1 Image-to-Video (I2V) on Novita AI

Endpoint: POST /v3/async/kling-o1-i2v

I2V animates a static image into a video clip. The source image becomes the starting frame; the prompt controls what motion and scene development follows. You can optionally provide an end frame to give the model a target state, and the model interpolates the motion between start and end.

Required: image_url (start frame) and prompt. The end frame (end_image_url) is optional but useful when you want a specific composition at the cut point.

Use I2V when:

  • You have an existing image or design that needs to move.
  • You want deterministic visual grounding — the character or scene appearance is already defined in the source image.
  • You are building product demos, social content, or e-commerce animations from existing assets.

At $0.112/s, I2V costs the same as T2V. The key tradeoff is that I2V locks the opening frame to your input image, which improves consistency but also means a poor-quality source image limits the output. Image constraints on Novita AI: minimum 300×300px, max file size 10MB, aspect ratio between 0.4 and 2.5.

curl --request POST \
  --url https://api.novita.ai/v3/async/kling-o1-i2v \
  --header 'Authorization: Bearer $NOVITA_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "image_url": "https://example.com/product-shot.jpg",
    "prompt": "The product slowly rotates to reveal the back panel, soft studio lighting",
    "duration": 5,
    "aspect_ratio": "1:1"
  }'

Kling O1 Reference-to-Video (Ref2V) on Novita AI

Endpoint: POST /v3/async/kling-o1-ref2v

Ref2V is the most flexible mode and the one that most directly uses O1’s MVL architecture. Instead of a single start frame, you supply up to seven reference images across two input types: image_urls (style or scene references) and elements (character or object identity anchors). The prompt uses @Image1, @Image2, and @Element1, @Element2 tags to tell the model which reference to apply and where.

This lets you compose a scene from multiple source assets: one character from a portrait photo, a background from a location image, and a prop from a product image — all referenced by name in the prompt.

Input rules:

  • prompt is required.
  • image_urls and elements are optional but at least one must be meaningful; a bare prompt with no references works but behaves closer to T2V.
  • Total references (elements + image_urls) must not exceed 7.
  • Each element in elements can include multiple reference_image_urls (multi-angle shots) plus an optional frontal_image_url for cleaner identity matching.

Use Ref2V when:

  • You need consistent characters across multiple clips (episodic content, marketing sequences).
  • You are combining characters or objects from different source images into a single scene.
  • You want the model to interpolate from a start frame while maintaining visual identity from a separate reference set.

Ref2V costs $0.168/s — 50% more than T2V and I2V. For a 5-second clip, that is $0.84; for 10 seconds, $1.68. The premium reflects the additional reference-encoding step. If your use case does not require cross-image identity consistency, I2V at $0.112/s is sufficient.

curl --request POST \
  --url https://api.novita.ai/v3/async/kling-o1-ref2v \
  --header 'Authorization: Bearer $NOVITA_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "prompt": "Take @Image1 as the start frame. @Element1 walks into the scene and picks up the glowing artifact. Cinematic lighting, steady camera.",
    "image_urls": ["https://example.com/scene-bg.jpg"],
    "elements": [
      {
        "reference_image_urls": ["https://example.com/character-front.jpg", "https://example.com/character-side.jpg"],
        "frontal_image_url": "https://example.com/character-front.jpg"
      }
    ],
    "duration": 5,
    "aspect_ratio": "16:9"
  }'

Kling O1 Video Edit Mode on Novita AI

Endpoint (standard): POST /v3/async/kling-o1-video-edit

Endpoint (fast): available via Novita AI’s Fast VideoEdit variant

Video Edit takes an existing video as input and transforms it using a natural-language prompt. The model preserves the original motion structure — timing, camera movement, the arc of action — while changing subjects, environments, or visual style according to the prompt. You can also supply reference images and element anchors using the same @Image1 / @Element1 tagging system as Ref2V.

Required: video_url (source video, 3–10s, MP4 or MOV, 720–2160px, max 200MB) and prompt.

Two variants:

  • Standard VideoEdit: supports 3–10 second source videos, priced at $0.168/s.
  • Fast VideoEdit: supports 6–20 second source videos, priced at $0.09/s — the lowest per-second cost of any Kling O1 mode on Novita AI.

Use Video Edit when:

  • You have footage that needs a style or content change without re-shooting.
  • You want to replace a character in existing video while keeping the same movement.
  • You need to transform a live-action clip into an animated style.

The key limitation: the source video controls the motion. Video Edit cannot change what a subject does — it can only change how the subject looks and what environment they occupy. For motion changes, generate new footage with T2V or I2V instead.

curl --request POST \
  --url https://api.novita.ai/v3/async/kling-o1-video-edit \
  --header 'Authorization: Bearer $NOVITA_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "video_url": "https://example.com/source-clip.mp4",
    "prompt": "Transform the setting to a neon-lit cyberpunk alley, keep the character movements exactly as-is",
    "duration": 5
  }'

Pricing on Novita AI

All Kling O1 modes on Novita AI use per-second billing against the duration you set at request time. Pricing verified 2026-06-26.

ModeEndpointDuration RangePrice/s5s Cost10s Cost
T2V/v3/async/kling-o1-t2v5–10 s$0.112$0.56$1.12
I2V/v3/async/kling-o1-i2v5–10 s$0.112$0.56$1.12
Ref2V/v3/async/kling-o1-ref2v3–10 s$0.168$0.84$1.68
VideoEdit/v3/async/kling-o1-video-edit3–10 s$0.168$0.84$1.68
VideoEdit Fast(Novita AI Fast variant)6–20 s$0.090$0.90

Novita AI new users receive free credits. Check the Novita AI pricing page for current rates, as prices may change.

Which Mode Should You Start With?

Start with T2V if your goal is concept exploration or you do not have a specific image asset. It is the lowest-friction entry point: one required parameter (prompt), no asset preparation needed.

Move to I2V when you have an image that needs to move. Product images, character illustrations, and scene backgrounds all work well as I2V starting frames. Same price as T2V, more visual control.

Use Ref2V when identity consistency across clips matters — for example, a recurring character in multiple scenes, or combining a specific person with a specific environment. Budget for the 50% price premium; it is not necessary for single-clip generation.

Reserve Video Edit for post-production workflows where existing footage needs a visual overhaul but the motion should stay intact. The Fast variant at $0.09/s is the most cost-efficient option for longer edits (6–20 seconds) where generation speed is less critical.

SituationRecommended Mode
No image, exploring ideasT2V
Have a product or scene image, want motionI2V
Need same character across multiple clipsRef2V
Have video footage, want different lookVideoEdit (standard)
Long edit (6–20 s), cost-sensitiveVideoEdit Fast

How to Call the Kling O1 API on Novita AI

All four Kling O1 modes on Novita AI are asynchronous. Every request returns a task_id immediately; poll the Task Result endpoint until the status is succeed.

# Step 1: Submit your generation task (example: T2V)
RESPONSE=$(curl --silent --request POST \
  --url https://api.novita.ai/v3/async/kling-o1-t2v \
  --header "Authorization: Bearer $NOVITA_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{"prompt": "Your prompt here", "duration": 5, "aspect_ratio": "16:9"}')

TASK_ID=$(echo $RESPONSE | python3 -c "import sys,json; print(json.load(sys.stdin)['task_id'])")

# Step 2: Poll for results
curl --request GET \
  --url "https://api.novita.ai/v3/async/task-result?task_id=$TASK_ID" \
  --header "Authorization: Bearer $NOVITA_API_KEY"

The response includes a status field. When it reads succeed, the videos array contains the output URL. Typical generation time is 30–120 seconds depending on duration and mode.

Get your API key from the Novita AI dashboard. New accounts receive free credits to test all four modes before committing to production volume.

Conclusion

Kling O1 on Novita AI gives developers access to four distinct video generation modes — T2V, I2V, Ref2V, and Video Edit — through a single unified API. T2V and I2V cover the common generation cases at $0.112/s. Ref2V adds multi-reference identity composition for recurring characters at $0.168/s. Video Edit transforms existing footage while preserving motion, with a Fast variant at $0.09/s for longer clips. Picking the right mode upfront saves cost and removes friction: start with T2V if you have no image asset, I2V if you do, Ref2V when cross-clip identity consistency matters, and Video Edit when the motion is already captured. All modes share the same asynchronous task pattern on Novita AI, so integrating multiple modes into one pipeline requires minimal additional code.

Novita AI is an AI cloud platform that gives developers hosted access to video, image, audio, and language models through a unified API.

Frequently Asked Questions

What is the difference between Kling O1 T2V and I2V on Novita AI?

T2V generates video from a text prompt alone — no image is required. I2V takes an image as the starting frame and animates it according to the prompt. Both are priced at $0.112/s and support 5–10 second clips. Use T2V for exploration; use I2V when you have a specific visual anchor.

What does Kling O1 Ref2V do that I2V cannot?

Ref2V accepts up to 7 reference images across multiple input slots, letting you combine separate sources for character identity, scene background, and style. You reference each input by name in the prompt (@Element1, @Image1). I2V uses a single start frame with no named reference system.

Is Kling O1 the same as Kling 3.0?

No. Kling O1 (released December 2025) is the base unified multimodal video model. Kling 3.0 (also called Kling O3, released February 2026) is a follow-on that adds native audio co-generation and up to 15-second clips. Kling O1 on Novita AI supports video up to 10 seconds without native audio.

How do I choose between VideoEdit standard and VideoEdit Fast?

Standard VideoEdit accepts 3–10 second source clips at $0.168/s. Fast VideoEdit accepts 6–20 second clips at $0.09/s. If your source video is under 10 seconds and turnaround time matters, use standard. If you have longer clips or are doing batch post-production work, Fast is significantly cheaper.