Wan 2.7 on Novita AI: Text-to-Video vs Image-to-Video vs Reference-to-Video

Wan 2.7 on Novita AI: Text-to-Video vs Image-to-Video vs Reference-to-Video

Wan 2.7 on Novita AI ships three distinct generation modes — Text-to-Video, Image-to-Video, and Reference-to-Video — each solving a different problem. T2V generates video directly from a prompt with optional audio; I2V animates a starting image and supports video continuation; R2V brings reference characters into new scenes with multi-shot control. Choosing the wrong mode adds friction; this guide maps each mode to the workflows where it actually belongs.

What Changed from Wan 2.6 to 2.7

Wan 2.6 introduced role-playing via reference video, multi-shot narratives, and audio-visual synchronization — a capable but sprawling feature set distributed across three endpoints with some overlap. Wan 2.7 sharpens that model significantly.

The clearest upgrade is in I2V. Wan 2.7 I2V moves beyond single-frame animation to support three distinct input modes in one endpoint: first-frame-only, first+last-frame, and video continuation. Wan 2.6 I2V handled only single-frame animation; continuation was handled by R2V. That consolidation matters for developers building pipelines that extend or remix existing footage.

R2V in 2.7 also changes its character model. Where 2.6 accepted up to two reference videos for role-playing, 2.7 accepts up to five reference media items (images or videos), mapping each to a named character slot (character1, character2, etc.) in your prompt. Multi-character interaction at scale is now a first-class feature, not a workaround.

T2V’s core capability — text prompt to video with audio — remains similar, but the endpoint is cleaner: audio generation is on by default (you can disable it), and the prompt_extend flag intelligently rewrites short prompts before generation. The Wan 2.6 T2V parameter surface is carried forward with refinements, not replaced.

Duration ranges also diverge by mode in 2.7: T2V and I2V both support 2–15 seconds, while R2V caps at 10 seconds. The 2-second minimum replaces the 5-second floor from 2.6’s standard durations.

Mode Overview and Quick Selection Table

T2VI2VR2V
InputText promptImage + optional textReference media (images/videos) + text
Output duration2–15 s2–15 s2–10 s
Resolutions720P, 1080P720P, 1080P720P, 1080P
AudioAuto-generated or audio-drivenAuto-generated or audio-drivenControllable via audio flag + reference_voice
Shot controlSingle shotSingle shotSingle or multi-shot
CharactersPrompt-definedPrompt-definedUp to 5 named reference characters
Model IDwan2.7-t2vwan2.7-i2vwan2.7-r2v
Endpoint/v3/async/wan2.7-t2v/v3/async/wan2.7-i2v/v3/async/wan2.7-r2v
Best forOriginal content from scratchAnimating existing assetsCharacter-consistent, role-play scenes

How Does Wan 2.7 T2V Work on Novita AI?

T2V is the right starting point when you have a creative concept but no existing visual assets. The model generates smooth video directly from a text description and attaches audio automatically — either background music/sound effects generated to match the scene, or audio you supply as a driving source for lip-sync and beat-matching.

Key parameters:

  • prompt — scene description; supports Chinese and English
  • size — resolution tier: 1920*1080, 1280*720, 720*1280, 960*960, 1088*832, 832*1088 (1080P or 720P)
  • duration — integer seconds, range 2–15
  • audio_url — optional; when provided, the model uses this audio to drive generation (lip-sync, beat-matching). Omit to let the model auto-generate
  • prompt_extend — default true; rewrites short prompts using an LLM before generation for better quality
  • seed — set for reproducible outputs

Who T2V fits: Marketers generating product campaign clips from copy, developers prototyping video content at scale, or anyone who needs original footage without source material.

Where it falls short: Without a reference image or prior video frame, complex character consistency across multiple generations is hard to maintain. If you’re iterating on a specific scene or character, I2V or R2V gives you more control.

How Does Wan 2.7 I2V Work on Novita AI?

I2V’s defining feature in 2.7 is that it handles three different animation patterns through a single endpoint, distinguished by which parameters you populate:

First-frame-to-video: Supply image_url. The model animates the image forward. This is the classic “bring a photo to life” use case.

First+last-frame-to-video: Supply both image_url and last_frame_url. The model generates the bridge between two keyframes, which is useful for controlled transitions or morphing sequences.

Video continuation: Supply first_clip_url (an existing video clip, mp4 or mov, 2–10 seconds). The model extends the video forward based on its content and your prompt.

The driving_audio_url parameter works the same way as in T2V — when supplied, it drives generation with lip-sync or beat-matching; when omitted, audio is auto-generated.

Key parameters:

  • image_url — required for first-frame and first+last-frame modes; first-frame image (JPEG, JPG, PNG, BMP, WEBP; up to 20 MB; width/height 240–8000 px). Not used in continuation mode.
  • last_frame_url — optional; last-frame image for keyframe-to-keyframe mode
  • first_clip_url — optional; existing video clip for continuation mode (mp4/mov, 2–10 s)
  • resolution720P or 1080P (default 1080P); video aspect ratio matches input media
  • duration — 2–15 seconds (integer)
  • driving_audio_url — optional driving audio
  • prompt — optional; guides animation direction and style

Who I2V fits: E-commerce teams animating product photos, concept artists adding motion to illustrations, or developers building pipelines that extend existing footage.

Gotcha: The continuation input clip must be 2–10 seconds. The output video resolution aspect ratio follows the input media — you can’t independently set resolution and aspect ratio.

How Does Wan 2.7 R2V Work on Novita AI?

R2V is the mode for character-consistent, narrative video. You supply one or more reference media items — images or short video clips — and the model extracts each character’s appearance, motion, and voice. You then direct those characters in your prompt using character1, character2, etc.

This is where Wan 2.7 advances meaningfully over 2.6. Instead of being limited to 1–2 reference videos, 2.7 accepts up to five media items total (images: 0–5, videos: 0–3, total ≤ 5), giving you a cast of characters without patching together separate generations.

The shot_type parameter controls narrative structure: single keeps the output as one continuous shot; multi generates a sequence with transitions. The multi value takes priority over any shot-by-shot instructions in your prompt, so it’s a deliberate mode switch rather than a prompt hint.

Audio behavior in R2V is also more explicit: the audio boolean (default true) controls whether audio is generated at all, and reference_voice allows you to specify a voice reference for character dialogue.

Key parameters:

  • media — required; array of reference media items; order maps to character1, character2, etc.
  • prompt — required; use character1, character2 to reference characters
  • size — resolution; same 720P/1080P options as T2V
  • duration — 2–10 seconds (shorter cap than T2V/I2V)
  • shot_typesingle (default) or multi
  • audio — boolean, default true
  • reference_voice — optional voice reference for character speech
  • negative_prompt — optional; max 500 characters; Chinese or English

Who R2V fits: Developers building video avatars, short-form content creators who need a consistent cast, or anyone doing role-play/character-performance scenarios.

Gotcha: R2V caps at 10 seconds per generation. For longer sequences, plan on stitching multiple R2V calls. The multi shot type handles transitions within that window, but it doesn’t extend the 10-second ceiling.

Pricing Comparison Across Modes

All three Wan 2.7 modes are billed per second of generated video, not per request. Resolution also affects cost — 1080P outputs cost more than 720P. The R2V endpoint has an additional audio boolean that affects pricing when enabled.

Pricing is listed on the Wan 2.7 T2V, Wan 2.7 I2V, and Wan 2.7 R2V model pages on Novita AI. Check those pages directly for current per-second rates, as video model pricing updates frequently.

To estimate cost for a workflow: multiply your target duration by the per-second rate for your chosen resolution. For example, a 10-second 1080P T2V clip costs 10× the stated 1080P/s rate. Because T2V and I2V share the same duration ceiling (15 s) and resolution options, their cost curves are comparable; R2V’s 10-second cap means its maximum per-generation cost is lower.

Cost control levers:

  • Use 720P for development and testing; switch to 1080P only for final outputs
  • Keep prompt_extend enabled (T2V default) — it improves quality without affecting cost
  • For R2V, set audio: false when you’re supplying your own audio in post-production

Which Mode Should You Use?

Start with T2V when: You’re generating original content from a script or prompt and don’t have source visuals. It’s the lowest-friction path — one prompt, one call, video plus audio out. Good for volume content generation, campaign asset creation, and rapid concept exploration.

Switch to I2V when: You have existing images or footage that need to move. First-frame mode animates product photos or illustrations; first+last-frame mode gives you controlled transitions between two keyframes; continuation mode extends footage you already have. I2V is the right choice whenever your source material drives the visual output.

Use R2V when: Character identity and consistency matter. If your use case requires the same person (or multiple people) to appear across multiple videos, or if you’re building performance-based content like video avatars or scripted scenes, R2V’s reference character system is the purpose-built solution. The multi shot type adds cinematic structure without a separate storyboarding step.

A practical decision tree:

  1. Do you have reference characters or people who must appear in the video? → R2V
  2. Do you have an existing image or video clip you want to animate or extend? → I2V
  3. Are you generating original footage from a text description alone? → T2V

Getting Started with the Novita AI API

All three endpoints follow the same asynchronous pattern: POST to submit a job, get back a task_id, then poll the Task Result API.

Prerequisites: An API key from your Novita AI console. New accounts receive $1 in free credits.

T2V Quick Start

import requests, time

API_KEY = "your_api_key"
BASE = "https://api.novita.ai"

# Submit generation
resp = requests.post(
    f"{BASE}/v3/async/wan2.7-t2v",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    json={
        "input": {
            "prompt": "A golden retriever running through autumn leaves in a park, warm afternoon light",
        },
        "parameters": {
            "size": "1920*1080",
            "duration": 5,
            "prompt_extend": True
        }
    }
)
task_id = resp.json()["task_id"]

# Poll for result
while True:
    result = requests.get(
        f"{BASE}/v3/async/task-result",
        headers={"Authorization": f"Bearer {API_KEY}"},
        params={"task_id": task_id}
    ).json()
    if result.get("task", {}).get("status") == "TASK_STATUS_SUCCEED":
        print(result["videos"][0]["video_url"])
        break
    time.sleep(5)

I2V — Video Continuation

resp = requests.post(
    f"{BASE}/v3/async/wan2.7-i2v",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    json={
        "input": {
            "first_clip_url": "https://example.com/existing-clip.mp4",
            "prompt": "Continue the scene with smooth camera pan to the right"
        },
        "parameters": {
            "resolution": "1080P",
            "duration": 8
        }
    }
)
task_id = resp.json()["task_id"]

R2V — Multi-Character Scene

resp = requests.post(
    f"{BASE}/v3/async/wan2.7-r2v",
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
    json={
        "input": {
            "media": [
                {"type": "image", "url": "https://example.com/person-a.jpg"},
                {"type": "image", "url": "https://example.com/person-b.jpg"}
            ],
            "prompt": "character1 and character2 are having a conversation at a café, natural daylight"
        },
        "parameters": {
            "size": "1920*1080",
            "duration": 8,
            "shot_type": "multi",
            "audio": True
        }
    }
)
task_id = resp.json()["task_id"]

The full parameter reference for each mode is in the Wan 2.7 T2V API docs, Wan 2.7 I2V API docs, and Wan 2.7 R2V API docs.

If you want to compare Wan 2.7 against the previous generation, the Wan 2.6 on Novita AI guide covers the full 2.6 feature set and parameter surface.

Conclusion

Wan 2.7 organizes its generation capabilities into three purpose-built modes rather than one sprawling endpoint. T2V is the fastest path from idea to video when you have no source material — a prompt and an API key are all you need. I2V gives you control over motion and continuity when you’re working from existing images or footage, with three distinct input patterns in a single endpoint. R2V handles the hardest problem: character-consistent video across scenes, with up to five reference characters and multi-shot structure built in.

The upgrade from 2.6 to 2.7 is most visible in I2V (continuation is now native, not a workaround) and R2V (five characters vs. two, named slots vs. positional). T2V carries forward 2.6’s strengths with a cleaner parameter surface.

For most workflows, the decision tree is simple: start with T2V for original content, switch to I2V when you have a source image or clip, and reach for R2V when character identity needs to stay consistent across multiple generations.

FAQ

What is the difference between Wan 2.7 T2V, I2V, and R2V? T2V generates video from a text prompt alone. I2V animates an existing image or extends an existing video clip. R2V generates character-consistent video using reference images or clips as character templates. Each mode is a separate endpoint optimized for its input type.

Can Wan 2.7 generate audio automatically? Yes. All three modes support auto-generated audio by default. T2V and I2V generate background music and sound effects matched to the scene; R2V adds a reference_voice parameter for character dialogue. You can supply your own audio via audio_url (T2V) or driving_audio_url (I2V), or disable audio with audio: false (R2V).

What video lengths does Wan 2.7 support? T2V and I2V both support 2–15 seconds. R2V caps at 10 seconds per generation. All modes use a 2-second minimum.

How does I2V video continuation work? Send first_clip_url pointing to an existing mp4 or mov file (2–10 seconds). The model analyzes the clip’s content and motion, then generates a new segment that continues naturally from the final frame. Do not send image_url together with first_clip_url — they are for different modes.

How many reference characters does Wan 2.7 R2V support? Up to five media items total (images: 0–5, videos: 0–3, combined total ≤ 5). Each item maps to a named character slot (character1, character2, etc.) that you use in your prompt.

Does resolution affect pricing? Yes. All three modes bill per second of generated video, and 1080P costs more per second than 720P. Use 720P during development and switch to 1080P for final outputs to manage costs.

Can I use Wan 2.7 via a REST API? Yes. All endpoints are REST-based and follow an asynchronous pattern: POST a job to receive a task_id, then poll the Task Result API. See the API examples in the “Getting Started” section above, and the full parameter reference in the Novita AI API docs.