Wan 2.2 T2V on Novita AI: What’s New and Why It Matters

Wan 2.2 T2V on Novita AI

Novita AI has officially launched the latest Wan 2.2 API, a cutting-edge tool for text-to-video generation. This article will introduce what Wan 2.2 is, highlight its new features and updates, and discuss its performance. Additionally, we’ll address common questions to help you get started with this powerful technology.

What is Wan 2.2 T2V?

Wan 2.2 T2V is Alibaba’s latest open-source text-to-video generative AI model, representing a major upgrade over the earlier Wan 2.1 system. It’s part of Alibaba’s “Wan” series of video generation models (often referred to as Tongyi Wanxiang in Chinese) and is notable for being the industry’s first open-source video model that uses a Mixture-of-Experts (MoE) architecture. Wan 2.2 actually encompasses a suite of models, including a dedicated text-to-video model and related tools, but “Wan 2.2 T2V” specifically refers to the text-to-video component of this series.

Wan 2.2 T2V Specifications

CategoryDescription
Model ArchitectureUses a Mix-of-Experts architecture with two expert sub-models.
Parameter CountThe total model has 27 billion parameters, but only 14 billion are active during inference.
Design AdvantagesBy using specialized “experts” (each around 14B parameters), the model doubles in size while maintaining similar runtime costs compared to its predecessor, Wan 2.1 (14B parameters).
Released Model Variants1. T2V-A14B: A text-to-video model for generating videos from text.
2.TI2V-5B: A hybrid model for both tasks, optimized for consumer-grade hardware (5B parameters).
Hardware OptimizationTI2V-5B is optimized for consumer-grade GPUs, such as running on a single NVIDIA RTX 4090.
Resolution and Frame RateThe standard Wan 2.2 T2V model can generate 5-second-long videos at 720p resolution (1280×720) with 24 frames per second.

Wan 2.2 T2V Key Features

Cinematic Quality & Control

  • Trained on a meticulously curated dataset with aesthetic labels to generate videos with a cinematic look and feel.
  • Supports fine-grained text control, allowing users to specify:
    • Lighting conditions
    • Time of day
    • Color tone
    • Camera angles
    • Focal length
    • Other cinematic aspects.
  • Understands cinematic terms such as “golden hour lighting” and “wide-angle lens,” ensuring precise control over the video output.

Multi-Modal Generative Suite

  • Includes a style transfer functionality:
    • Enables one-click application of artistic styles, such as converting photos or videos into cartoon or sketch formats (veo-video.org).
  • Provides a unified model family that supports various generative tasks, making it a comprehensive creative AI platform.

Open Source & Community Ecosystem

Licensed under Apache 2.0, permitting commercial use (hackernoon.com). Supported by an active community that contributes:

  • Guides
  • Integration tools (e.g., for ComfyUI)
  • Fine-tuning optimizations
  • General support.

What Work Process Optimizations are in Wan 2.2?

What Work Process Optimizations are in Wan 2.2?

Wan 2.2 T2V vs Wan 2.1 T2V

Wan 2.2 T2V vs Wan 2.1 T2V: Architecture

AspectWan 2.1Wan 2.2
ArchitectureSingle-stage Diffusion Transformer (UNet).Two-stage Mixture-of-Experts (MoE) with High-Noise and Low-Noise Experts.
Parameters14B (base) and 1.3B (small).27B total (14B active); 14B T2V, 14B I2V, and 5B hybrid model.
Training DataLarge dataset, less curated.+65% images, +83% videos, annotated for aesthetics and cinematic attributes.
Output QualityGood but prone to flickering; suited for simpler, stylized videos.Higher detail, better temporal consistency, realism, and cinematic visuals.
FeaturesT2V, I2V, editing (VACE framework), LoRA fine-tuning supported.T2V, I2V, better style transfer; no VACE yet, limited LoRA compatibility.

Wan 2.2 T2V vs Wan 2.1 T2V: Perfromance

Wan 2.2 T2V vs Wan 2.1 T2V
From Artificial Analysis

Wan 2.2 T2V vs Wan 2.1 T2V: Generation

Wan 2.2 T2V
Wan 2.1 T2V

Cost and Access of Wan 2.2 T2V

Hardware Costs

ModelMinimum VRAM Requirement (GB)Minimum GPU ModelMinimum GPU QuantitySingle GPU Speed (s) (480P)Single GPU Speed (s) (720P)Approximate GPU Price (USD)
T2V-5B22.6NVIDIA RTX 40901534.7524.8$1,599
T2V-A14B41.3NVIDIA A10011133.94048.7$10,000 – $15,000

Notes:

  • NVIDIA RTX 4090: Released in October 2022 with an MSRP of $1,599.
  • NVIDIA A100: Prices vary based on configuration and market factors. The 40GB PCIe model typically ranges from $10,000 to $12,000, while the 80GB PCIe model ranges from $12,000 to $15,000.

API Costs

Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.

ModelPriceResolutionGeneration Time
Wan 2.1 T2V$0.3/video 1280*7205s
Wan 2.2 T2V$0.4/video 1080P 5s

Wan 2.2 T2V Access Guide

Step 1: Log In and Access the Model Library

Log in to your account and click on the Model Library button.

Log In and Access the Model Library

Step 2: Choose Your Model

Browse through the available options and select the model that suits your needs.

Step 2: Choose Your Model

Step 3: Get Your API Key

To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

get api key

Step 4: Install the API

Install API using the package manager specific to your programming language.

Step 4: Install the API

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.

import requests

url = "https://api.novita.ai/v3/async/wan-2.2-t2v"

payload = {
    "input": {
        "prompt": "<string>",
        "negative_prompt": "<string>"
    },
    "parameters": {
        "size": "<string>",
        "prompt_extend": True,
        "seed": 123
    }
}
headers = {
    "Content-Type": "<content-type>",
    "Authorization": "<authorization>"
}

response = requests.post(url, json=payload, headers=headers)

print(response.json())

Common Wan 2.2 T2V Issues and Fixes

Installation and GPU Compatibility

  • Issue: Errors on older GPUs (e.g., GTX 10-series) due to FlashAttention.
  • Solution: Use compatible GPUs like RTX 30/40-series or A-series. Alternatively, disable FlashAttention (--disable_flashattn) or replace it with xFormers for slower but functional performance.

Slow Generation Speed

  • Issue: Extremely slow output, especially on modest GPUs.
  • Solution:
    • Optimize step count (30–50 steps are often sufficient).
    • Use the smaller TI2V-5B model for faster results.
    • Ensure correct expert switching settings (default configurations are recommended).

Output Quality Issues (Flicker/Artifacts)

  • Issue: Flickering frames or artifacts in generated videos.
  • Solution:
    • Adjust CFG scale for better balance between precision and smoothness.
    • Tweak the expert handover step for optimal diffusion.
    • Enable temporal attention to maintain frame consistency.
    • Use post-processing tools like frame interpolation if needed.

Prompt or Output Not as Expected

  • Issue: Outputs differ from the described scenes or include unwanted elements.
  • Solution:
    • Rephrase and simplify prompts.
    • Use negative prompts to exclude specific elements.
    • Ensure correct model weights (e.g., don’t use I2V for text-only prompts).

LoRA and Fine-tuning Issues

  • Issue: Old LoRA models from Wan 2.1 are incompatible with Wan 2.2.
  • Solution: Wait for Wan 2.2-specific LoRAs or fine-tunes. Ensure any fine-tuning is tailored to the new two-expert architecture.

Pros and Cons of Wan 2.2 T2V for Small Businesses

AspectAdvantagesDisadvantages
Licensing & CostFree under Apache 2.0, no licensing fees, drastically lowers entry costs.High computational costs for large-scale usage (cloud or electricity).
Content QualityCinematic-quality videos; in-house creation without hiring designers or videographers.Unpredictable output quality; may require manual review and editing.
Creative FlexibilityRapid prototyping with text prompts; quick turnaround for concept videos.Slower for real-time or on-demand generation; better for pre-planned content.
CustomizationTailored to brand aesthetics via prompts or fine-tuning; open-source flexibility for deeper integration.Requires expertise to craft prompts or fine-tune models effectively.
ScalabilityGenerate hundreds of videos easily; ideal for localized ads or A/B testing.Expensive hardware (e.g., RTX 4090 or A100) needed for high-capacity use.
Community SupportBacked by open-source community; access to tutorials, updates, and tools like ComfyUI workflows.No formal support or guarantees; reliance on community goodwill for troubleshooting.
Ease of UseSimplifies video creation for small teams; acts as a “mini creative studio.”Requires ML knowledge for setup (Python, CUDA, model parameters); steep learning curve.
Ethical & LegalEnables innovation in AI-driven marketing.Risks of generating unintended or inappropriate content; potential legal liabilities.

Best for: Small businesses with technical expertise or access to consultants, aiming to reduce content creation costs and scale video production. Challenges: Requires careful planning, technical setup, and monitoring of hardware and costs.

Future Trends in Wan 2.2 T2V Technology
  1. Higher Resolution & Length
    • Move towards 1080p, 4K, and longer clips (10–20 seconds).
    • Improved coherence for extended videos via hierarchical generation.
  2. Enhanced Motion & Consistency
    • Better motion stability and natural interactions.
    • Specialized experts for different motion types (e.g., slow vs. fast).
  3. Video Editing & Multi-Modality
    • Text commands for editing existing videos (e.g., scene changes or object removal).
    • Integration of audio generation for complete video projects.
  4. Efficiency & Scalability
    • Smaller, faster models (e.g., distilled 5B models with near 27B quality).
    • Real-time video generation becomes feasible with hardware advancements.
  5. Community & Ecosystem Growth
    • Niche fine-tunes (e.g., cartoon style, medical videos).
    • Wider adoption through plugins and mobile apps.
  6. Ethics & Regulation
    • Watermarks and metadata for AI-generated content.
    • Standards ensuring transparency in use cases like advertising.

The release of the Wan 2.2 API marks a significant advancement in text-to-video technology. With higher resolutions, enhanced motion consistency, and improved efficiency, Wan 2.2 opens new possibilities for developers and creators. Its flexible API interface empowers you to bring your ideas to life, setting a new standard for video generation.

Frequently Asked Questions

What is Wan 2.2?

Wan 2.2 is an open-source text-to-video model capable of generating high-quality, motion-consistent videos suitable for applications like advertising, filmmaking, and more.

What’s new in Wan 2.2 compared to previous versions?

Support for higher resolutions (up to 1080p).
Improved temporal consistency, reducing flickering.
Introduction of Mixture-of-Experts (MoE) architecture for better handling of complex scenes.

How does Wan 2.2 perform?

Wan 2.2 excels in speed, memory optimization, and output quality. When paired with high-end GPUs, it can quickly generate high-resolution video.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

Recommend Reading


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading