Novita AI has officially launched the latest Wan 2.2 API, a cutting-edge tool for text-to-video generation. This article will introduce what Wan 2.2 is, highlight its new features and updates, and discuss its performance. Additionally, we’ll address common questions to help you get started with this powerful technology.
What is Wan 2.2 T2V?
Wan 2.2 T2V is Alibaba’s latest open-source text-to-video generative AI model, representing a major upgrade over the earlier Wan 2.1 system. It’s part of Alibaba’s “Wan” series of video generation models (often referred to as Tongyi Wanxiang in Chinese) and is notable for being the industry’s first open-source video model that uses a Mixture-of-Experts (MoE) architecture. Wan 2.2 actually encompasses a suite of models, including a dedicated text-to-video model and related tools, but “Wan 2.2 T2V” specifically refers to the text-to-video component of this series.
Wan 2.2 T2V Specifications
| Category | Description |
|---|---|
| Model Architecture | Uses a Mix-of-Experts architecture with two expert sub-models. |
| Parameter Count | The total model has 27 billion parameters, but only 14 billion are active during inference. |
| Design Advantages | By using specialized “experts” (each around 14B parameters), the model doubles in size while maintaining similar runtime costs compared to its predecessor, Wan 2.1 (14B parameters). |
| Released Model Variants | 1. T2V-A14B: A text-to-video model for generating videos from text. 2.TI2V-5B: A hybrid model for both tasks, optimized for consumer-grade hardware (5B parameters). |
| Hardware Optimization | TI2V-5B is optimized for consumer-grade GPUs, such as running on a single NVIDIA RTX 4090. |
| Resolution and Frame Rate | The standard Wan 2.2 T2V model can generate 5-second-long videos at 720p resolution (1280×720) with 24 frames per second. |
Wan 2.2 T2V Key Features
Cinematic Quality & Control
- Trained on a meticulously curated dataset with aesthetic labels to generate videos with a cinematic look and feel.
- Supports fine-grained text control, allowing users to specify:
- Lighting conditions
- Time of day
- Color tone
- Camera angles
- Focal length
- Other cinematic aspects.
- Understands cinematic terms such as “golden hour lighting” and “wide-angle lens,” ensuring precise control over the video output.
Multi-Modal Generative Suite
- Includes a style transfer functionality:
- Enables one-click application of artistic styles, such as converting photos or videos into cartoon or sketch formats (veo-video.org).
- Provides a unified model family that supports various generative tasks, making it a comprehensive creative AI platform.
Open Source & Community Ecosystem
Licensed under Apache 2.0, permitting commercial use (hackernoon.com). Supported by an active community that contributes:
- Guides
- Integration tools (e.g., for ComfyUI)
- Fine-tuning optimizations
- General support.
What Work Process Optimizations are in Wan 2.2?

Wan 2.2 T2V vs Wan 2.1 T2V
Wan 2.2 T2V vs Wan 2.1 T2V: Architecture
| Aspect | Wan 2.1 | Wan 2.2 |
|---|---|---|
| Architecture | Single-stage Diffusion Transformer (UNet). | Two-stage Mixture-of-Experts (MoE) with High-Noise and Low-Noise Experts. |
| Parameters | 14B (base) and 1.3B (small). | 27B total (14B active); 14B T2V, 14B I2V, and 5B hybrid model. |
| Training Data | Large dataset, less curated. | +65% images, +83% videos, annotated for aesthetics and cinematic attributes. |
| Output Quality | Good but prone to flickering; suited for simpler, stylized videos. | Higher detail, better temporal consistency, realism, and cinematic visuals. |
| Features | T2V, I2V, editing (VACE framework), LoRA fine-tuning supported. | T2V, I2V, better style transfer; no VACE yet, limited LoRA compatibility. |
Wan 2.2 T2V vs Wan 2.1 T2V: Perfromance

Wan 2.2 T2V vs Wan 2.1 T2V: Generation
Cost and Access of Wan 2.2 T2V
Hardware Costs
| Model | Minimum VRAM Requirement (GB) | Minimum GPU Model | Minimum GPU Quantity | Single GPU Speed (s) (480P) | Single GPU Speed (s) (720P) | Approximate GPU Price (USD) |
|---|---|---|---|---|---|---|
| T2V-5B | 22.6 | NVIDIA RTX 4090 | 1 | 534.7 | 524.8 | $1,599 |
| T2V-A14B | 41.3 | NVIDIA A100 | 1 | 1133.9 | 4048.7 | $10,000 – $15,000 |
Notes:
- NVIDIA RTX 4090: Released in October 2022 with an MSRP of $1,599.
- NVIDIA A100: Prices vary based on configuration and market factors. The 40GB PCIe model typically ranges from $10,000 to $12,000, while the 80GB PCIe model ranges from $12,000 to $15,000.
API Costs
Novita AI is an AI cloud platform that offers developers an easy way to deploy AI models using our simple API, while also providing the affordable and reliable GPU cloud for building and scaling.
| Model | Price | Resolution | Generation Time |
| Wan 2.1 T2V | $0.3/video | 1280*720 | 5s |
| Wan 2.2 T2V | $0.4/video | 1080P | 5s |
Wan 2.2 T2V Access Guide
Step 1: Log In and Access the Model Library
Log in to your account and click on the Model Library button.

Step 2: Choose Your Model
Browse through the available options and select the model that suits your needs.

Step 3: Get Your API Key
To authenticate with the API, we will provide you with a new API key. Entering the “Settings“ page, you can copy the API key as indicated in the image.

Step 4: Install the API
Install API using the package manager specific to your programming language.

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API for python users.
import requests
url = "https://api.novita.ai/v3/async/wan-2.2-t2v"
payload = {
"input": {
"prompt": "<string>",
"negative_prompt": "<string>"
},
"parameters": {
"size": "<string>",
"prompt_extend": True,
"seed": 123
}
}
headers = {
"Content-Type": "<content-type>",
"Authorization": "<authorization>"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())
Common Wan 2.2 T2V Issues and Fixes
Installation and GPU Compatibility
- Issue: Errors on older GPUs (e.g., GTX 10-series) due to FlashAttention.
- Solution: Use compatible GPUs like RTX 30/40-series or A-series. Alternatively, disable FlashAttention (
--disable_flashattn) or replace it with xFormers for slower but functional performance.
Slow Generation Speed
- Issue: Extremely slow output, especially on modest GPUs.
- Solution:
- Optimize step count (30–50 steps are often sufficient).
- Use the smaller TI2V-5B model for faster results.
- Ensure correct expert switching settings (default configurations are recommended).
Output Quality Issues (Flicker/Artifacts)
- Issue: Flickering frames or artifacts in generated videos.
- Solution:
- Adjust CFG scale for better balance between precision and smoothness.
- Tweak the expert handover step for optimal diffusion.
- Enable temporal attention to maintain frame consistency.
- Use post-processing tools like frame interpolation if needed.
Prompt or Output Not as Expected
- Issue: Outputs differ from the described scenes or include unwanted elements.
- Solution:
- Rephrase and simplify prompts.
- Use negative prompts to exclude specific elements.
- Ensure correct model weights (e.g., don’t use I2V for text-only prompts).
LoRA and Fine-tuning Issues
- Issue: Old LoRA models from Wan 2.1 are incompatible with Wan 2.2.
- Solution: Wait for Wan 2.2-specific LoRAs or fine-tunes. Ensure any fine-tuning is tailored to the new two-expert architecture.
Pros and Cons of Wan 2.2 T2V for Small Businesses
| Aspect | Advantages | Disadvantages |
|---|---|---|
| Licensing & Cost | Free under Apache 2.0, no licensing fees, drastically lowers entry costs. | High computational costs for large-scale usage (cloud or electricity). |
| Content Quality | Cinematic-quality videos; in-house creation without hiring designers or videographers. | Unpredictable output quality; may require manual review and editing. |
| Creative Flexibility | Rapid prototyping with text prompts; quick turnaround for concept videos. | Slower for real-time or on-demand generation; better for pre-planned content. |
| Customization | Tailored to brand aesthetics via prompts or fine-tuning; open-source flexibility for deeper integration. | Requires expertise to craft prompts or fine-tune models effectively. |
| Scalability | Generate hundreds of videos easily; ideal for localized ads or A/B testing. | Expensive hardware (e.g., RTX 4090 or A100) needed for high-capacity use. |
| Community Support | Backed by open-source community; access to tutorials, updates, and tools like ComfyUI workflows. | No formal support or guarantees; reliance on community goodwill for troubleshooting. |
| Ease of Use | Simplifies video creation for small teams; acts as a “mini creative studio.” | Requires ML knowledge for setup (Python, CUDA, model parameters); steep learning curve. |
| Ethical & Legal | Enables innovation in AI-driven marketing. | Risks of generating unintended or inappropriate content; potential legal liabilities. |
Best for: Small businesses with technical expertise or access to consultants, aiming to reduce content creation costs and scale video production. Challenges: Requires careful planning, technical setup, and monitoring of hardware and costs.
Future Trends in Wan 2.2 T2V Technology

- Higher Resolution & Length
- Move towards 1080p, 4K, and longer clips (10–20 seconds).
- Improved coherence for extended videos via hierarchical generation.
- Enhanced Motion & Consistency
- Better motion stability and natural interactions.
- Specialized experts for different motion types (e.g., slow vs. fast).
- Video Editing & Multi-Modality
- Text commands for editing existing videos (e.g., scene changes or object removal).
- Integration of audio generation for complete video projects.
- Efficiency & Scalability
- Smaller, faster models (e.g., distilled 5B models with near 27B quality).
- Real-time video generation becomes feasible with hardware advancements.
- Community & Ecosystem Growth
- Niche fine-tunes (e.g., cartoon style, medical videos).
- Wider adoption through plugins and mobile apps.
- Ethics & Regulation
- Watermarks and metadata for AI-generated content.
- Standards ensuring transparency in use cases like advertising.
The release of the Wan 2.2 API marks a significant advancement in text-to-video technology. With higher resolutions, enhanced motion consistency, and improved efficiency, Wan 2.2 opens new possibilities for developers and creators. Its flexible API interface empowers you to bring your ideas to life, setting a new standard for video generation.
Frequently Asked Questions
Wan 2.2 is an open-source text-to-video model capable of generating high-quality, motion-consistent videos suitable for applications like advertising, filmmaking, and more.
Support for higher resolutions (up to 1080p).
Improved temporal consistency, reducing flickering.
Introduction of Mixture-of-Experts (MoE) architecture for better handling of complex scenes.
Wan 2.2 excels in speed, memory optimization, and output quality. When paired with high-end GPUs, it can quickly generate high-resolution video.
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.
Recommend Reading
- Unleash Your Creativity: YouTube Videos Voiceovers Mastery
- 2024 Youtube Video Notes Taker AI Market and Leading Players
- Transforming Images with Ease: Image to Video AI API
Discover more from Novita
Subscribe to get the latest posts sent to your email.





