Wan2.1: An Open-Source AI Model Outperforms Sora

wan2.1

Key Highlights

Open-Source Availability: Wan2.1 is an open-source AI model enabling cost-effective, high-quality video generation for academics, researchers, and businesses.

Versatile Capabilities: Supports T2V, I2V, Video Editing, T2I, and generates multilingual text in Chinese and English for subtitles.

Hardware Requirements: T2V-1.3B has only 1.3B parameters, significantly reducing hardware requirements.

Model Architecture and Innovations: Features Wan-VAE for 3D encoding, Video Diffusion DiT, and a robust pipeline for high-quality training datasets.

VBench and Performance Evaluation: Outperforms competitors like Sora with 86.22% on VBench, excelling in ID consistency, spatial accuracy, and action instruction execution.

Novita AI offers an API for Wan 2.1. Just sign up for a free trial and use the API with simple requests.

Wan2.1 is an open-source AI model developed by Alibaba Cloud for advanced video generation. Designed for high performance, efficiency, and versatility, it caters to a wide range of creative and professional applications. The models are available on Alibaba Cloud’s AI model community, ModelScope, and Hugging Face.

source from wan

Start a free trial on Novita AI today. To integrate the Hunyuan Video API, visit our developer docs for more details.

Novita offers highly competitive pricing in the market.

For example, a Wan 2.1 720P 5-second video costs only $0.3 per video

while a similar video on Replicate costs $2.39 per video

Open-Source Availability

Alibaba Cloud has open-sourced its Wan2.1 series of AI models for video generation. This initiative aims to lower accessibility barriers and enable businesses to create high-quality visual content cost-effectively. By releasing these models as open source, academics, researchers, and commercial entities can harness the power of AI for their projects without significant upfront costs.

Versatile Capabilities of Wan2.1

Wan2.1 excels in a variety of tasks, making it a versatile tool for video generation:

  • Text-to-Video (T2V)
  • Image-to-Video (I2V)
  • Video Editing
  • Text-to-Image (T2I)

Notably, Wan2.1 is the first video model capable of generating text in both Chinese and English, featuring robust text generation that enhances its practical applications.

Hardware Requirements

Below is a detailed summary of the hardware requirements for the four Wan2.1 models. The table outlines each model’s functionality, supported resolution, model size, hardware demand, and recommended GPUs for optimal performance.

Model Name Function Resolution Support Model Size Hardware Demand Recommended GPU
T2V-14B Text-to-Video (T2V) 480P / 720P 14B ⭐⭐⭐⭐ A100 / RTX 3090 / RTX 4090
I2V-14B-720P Image-to-Video (I2V) 720P 14B ⭐⭐⭐⭐ A100 / RTX 3090 / RTX 4090
I2V-14B-480P Image-to-Video (I2V) 480P 14B ⭐⭐⭐ RTX 3090 / RTX 4070 Ti
T2V-1.3B Text-to-Video (T2V) Low Resolution 1.3B ⭐⭐ RTX 3060 / RTX 4060 or higher

Model Architecture and Key Innovations

Wan2.1 is built on a diffusion transformer paradigm, enhanced by the Flow Matching framework. Its key innovations include:

  • Wan-VAE: A 3D variational autoencoder designed for efficient compression and high fidelity in motion reproduction. It encodes and decodes 1080P videos while maintaining temporal coherence. The model integrates multiple strategies to optimize spatio-temporal compression, reduce memory usage, and ensure temporal causality.
Wan-VAE
  • Video Diffusion DiT: Wan2.1 leverages the Flow Matching framework within Diffusion Transformers, utilizing a T5 Encoder for multilingual text input and cross-attention for embedding text into the model. A shared MLP with SiLU and Linear layers predicts six modulation parameters for time embeddings, enabling each transformer block to learn distinct biases. This architecture significantly improves performance without increasing parameter scale.
dIT
  • A candidate dataset: Wan 2.1 curated and deduplicated a candidate dataset comprising a vast amount of image and video data. During the data curation process, we designed a four-step data cleaning process, focusing on fundamental dimensions, visual quality and motion quality. Through the robust data processing pipeline, we can easily obtain high-quality, diverse, and large-scale training sets of images and videos.
DATA

VBench Evaluation

VBench is a robust and comprehensive benchmark suite designed to evaluate video generative models. It breaks down “video generation quality” into hierarchical, disentangled, and specific dimensions, each equipped with tailored prompts and evaluation methods. The main evaluation metrics include:

  • Large Motion Generation
  • Human Artifacts
  • Pixel-Level Stability
  • ID Consistency
  • Physical Plausibility
  • Smoothness
  • Comprehensive Image Quality
  • Scene Generation Quality
  • Stylization Ability
  • Single Object Accuracy
  • Multiple Object Accuracy
  • Spatial Position Accuracy
  • Camera Control
  • Action Instruction Following

The purpose of VBench is to provide valuable insights into the strengths and weaknesses of individual models, enabling fine-grained and objective evaluation. These insights not only guide future developments in video generation but also help improve model performance. To ensure alignment with human perception, VBench incorporates human preference annotations, validating its relevance and reliability as a benchmark. The performance of Wan2.1 is presented in the chart below:

vbench
from Alizila

Additionally, Wan-Bench was used to assess the T2V-1.3B model, which outperformed larger open-source counterparts across key metrics. These evaluations highlight the model’s advancements in:

wanbench

Wan 2.1 VS Sora

Comprehensive Performance Superiority:

  • Wan2.1 achieves a higher overall score on VBench, with 86.22%, surpassing Sora’s 84.28%, and demonstrates stronger performance across multiple sub-dimensions.

Support for Chinese and English Subtitle Generation:

  • Wan2.1 is the first video generation model to support both Chinese and English subtitle generation, giving it a unique advantage in multilingual scenarios. Sora does not offer this functionality.

Performance in Sub-Dimensions:

  • ID Consistency: Wan2.1 excels at maintaining the consistency of subjects within videos.
  • Single Object Accuracy: Wan2.1 generates more precise results for single-object scenarios.
  • Spatial Position Accuracy: Wan2.1 significantly outperforms Sora in handling spatial logic relationships.
  • Action Instruction Execution: Wan2.1 demonstrates better understanding and execution of complex action instructions.

Open Source and Accessibility:

  • Wan2.1 provides open-source code, making it more accessible and easier for developers to use and integrate.
  • Sora, although offering APIs, is not open-source, which limits its flexibility.

Areas for Improvement:

  • Wan2.1 is slightly inferior to Sora in terms of motion smoothness and large motion generation, but the gap is minimal.

Applications

Content Creation

  • Enables automated generation of high-quality videos for social media, marketing, and entertainment.
  • Supports stylized video generation to match specific artistic or branding needs.

Education and E-Learning

  • Generates educational videos with custom visuals and subtitles in both Chinese and English.
  • Facilitates the creation of engaging and personalized learning content.

Film and Animation

  • Assists in creating storyboards, video prototypes, or entire scenes based on textual or image inputs.
  • Supports multilingual subtitles, making it suitable for global audiences.

Advertising and Marketing

  • Produces customized video advertisements tailored to target audiences.
  • Enhances campaigns with visually compelling and context-sensitive content.

Gaming

  • Generates in-game cutscenes or animations based on textual descriptions or character images.
  • Creates dynamic video assets for game development and storytelling.

Multilingual Communication

  • Supports both Chinese and English subtitle generation, making it ideal for multilingual presentations and media.

Prototyping and Visualization

  • Aids in visualizing concepts, ideas, or architectural designs through video.
  • Generates dynamic representations of projects for presentations or pitches.

Accessibility and Inclusion

  • Creates videos with subtitles, improving accessibility for hearing-impaired audiences.
  • Multilingual support facilitates content creation for diverse user groups.

Wan2.1 represents a significant advancement in AI-driven video generation. Its open-source nature, multilingual capabilities, and superior performance across benchmarks like VBench position it as a versatile and accessible tool for creative and professional applications. While it slightly lags behind Sora in motion smoothness and large motion generation, its overall capabilities, innovative architecture, and wide-ranging applications make it a game-changer for industries like education, media, gaming, and more.

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.


Discover more from Novita

Subscribe to get the latest posts sent to your email.

Leave a Comment

Scroll to Top

Discover more from Novita

Subscribe now to keep reading and get access to the full archive.

Continue reading