Alibaba's Wan2.2-S2V boosts digital human creation

Summary is AI-generated, newsdesk-reviewed
  • Alibaba's Wan2.2-S2V converts photos to film-quality avatars with speech capabilities.
  • Wan2.2-S2V supports multiple video formats, offering flexible resolutions for diverse creation needs.
  • Model merges audio-driven animation with text-guided motion for expressive character performances.

Alibaba has unveiled Wan2.2-S2V (Speech-to-Video), its latest open-source model designed for digital human video creation. This innovative tool converts portrait photos into film-quality avatars capable of speaking, singing, and performing. 

Part of Alibaba’s Wan2.2 video generation series, the new model can generate high-quality animated videos from a single image and an audio clip.

Visual representations

Wan2.2-S2V offers versatile character animation abilities, enabling the design of videos across options

Wan2.2-S2V offers versatile character animation capabilities, enabling the creation of videos across multiple framing options, including portrait, bust, and full-body perspectives.

It can generate character actions and environmental factors dynamically based on prompt instructions, allowing professional content creators to capture precise visual representations tailored to specific storytelling and design requirements.

Advanced audio-driven animation technology

Powered by advanced audio-driven animation technology, the model delivers lifelike character performances, ranging from natural dialogue to musical performances, and seamlessly handles multiple characters within a scene.

Creators can now transform voice recordings into lifelike animated movements, supporting a diverse range of avatars, from cartoon and animals to stylised characters.

Flexible output resolutions

To meet the diverse needs of professional content creators, the technology provides flexible output resolutions of 480P and 720P. 

This ensures high-quality visual output that meets various professional and creative standards, making it suitable for both social media content and professional presentations.

Innovative technologies

Wan2.2-S2V transcends traditional talking-head animations by mixing text-guided motion control

Wan2.2-S2V transcends traditional talking-head animations by combining text-guided global motion control with audio-driven fine-grained local movements. This enables natural and expressive character performances across complex and challenging scenarios.

Another key breakthrough lies in the model's innovative frame processing technique. By compressing historical frames of arbitrary length into a single, compact latent representation, the technology significantly reduces computational overhead.

This approach allows for remarkably stable long-video generation, addressing a critical challenge in extended animated content production.

Alibaba’s research team

The model’s advanced capabilities are further amplified by its comprehensive training methodology. Alibaba’s research team constructed a large-scale audio-visual dataset specifically tailored to film and television production scenarios.

Using a multi-resolution training approach, Wan2.2-S2V supports flexible video generation across diverse formats – from vertical short-form content to traditional horizontal film and television productions.

Alibaba Cloud’s open-source community

Wan2.2-S2V model is available to download on Hugging Face and GitHub, as well as Alibaba Cloud’s open-source community, ModelScope.

A major contributor to the global open-source community, Alibaba open-sourced Wan2.1 models in February 2025 and Wan 2.2 models in July. To date, the Wan series has generated over 6.9 million downloads on Hugging Face and ModelScope.

In case you missed it

How can physical security systems make schools safer?
How can physical security systems make schools safer?

Students deserve a safe and positive environment where they can learn and thrive. Teachers and administrators should be able to focus on their primary role of educating students be...

DNAKE smart intercom elevates Dickensa 27 security
DNAKE smart intercom elevates Dickensa 27 security

Dickensa 27, a modern residential complex in Warsaw, Poland, sought to enhance its security, communication, and convenience for residents through advanced intercom solutions. ...

Anviz transforms traditional property management into a smart reality, making digitisation more than just talk
Anviz transforms traditional property management into a smart reality, making digitisation more than just talk

The Middle East has recently expanded its real estate market as the region's economy grows and urbanisation accelerates. This trend has led to an increasing demand for smart securi...

Quick poll
How likely is it that companies will invest in cloud-based physical security solutions in the next 5 years?