Model Description


license: other license_name: skywork-license license_link: LICENSE pipeline_tag: video-to-video

Quantized GGUF version of SkyReels v3 Suit

📥 Original Links

Reference to Video https://huggingface.co/Skywork/SkyReels-V3-R2V-14B

Video Extension https://huggingface.co/Skywork/SkyReels-V3-V2V-14B

Talking Avatar https://huggingface.co/Skywork/SkyReels-V3-A2V-19B

Watch us at Youtube: @VantageWithAI


SkyReels Logo

SkyReels V3: Multimodal Video Generation Model


👋 Playground . 🔧 API Platform · 🤗 Hugging Face · 🤖 ModelScope · 📑 Technical Report


Welcome to the SkyReels V3 repository! This is the official release of our flagship video generation model, built upon a unified multimodal in-context learning framework. SkyReels V3 natively supports three core generative capabilities: 1) multi-subject video generation from reference images, 2) video generation guided by audio, and 3) video-to-video generation.

🔥🔥🔥 News!!

🎥 Demos

Reference to Video
Reference to Video
Video Extension
Video Extension
Talking Avatar
Talking Avatar

The demos above showcase videos generated using our SkyReels-V3 unified multimodal in-context learning framework.

Introduction of SkyReels-V3

Reference to Video

SkyReels-V3 Multi-Reference Video Generation Model is a new-generation video synthesis system independently developed by SkyReels. The model enables users to input 1 to 4 reference images—including character portraits, object images, and background scenes—and generates coherent video sequences aligned with textual instructions, ensuring logical compositional relationships and narrative progression. With robust capabilities in dynamic scene generation, the model is widely applicable across various domains such as video production, social media entertainment, live-stream commerce, and product demonstration.

Key Features :

Supports fusion of up to 4 reference images, including character, object, and background references.

Exceptional subject consistency and composition coherence, with industry-leading motion generation quality.

* Multiple aspect ratios: 1:1, 3:4, 4:3, 16:9, 9:16.

#### Model Overview
The model achieves high subject and background consistency while accurately responding to user instructions. To enhance its capability of preserving reference image content, the SkyReels team developed a comprehensive data processing pipeline. This pipeline employs a cross-frame pairing strategy to select reference frames from continuous video sequences and utilizes image editing models to extract subject images, simultaneously accomplishing background completion and semantic rewriting—effectively avoiding the "copy-paste" effect.

During the training phase, the SkyReels team introduced an image-video hybrid training mechanism and supported multi-resolution joint training, significantly improving the model's generalization performance. Evolving from the SkyReels V2 to the V3 version, the model has reached the level of industry-leading closed-source SOTA (state-of-the-art) models across multiple evaluation metrics, demonstrating top-tier comprehensive generation capabilities in the field.

#### 📊 Performance Comparison

ModelReference Consistency ↑Instruction Following ↑Visual Quality ↑
Vidu Q20.596127.840.7877
Kling 1.60.663029.230.8034
PixVerse V50.654229.340.7976
SkyReels V30.669827.220.8119

Video Extension

SkyReels-V3 Video Extension Model is a new-generation video generation system independently developed by SkyReels. The model allows users to input an existing video segment and extend it with coherent, logically consistent subsequent scenes based on textual instructions. It is widely applicable in scenarios such as video production, short-form series creation, live commerce, and product demonstration.

Key Features :

Dual Extension Modes: Supports both single-shot continuation and multi-shot switching (with 5 transition types), operable via manual selection or automatic detection.

Superior Visual Quality: Excellent aesthetic composition, robust motion quality, and seamless continuity preservation.

Outstanding Style Adherence: Strictly follows input visual styles (realistic, cinematic, or specialized aesthetics) with exceptional compatibility.

High-Definition Output: Ensures premium content quality, supporting 720P resolution.

Flexible Duration Control: Adjustable output length between 5 to 30 seconds for sing-shot video extension.

Customizable Aspect Ratios: Supports multiple ratios including 1:1, 3:4, 4:3, 16:9, and 9:16.

#### Model Overview
The SkyReels-V3 Video Extension Engine deeply integrates spatiotemporal consistency modeling with large-scale video understanding, breaking through the frame-level limitations of traditional video generation to achieve a qualitative leap from "visual continuation" to "narrative continuation." As the industry's first engine supporting intelligent shot switching during video extension, SkyReels-V3 not only achieves top-tier temporal coherence but also extends generation capacity to minute-level durations through an innovative history enhancement mechanism, ensuring depth and stability in long-form video storytelling.

The engine accurately parses scene semantics, motion trajectories, and emotional context from the original video, while intelligently planning the composition, character behavior, and cinematography of the extended content. It supports both seamless single-shot continuation and multi-type shot switching—including professional techniques such as Cut-In, Cut-Out, Reverse Shot, Multi-Angle, and Cut Away—automatically generating extended clips with strong narrative logic and visual coherence. This empowers visual language with cinematic dynamism and tension, marking a true generational shift from frame interpolation to plot creation.

Technical Innovations:

  • Unified multi-segment positional encoding and hybrid hierarchical data training enable precise motion prediction and smooth transitions in complex scenes.
  • The architecture robustly handles challenges such as rapid motion, multi-person interactions, and abrupt scene changes, strictly ensuring physical plausibility and emotional consistency.
  • In intelligent shot switching, the system dynamically plans cut rhythms and viewpoint variations based on video semantics and user prompts, generating freely lengthened, professionally shot-extended content within a unified style.

With outstanding generalization capabilities, SkyReels-V3 achieves state-of-the-art (SOTA) performance on core metrics including single-shot and multi-shot extension. It is widely adaptable to diverse scenarios such as live-action filmmaking, short-series industrial production, game cinematics, and security footage enhancement. The generated content delivers high-definition visuals, sharp details, and natural motion fluency, offering professional creators a "what-you-see-is-what-you-think" extension experience and redefining the boundaries of video generation.

Talking Avatar

Create with just one image and audio clip.

Key Features :

Superior visual quality and precise lip sync. Generate 720p HD videos at 24 fps for smooth and clear results. Supports multiple languages to ensure lip movements match the audio, enhancing authenticity.

Multi-style support. Compatible with real-life, cartoon, animal, and stylized characters—offering creative flexibility for brand ambassadors or virtual IPs.

Long-form video generation. Produce minute-long coherent videos for detailed explanations, news reports, training courses, and more.

Multi-character scenes. Optimized for group interactions, allowing role assignments to support dialogues, interviews, and other dynamic content.

#### Model Overview

Powered by advanced multimodal understanding techniques, SkyReels Avatars don’t just “hear sound”—they truly understand your content. By analyzing voice, image, and emotional cues, they generate expressions, movements, and camera language that naturally align with your intent.
Built on a scalable diffusion Transformer architecture and trained with audio-visual alignment strategies, our technology ensures highly accurate lip sync. Whether it’s Chinese, English, Korean, singing, or fast-paced dialogue—the lip movements match the pronunciation for a realistic audiovisual experience.

Using a keyframe-constrained generation framework, the model first structures key content before smoothly connecting transitions. This ensures consistent character appearance and fluid motion, even in long videos. Generate high-quality minute-long videos in one go—ideal for explanations, broadcasts, storytelling, and more.
From real people and anime characters to pets and artwork—any image can be turned into a lifelike digital avatar.

In internal evaluations against mainstream avatar models, SkyReels model excel across multiple dimensions—overall quality, lip sync, and expressiveness—achieving a significantly higher overall rating.

#### 📊 Performance Comparison

ModelAudio-Visual Sync ↑Visual Quality ↑Charactr Consistency ↑
OmniHuman 1.58.254.600.81
KlingAvatar8.014.550.78
HunyuanAvatar6.724.500.74
SkyReels V38.184.600.80

Acknowledgements

We would like to thank the contributors of Wan 2.1, MultiTalk, XDit and diffusers repositories, for their open research and contributions.

Github Star History





Star History Chart

GGUF File List

No GGUF files available