vantagewithai/SkyReels-V3-14B-GGUF

Name: vantagewithai/SkyReels-V3-14B-GGUF
Author: vantagewithai

High-quality GGUF model

2.5K Downloads

3 Likes

0 GGUF Files

0 B Total Size

2 weeks ago Updated

Model Description

license: other license_name: skywork-license license_link: LICENSE pipeline_tag: video-to-video

Quantized GGUF version of SkyReels v3 Suit

📥 Original Links

Reference to Video https://huggingface.co/Skywork/SkyReels-V3-R2V-14B

Video Extension https://huggingface.co/Skywork/SkyReels-V3-V2V-14B

Talking Avatar https://huggingface.co/Skywork/SkyReels-V3-A2V-19B

Watch us at Youtube: @VantageWithAI

SkyReels Logo

SkyReels V3: Multimodal Video Generation Model

👋 Playground . 🔧 API Platform · 🤗 Hugging Face · 🤖 ModelScope · 📑 Technical Report

Welcome to the SkyReels V3 repository! This is the official release of our flagship video generation model, built upon a unified multimodal in-context learning framework. SkyReels V3 natively supports three core generative capabilities: 1) multi-subject video generation from reference images, 2) video generation guided by audio, and 3) video-to-video generation.

🔥🔥🔥 News!!

Jan 29, 2026: 🎉 We launched the API for the SkyReels-V3 models on the apifree.ai.
Jan 29, 2026: 🎉 We release the inference code and model weights of SkyReels-V3.
Jun 1, 2025: 🎉 We published the technical report, SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers.
May 16, 2025: 🔥 We release the inference code for video extension and start/end frame control in diffusion forcing model.
Apr 24, 2025: 🔥 We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.
Apr 21, 2025: 👋 We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1 .
Apr 3, 2025: 🔥 We also release SkyReels-A2. This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
Feb 18, 2025: 🔥 we released SkyReels-A1. This is an open-sourced and effective framework for portrait image animation.
Feb 18, 2025: 🔥 We released SkyReels-V1. This is the first and most advanced open-source human-centric video foundation model.

🎥 Demos

Reference to Video

Video Extension

Talking Avatar

The demos above showcase videos generated using our SkyReels-V3 unified multimodal in-context learning framework.

Introduction of SkyReels-V3

Reference to Video

SkyReels-V3 Multi-Reference Video Generation Model is a new-generation video synthesis system independently developed by SkyReels. The model enables users to input 1 to 4 reference images—including character portraits, object images, and background scenes—and generates coherent video sequences aligned with textual instructions, ensuring logical compositional relationships and narrative progression. With robust capabilities in dynamic scene generation, the model is widely applicable across various domains such as video production, social media entertainment, live-stream commerce, and product demonstration.

Key Features :
Supports fusion of up to 4 reference images, including character, object, and background references.
Exceptional subject consistency and composition coherence, with industry-leading motion generation quality.
* Multiple aspect ratios: 1:1, 3:4, 4:3, 16:9, 9:16.

#### Model Overview
The model achieves high subject and background consistency while accurately responding to user instructions. To enhance its capability of preserving reference image content, the SkyReels team developed a comprehensive data processing pipeline. This pipeline employs a cross-frame pairing strategy to select reference frames from continuous video sequences and utilizes image editing models to extract subject images, simultaneously accomplishing background completion and semantic rewriting—effectively avoiding the "copy-paste" effect.

During the training phase, the SkyReels team introduced an image-video hybrid training mechanism and supported multi-resolution joint training, significantly improving the model's generalization performance. Evolving from the SkyReels V2 to the V3 version, the model has reached the level of industry-leading closed-source SOTA (state-of-the-art) models across multiple evaluation metrics, demonstrating top-tier comprehensive generation capabilities in the field.

#### 📊 Performance Comparison

Model	Reference Consistency ↑	Instruction Following ↑	Visual Quality ↑
Vidu Q2	0.5961	27.84	0.7877
Kling 1.6	0.6630	29.23	0.8034
PixVerse V5	0.6542	29.34	0.7976
SkyReels V3	0.6698	27.22	0.8119

Video Extension

SkyReels-V3 Video Extension Model is a new-generation video generation system independently developed by SkyReels. The model allows users to input an existing video segment and extend it with coherent, logically consistent subsequent scenes based on textual instructions. It is widely applicable in scenarios such as video production, short-form series creation, live commerce, and product demonstration.

Key Features :
Dual Extension Modes: Supports both single-shot continuation and multi-shot switching (with 5 transition types), operable via manual selection or automatic detection.
Superior Visual Quality: Excellent aesthetic composition, robust motion quality, and seamless continuity preservation.
Outstanding Style Adherence: Strictly follows input visual styles (realistic, cinematic, or specialized aesthetics) with exceptional compatibility.
High-Definition Output: Ensures premium content quality, supporting 720P resolution.
Flexible Duration Control: Adjustable output length between 5 to 30 seconds for sing-shot video extension.
Customizable Aspect Ratios: Supports multiple ratios including 1:1, 3:4, 4:3, 16:9, and 9:16.

#### Model Overview
The SkyReels-V3 Video Extension Engine deeply integrates spatiotemporal consistency modeling with large-scale video understanding, breaking through the frame-level limitations of traditional video generation to achieve a qualitative leap from "visual continuation" to "narrative continuation." As the industry's first engine supporting intelligent shot switching during video extension, SkyReels-V3 not only achieves top-tier temporal coherence but also extends generation capacity to minute-level durations through an innovative history enhancement mechanism, ensuring depth and stability in long-form video storytelling.

The engine accurately parses scene semantics, motion trajectories, and emotional context from the original video, while intelligently planning the composition, character behavior, and cinematography of the extended content. It supports both seamless single-shot continuation and multi-type shot switching—including professional techniques such as Cut-In, Cut-Out, Reverse Shot, Multi-Angle, and Cut Away—automatically generating extended clips with strong narrative logic and visual coherence. This empowers visual language with cinematic dynamism and tension, marking a true generational shift from frame interpolation to plot creation.

Technical Innovations:

Unified multi-segment positional encoding and hybrid hierarchical data training enable precise motion prediction and smooth transitions in complex scenes.
The architecture robustly handles challenges such as rapid motion, multi-person interactions, and abrupt scene changes, strictly ensuring physical plausibility and emotional consistency.
In intelligent shot switching, the system dynamically plans cut rhythms and viewpoint variations based on video semantics and user prompts, generating freely lengthened, professionally shot-extended content within a unified style.

With outstanding generalization capabilities, SkyReels-V3 achieves state-of-the-art (SOTA) performance on core metrics including single-shot and multi-shot extension. It is widely adaptable to diverse scenarios such as live-action filmmaking, short-series industrial production, game cinematics, and security footage enhancement. The generated content delivers high-definition visuals, sharp details, and natural motion fluency, offering professional creators a "what-you-see-is-what-you-think" extension experience and redefining the boundaries of video generation.

Talking Avatar

Create with just one image and audio clip.

Key Features :
Superior visual quality and precise lip sync. Generate 720p HD videos at 24 fps for smooth and clear results. Supports multiple languages to ensure lip movements match the audio, enhancing authenticity.
Multi-style support. Compatible with real-life, cartoon, animal, and stylized characters—offering creative flexibility for brand ambassadors or virtual IPs.
Long-form video generation. Produce minute-long coherent videos for detailed explanations, news reports, training courses, and more.
Multi-character scenes. Optimized for group interactions, allowing role assignments to support dialogues, interviews, and other dynamic content.

#### Model Overview

Powered by advanced multimodal understanding techniques, SkyReels Avatars don’t just “hear sound”—they truly understand your content. By analyzing voice, image, and emotional cues, they generate expressions, movements, and camera language that naturally align with your intent.
Built on a scalable diffusion Transformer architecture and trained with audio-visual alignment strategies, our technology ensures highly accurate lip sync. Whether it’s Chinese, English, Korean, singing, or fast-paced dialogue—the lip movements match the pronunciation for a realistic audiovisual experience.

Using a keyframe-constrained generation framework, the model first structures key content before smoothly connecting transitions. This ensures consistent character appearance and fluid motion, even in long videos. Generate high-quality minute-long videos in one go—ideal for explanations, broadcasts, storytelling, and more.
From real people and anime characters to pets and artwork—any image can be turned into a lifelike digital avatar.

In internal evaluations against mainstream avatar models, SkyReels model excel across multiple dimensions—overall quality, lip sync, and expressiveness—achieving a significantly higher overall rating.

#### 📊 Performance Comparison

Model	Audio-Visual Sync ↑	Visual Quality ↑	Charactr Consistency ↑
OmniHuman 1.5	8.25	4.60	0.81
KlingAvatar	8.01	4.55	0.78
HunyuanAvatar	6.72	4.50	0.74
SkyReels V3	8.18	4.60	0.80

Acknowledgements

We would like to thank the contributors of Wan 2.1, MultiTalk, XDit and diffusers repositories, for their open research and contributions.

Github Star History

GGUF File List

No GGUF files available

Model Information

Model ID: vantagewithai/SkyReels-V3-14B-GGUF

Created: 3 weeks ago

Last Updated: 2 weeks ago

Downloads: 2.5K

Likes: 3

Difficulty: Beginner