vace gradio and Readme

2025-07-13 11:10:11 +00:00 · 2025-05-14 12:28:06 +08:00 · 2025-05-14 12:28:06 +08:00 · 5142905303
commit 5142905303
parent 1a55891718
3 changed files with 380 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -27,6 +27,7 @@ In this repository, we present **Wan2.1**, a comprehensive and open suite of vid

 ## 🔥 Latest News!!

+* May 14, 2025: 👋 We introduce **Wan2.1** [VACE](https://github.com/ali-vilab/VACE), an all-in-one model for video creation and editing, along with its [inference code](#run-vace), [weights](#model-download), and [technical report](https://arxiv.org/abs/2503.07598)!
 * Apr 17, 2025: 👋 We introduce **Wan2.1** [FLF2V](#run-first-last-frame-to-video-generation) with its inference code and weights!
 * Mar 21, 2025: 👋 We are excited to announce the release of the **Wan2.1** [technical report](https://files.alicdn.com/tpsservice/5c9de1c74de03972b7aa657e5a54756b.pdf). We welcome discussions and feedback!
 * Mar 3, 2025: 👋 **Wan2.1**'s T2V and I2V have been integrated into Diffusers ([T2V](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan#diffusers.WanPipeline) | [I2V](https://huggingface.co/docs/diffusers/main/en/api/pipelines/wan#diffusers.WanImageToVideoPipeline)). Feel free to give it a try!
@ -64,7 +65,13 @@ If your work has improved **Wan2.1** and you would like more people to see it, p
    - [ ] ComfyUI integration
    - [ ] Diffusers integration
    - [ ] Diffusers + Multi-GPU Inference
-
+- Wan2.1 VACE
+    - [x] Multi-GPU Inference code of the 14B and 1.3B models
+    - [x] Checkpoints of the 14B and 1.3B models
+    - [x] Gradio demo
+    - [x] ComfyUI integration
+    - [ ] Diffusers integration
+    - [ ] Diffusers + Multi-GPU Inference

 ## Quickstart

@ -84,13 +91,15 @@ pip install -r requirements.txt

 #### Model Download

-| Models       | Download Link                                                                                                                                       |    Notes                      |
-|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
-| T2V-14B      | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)      🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B)         | Supports both 480P and 720P
-| I2V-14B-720P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
-| I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
-| T2V-1.3B     | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B)        | Supports 480P
-| FLF2V-14B    | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-FLF2V-14B-720P)      | Supports 720P
+| Models       | Download Link                                                                                                                                           |    Notes                      |
+|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
+| T2V-14B      | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)      🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B)             | Supports both 480P and 720P
+| I2V-14B-720P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)     | Supports 720P
+| I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P)     | Supports 480P
+| T2V-1.3B     | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B)            | Supports 480P
+| FLF2V-14B    | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-FLF2V-14B-720P)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-FLF2V-14B-720P) | Supports 720P
+| VACE-1.3B    | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-1.3B)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-1.3B)          | Supports 480P
+| VACE-14B     | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-VACE-14B)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-VACE-14B)        | Supports both 480P and 720P

 > 💡Note: 
 > * The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution. 
@ -448,6 +457,73 @@ DASH_API_KEY=your_key python flf2v_14B_singleGPU.py --prompt_extend_method 'dash
 ```


+#### Run VACE
+
+[VACE](https://github.com/ali-vilab/VACE) now supports two models (1.3B and 14B) and two main resolutions (480P and 720P). 
+The input supports any resolution, but to achieve optimal results, the video size should fall within a specific range.
+The parameters and configurations for these models are as follows:
+
+<table>
+    <thead>
+        <tr>
+            <th rowspan="2">Task</th>
+            <th colspan="2">Resolution</th>
+            <th rowspan="2">Model</th>
+        </tr>
+        <tr>
+            <th>480P(~81x480x832)</th>
+            <th>720P(~81x720x1280)</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>VACE</td>
+            <td style="color: green; text-align: center; vertical-align: middle;">✔️</td>
+            <td style="color: green; text-align: center; vertical-align: middle;">✔️</td>
+            <td>Wan2.1-VACE-14B</td>
+        </tr>
+        <tr>
+            <td>VACE</td>
+            <td style="color: green; text-align: center; vertical-align: middle;">✔️</td>
+            <td style="color: red; text-align: center; vertical-align: middle;">❌</td>
+            <td>Wan2.1-VACE-1.3B</td>
+        </tr>
+    </tbody>
+</table>
+
+In VACE, users can input text prompt and optional video, mask, and image for video generation or editing. Detailed instructions for using VACE can be found in the [User Guide](https://github.com/ali-vilab/VACE/blob/main/UserGuide.md).
+The execution process is as follows:
+
+##### (1) Preprocessing
+
+User-collected materials needs to be preprocessed into VACE-recognizable inputs, including `src_video`, `src_mask`, `src_ref_images`, and `prompt`.
+For R2V (Reference-to-Video Generation), you may skip this preprocessing, but for V2V (Video-to-Video Editing) and MV2V (Masked Video-to-Video Editing) tasks, additional preprocessing is required to obtain video with conditions such as depth, pose or masked regions.
+For more details, please refer to [vace_preproccess](https://github.com/ali-vilab/VACE/blob/main/vace/vace_preproccess.py).
+
+##### (2) cli inference
+
+- Single-GPU inference
+```sh
+python generate.py --task vace-1.3B --size 832*480 --ckpt_dir ./Wan2.1-VACE-1.3B --src_ref_images examples/girl.png,examples/snake.png --prompt "在一个欢乐而充满节日气氛的场景中，穿着鲜艳红色春服的小女孩正与她的可爱卡通蛇嬉戏。她的春服上绣着金色吉祥图案，散发着喜庆的气息，脸上洋溢着灿烂的笑容。蛇身呈现出亮眼的绿色，形状圆润，宽大的眼睛让它显得既友善又幽默。小女孩欢快地用手轻轻抚摸着蛇的头部，共同享受着这温馨的时刻。周围五彩斑斓的灯笼和彩带装饰着环境，阳光透过洒在她们身上，营造出一个充满友爱与幸福的新年氛围。"
+```
+
+- Multi-GPU inference using FSDP + xDiT USP
+
+```sh
+torchrun --nproc_per_node=8 generate.py --task vace-14B --size 1280*720 --ckpt_dir ./Wan2.1-VACE-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --src_ref_images examples/girl.png,examples/snake.png --prompt "在一个欢乐而充满节日气氛的场景中，穿着鲜艳红色春服的小女孩正与她的可爱卡通蛇嬉戏。她的春服上绣着金色吉祥图案，散发着喜庆的气息，脸上洋溢着灿烂的笑容。蛇身呈现出亮眼的绿色，形状圆润，宽大的眼睛让它显得既友善又幽默。小女孩欢快地用手轻轻抚摸着蛇的头部，共同享受着这温馨的时刻。周围五彩斑斓的灯笼和彩带装饰着环境，阳光透过洒在她们身上，营造出一个充满友爱与幸福的新年氛围。"
+```
+
+##### (3) Running local gradio
+- Single-GPU inference
+```sh
+python gradio/vace.py --ckpt_dir ./Wan2.1-VACE-1.3B
+```
+
+- Multi-GPU inference using FSDP + xDiT USP
+```sh
+python gradio/vace.py --mp --ulysses_size 8 --ckpt_dir ./Wan2.1-VACE-14B/
+```
+
 #### Run Text-to-Image Generation

 Wan2.1 is a unified model for both image and video generation. Since it was trained on both types of data, it can also generate images. The command for generating images is similar to video generation, as follows:
--- a/gradio/vace.py
+++ b/gradio/vace.py
@ -0,0 +1,295 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+
+import argparse
+import os
+import sys
+import datetime
+import imageio
+import numpy as np
+import torch
+import gradio as gr
+
+sys.path.insert(0, os.path.sep.join(os.path.realpath(__file__).split(os.path.sep)[:-2]))
+import wan
+from wan import WanVace, WanVaceMP
+from wan.configs import WAN_CONFIGS, SIZE_CONFIGS
+
+
+class FixedSizeQueue:
+    def __init__(self, max_size):
+        self.max_size = max_size
+        self.queue = []
+    def add(self, item):
+        self.queue.insert(0, item)
+        if len(self.queue) > self.max_size:
+            self.queue.pop()
+    def get(self):
+        return self.queue
+    def __repr__(self):
+        return str(self.queue)
+
+
+class VACEInference:
+    def __init__(self, cfg, skip_load=False, gallery_share=True, gallery_share_limit=5):
+        self.cfg = cfg
+        self.save_dir = cfg.save_dir
+        self.gallery_share = gallery_share
+        self.gallery_share_data = FixedSizeQueue(max_size=gallery_share_limit)
+        if not skip_load:
+            if not args.mp:
+                self.pipe = WanVace(
+                    config=WAN_CONFIGS[cfg.model_name],
+                    checkpoint_dir=cfg.ckpt_dir,
+                    device_id=0,
+                    rank=0,
+                    t5_fsdp=False,
+                    dit_fsdp=False,
+                    use_usp=False,
+                )
+            else:
+                self.pipe = WanVaceMP(
+                    config=WAN_CONFIGS[cfg.model_name],
+                    checkpoint_dir=cfg.ckpt_dir,
+                    use_usp=True,
+                    ulysses_size=cfg.ulysses_size,
+                    ring_size=cfg.ring_size
+                )
+
+
+    def create_ui(self, *args, **kwargs):
+        gr.Markdown("""
+                    <div style="text-align: center; font-size: 24px; font-weight: bold; margin-bottom: 15px;">
+                        <a href="https://ali-vilab.github.io/VACE-Page/" style="text-decoration: none; color: inherit;">VACE-WAN Demo</a>
+                    </div>
+                    """)
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                self.src_video = gr.Video(
+                    label="src_video",
+                    sources=['upload'],
+                    value=None,
+                    interactive=True)
+            with gr.Column(scale=1, min_width=0):
+                self.src_mask = gr.Video(
+                    label="src_mask",
+                    sources=['upload'],
+                    value=None,
+                    interactive=True)
+        #
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                with gr.Row(equal_height=True):
+                    self.src_ref_image_1 = gr.Image(label='src_ref_image_1',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_1",
+                                                    format='png')
+                    self.src_ref_image_2 = gr.Image(label='src_ref_image_2',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_2",
+                                                    format='png')
+                    self.src_ref_image_3 = gr.Image(label='src_ref_image_3',
+                                                    height=200,
+                                                    interactive=True,
+                                                    type='filepath',
+                                                    image_mode='RGB',
+                                                    sources=['upload'],
+                                                    elem_id="src_ref_image_3",
+                                                    format='png')
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1):
+                self.prompt = gr.Textbox(
+                    show_label=False,
+                    placeholder="positive_prompt_input",
+                    elem_id='positive_prompt',
+                    container=True,
+                    autofocus=True,
+                    elem_classes='type_row',
+                    visible=True,
+                    lines=2)
+                self.negative_prompt = gr.Textbox(
+                    show_label=False,
+                    value=self.pipe.config.sample_neg_prompt,
+                    placeholder="negative_prompt_input",
+                    elem_id='negative_prompt',
+                    container=True,
+                    autofocus=False,
+                    elem_classes='type_row',
+                    visible=True,
+                    interactive=True,
+                    lines=1)
+        #
+        with gr.Row(variant='panel', equal_height=True):
+            with gr.Column(scale=1, min_width=0):
+                with gr.Row(equal_height=True):
+                    self.shift_scale = gr.Slider(
+                        label='shift_scale',
+                        minimum=0.0,
+                        maximum=100.0,
+                        step=1.0,
+                        value=16.0,
+                        interactive=True)
+                    self.sample_steps = gr.Slider(
+                        label='sample_steps',
+                        minimum=1,
+                        maximum=100,
+                        step=1,
+                        value=25,
+                        interactive=True)
+                    self.context_scale = gr.Slider(
+                        label='context_scale',
+                        minimum=0.0,
+                        maximum=2.0,
+                        step=0.1,
+                        value=1.0,
+                        interactive=True)
+                    self.guide_scale = gr.Slider(
+                        label='guide_scale',
+                        minimum=1,
+                        maximum=10,
+                        step=0.5,
+                        value=5.0,
+                        interactive=True)
+                    self.infer_seed = gr.Slider(minimum=-1,
+                                                maximum=10000000,
+                                                value=2025,
+                                                label="Seed")
+        #
+        with gr.Accordion(label="Usable without source video", open=False):
+            with gr.Row(equal_height=True):
+                self.output_height = gr.Textbox(
+                    label='resolutions_height',
+                    # value=480,
+                    value=720,
+                    interactive=True)
+                self.output_width = gr.Textbox(
+                    label='resolutions_width',
+                    # value=832,
+                    value=1280,
+                    interactive=True)
+                self.frame_rate = gr.Textbox(
+                    label='frame_rate',
+                    value=16,
+                    interactive=True)
+                self.num_frames = gr.Textbox(
+                    label='num_frames',
+                    value=81,
+                    interactive=True)
+        #
+        with gr.Row(equal_height=True):
+            with gr.Column(scale=5):
+                self.generate_button = gr.Button(
+                    value='Run',
+                    elem_classes='type_row',
+                    elem_id='generate_button',
+                    visible=True)
+            with gr.Column(scale=1):
+                self.refresh_button = gr.Button(value='\U0001f504')  # 🔄
+        #
+        self.output_gallery = gr.Gallery(
+            label="output_gallery",
+            value=[],
+            interactive=False,
+            allow_preview=True,
+            preview=True)
+
+
+    def generate(self, output_gallery, src_video, src_mask, src_ref_image_1, src_ref_image_2, src_ref_image_3, prompt, negative_prompt, shift_scale, sample_steps, context_scale, guide_scale, infer_seed, output_height, output_width, frame_rate, num_frames):
+        output_height, output_width, frame_rate, num_frames = int(output_height), int(output_width), int(frame_rate), int(num_frames)
+        src_ref_images = [x for x in [src_ref_image_1, src_ref_image_2, src_ref_image_3] if
+                          x is not None]
+        src_video, src_mask, src_ref_images = self.pipe.prepare_source([src_video],
+                                                                         [src_mask],
+                                                                         [src_ref_images],
+                                                                         num_frames=num_frames,
+                                                                         image_size=SIZE_CONFIGS[f"{output_height}*{output_width}"],
+                                                                         device=self.pipe.device)
+        video = self.pipe.generate(
+            prompt,
+            src_video,
+            src_mask,
+            src_ref_images,
+            size=(output_width, output_height),
+            context_scale=context_scale,
+            shift=shift_scale,
+            sampling_steps=sample_steps,
+            guide_scale=guide_scale,
+            n_prompt=negative_prompt,
+            seed=infer_seed,
+            offload_model=True)
+
+        name = '{0:%Y%m%d%-H%M%S}'.format(datetime.datetime.now())
+        video_path = os.path.join(self.save_dir, f'cur_gallery_{name}.mp4')
+        video_frames = (torch.clamp(video / 2 + 0.5, min=0.0, max=1.0).permute(1, 2, 3, 0) * 255).cpu().numpy().astype(np.uint8)
+
+        try:
+            writer = imageio.get_writer(video_path, fps=frame_rate, codec='libx264', quality=8, macro_block_size=1)
+            for frame in video_frames:
+                writer.append_data(frame)
+            writer.close()
+            print(video_path)
+        except Exception as e:
+            raise gr.Error(f"Video save error: {e}")
+
+        if self.gallery_share:
+            self.gallery_share_data.add(video_path)
+            return self.gallery_share_data.get()
+        else:
+            return [video_path]
+
+    def set_callbacks(self, **kwargs):
+        self.gen_inputs = [self.output_gallery, self.src_video, self.src_mask, self.src_ref_image_1, self.src_ref_image_2, self.src_ref_image_3, self.prompt, self.negative_prompt, self.shift_scale, self.sample_steps, self.context_scale, self.guide_scale, self.infer_seed, self.output_height, self.output_width, self.frame_rate, self.num_frames]
+        self.gen_outputs = [self.output_gallery]
+        self.generate_button.click(self.generate,
+                                   inputs=self.gen_inputs,
+                                   outputs=self.gen_outputs,
+                                   queue=True)
+        self.refresh_button.click(lambda x: self.gallery_share_data.get() if self.gallery_share else x, inputs=[self.output_gallery], outputs=[self.output_gallery])
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Argparser for VACE-WAN Demo:\n')
+    parser.add_argument('--server_port', dest='server_port', help='', type=int, default=7860)
+    parser.add_argument('--server_name', dest='server_name', help='', default='0.0.0.0')
+    parser.add_argument('--root_path', dest='root_path', help='', default=None)
+    parser.add_argument('--save_dir', dest='save_dir', help='', default='cache')
+    parser.add_argument("--mp", action="store_true", help="Use Multi-GPUs",)
+    parser.add_argument("--model_name", type=str, default="vace-14B", choices=list(WAN_CONFIGS.keys()), help="The model name to run.")
+    parser.add_argument("--ulysses_size", type=int, default=1, help="The size of the ulysses parallelism in DiT.")
+    parser.add_argument("--ring_size", type=int, default=1, help="The size of the ring attention parallelism in DiT.")
+    parser.add_argument(
+        "--ckpt_dir",
+        type=str,
+        # default='models/VACE-Wan2.1-1.3B-Preview',
+        default='models/Wan2.1-VACE-14B/',
+        help="The path to the checkpoint directory.",
+    )
+    parser.add_argument(
+        "--offload_to_cpu",
+        action="store_true",
+        help="Offloading unnecessary computations to CPU.",
+    )
+
+    args = parser.parse_args()
+
+    if not os.path.exists(args.save_dir):
+        os.makedirs(args.save_dir, exist_ok=True)
+
+    with gr.Blocks() as demo:
+        infer_gr = VACEInference(args, skip_load=False, gallery_share=True, gallery_share_limit=5)
+        infer_gr.create_ui()
+        infer_gr.set_callbacks()
+        allowed_paths = [args.save_dir]
+        demo.queue(status_update_rate=1).launch(server_name=args.server_name,
+                                                server_port=args.server_port,
+                                                root_path=args.root_path,
+                                                allowed_paths=allowed_paths,
+                                                show_error=True, debug=True)
--- a/wan/init.py
+++ b/wan/init.py
@ -2,4 +2,4 @@ from . import configs, distributed, modules
 from .image2video import WanI2V
 from .text2video import WanT2V
 from .first_last_frame2video import WanFLF2V
-from .vace import WanVace
+from .vace import WanVace, WanVaceMP