Flux Kontext and more

2026-01-11 16:53:34 +00:00 · 2025-07-15 22:26:56 +02:00 · 2025-07-15 22:26:56 +02:00 · 64c59c15d9
commit 64c59c15d9
parent 37f41804a6
21 changed files with 734 additions and 392 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,26 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 **Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep

 ## 🔥 Latest Updates
+### July 15 2025: WanGP v7.0 is an AI Powered Photoshop
+This release turns the Wan models into Image Generators. This goes way more than allowing to generate a video made of single frame :
+- Multiple Images generated at the same time so that you can choose the one you like best.It is Highly VRAM optimized so that you can generate for instance 4 720p Images at the same time with less than 10 GB
+- With the *image2image* the original text2video WanGP becomes an image upsampler / restorer
+- *Vace image2image* comes out of the box with image outpainting, person / object replacement, ...
+- You can use in one click a newly Image generated as Start Image or Reference Image for a Video generation
+
+And to complete the full suite of AI Image Generators, Ladies and Gentlemen please welcome for the first time in WanGP : **Flux Kontext**.\
+As a reminder Flux Kontext is an image editor : give it an image and a prompt and it will do the change for you.\
+This highly optimized version of Flux Kontext will make you feel that you have been cheated all this time as WanGP Flux Kontext requires only 8 GB of VRAM to generate 4 images at the same time with no need for quantization.
+
+WanGP v7 comes with *Image2image* vanilla and *Vace FusinoniX*. However you can build your own finetune where you will combine a text2video or Vace model with any combination of Loras.
+
+Also in the news:
+- You can now enter the *Bbox* for each speaker in *Multitalk* to precisely locate who is speaking. And to save some headaches the *Image Mask generator* will give you the *Bbox* coordinates of an area you have selected.
+- *Film Grain* post processing to add a vintage look at your video
+- *First Last Frame to Video* model should work much better now as I have discovered rencently its implementation was not complete
+- More power for the finetuners, you can now embed Loras directly in the finetune definition. You can also override the default models (titles, visibility, ...) with your own finetunes. Check the doc that has been updated.
+
+
 ### July 10 2025: WanGP v6.7, is NAG a game changer ? you tell me
 Maybe you knew that already but most *Loras accelerators* we use today (Causvid, FusioniX) don't use *Guidance* at all (that it is *CFG* is set to 1). This helps to get much faster generations but the downside is that *Negative Prompts* are completely ignored (including the default ones set by the models). **NAG** (https://github.com/ChenDarYen/Normalized-Attention-Guidance) aims to solve that by injecting the *Negative Prompt* during the *attention* processing phase.

--- a/defaults/flux_dev_kontext.json
+++ b/defaults/flux_dev_kontext.json
@ -2,18 +2,15 @@
    "model": {
        "name": "Flux Dev Kontext 12B",
        "architecture": "flux_dev_kontext",
-        "description": "FLUX.1 Kontext is a 12 billion parameter rectified flow transformer capable of editing images based on instructions stored in the Prompt. Please be aware that the output resolution is modified by Flux Kontext and may not be what you requested.",
+        "description": "FLUX.1 Kontext is a 12 billion parameter rectified flow transformer capable of editing images based on instructions stored in the Prompt. Please be aware that Flux Kontext is picky on the resolution of the input image the output dimensions may not match the dimensions of the input image.",
        "URLs": [
-            "c:/temp/kontext/flux1_kontext_dev_bf16.safetensors",
-            "c:/temp/kontext/flux1_kontext_dev_quanto_bf16_int8.safetensors"
-        ],
-        "URLs2": [
            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_bf16.safetensors",
            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_quanto_bf16_int8.safetensors"
        ]
    },
+	"prompt": "add a hat",
    "resolution": "1280x720",
-    "video_length": "1"
+    "video_length": 1
 }

 	
--- a/defaults/t2i.json
+++ b/defaults/t2i.json
@ -0,0 +1,13 @@
+{
+    "model": {
+        "name": "Wan2.1 text2image 14B",
+        "architecture": "t2v",
+        "description": "The original Wan Text 2 Video model configured to generate an image instead of a video.",
+		"image_outputs": true,
+        "URLs": "t2v"
+    },
+    "video_length": 1,
+    "resolution": "1280x720"	
+}
+
+	
--- a/defaults/vace_14B_fusionix.json
+++ b/defaults/vace_14B_fusionix.json
@ -15,7 +15,7 @@
    "seed": -1,
    "num_inference_steps": 10,
    "guidance_scale": 1,
-    "flow_shift": 5,
+    "flow_shift": 2,
    "embedded_guidance_scale": 6,
    "repeat_generation": 1,
    "multi_images_gen_type": 0,
--- a/defaults/vace_14B_fusionix_t2i.json
+++ b/defaults/vace_14B_fusionix_t2i.json
@ -0,0 +1,16 @@
+{
+    "model": {
+        "name": "Vace FusioniX image2image 14B",
+        "architecture": "vace_14B",
+        "modules": [
+            "vace_14B"
+        ],
+		"image_outputs": true,
+        "description": "Vace control model enhanced using multiple open-source components and LoRAs to boost motion realism, temporal consistency, and expressive detail.",
+        "URLs": "t2v_fusionix"
+    },
+    "resolution": "1280x720",
+	"guidance_scale": 1,
+    "num_inference_steps": 10,	
+    "video_length": 1
+}
--- a/docs/FINETUNES.md
+++ b/docs/FINETUNES.md
@ -2,22 +2,30 @@

 A Finetuned model is model that shares the same architecture of one specific model but has derived weights from this model. Some finetuned models have been created by combining multiple finetuned models.

-As there are potentially an infinite number of finetunes, specific finetuned models are not known by default by WanGP, however you can create a finetuned model definition that will tell WanGP about the existence of this finetuned model and WanGP will do as usual all the work for you: autodownload the model and build the user interface.
+As there are potentially an infinite number of finetunes, specific finetuned models are not known by default by WanGP. However you can create a finetuned model definition that will tell WanGP about the existence of this finetuned model and WanGP will do as usual all the work for you: autodownload the model and build the user interface.
+
+WanGP finetune system can be also used to tweak default models : for instance you can add on top of an existing model some loras that will be always applied transparently.

 Finetune models definitions are light json files that can be easily shared. You can find some of them on the WanGP *discord* server https://discord.gg/g7efUW9jGV

+All the finetunes definitions files should be stored in the *finetunes/* subfolder.
+
 Finetuned models have been tested so far with Wan2.1 text2video, Wan2.1 image2video,  Hunyuan Video text2video. There isn't currently any support for LTX Video finetunes.

-## Create a new Finetune Model Definition
-All the finetune models definitions are json files stored in the **finetunes** sub folder. All the corresponding finetune model weights will be stored in the *ckpts* subfolder and will sit next to the base models.

-WanGP comes with a few prebuilt finetune models that you can use as starting points and to get an idea of the structure of the definition file.
+
+## Create a new Finetune Model Definition
+All the finetune models definitions are json files stored in the **finetunes/** sub folder. All the corresponding finetune model weights when they are downloaded will be stored in the *ckpts/* subfolder and will sit next to the base models.
+
+All the models used by WanGP are also described using the finetunes json format and can be found in the **defaults/** subfolder. Please don’t modify any file in the **defaults/** folder.
+
+However you can use these files as starting points for new definition files and to get an idea of the structure of a definition file. If you want to change how a base model is handled (title, default settings, path to model weights, …) you may override any property of the default finetunes definition file by creating a new file in the finetunes folder with the same name. Everything will happen as if the two models will be merged property by property with a higher priority given to the finetunes model definition.

 A definition is built from a *settings file* that can contains all the default parameters for a video generation. On top of this file a subtree named **model** contains all the information regarding the finetune (URLs to download model, corresponding base model id, ...).

 You can obtain a settings file in several ways:
 - In the subfolder **settings**, get the json file that corresponds to the base model of your finetune (see the next section for the list of ids of base models)
- From the user interface, go to the base model and click **export settings**
+- From the user interface, select the base model for which you want to create a finetune and click **export settings**

 Here are steps:
 1) Create a *settings file*
@ -26,22 +34,37 @@ Here are steps:
 4) Restart WanGP

 ## Architecture Models Ids
-A finetune is derived from a base model and will inherit all the user interface and corresponding model capabilities, here are Architecture Ids:
- *t2v*: Wan 2.1 Video text 2 
- *i2v*: Wan 2.1 Video image 2 480p
- *i2v_720p*: Wan 2.1 Video image 2 720p
+A finetune is derived from a base model and will inherit all the user interface and corresponding model capabilities, here are some Architecture Ids:
+- *t2v*: Wan 2.1 Video text 2 video
+- *i2v*: Wan 2.1 Video image 2 video 480p and 720p
 - *vace_14B*: Wan 2.1 Vace 14B
 - *hunyuan*: Hunyuan Video text 2 video
 - *hunyuan_i2v*: Hunyuan Video image 2 video

+Any file name in the defaults subfolder (without the json extension) corresponds to an architecture id.
+
+Please note that weights of some architectures correspond to a combination of weight of a one architecture which are completed by the weights of one more or modules.
+
+A module is a set a weights that are insufficient to be model by itself but that can be added to an existing model to extend its capabilities.
+
+For instance if one adds a module *vace_14B* on top of a model with architecture *t2v* one gets get a model with the *vace_14B* architecture. Here *vace_14B* stands for both an architecture name and a module name. The module system allows you to reuse shared weights between models.
+
+
 ## The Model Subtree
 - *name* : name of the finetune used to select
 - *architecture* : architecture Id of the base model of the finetune (see previous section)
 - *description*: description of the finetune that will appear at the top
 - *URLs*: URLs of all the finetune versions (quantized / non quantized). WanGP will pick the version that is the closest to the user preferences. You will need to follow a naming convention to help WanGP identify the content of each version (see next section). Right now WanGP supports only 8 bits quantized model that have been quantized using **quanto**. WanGP offers a command switch to build easily such a quantized model (see below). *URLs* can contain also paths to local file to allow testing.
- *modules*: this a list of modules to be combined with the models referenced by the URLs. A module is a model extension that is merged with a model to expand its capabilities. So far the only module supported is Vace 14B  (its id is *vace_14B*). For instance the full Vace model is the fusion of a Wan text 2 video and the Vace module.
+- *modules*: this a list of modules to be combined with the models referenced by the URLs. A module is a model extension that is merged with a model to expand its capabilities. Supported models so far are : *vace_14B* and *multitalk*. For instance the full Vace model is the fusion of a Wan text 2 video and the Vace module.
 - *preload_URLs* : URLs of files to download no matter what (used to load quantization maps for instance)
+-*loras* : URLs of Loras that will applied before any other Lora specified by the user. These loras will be quite often Loras accelerator. For instance if you specified here the FusioniX Lora you will be able to reduce the number of generation steps to -*loras_multipliers* : a list of float numbers that defines the weight of each Lora mentioned above.
 - *auto_quantize*: if set to True and no quantized model URL is provided, WanGP will perform on the fly quantization if the user expects a quantized model
+-*visible* : by default assumed to be true. If set to false the model will no longer be visible. This can be useful if you create a finetune to override a default model and hide it.
+-*image_outputs* : turn any model that generates a video into a model that generates images. In fact it will adapt the user interface for image generation and ask the model to generate a video with a single frame.
+
+In order to favor reusability the properties of *URLs*, *modules*, *loras* and  *preload_URLs* can contain instead of a list of URLs a single text which corresponds to the id of a finetune or default model to reuse.
+
+For example let’s say you have defined a *t2v_fusionix.json* file which contains the URLs to download the finetune. In the *vace_fusionix.json* you can write « URLs » : « fusionix » to reuse automatically the URLS already defined in the correspond file.

 Example of **model** subtree
 ```
--- a/docs/LORAS.md
+++ b/docs/LORAS.md
@ -6,18 +6,19 @@ Loras (Low-Rank Adaptations) allow you to customize video generation models by a

 Loras are organized in different folders based on the model they're designed for:

-### Text-to-Video Models
+### Wan Text-to-Video Models
 - `loras/` - General t2v loras
 - `loras/1.3B/` - Loras specifically for 1.3B models
 - `loras/14B/` - Loras specifically for 14B models

-### Image-to-Video Models
+### Wan Image-to-Video Models
 - `loras_i2v/` - Image-to-video loras

 ### Other Models
 - `loras_hunyuan/` - Hunyuan Video t2v loras
 - `loras_hunyuan_i2v/` - Hunyuan Video i2v loras
 - `loras_ltxv/` - LTX Video loras
+- `loras_flux/` - Flux loras

 ## Custom Lora Directory

@ -64,7 +65,7 @@ For dynamic effects over generation steps, use comma-separated values:

 ## Lora Presets

-Presets are combinations of loras with predefined multipliers and prompts.
+Lora Presets are combinations of loras with predefined multipliers and prompts.

 ### Creating Presets
 1. Configure your loras and multipliers
@ -95,16 +96,36 @@ WanGP supports multiple lora formats:
 - **Replicate** format
 - **Standard PyTorch** (.pt, .pth)

-## Safe-Forcing lightx2v Lora (Video Generation Accelerator)

-Safeforcing Lora has been created by Kijai from the Safe-Forcing lightx2v distilled Wan model and can generate videos with only 2 steps and offers also a 2x speed improvement since it doesnt require classifier free guidance. It works on both t2v and i2v models
+## Loras Accelerators
+Most Loras are used to apply a specific style or to alter the content of the output of the generated video.
+However some Loras have been designed to tranform a model into a distilled model which requires fewer steps to generate a video.
+
+You will find most *Loras Accelerators* here:
+https://huggingface.co/DeepBeepMeep/Wan2.1/tree/main/loras_accelerators

 ### Setup Instructions
-1. Download the Lora:
-   ```
-   https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors
-   ```
-2. Place in your `loras/` directory
+1. Download the Lora
+2. Place it in your `loras/` directory if it is a t2v lora or in the `loras_i2v/` directory if it isa i2v lora
+
+## FusioniX (or FusionX) Lora 
+If you need just one Lora accelerator use this one. It is a combination of multiple Loras acelerators (including Causvid below) and style loras. It will not only accelerate the video generation but it will also improve the quality. There are two versions of this lora whether you use it for t2v or i2v
+
+### Usage
+1. Select a Wan t2v model (e.g., Wan 2.1 text2video 13B or Vace 13B)
+2. Enable Advanced Mode
+3. In Advanced Generation Tab:
+   - Set Guidance Scale = 1
+   - Set Shift Scale = 2
+4. In Advanced Lora Tab:
+   - Select CausVid Lora
+   - Set multiplier to 1
+5. Set generation steps from 8-10
+6. Generate!
+
+## Safe-Forcing lightx2v Lora (Video Generation Accelerator)
+Safeforcing Lora has been created by Kijai from the Safe-Forcing lightx2v distilled Wan model and can generate videos with only 2 steps and offers also a 2x speed improvement since it doesnt require classifier free guidance. It works on both t2v and i2v models
+You will find it under the name of *Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors*
 
 ### Usage
 1. Select a Wan t2v or i2v model (e.g., Wan 2.1 text2video 13B or Vace 13B)
@ -118,17 +139,10 @@ Safeforcing Lora has been created by Kijai from the Safe-Forcing lightx2v distil
 5. Set generation steps to 2-8
 6. Generate!

+
 ## CausVid Lora (Video Generation Accelerator)
-
 CausVid is a distilled Wan model that generates videos in 4-12 steps with 2x speed improvement.

-### Setup Instructions
-1. Download the CausVid Lora:
-   ```
-   https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_CausVid_14B_T2V_lora_rank32.safetensors
-   ```
-2. Place in your `loras/` directory
-
 ### Usage
 1. Select a Wan t2v model (e.g., Wan 2.1 text2video 13B or Vace 13B)
 2. Enable Advanced Mode
@ -149,25 +163,10 @@ CausVid is a distilled Wan model that generates videos in 4-12 steps with 2x spe
 *Note: Lower steps = lower quality (especially motion)*


-
 ## AccVid Lora (Video Generation Accelerator)

 AccVid is a distilled Wan model that generates videos with a 2x speed improvement since classifier free guidance is no longer needed (that is cfg = 1).

-### Setup Instructions
-1. Download the AccVid Lora:
-
- for t2v models:
-   ```
-   https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_AccVid_T2V_14B_lora_rank32_fp16.safetensors
-   ```
-
- for i2v models:
-   ```
-   https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_AccVid_I2V_480P_14B_lora_rank32_fp16.safetensors
-   ```
-
-2. Place in your `loras/` directory or `loras_i2v/` directory

 ### Usage
 1. Select a Wan t2v model (e.g., Wan 2.1 text2video 13B or Vace 13B) or Wan i2v model
@ -268,6 +267,7 @@ In the video, a man is presented. The man is in a city and looks at his watch.
 --lora-dir-hunyuan path           # Path to Hunyuan t2v loras
 --lora-dir-hunyuan-i2v path       # Path to Hunyuan i2v loras
 --lora-dir-ltxv path              # Path to LTX Video loras
+--lora-dir-flux path              # Path to Flux loras
 --lora-preset preset              # Load preset on startup
 --check-loras                     # Filter incompatible loras
 ``` 
--- a/docs/MODELS.md
+++ b/docs/MODELS.md
@ -2,6 +2,8 @@

 WanGP supports multiple video generation models, each optimized for different use cases and hardware configurations. 

+Most models can combined with Loras Accelerators (check the Lora guide) to accelerate the generation of a video x2 or x3 with little quality loss
+

 ## Wan 2.1 Text2Video Models
 Please note that that the term *Text2Video* refers to the underlying Wan architecture but as it has been greatly improved overtime many derived Text2Video models can now  generate videos using images.
@ -65,6 +67,12 @@ Please note that that the term *Text2Video* refers to the underlying Wan archite

 ## Wan 2.1 Specialized Models

+#### Multitalk
+- **Type**: Multi Talking head animation
+- **Input**: Voice track + image
+- **Works on**: People
+- **Use case**: Lip-sync and voice-driven animation for up to two people
+
 #### FantasySpeaking
 - **Type**: Talking head animation
 - **Input**: Voice track + image
@ -82,7 +90,7 @@ Please note that that the term *Text2Video* refers to the underlying Wan archite
 - **Requirements**: 81+ frame input videos, 15+ denoising steps
 - **Use case**: View same scene from different angles

-#### Sky Reels v2
+#### Sky Reels v2 Diffusion
 - **Type**: Diffusion Forcing model
 - **Specialty**: "Infinite length" videos
 - **Features**: High quality continuous generation
@ -107,22 +115,6 @@ Please note that that the term *Text2Video* refers to the underlying Wan archite

 <BR>

-## Wan Special Loras
-### Safe-Forcing lightx2v Lora
- **Type**: Distilled model (Lora implementation)
- **Speed**: 4-8 steps generation, 2x faster (no classifier free guidance)
- **Compatible**: Works with t2v and i2v Wan 14B models
- **Setup**: Requires Safe-Forcing lightx2v Lora (see [LORAS.md](LORAS.md))
-
-
-### Causvid Lora
- **Type**: Distilled model (Lora implementation)
- **Speed**: 4-12 steps generation, 2x faster (no classifier free guidance)
- **Compatible**: Works with Wan 14B models
- **Setup**: Requires CausVid Lora (see [LORAS.md](LORAS.md))
-
-
-<BR>

 ## Hunyuan Video Models

--- a/finetunes/put
+++ b/finetunes/put
--- a/flux/flux_main.py
+++ b/flux/flux_main.py
@ -65,7 +65,7 @@ class model_factory:
            fit_into_canvas = None,
            callback = None,
            loras_slists = None,
-            frame_num = 1,
+            batch_size = 1,
            **bbargs
    ):
            
@ -89,7 +89,7 @@ class model_factory:
                img_cond=image_ref,
                target_width=width,
                target_height=height,
-                bs=frame_num,
+                bs=batch_size,
                seed=seed,
                device="cuda",
            )
--- a/postprocessing/film_grain.py
+++ b/postprocessing/film_grain.py
@ -0,0 +1,21 @@
+# Thanks to https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/film_grain.py
+import torch
+
+def add_film_grain(images: torch.Tensor, grain_intensity: float = 0, saturation: float = 0.5):
+    device = images.device 
+
+    images = images.permute(1, 2 ,3 ,0)
+    images.add_(1.).div_(2.)
+    grain = torch.randn_like(images, device=device)
+    grain[:, :, :, 0] *= 2
+    grain[:, :, :, 2] *= 3
+    grain = grain * saturation + grain[:, :, :, 1].unsqueeze(3).repeat(
+        1, 1, 1, 3
+    ) * (1 - saturation)
+
+    # Blend the grain with the image
+    noised_images = images + grain_intensity * grain
+    noised_images.clamp_(0, 1)
+    noised_images.sub_(.5).mul_(2.)
+    noised_images = noised_images.permute(3, 0, 1 ,2)
+    return noised_images
--- a/preprocessing/matanyone/app.py
+++ b/preprocessing/matanyone/app.py
@ -65,6 +65,7 @@ def get_frames_from_image(image_input, image_state):
    Return 
        [[0:nearest_frame], [nearest_frame:], nearest_frame]
    """
+    load_sam()

    user_name = time.time()
    frames = [image_input] * 2  # hardcode: mimic a video with 2 frames
@ -89,7 +90,7 @@ def get_frames_from_image(image_input, image_state):
                        gr.update(visible=True), gr.update(visible=True), \
                        gr.update(visible=True), gr.update(visible=True),\
                        gr.update(visible=True), gr.update(visible=True), \
-                        gr.update(visible=True), gr.update(visible=False), \
+                        gr.update(visible=True), gr.update(value="", visible=True),  gr.update(visible=False), \
                        gr.update(visible=False), gr.update(visible=True), \
                        gr.update(visible=True)

@ -103,6 +104,8 @@ def get_frames_from_video(video_input, video_state):
        [[0:nearest_frame], [nearest_frame:], nearest_frame]
    """

+    load_sam()
+
    while model == None:
        time.sleep(1)
        
@ -273,6 +276,20 @@ def save_video(frames, output_path, fps):

    return output_path

+def mask_to_xyxy_box(mask):
+    rows, cols = np.where(mask == 255)
+    xmin = min(cols)
+    xmax = max(cols) + 1
+    ymin = min(rows)
+    ymax = max(rows) + 1
+    xmin = max(xmin, 0)
+    ymin = max(ymin, 0)
+    xmax = min(xmax, mask.shape[1])
+    ymax = min(ymax, mask.shape[0])
+    box = [xmin, ymin, xmax, ymax]
+    box = [int(x) for x in box]
+    return box
+
 # image matting
 def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, refine_iter):
    matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
@ -320,9 +337,17 @@ def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_si
    foreground = output_frames

    foreground_output = Image.fromarray(foreground[-1])
-    alpha_output = Image.fromarray(alpha[-1][:,:,0])
-
-    return foreground_output, gr.update(visible=True) 
+    alpha_output = alpha[-1][:,:,0]
+    frame_temp = alpha_output.copy()
+    alpha_output[frame_temp > 127] = 0
+    alpha_output[frame_temp <= 127] = 255
+    bbox_info = mask_to_xyxy_box(alpha_output)
+    h = alpha_output.shape[0]
+    w = alpha_output.shape[1]
+    bbox_info = [str(int(bbox_info[0]/ w * 100 )), str(int(bbox_info[1]/ h * 100 )),  str(int(bbox_info[2]/ w * 100 )), str(int(bbox_info[3]/ h * 100 )) ]
+    bbox_info = ":".join(bbox_info)
+    alpha_output = Image.fromarray(alpha_output)
+    return foreground_output, alpha_output, bbox_info, gr.update(visible=True), gr.update(visible=True) 

 # video matting
 def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
@ -469,6 +494,13 @@ def restart():
        gr.update(visible=False), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False), \
        gr.update(visible=False), gr.update(visible=False, choices=[], value=[]), "", gr.update(visible=False)

+def load_sam():
+    global model_loaded
+    global model
+    global matanyone_model 
+    model.samcontroler.sam_controler.model.to(arg_device)
+    matanyone_model.to(arg_device)
+
 def load_unload_models(selected):
    global model_loaded
    global model
@ -476,8 +508,7 @@ def load_unload_models(selected):
    if selected:
        # print("Matanyone Tab Selected")
        if model_loaded:
-            model.samcontroler.sam_controler.model.to(arg_device)
-            matanyone_model.to(arg_device)
+            load_sam()
        else:
            # args, defined in track_anything.py
            sam_checkpoint_url_dict = {
@ -522,12 +553,16 @@ def export_to_vace_video_input(foreground_video_output):

 def export_image(image_refs, image_output):
    gr.Info("Masked Image transferred to Current Video")
-    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
    if image_refs == None:
        image_refs =[]
    image_refs.append( image_output)
    return image_refs

+def export_image_mask(image_input, image_mask):
+    gr.Info("Input Image & Mask transferred to Current Video")
+    return Image.fromarray(image_input), image_mask
+
+
 def export_to_current_video_engine(model_type, foreground_video_output, alpha_video_output):
    gr.Info("Original Video and Full Mask have been transferred")
    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
@ -543,7 +578,7 @@ def teleport_to_video_tab(tab_state):
    return gr.Tabs(selected="video_gen")


-def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, vace_image_refs):
+def display(tabs, tab_state, model_choice, vace_video_input, vace_image_input, vace_video_mask, vace_image_mask, vace_image_refs):
    # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])

    media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
@ -677,7 +712,7 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                                foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
                            with gr.Column(scale=2):
                                alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
-                                alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
+                                export_image_mask_btn = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
                        with gr.Row():
                            with gr.Row(visible= False):
                                export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
@ -696,7 +731,7 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                    ],
                    outputs=[video_state, video_info, template_frame,
                            image_selection_slider, end_selection_slider,  track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
-                            foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
+                            foreground_video_output, alpha_video_output, foreground_output_button, export_image_mask_btn, mask_dropdown, step2_title]
                )   

                # second step: select images from slider
@ -755,7 +790,7 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                        foreground_video_output, alpha_video_output,
                        template_frame,
                        image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click, 
-                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
+                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, export_image_mask_btn, mask_dropdown, video_info, step2_title
                    ],
                    queue=False,
                    show_progress=False)
@ -770,7 +805,7 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                        foreground_video_output, alpha_video_output,
                        template_frame,
                        image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click, 
-                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
+                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, export_image_mask_btn, mask_dropdown, video_info, step2_title
                    ],
                    queue=False,
                    show_progress=False)
@ -872,15 +907,19 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                    # output image
                    with gr.Row(equal_height=True):
                        foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
+                        alpha_image_output = gr.Image(type="pil", label="Mask", visible=False, elem_classes="image")
+                    with gr.Row(equal_height=True):
+                        bbox_info = gr.Text(label ="Mask BBox Info (Left:Top:Right:Bottom)", interactive= False)
                    with gr.Row():
-                        with gr.Row():
+                        # with gr.Row():
                        export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
-                    with gr.Column(scale=2, visible= False):
-                        alpha_image_output = gr.Image(type="pil", label="Alpha Output", visible=False, elem_classes="image")
-                        alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
+                    # with gr.Column(scale=2, visible= True):
+                        export_image_mask_btn = gr.Button(value="Set to Control Image & Mask", visible=False, elem_classes="new_button")

                export_image_btn.click(  fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger, 
                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
+                export_image_mask_btn.click(  fn=export_image_mask, inputs= [image_input, alpha_image_output], outputs= [vace_image_input, vace_image_mask]).then( #video_prompt_video_guide_trigger, 
+                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])

                # first step: get the image information 
                extract_frames_button.click(
@ -890,9 +929,17 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                    ],
                    outputs=[image_state, image_info, template_frame,
                            image_selection_slider, track_pause_number_slider,point_prompt, clear_button_click, add_mask_button, matting_button, template_frame,
-                            foreground_image_output, alpha_image_output, export_image_btn, alpha_output_button, mask_dropdown, step2_title]
+                            foreground_image_output, alpha_image_output, bbox_info, export_image_btn, export_image_mask_btn, mask_dropdown, step2_title]
                )   

+                # points clear
+                clear_button_click.click(
+                    fn = clear_click,
+                    inputs = [image_state, click_state,],
+                    outputs = [template_frame,click_state],
+                )
+
+
                # second step: select images from slider
                image_selection_slider.release(fn=select_image_template, 
                                            inputs=[image_selection_slider, image_state, interactive_state], 
@ -925,7 +972,7 @@ def display(tabs, tab_state, model_choice, vace_video_input, vace_video_mask, va
                matting_button.click(
                    fn=image_matting,
                    inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
-                    outputs=[foreground_image_output, export_image_btn]
+                    outputs=[foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
                )


--- a/wan/any2video.py
+++ b/wan/any2video.py
@ -61,6 +61,7 @@ class WanAny2V:
        checkpoint_dir,
        model_filename = None,
        model_type = None, 
+        model_def = None,
        base_model_type = None,
        text_encoder_filename = None,
        quantizeTransformer = False,
@ -75,7 +76,8 @@ class WanAny2V:
        self.dtype = dtype
        self.num_train_timesteps = config.num_train_timesteps
        self.param_dtype = config.param_dtype
-
+        self.model_def = model_def
+        self.image_outputs = model_def.get("image_outputs", False)
        self.text_encoder = T5EncoderModel(
            text_len=config.text_len,
            dtype=config.t5_dtype,
@ -106,7 +108,7 @@ class WanAny2V:
        #     config = json.load(f)
        # from mmgp import safetensors2
        # sd = safetensors2.torch_load_file(xmodel_filename)
-        # model_filename = "c:/temp/fflf/diffusion_pytorch_model-00001-of-00007.safetensors"
+        # model_filename = "c:/temp/flf/diffusion_pytorch_model-00001-of-00007.safetensors"
        base_config_file = f"configs/{base_model_type}.json"
        forcedConfigPath = base_config_file if len(model_filename) > 1 else None
        # forcedConfigPath = base_config_file = f"configs/flf2v_720p.json"
@ -208,7 +210,7 @@ class WanAny2V:

            if refs is not None:
                length = len(refs)
-                mask_pad = torch.zeros_like(mask[:, :length, :, :])
+                mask_pad = torch.zeros(mask.shape[0], length, *mask.shape[-2:], dtype=mask.dtype, device=mask.device)
                mask = torch.cat((mask_pad, mask), dim=1)
            result_masks.append(mask)
        return result_masks
@ -327,20 +329,6 @@ class WanAny2V:
            self.background_mask = [ item if item != None else self.background_mask[0] for item in self.background_mask ] # deplicate background mask with double control net since first controlnet image ref modifed by ref
        return src_video, src_mask, src_ref_images

-    def decode_latent(self, zs, ref_images=None, tile_size= 0 ):
-        if ref_images is None:
-            ref_images = [None] * len(zs)
-        # else:
-        #     assert len(zs) == len(ref_images)
-
-        trimed_zs = []
-        for z, refs in zip(zs, ref_images):
-            if refs is not None:
-                z = z[:, len(refs):, :, :]
-            trimed_zs.append(z)
-
-        return self.vae.decode(trimed_zs, tile_size= tile_size)
-
    def get_vae_latents(self, ref_images, device, tile_size= 0):
        ref_vae_latents = []
        for ref_image in ref_images:
@ -366,6 +354,7 @@ class WanAny2V:
        height = 720,
        fit_into_canvas = True,
        frame_num=81,
+        batch_size = 1,
        shift=5.0,
        sample_solver='unipc',
        sampling_steps=50,
@ -397,6 +386,7 @@ class WanAny2V:
        NAG_alpha = 0.5,
        offloadobj = None,
        apg_switch = False,
+        speakers_bboxes = None,
        **bbargs
                ):
        
@ -554,8 +544,8 @@ class WanAny2V:
            overlapped_latents_frames_num = int(1 + (preframes_count-1) // 4)
            if overlapped_latents != None:
                # disabled because looks worse
-                if False and overlapped_latents_frames_num > 1: lat_y[:, 1:overlapped_latents_frames_num]  = overlapped_latents[:, 1:]
-                extended_overlapped_latents = lat_y[:, :overlapped_latents_frames_num].clone() 
+                if False and overlapped_latents_frames_num > 1: lat_y[:, :, 1:overlapped_latents_frames_num]  = overlapped_latents[:, 1:]
+                extended_overlapped_latents = lat_y[:, :overlapped_latents_frames_num].clone().unsqueeze(0)
            y = torch.concat([msk, lat_y])
            lat_y = None
            kwargs.update({'clip_fea': clip_context, 'y': y})
@ -586,7 +576,7 @@ class WanAny2V:
                    overlapped_frames_num = (overlapped_latents_frames_num-1) * 4 + 1
                else: 
                    overlapped_latents_frames_num = overlapped_frames_num  = 0
-                if len(keep_frames_parsed) == 0  or  (overlapped_frames_num + len(keep_frames_parsed)) == input_frames.shape[1] and all(keep_frames_parsed) : keep_frames_parsed = [] 
+                if len(keep_frames_parsed) == 0  or self.image_outputs or  (overlapped_frames_num + len(keep_frames_parsed)) == input_frames.shape[1] and all(keep_frames_parsed) : keep_frames_parsed = [] 
                injection_denoising_step = int(sampling_steps * (1. - denoising_strength) )
                latent_keep_frames = []
                if source_latents.shape[1] < lat_frames or len(keep_frames_parsed) > 0:
@ -609,6 +599,7 @@ class WanAny2V:
                input_ref_images = self.get_vae_latents(input_ref_images, self.device)
                input_ref_images_neg = torch.zeros_like(input_ref_images)
                ref_images_count = input_ref_images.shape[1] if input_ref_images != None else 0
+                trim_frames = input_ref_images.shape[1]

        # Vace
        if vace :
@ -633,8 +624,8 @@ class WanAny2V:
            context_scale = context_scale if context_scale != None else [1.0] * len(z)
            kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale, "ref_images_count": ref_images_count })
            if overlapped_latents != None :
-                overlapped_latents_size = overlapped_latents.shape[1]
-                extended_overlapped_latents = z[0][0:16, 0:overlapped_latents_size + ref_images_count].clone()
+                overlapped_latents_size = overlapped_latents.shape[2]
+                extended_overlapped_latents = z[0][:16, :overlapped_latents_size + ref_images_count].clone().unsqueeze(0)

            target_shape = list(z0[0].shape)
            target_shape[0] = int(target_shape[0] / 2)
@ -649,7 +640,7 @@ class WanAny2V:
            from wan.multitalk.multitalk import get_target_masks
            audio_proj = [audio.to(self.dtype) for audio in audio_proj]
            human_no = len(audio_proj[0])
-            token_ref_target_masks = get_target_masks(human_no, lat_h, lat_w, height, width, face_scale = 0.05, bbox = None).to(self.dtype) if human_no > 1 else None
+            token_ref_target_masks = get_target_masks(human_no, lat_h, lat_w, height, width, face_scale = 0.05, bbox = speakers_bboxes).to(self.dtype) if human_no > 1 else None

        if fantasy and audio_proj != None:
            kwargs.update({ "audio_proj": audio_proj.to(self.dtype), "audio_context_lens": audio_context_lens, }) 
@ -658,8 +649,8 @@ class WanAny2V:
        if self._interrupt:
            return None

+        expand_shape = [batch_size] + [-1] * len(target_shape)
        # Ropes
-        batch_size = 1
        if target_camera != None:
            shape = list(target_shape[1:])
            shape[0] *= 2
@ -698,14 +689,14 @@ class WanAny2V:

        if sample_scheduler != None:
            scheduler_kwargs = {} if isinstance(sample_scheduler, FlowMatchScheduler) else {"generator": seed_g}
-
-        latents = torch.randn( *target_shape, dtype=torch.float32, device=self.device, generator=seed_g)
+        # b, c, lat_f, lat_h, lat_w
+        latents = torch.randn(batch_size, *target_shape, dtype=torch.float32, device=self.device, generator=seed_g)
        if apg_switch != 0:  
            apg_momentum = -0.75
            apg_norm_threshold = 55
            text_momentumbuffer  = MomentumBuffer(apg_momentum) 
            audio_momentumbuffer = MomentumBuffer(apg_momentum) 
-
+        # self.image_outputs = False
        # denoising
        for i, t in enumerate(tqdm(timesteps)):
            offload.set_step_no_for_lora(self.model, i)
@ -715,36 +706,36 @@ class WanAny2V:

            if denoising_strength < 1 and input_frames != None and i <= injection_denoising_step:
                sigma = t / 1000
-                noise = torch.randn( *target_shape, dtype=torch.float32, device=self.device, generator=seed_g)
+                noise = torch.randn(batch_size, *target_shape, dtype=torch.float32, device=self.device, generator=seed_g)
                if inject_from_start:
                    new_latents = latents.clone()
-                    new_latents[:, :source_latents.shape[1] ] = noise[:, :source_latents.shape[1] ] * sigma + (1 - sigma) * source_latents
+                    new_latents[:,:, :source_latents.shape[1] ] = noise[:, :, :source_latents.shape[1] ] * sigma + (1 - sigma) * source_latents.unsqueeze(0)
                    for latent_no, keep_latent in enumerate(latent_keep_frames):
                        if not keep_latent:
-                            new_latents[:, latent_no:latent_no+1 ] = latents[:, latent_no:latent_no+1]
+                            new_latents[:, :, latent_no:latent_no+1 ] = latents[:, :, latent_no:latent_no+1]
                    latents = new_latents
                    new_latents = None
                else:
-                    latents = noise * sigma + (1 - sigma) * source_latents
+                    latents = noise * sigma + (1 - sigma) * source_latents.unsqueeze(0)
                noise = None

            if extended_overlapped_latents != None:
                latent_noise_factor = t / 1000
-                latents[:, 0:extended_overlapped_latents.shape[1]]   = extended_overlapped_latents  * (1.0 - latent_noise_factor) + torch.randn_like(extended_overlapped_latents ) * latent_noise_factor 
+                latents[:, :, :extended_overlapped_latents.shape[2]]   = extended_overlapped_latents  * (1.0 - latent_noise_factor) + torch.randn_like(extended_overlapped_latents ) * latent_noise_factor 
                if vace:
                    overlap_noise_factor = overlap_noise / 1000 
                    for zz in z:
-                        zz[0:16, ref_images_count:extended_overlapped_latents.shape[1] ]   = extended_overlapped_latents[:, ref_images_count:]  * (1.0 - overlap_noise_factor) + torch.randn_like(extended_overlapped_latents[:, ref_images_count:] ) * overlap_noise_factor 
+                        zz[0:16, ref_images_count:extended_overlapped_latents.shape[2] ]   = extended_overlapped_latents[0, :, ref_images_count:]  * (1.0 - overlap_noise_factor) + torch.randn_like(extended_overlapped_latents[0, :, ref_images_count:] ) * overlap_noise_factor 

            if target_camera != None:
-                latent_model_input = torch.cat([latents, source_latents], dim=1)
+                latent_model_input = torch.cat([latents, source_latents.unsqueeze(0).expand(*expand_shape)], dim=2) # !!!!
            else:
                latent_model_input = latents

            if phantom:
                gen_args = {
-                    "x" : ([ torch.cat([latent_model_input[:,:-ref_images_count], input_ref_images], dim=1) ] * 2 + 
-                        [ torch.cat([latent_model_input[:,:-ref_images_count], input_ref_images_neg], dim=1)]),
+                    "x" : ([ torch.cat([latent_model_input[:,:, :-ref_images_count], input_ref_images.unsqueeze(0).expand(*expand_shape)], dim=2) ] * 2 + 
+                        [ torch.cat([latent_model_input[:,:, :-ref_images_count], input_ref_images_neg.unsqueeze(0).expand(*expand_shape)], dim=2)]),
                    "context": [context, context_null, context_null] ,
                }
            elif fantasy:
@ -832,38 +823,41 @@ class WanAny2V:
            if sample_solver == "euler":
                dt = timesteps[i] if i == len(timesteps)-1 else (timesteps[i] - timesteps[i + 1])
                dt = dt / self.num_timesteps
-                latents = latents - noise_pred * dt[:, None, None, None]
+                latents = latents - noise_pred * dt[:, None, None, None, None]
            else:
-                temp_x0 = sample_scheduler.step(
-                    noise_pred[:, :target_shape[1]].unsqueeze(0),
+                latents = sample_scheduler.step(
+                    noise_pred[:, :, :target_shape[1]],
                    t,
-                    latents.unsqueeze(0),
+                    latents,
                    **scheduler_kwargs)[0]
-                latents = temp_x0.squeeze(0)
-                del temp_x0

            if callback is not None:
-                callback(i, latents, False)         
+                latents_preview = latents
+                if vace and ref_images_count > 0: latents_preview = latents_preview[:, :, ref_images_count: ] 
+                if trim_frames > 0:  latents_preview=  latents_preview[:, :,:-trim_frames]
+                if len(latents_preview) > 1: latents_preview = latents_preview.transpose(0,2)
+                callback(i, latents_preview[0], False)
+                latents_preview = None

-        x0 = [latents]
+        if vace and ref_images_count > 0: latents = latents[:, :, ref_images_count:]
+        if trim_frames > 0:  latents=  latents[:, :,:-trim_frames]
+        if return_latent_slice != None:
+            latent_slice = latents[:, :, return_latent_slice].clone()
+
+        x0 =latents.unbind(dim=0)

        if chipmunk:
            self.model.release_chipmunk() # need to add it at every exit when in prod

-        if return_latent_slice != None:
-            latent_slice = latents[:, return_latent_slice].clone()
-        if vace:
-            # vace post processing
-            videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
-        else:
-            if phantom and input_ref_images != None:
-                trim_frames = input_ref_images.shape[1]
-            if trim_frames > 0: x0 = [x0_[:,:-trim_frames] for x0_ in x0]
        videos = self.vae.decode(x0, VAE_tile_size)

+        if self.image_outputs:
+            videos = torch.cat(videos, dim=1) if len(videos) > 1 else videos[0]
+        else:
+            videos = videos[0] # return only first video     
        if return_latent_slice != None:
-            return { "x" : videos[0], "latent_slice" : latent_slice }
-        return videos[0]
+            return { "x" : videos, "latent_slice" : latent_slice }
+        return videos

    def adapt_vace_model(self):
        model = self.model
--- a/wan/diffusion_forcing.py
+++ b/wan/diffusion_forcing.py
@ -31,6 +31,7 @@ class DTT2V:
        rank=0,
        model_filename = None,
        model_type = None,
+        model_def = None,
        base_model_type = None,
        save_quantized = False,
        text_encoder_filename = None,
@ -53,6 +54,8 @@ class DTT2V:
            checkpoint_path=text_encoder_filename,
            tokenizer_path=os.path.join(checkpoint_dir, config.t5_tokenizer),
            shard_fn= None)
+        self.model_def = model_def
+        self.image_outputs = model_def.get("image_outputs", False)

        self.vae_stride = config.vae_stride
        self.patch_size = config.patch_size 
@ -202,6 +205,7 @@ class DTT2V:
        width: int = 832,
        fit_into_canvas = True,
        frame_num: int = 97,
+        batch_size = 1,
        sampling_steps: int = 50,
        shift: float = 1.0,
        guide_scale: float = 5.0,
@ -224,6 +228,7 @@ class DTT2V:
        generator = torch.Generator(device=self.device)
        generator.manual_seed(seed)
        self._guidance_scale = guide_scale
+        if frame_num > 1:
            frame_num = max(17, frame_num) # must match causal_block_size for value of 5
            frame_num = int( round( (frame_num - 17) / 20)* 20 + 17 )

@ -297,12 +302,12 @@ class DTT2V:
                    prefix_video = prefix_video[:, : predix_video_latent_length]

        base_num_frames_iter = latent_length
-        latent_shape = [16, base_num_frames_iter, latent_height, latent_width]
+        latent_shape = [batch_size, 16, base_num_frames_iter, latent_height, latent_width]
        latents = self.prepare_latents(
            latent_shape, dtype=torch.float32, device=self.device, generator=generator
        )
        if prefix_video is not None:
-            latents[:, :predix_video_latent_length] = prefix_video.to(torch.float32)
+            latents[:, :, :predix_video_latent_length] = prefix_video.to(torch.float32)
        step_matrix, _, step_update_mask, valid_interval = self.generate_timestep_matrix(
            base_num_frames_iter,
            init_timesteps,
@ -340,7 +345,7 @@ class DTT2V:
        else:
            self.model.enable_cache = None
        from mmgp import offload
-        freqs = get_rotary_pos_embed(latents.shape[1 :], enable_RIFLEx= False) 
+        freqs = get_rotary_pos_embed(latents.shape[2 :], enable_RIFLEx= False) 
        kwrags = {
            "freqs" :freqs,
            "fps" : fps_embeds,
@ -358,15 +363,15 @@ class DTT2V:
            update_mask_i = step_update_mask[i]
            valid_interval_start, valid_interval_end = valid_interval[i]
            timestep = timestep_i[None, valid_interval_start:valid_interval_end].clone()
-            latent_model_input = latents[:, valid_interval_start:valid_interval_end, :, :].clone()
+            latent_model_input = latents[:, :, valid_interval_start:valid_interval_end, :, :].clone()
            if overlap_noise > 0 and valid_interval_start < predix_video_latent_length:
                noise_factor = 0.001 * overlap_noise
                timestep_for_noised_condition = overlap_noise
-                latent_model_input[:, valid_interval_start:predix_video_latent_length] = (
-                    latent_model_input[:, valid_interval_start:predix_video_latent_length]
+                latent_model_input[:, :, valid_interval_start:predix_video_latent_length] = (
+                    latent_model_input[:, :, valid_interval_start:predix_video_latent_length]
                    * (1.0 - noise_factor)
                    + torch.randn_like(
-                        latent_model_input[:, valid_interval_start:predix_video_latent_length]
+                        latent_model_input[:, :, valid_interval_start:predix_video_latent_length]
                    )
                    * noise_factor
                )
@ -417,18 +422,27 @@ class DTT2V:
                    del noise_pred_cond, noise_pred_uncond
            for idx in range(valid_interval_start, valid_interval_end):
                if update_mask_i[idx].item():
-                    latents[:, idx] = sample_schedulers[idx].step(
-                        noise_pred[:, idx - valid_interval_start],
+                    latents[:, :, idx] = sample_schedulers[idx].step(
+                        noise_pred[:, :, idx - valid_interval_start],
                        timestep_i[idx],
-                        latents[:, idx],
+                        latents[:, :, idx],
                        return_dict=False,
                        generator=generator,
                    )[0]
                    sample_schedulers_counter[idx] += 1
            if callback is not None:
-                callback(i, latents.squeeze(0), False)         
+                latents_preview = latents
+                if len(latents_preview) > 1: latents_preview = latents_preview.transpose(0,2)
+                callback(i, latents_preview[0], False)
+                latents_preview = None

-        x0 = latents.unsqueeze(0)
-        videos = [self.vae.decode(x0, tile_size= VAE_tile_size)[0]]
-        output_video = videos[0].clamp(-1, 1).cpu()  # c, f, h, w
-        return output_video
+        x0 =latents.unbind(dim=0)
+
+        videos = self.vae.decode(x0, VAE_tile_size)
+
+        if self.image_outputs:
+            videos = torch.cat(videos, dim=1) if len(videos) > 1 else videos[0]
+        else:
+            videos = videos[0] # return only first video     
+
+        return videos
--- a/wan/modules/attention.py
+++ b/wan/modules/attention.py
@ -194,7 +194,9 @@ def pay_attention(

    q = q.to(v.dtype)
    k = k.to(v.dtype)
-
+    batch = len(q)
+    if len(k) != batch: k = k.expand(batch, -1, -1, -1)
+    if len(v) != batch: v = v.expand(batch, -1, -1, -1)
    if attn == "chipmunk":
        from src.chipmunk.modules import SparseDiffMlp, SparseDiffAttn
        from src.chipmunk.util import LayerCounter, GLOBAL_CONFIG
--- a/wan/modules/model.py
+++ b/wan/modules/model.py
@ -33,9 +33,10 @@ def sinusoidal_embedding_1d(dim, position):


 def reshape_latent(latent, latent_frames):
-    if latent_frames == latent.shape[0]:
-        return latent
-    return latent.reshape(latent_frames, -1, latent.shape[-1] )
+    return latent.reshape(latent.shape[0], latent_frames, -1, latent.shape[-1] )
+
+def restore_latent_shape(latent):
+    return latent.reshape(latent.shape[0], -1, latent.shape[-1] )


 def identify_k( b: float, d: int, N: int):
@ -493,7 +494,7 @@ class WanAttentionBlock(nn.Module):
        x_mod = reshape_latent(x_mod , latent_frames)
        x_mod *= 1 + e[1]
        x_mod += e[0]
-        x_mod = reshape_latent(x_mod , 1)
+        x_mod = restore_latent_shape(x_mod)
        if cam_emb != None:
            cam_emb = self.cam_encoder(cam_emb)
            cam_emb = cam_emb.repeat(1, 2, 1)
@ -510,7 +511,7 @@ class WanAttentionBlock(nn.Module):

        x, y = reshape_latent(x , latent_frames), reshape_latent(y , latent_frames)
        x.addcmul_(y, e[2])
-        x, y = reshape_latent(x , 1), reshape_latent(y , 1)
+        x, y = restore_latent_shape(x), restore_latent_shape(y)
        del y
        y = self.norm3(x)
        y = y.to(attention_dtype)
@ -542,7 +543,7 @@ class WanAttentionBlock(nn.Module):
        y = reshape_latent(y , latent_frames)
        y *= 1 + e[4]
        y += e[3]
-        y = reshape_latent(y , 1)
+        y = restore_latent_shape(y)
        y = y.to(attention_dtype)

        ffn = self.ffn[0]
@ -562,7 +563,7 @@ class WanAttentionBlock(nn.Module):
        y = y.to(dtype)
        x, y = reshape_latent(x , latent_frames), reshape_latent(y , latent_frames)
        x.addcmul_(y, e[5])
-        x, y = reshape_latent(x , 1), reshape_latent(y , 1)
+        x, y = restore_latent_shape(x), restore_latent_shape(y)

        if hints_processed is not None:
            for hint, scale in zip(hints_processed, context_scale):
@ -669,6 +670,8 @@ class VaceWanAttentionBlock(WanAttentionBlock):
        hints[0] = None
        if self.block_id == 0:
            c = self.before_proj(c)
+            bz = x.shape[0]
+            if bz > c.shape[0]: c = c.repeat(bz, 1, 1 )
            c += x
        c = super().forward(c, **kwargs)
        c_skip = self.after_proj(c)
@ -707,7 +710,7 @@ class Head(nn.Module):
        x = reshape_latent(x , latent_frames)
        x *= (1 + e[1])
        x += e[0]
-        x = reshape_latent(x , 1)
+        x = restore_latent_shape(x)
        x= x.to(self.head.weight.dtype)
        x = self.head(x)
        return x
@ -1163,10 +1166,14 @@ class WanModel(ModelMixin, ConfigMixin):
                last_x_idx = i
            else:
                # image source
+                bz = len(x)
                if y is not None:
-                    x = torch.cat([x, y], dim=0)
+                    y = y.unsqueeze(0)        
+                    if bz > 1: y = y.expand(bz, -1, -1, -1, -1)
+                    x = torch.cat([x, y], dim=1)
                # embeddings
-                x = self.patch_embedding(x.unsqueeze(0)).to(modulation_dtype)
+                # x = self.patch_embedding(x.unsqueeze(0)).to(modulation_dtype)
+                x = self.patch_embedding(x).to(modulation_dtype)
                grid_sizes = x.shape[2:]
                if chipmunk:
                    x = x.unsqueeze(-1)
@ -1204,7 +1211,7 @@ class WanModel(ModelMixin, ConfigMixin):
        )  # b, dim        
        e0 = self.time_projection(e).unflatten(1, (6, self.dim)).to(e.dtype)

-        if self.inject_sample_info:
+        if self.inject_sample_info and fps!=None:
            fps = torch.tensor(fps, dtype=torch.long, device=device)

            fps_emb = self.fps_embedding(fps).to(e.dtype) 
@ -1402,7 +1409,7 @@ class WanModel(ModelMixin, ConfigMixin):
            x_list[i] = self.unpatchify(x, grid_sizes)
            del x

-        return [x[0].float() for x in x_list]
+        return [x.float() for x in x_list]

    def unpatchify(self, x, grid_sizes):
        r"""
@ -1427,7 +1434,10 @@ class WanModel(ModelMixin, ConfigMixin):
            u = torch.einsum('fhwpqrc->cfphqwr', u)
            u = u.reshape(c, *[i * j for i, j in zip(grid_sizes, self.patch_size)])
            out.append(u)
-        return out
+        if len(x) == 1:
+            return out[0].unsqueeze(0)
+        else:
+            return torch.stack(out, 0)

    def init_weights(self):
        r"""
--- a/wan/multitalk/attention.py
+++ b/wan/multitalk/attention.py
@ -333,7 +333,7 @@ class SingleStreamMutiAttention(SingleStreamAttention):

        human1 = normalize_and_scale(x_ref_attn_map[0], (human1_min_value, human1_max_value), (self.rope_h1[0], self.rope_h1[1]))
        human2 = normalize_and_scale(x_ref_attn_map[1], (human2_min_value, human2_max_value), (self.rope_h2[0], self.rope_h2[1]))
-        back   = torch.full((x_ref_attn_map.size(1),), self.rope_bak, dtype=human1.dtype).to(human1.device)
+        back   = torch.full((x_ref_attn_map.size(1),), self.rope_bak, dtype=human1.dtype, device=human1.device)
        max_indices = x_ref_attn_map.argmax(dim=0)
        normalized_map = torch.stack([human1, human2, back], dim=1)
        normalized_pos = normalized_map[range(x_ref_attn_map.size(1)), max_indices] # N 
@ -351,7 +351,7 @@ class SingleStreamMutiAttention(SingleStreamAttention):
        if self.qk_norm:
            encoder_k = self.add_k_norm(encoder_k)

-        per_frame = torch.zeros(N_a, dtype=encoder_k.dtype).to(encoder_k.device)
+        per_frame = torch.zeros(N_a, dtype=encoder_k.dtype, device=encoder_k.device)
        per_frame[:per_frame.size(0)//2] = (self.rope_h1[0] + self.rope_h1[1]) / 2
        per_frame[per_frame.size(0)//2:] = (self.rope_h2[0] + self.rope_h2[1]) / 2
        encoder_pos = torch.concat([per_frame]*N_t, dim=0)
--- a/wan/multitalk/multitalk.py
+++ b/wan/multitalk/multitalk.py
@ -272,6 +272,34 @@ def timestep_transform(
    new_t = new_t * num_timesteps
    return new_t

+def parse_speakers_locations(speakers_locations):
+    bbox = {}
+    if speakers_locations is None or len(speakers_locations) == 0:
+        return None, ""
+    speakers = speakers_locations.split(" ")
+    if len(speakers) !=2:
+        error= "Two speakers locations should be defined"
+        return "", error
+    
+    for i, speaker in enumerate(speakers):
+        location = speaker.strip().split(":")
+        if len(location) not in (2,4):
+            error = f"Invalid Speaker Location '{location}'. A Speaker Location should be defined in the format Left:Right or usuing a BBox Left:Top:Right:Bottom"
+            return "", error
+        try:
+            good = False
+            location_float = [ float(val) for val in location]
+            good = all( 0 <= val <= 100 for val in location_float)
+        except:
+            pass
+        if not good:
+            error = f"Invalid Speaker Location '{location}'. Each number should be between 0 and 100."
+            return "", error
+        if len(location_float) == 2:
+            location_float = [location_float[0], 0, location_float[1], 100]
+        bbox[f"human{i}"] = location_float
+    return bbox, ""
+

 # construct human mask
 def get_target_masks(HUMAN_NUMBER, lat_h, lat_w, src_h, src_w, face_scale = 0.05, bbox = None):
@ -286,7 +314,9 @@ def get_target_masks(HUMAN_NUMBER, lat_h, lat_w, src_h, src_w, face_scale = 0.05
            assert len(bbox) == HUMAN_NUMBER, f"The number of target bbox should be the same with cond_audio"
            background_mask = torch.zeros([src_h, src_w])
            for _, person_bbox in bbox.items():
-                x_min, y_min, x_max, y_max = person_bbox
+                y_min, x_min, y_max, x_max = person_bbox
+                x_min, y_min, x_max, y_max = max(x_min,5), max(y_min, 5), min(x_max,95), min(y_max,95)                
+                x_min, y_min, x_max, y_max =  int(src_h * x_min / 100), int(src_w * y_min / 100), int(src_h * x_max / 100), int(src_w * y_max / 100)
                human_mask = torch.zeros([src_h, src_w])
                human_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
                background_mask += human_mask
@ -306,7 +336,7 @@ def get_target_masks(HUMAN_NUMBER, lat_h, lat_w, src_h, src_w, face_scale = 0.05
            human_masks = [human_mask1, human_mask2]
        background_mask = torch.where(background_mask > 0, torch.tensor(0), torch.tensor(1))
        human_masks.append(background_mask)
-    
+    # toto = Image.fromarray(human_masks[2].mul_(255).unsqueeze(-1).repeat(1,1,3).to(torch.uint8).cpu().numpy())
    ref_target_masks = torch.stack(human_masks, dim=0) #.to(self.device)
    # resize and centercrop for ref_target_masks 
    # ref_target_masks = resize_and_centercrop(ref_target_masks, (target_h, target_w))
--- a/wan/multitalk/multitalk_utils.py
+++ b/wan/multitalk/multitalk_utils.py
@ -128,7 +128,7 @@ def get_attn_map_with_target(visual_q, ref_k, shape, ref_target_masks=None, spli

    _, seq_lens, heads, _ = visual_q.shape
    class_num, _ = ref_target_masks.shape
-    x_ref_attn_maps = torch.zeros(class_num, seq_lens).to(visual_q.device).to(visual_q.dtype)
+    x_ref_attn_maps = torch.zeros(class_num, seq_lens, dtype=visual_q.device, device=visual_q.dtype)

    split_chunk = heads // split_num
    
--- a/wan/utils/utils.py
+++ b/wan/utils/utils.py
@ -5,7 +5,8 @@ import os
 import os.path as osp
 import torchvision.transforms.functional as TF
 import torch.nn.functional as F
-
+import cv2
+import tempfile
 import imageio
 import torch
 import decord
@ -101,6 +102,29 @@ def get_video_frame(file_name, frame_no):
    img = Image.fromarray(frame.numpy().astype(np.uint8))
    return img

+def convert_image_to_video(image):
+    if image is None:
+        return None
+    
+    # Convert PIL/numpy image to OpenCV format if needed
+    if isinstance(image, np.ndarray):
+        # Gradio images are typically RGB, OpenCV expects BGR
+        img_bgr = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
+    else:
+        # Handle PIL Image
+        img_array = np.array(image)
+        img_bgr = cv2.cvtColor(img_array, cv2.COLOR_RGB2BGR)
+    
+    height, width = img_bgr.shape[:2]
+    
+    # Create temporary video file (auto-cleaned by Gradio)
+    with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_video:
+        fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+        out = cv2.VideoWriter(temp_video.name, fourcc, 30.0, (width, height))
+        out.write(img_bgr)
+        out.release()
+        return temp_video.name
+    
 def resize_lanczos(img, h, w):
    img = Image.fromarray(np.clip(255. * img.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8))
    img = img.resize((w,h), resample=Image.Resampling.LANCZOS) 
--- a/wgp.py
+++ b/wgp.py
@ -16,7 +16,7 @@ import json
 import wan
 from wan.utils import notification_sound
 from wan.configs import MAX_AREA_CONFIGS, WAN_CONFIGS, SUPPORTED_SIZES, VACE_SIZE_CONFIGS
-from wan.utils.utils import cache_video, convert_tensor_to_image, save_image, get_video_info, get_file_creation_date
+from wan.utils.utils import cache_video, convert_tensor_to_image, save_image, get_video_info, get_file_creation_date, convert_image_to_video
 from wan.utils.utils import extract_audio_tracks, combine_video_with_audio_tracks, cleanup_temp_audio_files

 from wan.modules.attention import get_attention_modes, get_supported_attention_modes
@ -50,7 +50,7 @@ AUTOSAVE_FILENAME = "queue.zip"
 PROMPT_VARS_MAX = 10

 target_mmgp_version = "3.5.1"
-WanGP_version = "6.7"
+WanGP_version = "7.0"
 settings_version = 2.22
 max_source_video_frames = 1000
 prompt_enhancer_image_caption_model, prompt_enhancer_image_caption_processor, prompt_enhancer_llm_model, prompt_enhancer_llm_tokenizer = None, None, None, None
@ -171,6 +171,8 @@ def process_prompt_and_add_tasks(state, model_choice):
        gr.Warning("Internal state error: Could not retrieve inputs for the model.")
        queue = gen.get("queue", [])
        return get_queue_table(queue)
+    model_def = get_model_def(model_type)
+    image_outputs = model_def.get("image_outputs", False)
    model_type = get_base_model_type(model_type)
    inputs["model_filename"] = model_filename
    
@ -182,7 +184,7 @@ def process_prompt_and_add_tasks(state, model_choice):
        if frames_count > max_source_video_frames:
            gr.Info(f"Post processing is not supported on videos longer than {max_source_video_frames} frames. Output Video will be truncated")
            # return
-        for k in ["image_start", "image_end", "image_refs", "video_guide", "audio_guide", "audio_guide2", "video_mask"]:
+        for k in ["image_start", "image_end", "image_refs", "video_guide", "audio_guide", "audio_guide2", "video_mask", "image_mask"]:
            inputs[k] = None    
        inputs.update(edit_overrides)
        del gen["edit_video_source"], gen["edit_overrides"]
@ -193,6 +195,13 @@ def process_prompt_and_add_tasks(state, model_choice):
        if len(spatial_upsampling) >0: prompt += ["Spatial Upsampling"]
        temporal_upsampling = inputs.get("temporal_upsampling","")
        if len(temporal_upsampling) >0: prompt += ["Temporal Upsampling"]
+        if image_outputs and len(temporal_upsampling) > 0:
+            gr.Info("Temporal Upsampling can not be used with an Image")
+            return 
+        film_grain_intensity  = inputs.get("film_grain_intensity",0)
+        film_grain_saturation  = inputs.get("film_grain_saturation",0.5)        
+        # if film_grain_intensity >0: prompt += [f"Film Grain: intensity={film_grain_intensity}, saturation={film_grain_saturation}"]
+        if film_grain_intensity >0: prompt += ["Film Grain"]
        MMAudio_setting = inputs.get("MMAudio_setting",0)
        seed = inputs.get("seed",None)
        repeat_generation= inputs.get("repeat_generation",1)
@ -201,7 +210,7 @@ def process_prompt_and_add_tasks(state, model_choice):
            return 
        if MMAudio_setting !=0: prompt += ["MMAudio"]
        if len(prompt) == 0:
-            gr.Info("You must choose at leat one Post Processing Method")
+            gr.Info("You must choose at least one Post Processing Method")
            return
        inputs["prompt"] = ", ".join(prompt)
        add_video_task(**inputs)
@ -247,7 +256,10 @@ def process_prompt_and_add_tasks(state, model_choice):
    audio_guide = inputs["audio_guide"]
    audio_guide2 = inputs["audio_guide2"]
    video_guide = inputs["video_guide"]
+    image_guide = inputs["image_guide"]
    video_mask = inputs["video_mask"]
+    image_mask = inputs["image_mask"]
+    speakers_locations = inputs["speakers_locations"]
    video_source = inputs["video_source"]
    frames_positions = inputs["frames_positions"]
    keep_frames_video_guide= inputs["keep_frames_video_guide"] 
@ -269,6 +281,13 @@ def process_prompt_and_add_tasks(state, model_choice):
            gr.Info("Mag Cache maximum number of steps is 50")
            return

+    if "B" in audio_prompt_type or "X" in audio_prompt_type:
+        from wan.multitalk.multitalk import parse_speakers_locations
+        speakers_bboxes, error = parse_speakers_locations(speakers_locations)
+        if len(error) > 0:
+            gr.Info(error)
+            return
+
    if MMAudio_setting != 0 and server_config.get("mmaudio_enabled", 0) != 0 and video_length <16: #should depend on the architecture
        gr.Info("MMAudio can generate an Audio track only if the Video is at least 1s long")
    if "F" in video_prompt_type:
@ -314,12 +333,16 @@ def process_prompt_and_add_tasks(state, model_choice):
        audio_guide2 = None
        
    if model_type in ["vace_multitalk_14B"] and ("B" in audio_prompt_type or "X" in audio_prompt_type):
-        if not "I" in video_prompt_type:
-            gr.Info("To get good results with Multitalk and two people speaking, it is recommended to set a Reference Frame that contains the two people one on each side ")
+        if not "I" in video_prompt_type and not not "V" in video_prompt_type:
+            gr.Info("To get good results with Multitalk and two people speaking, it is recommended to set a Reference Frame or a Control Video (potentially truncated) that contains the two people one on each side")

-    if "R" in audio_prompt_type and len(filter_letters(image_prompt_type, "VLG")) > 0 :
+    if len(filter_letters(image_prompt_type, "VL")) > 0 :
+        if "R" in audio_prompt_type:
            gr.Info("Remuxing is not yet supported if there is a video source")
-        audio_prompt_type= replace("R" ,"")
+            audio_prompt_type= audio_prompt_type.replace("R" ,"")
+        if "A" in audio_prompt_type:
+            gr.Info("Creating an Audio track is not yet supported if there is a video source")
+            return

    if model_type in ["hunyuan_custom", "hunyuan_custom_edit", "hunyuan_audio", "hunyuan_avatar"]:
        if image_refs  == None :
@ -342,17 +365,26 @@ def process_prompt_and_add_tasks(state, model_choice):
        image_refs = None

    if "V" in video_prompt_type:
-        if video_guide == None:
+        if video_guide is None and image_guide is None:
+            if image_outputs:
+                gr.Info("You must provide a Control Image")
+            else:
                gr.Info("You must provide a Control Video")
            return
        if "A" in video_prompt_type and not "U" in video_prompt_type:
-            if video_mask == None:
+            if video_mask is None and image_mask is None:
+                if image_outputs:
+                    gr.Info("You must provide a Image Mask")
+                else:
                    gr.Info("You must provide a Video Mask")
                return
        else:
            video_mask = None
+            image_mask = None

-        if not "G" in video_prompt_type: 
+        if "G" in video_prompt_type:
+            gr.Info(f"With Denoising Strength {denoising_strength:.1f}, denoising will start a Step no {int(num_inference_steps * (1. - denoising_strength))} ")
+        else: 
            denoising_strength = 1.0

        _, error = parse_keep_frames_video_guide(keep_frames_video_guide, video_length)
@ -361,7 +393,9 @@ def process_prompt_and_add_tasks(state, model_choice):
            return
    else:
        video_guide = None
+        image_guide = None
        video_mask = None
+        image_mask = None
        keep_frames_video_guide = ""
        denoising_strength = 1.0
    
@ -416,10 +450,6 @@ def process_prompt_and_add_tasks(state, model_choice):


    if "hunyuan_custom_custom_edit" in model_filename:
-        if video_guide == None:
-            gr.Info("You must provide a Control Video") 
-            return
-
        if len(keep_frames_video_guide) > 0: 
            gr.Info("Filtering Frames with this model is not supported")
            return
@ -440,7 +470,9 @@ def process_prompt_and_add_tasks(state, model_choice):
        "audio_guide": audio_guide,
        "audio_guide2": audio_guide2,
        "video_guide": video_guide,
+        "image_guide": image_guide,
        "video_mask": video_mask,
+        "image_mask": image_mask,
        "video_source": video_source,
        "frames_positions": frames_positions,
        "keep_frames_video_source": keep_frames_video_source,
@ -517,15 +549,15 @@ def process_prompt_and_add_tasks(state, model_choice):
    return update_queue_data(queue)

 def get_preview_images(inputs):
-    inputs_to_query = ["image_start", "image_end", "video_source", "video_guide", "video_mask", "image_refs" ]
-    labels = ["Start Image", "End Image", "Video Source", "Video Guide", "Video Mask", "Image Reference"]
+    inputs_to_query = ["image_start", "image_end", "video_source", "video_guide", "image_guide", "video_mask", "image_mask", "image_refs" ]
+    labels = ["Start Image", "End Image", "Video Source", "Video Guide", "Image Guide", "Video Mask", "Image Mask", "Image Reference"]
    start_image_data = None
    start_image_labels = []
    end_image_data = None
    end_image_labels = []
    for label, name in  zip(labels,inputs_to_query):
        image= inputs.get(name, None)
-        if image != None:
+        if image is not None:
            image= [image] if not isinstance(image, list) else image.copy()
            if start_image_data == None:
                start_image_data = image
@ -645,7 +677,7 @@ def save_queue_action(state):
            params_copy = task.get('params', {}).copy()
            task_id_s = task.get('id', f"task_{task_index}")

-            image_keys = ["image_start", "image_end", "image_refs"]
+            image_keys = ["image_start", "image_end", "image_refs", "image_guide", "image_mask"]
            video_keys = ["video_guide", "video_mask", "video_source", "audio_guide", "audio_guide2"]

            for key in image_keys:
@ -821,7 +853,7 @@ def load_queue_action(filepath, state, evt:gr.EventData):
                max_id_in_file = max(max_id_in_file, task_id_loaded)
                params['state'] = state

-                image_keys = ["image_start", "image_end", "image_refs"]
+                image_keys = ["image_start", "image_end", "image_refs", "image_guide", "image_mask"]
                video_keys = ["video_guide", "video_mask", "video_source", "audio_guide", "audio_guide2"]

                loaded_pil_images = {}
@ -1041,7 +1073,7 @@ def autosave_queue():
                    params_copy = task.get('params', {}).copy()
                    task_id_s = task.get('id', f"task_{task_index}")

-                    image_keys = ["image_start", "image_end", "image_refs"]
+                    image_keys = ["image_start", "image_end", "image_refs", "image_guide", "image_mask"]
                    video_keys = ["video_guide", "video_mask", "video_source", "audio_guide", "audio_guide2"]

                    for key in image_keys:
@ -1929,12 +1961,6 @@ def get_default_settings(model_type):
    i2v = test_class_i2v(model_type)
    defaults_filename = get_settings_file_name(model_type)
    if not Path(defaults_filename).is_file():
-        model_def = get_model_def(model_type)
-        if model_def != None:
-            ui_defaults = model_def["settings"] 
-            if len(ui_defaults.get("prompt","")) == 0:
-                ui_defaults["prompt"]= get_default_prompt(i2v)
-        else:    
        ui_defaults = {
            "prompt": get_default_prompt(i2v),
            "resolution": "1280x720" if "720" in model_type else "832x480",
@ -2034,6 +2060,14 @@ def get_default_settings(model_type):
            })
            

+        model_def = get_model_def(model_type)
+        if model_def != None:
+            ui_defaults_update = model_def["settings"] 
+            ui_defaults.update(ui_defaults_update)
+
+        if len(ui_defaults.get("prompt","")) == 0:
+            ui_defaults["prompt"]= get_default_prompt(i2v)
+
        with open(defaults_filename, "w", encoding="utf-8") as f:
            json.dump(ui_defaults, f, indent=4)
    else:
@ -2490,6 +2524,7 @@ def load_wan_model(model_filename, model_type, base_model_type, model_def, quant
        checkpoint_dir="ckpts",
        model_filename=model_filename,
        model_type = model_type,        
+        model_def = model_def,
        base_model_type=base_model_type,
        text_encoder_filename= get_wan_text_encoder_filename(text_encoder_quantization),
        quantizeTransformer = quantizeTransformer,
@ -2598,7 +2633,7 @@ def load_models(model_type):
        save_quantized = False
        print("Need to provide a non quantized model to create a quantized model to be saved") 
    if save_quantized and len(modules) > 0:
-        print(f"Unable to create a finetune quantized model as some modules are declared in the finetune definition. If your finetune includes already the module weights you can remove the 'modules' entry and try again. If not you will need also to change temporarly the model 'architecture' to an architecture that wont require the modules part ('{model_types_no_module[0] if len(model_types_no_module)>0 else ''}' ?) to quantize and then add back the original 'modules' and 'architecture' entries.")
+        print(f"Unable to create a finetune quantized model as some modules are declared in the finetune definition. If your finetune includes already the module weights you can remove the 'modules' entry and try again. If not you will need also to change temporarly the model 'architecture' to an architecture that wont require the modules part ({modules}) to quantize and then add back the original 'modules' and 'architecture' entries.")
        save_quantized = False
    quantizeTransformer = not save_quantized and model_def !=None and transformer_quantization in ("int8", "fp8") and model_def.get("auto_quantize", False) and not "quanto" in model_filename
    if quantizeTransformer and len(modules) > 0:
@ -2931,8 +2966,10 @@ def refresh_gallery(state): #, msg
        prompt =  task["prompt"]
        params = task["params"]
        model_type = params["model_type"] 
-        model_type = get_base_model_type(model_type)
-        onemorewindow_visible = test_any_sliding_window(model_type)
+        base_model_type = get_base_model_type(model_type)
+        model_def = get_model_def(model_type) 
+        is_image = model_def.get("image_outputs", False)
+        onemorewindow_visible = test_any_sliding_window(base_model_type) and not is_image
        enhanced = False
        if  prompt.startswith("!enhanced!\n"):
            enhanced = True
@ -3047,7 +3084,7 @@ def select_video(state, input_file_list, event_data: gr.EventData):
        pp_values= []
        pp_labels = []
        extension = os.path.splitext(file_name)[-1]
-        if not extension in [".mp4"]:
+        if not has_video_file_extension(file_name):
            img = Image.open(file_name)
            width, height = img.size
            configs = None
@ -3064,6 +3101,8 @@ def select_video(state, input_file_list, event_data: gr.EventData):
            misc_labels += ["Model"]
            video_temporal_upsampling = configs.get("temporal_upsampling", "")
            video_spatial_upsampling = configs.get("spatial_upsampling", "")
+            video_film_grain_intensity = configs.get("film_grain_intensity", 0)
+            video_film_grain_saturation = configs.get("film_grain_saturation", 0.5)
            video_MMAudio_setting = configs.get("MMAudio_setting", 0)
            video_MMAudio_prompt = configs.get("MMAudio_prompt", "")
            video_MMAudio_neg_prompt = configs.get("MMAudio_neg_prompt", "")
@ -3074,6 +3113,9 @@ def select_video(state, input_file_list, event_data: gr.EventData):
            if len(video_temporal_upsampling) > 0:
                pp_values += [ video_temporal_upsampling ]
                pp_labels += [ "Upsampling" ]
+            if video_film_grain_intensity > 0:
+                pp_values += [ f"Intensity={video_film_grain_intensity}, Saturation={video_film_grain_saturation}" ]
+                pp_labels += [ "Film Grain" ]
            if video_MMAudio_setting != 0:
                pp_values += [ f'Prompt="{video_MMAudio_prompt}", Neg Prompt="{video_MMAudio_neg_prompt}", Seed={video_MMAudio_seed}'  ]
                pp_labels += [ "MMAudio" ]
@ -3206,7 +3248,7 @@ def select_video(state, input_file_list, event_data: gr.EventData):
    else:
        html =  get_default_video_info()
    visible= len(file_list) > 0
-    return choice, html, gr.update(visible=visible and not is_image) , gr.update(visible=visible and is_image), gr.update(visible=visible and is_image) 
+    return choice, html, gr.update(visible=visible and not is_image) , gr.update(visible=visible and is_image), gr.update(visible=visible and not is_image) 
 def expand_slist(slist, num_inference_steps ):
    new_slist= []
    inc =  len(slist) / num_inference_steps 
@ -3674,6 +3716,8 @@ def edit_video(
                seed,   
                temporal_upsampling,
                spatial_upsampling,
+                film_grain_intensity,
+                film_grain_saturation,
                MMAudio_setting,
                MMAudio_prompt,
                MMAudio_neg_prompt,
@ -3694,6 +3738,7 @@ def edit_video(
    if configs == None: configs = { "type" : get_model_record("Post Processing") }

    has_already_audio = False
+    audio_tracks = []
    if MMAudio_setting == 0:
        audio_tracks  = extract_audio_tracks(video_source)
        has_already_audio = len(audio_tracks) > 0
@ -3711,8 +3756,8 @@ def edit_video(
    frames_count = min(frames_count, 1000)
    sample = None

-    if len(temporal_upsampling) > 0 or len(spatial_upsampling) > 0:                
-        send_cmd("progress", [0, get_latest_status(state,"Upsampling")])
+    if len(temporal_upsampling) > 0 or len(spatial_upsampling) > 0 or film_grain_intensity > 0:                
+        send_cmd("progress", [0, get_latest_status(state,"Upsampling" if len(temporal_upsampling) > 0 or len(spatial_upsampling) > 0 else "Adding Film Grain"  )])
        sample = get_resampled_video(video_source, 0, max_source_video_frames, fps)
        sample = sample.float().div_(127.5).sub_(1.).permute(-1,0,1,2)
        frames_count = sample.shape[1] 
@ -3728,6 +3773,12 @@ def edit_video(
        sample = perform_spatial_upsampling(sample, spatial_upsampling )
        configs["spatial_upsampling"] = spatial_upsampling

+    if film_grain_intensity > 0:
+        from postprocessing.film_grain import add_film_grain
+        sample = add_film_grain(sample, film_grain_intensity, film_grain_saturation) 
+        configs["film_grain_intensity"] = film_grain_intensity
+        configs["film_grain_saturation"] = film_grain_saturation
+
    any_mmaudio = MMAudio_setting != 0 and server_config.get("mmaudio_enabled", 0) != 0 and frames_count >=output_fps
    if any_mmaudio: download_mmaudio()

@ -3834,16 +3885,19 @@ def generate_video(
    image_refs,
    frames_positions,
    video_guide,
+    image_guide,
    keep_frames_video_guide,
    denoising_strength,
    video_guide_outpainting,
    video_mask,
+    image_mask,
    control_net_weight,
    control_net_weight2,
    mask_expand,
    audio_guide,
    audio_guide2,
    audio_prompt_type,
+    speakers_locations,
    sliding_window_size,
    sliding_window_overlap,
    sliding_window_overlap_noise,
@ -3851,6 +3905,8 @@ def generate_video(
    remove_background_images_ref,
    temporal_upsampling,
    spatial_upsampling,
+    film_grain_intensity,
+    film_grain_saturation,
    MMAudio_setting,
    MMAudio_prompt,
    MMAudio_neg_prompt,    
@ -3871,11 +3927,17 @@ def generate_video(
    model_filename,
    mode,
 ):
+    
+    def remove_temp_filenames(temp_filenames_list):
+        for temp_filename in temp_filenames_list: 
+            if temp_filename!= None and os.path.isfile(temp_filename):
+                os.remove(temp_filename)
+
    global wan_model, offloadobj, reload_needed, save_path
    gen = get_gen_info(state)
    torch.set_grad_enabled(False) 
    if mode == "edit":    
-        edit_video(send_cmd, state, video_source, seed, temporal_upsampling, spatial_upsampling, MMAudio_setting, MMAudio_prompt, MMAudio_neg_prompt, repeat_generation)
+        edit_video(send_cmd, state, video_source, seed, temporal_upsampling, spatial_upsampling, film_grain_intensity, film_grain_saturation, MMAudio_setting, MMAudio_prompt, MMAudio_neg_prompt, repeat_generation)
        return
    with lock:
        file_list = gen["file_list"]
@ -3884,6 +3946,23 @@ def generate_video(

    model_def = get_model_def(model_type) 
    is_image = model_def.get("image_outputs", False)
+    if is_image:
+        batch_size = video_length
+        video_length = 1
+    else:
+        batch_size = 1
+    temp_filenames_list = []
+
+    if image_guide is not None and isinstance(image_guide, Image.Image):
+        video_guide = convert_image_to_video(image_guide)
+        temp_filenames_list.append(video_guide)
+        image_guide = None
+
+    if image_mask is not None and isinstance(image_mask, Image.Image):
+        video_mask = convert_image_to_video(image_mask)
+        temp_filenames_list.append(video_mask)
+        image_mask = None
+

    fit_canvas = server_config.get("fit_canvas", 0)

@ -3926,7 +4005,6 @@ def generate_video(

    trans = get_transformer_model(wan_model)
    audio_sampling_rate = 16000
-    temp_filename = None
    base_model_type = get_base_model_type(model_type)

    prompts = prompt.split("\n")
@ -4012,6 +4090,11 @@ def generate_video(
    multitalk = base_model_type in ["multitalk", "vace_multitalk_14B"]
    flux_dev_kontext = base_model_type in ["flux_dev_kontext"]

+    if "B" in audio_prompt_type or "X" in audio_prompt_type:
+        from wan.multitalk.multitalk import parse_speakers_locations
+        speakers_bboxes, error = parse_speakers_locations(speakers_locations)
+    else:
+        speakers_bboxes = None        
    if "L" in image_prompt_type:
        if len(file_list)>0:
            video_source = file_list[-1]
@ -4268,7 +4351,7 @@ def generate_video(
            window_start_frame = guide_start_frame - (reuse_frames if window_no > 1 else source_video_overlap_frames_count)
            if reuse_frames > 0:                
                return_latent_slice = slice(-(reuse_frames - 1 + discard_last_frames ) // latent_size - 1, None if discard_last_frames == 0 else -(discard_last_frames // latent_size) )
-            refresh_preview  = {}
+            refresh_preview  = {"image_guide" : None, "image_mask" : None}
            if fantasy:
                window_latent_start_frame = (window_start_frame ) // latent_size 
                window_latent_size= (current_video_length - 1) // latent_size + 1
@ -4426,7 +4509,8 @@ def generate_video(
                    input_video= pre_video_guide  if diffusion_forcing or ltxv or hunyuan_custom_edit else source_video,
                    denoising_strength=denoising_strength,
                    target_camera= target_camera,
-                    frame_num=current_video_length if is_image else (current_video_length // latent_size)* latent_size + 1,
+                    frame_num= (current_video_length // latent_size)* latent_size + 1,
+                    batch_size = batch_size,
                    height =  height,
                    width = width,
                    fit_into_canvas = fit_canvas == 1,
@ -4469,11 +4553,13 @@ def generate_video(
                    NAG_scale = NAG_scale,
                    NAG_tau = NAG_tau,
                    NAG_alpha = NAG_alpha,
+                    speakers_bboxes =speakers_bboxes,
                    offloadobj = offloadobj,
                )
            except Exception as e:
-                if temp_filename!= None and  os.path.isfile(temp_filename):
-                    os.remove(temp_filename)
+                if len(control_audio_tracks) > 0:
+                    cleanup_temp_audio_files(control_audio_tracks)
+                remove_temp_filenames(temp_filenames_list)
                offloadobj.unload_all()
                offload.unload_loras_from_model(trans)
                # if compile:
@ -4569,7 +4655,9 @@ def generate_video(

                if len(spatial_upsampling) > 0:
                    sample = perform_spatial_upsampling(sample, spatial_upsampling )
-
+                if film_grain_intensity> 0:
+                    from postprocessing.film_grain import add_film_grain
+                    sample = add_film_grain(sample, film_grain_intensity, film_grain_saturation) 
                if sliding_window :
                    if frames_already_processed == None:
                        frames_already_processed = sample
@ -4675,8 +4763,8 @@ def generate_video(
    offload.unload_loras_from_model(trans)
    if len(control_audio_tracks) > 0:
        cleanup_temp_audio_files(control_audio_tracks)
-    if temp_filename!= None and  os.path.isfile(temp_filename):
-        os.remove(temp_filename)
+
+    remove_temp_filenames(temp_filenames_list)

 def prepare_generate_video(state):    

@ -5529,7 +5617,7 @@ def prepare_inputs_dict(target, inputs, model_type = None, model_filename = None
    if "lset_name" in inputs:
        inputs.pop("lset_name")
        
-    unsaved_params = ["image_start", "image_end", "image_refs", "video_guide", "video_source", "video_mask", "audio_guide", "audio_guide2"]
+    unsaved_params = ["image_start", "image_end", "image_refs", "video_guide", "image_guide", "video_source", "video_mask", "image_mask", "audio_guide", "audio_guide2"]
    for k in unsaved_params:
        inputs.pop(k)
    if model_filename == None: model_filename = state["model_filename"]
@ -5629,28 +5717,36 @@ def video_to_source_video(state, input_file_list, choice):
    gr.Info("Selected Video was copied to Source Video input")    
    return file_list[choice]

-def image_to_ref_image(state, input_file_list, choice, target, target_name):
+def image_to_ref_image_add(state, input_file_list, choice, target, target_name):
    file_list, file_settings_list = get_file_list(state, input_file_list)
    if len(file_list) == 0 or choice == None or choice < 0 or choice > len(file_list): return gr.update()
-    gr.Info(f"Selected Image was copied to {target_name}")
+    gr.Info(f"Selected Image was added to {target_name}")
    if target == None:
        target =[]
    target.append( file_list[choice])
    return target

+def image_to_ref_image_set(state, input_file_list, choice, target, target_name):
+    file_list, file_settings_list = get_file_list(state, input_file_list)
+    if len(file_list) == 0 or choice == None or choice < 0 or choice > len(file_list): return gr.update()
+    gr.Info(f"Selected Image was copied to {target_name}")
+    return file_list[choice]

-def apply_post_processing(state, input_file_list, choice, PP_temporal_upsampling, PP_spatial_upsampling, PP_MMAudio_setting, PP_MMAudio_prompt, PP_MMAudio_neg_prompt, PP_MMAudio_seed, PP_repeat_generation):
+
+def apply_post_processing(state, input_file_list, choice, PP_temporal_upsampling, PP_spatial_upsampling, PP_film_grain_intensity, PP_film_grain_saturation, PP_MMAudio_setting, PP_MMAudio_prompt, PP_MMAudio_neg_prompt, PP_MMAudio_seed, PP_repeat_generation):
    gen = get_gen_info(state)
    file_list, file_settings_list = get_file_list(state, input_file_list)
    if len(file_list) == 0 or choice == None or choice < 0 or choice > len(file_list)  :
-        return gr.update(), gr.update()
+        return gr.update(), gr.update(), gr.update()
    
    if not file_list[choice].endswith(".mp4"):
        gr.Info("Post processing is only available with Videos")
-        return gr.update(), gr.update()
+        return gr.update(), gr.update(), gr.update()
    overrides = {
        "temporal_upsampling":PP_temporal_upsampling,
        "spatial_upsampling":PP_spatial_upsampling,
+        "film_grain_intensity": PP_film_grain_intensity, 
+        "film_grain_saturation": PP_film_grain_saturation,
        "MMAudio_setting" : PP_MMAudio_setting, 
        "MMAudio_prompt" : PP_MMAudio_prompt,
        "MMAudio_neg_prompt": PP_MMAudio_neg_prompt,
@ -5682,6 +5778,14 @@ def eject_video_from_gallery(state, input_file_list, choice):
        choice = min(choice, len(file_list))
    return gr.Gallery(value = file_list, selected_index= choice), gr.update() if len(file_list) >0 else get_default_video_info(), gr.Row(visible= len(file_list) > 0)

+def has_video_file_extension(filename):
+    extension = os.path.splitext(filename)[-1]
+    return extension in [".mp4"]
+
+def has_image_file_extension(filename):
+    extension = os.path.splitext(filename)[-1]
+    return extension in [".jpeg", ".jpg", ".png", ".bmp", ".tiff"]
+
 def add_videos_to_gallery(state, input_file_list, choice, files_to_load):
    gen = get_gen_info(state)
    if files_to_load == None:
@ -5693,10 +5797,15 @@ def add_videos_to_gallery(state, input_file_list, choice, files_to_load):
        for file_path in files_to_load:
            file_settings, _ = get_settings_from_file(state, file_path, False, False, False)
            if file_settings == None:
-                try:
-                    fps, width, height, frames_count = get_video_info(file_path)        
-                except:
                fps = 0
+                try:
+                    if has_video_file_extension(file_path):
+                        fps, width, height, frames_count = get_video_info(file_path)
+                    elif has_image_file_extension(file_path):
+                        width, height = Image.open(file_path).size
+                        fps = 1 
+                except:
+                    pass
                if fps == 0:
                    invalid_files_count += 1 
                    continue
@ -5878,15 +5987,18 @@ def save_inputs(
            image_refs,
            frames_positions,
            video_guide,
+            image_guide,
            keep_frames_video_guide,
            denoising_strength,
            video_mask,
+            image_mask,
            control_net_weight,
            control_net_weight2,
            mask_expand,
            audio_guide,
            audio_guide2,
            audio_prompt_type,
+            speakers_locations,
            sliding_window_size,
            sliding_window_overlap,
            sliding_window_overlap_noise,
@ -5894,6 +6006,8 @@ def save_inputs(
            remove_background_images_ref,
            temporal_upsampling,
            spatial_upsampling,
+            film_grain_intensity,
+            film_grain_saturation,
            MMAudio_setting,
            MMAudio_prompt,
            MMAudio_neg_prompt,            
@ -6097,7 +6211,7 @@ def refresh_audio_prompt_type_remux(state, audio_prompt_type, remux):
 def refresh_audio_prompt_type_sources(state, audio_prompt_type, audio_prompt_type_sources):
    audio_prompt_type = del_in_sequence(audio_prompt_type, "XCPAB")
    audio_prompt_type = add_to_sequence(audio_prompt_type, audio_prompt_type_sources)
-    return audio_prompt_type, gr.update(visible = "A" in audio_prompt_type), gr.update(visible = "B" in audio_prompt_type)
+    return audio_prompt_type, gr.update(visible = "A" in audio_prompt_type), gr.update(visible = "B" in audio_prompt_type), gr.update(visible = ("B" in audio_prompt_type or "X" in audio_prompt_type))

 def refresh_image_prompt_type(state, image_prompt_type):
    any_video_source = len(filter_letters(image_prompt_type, "VLG"))>0
@ -6110,19 +6224,26 @@ def refresh_video_prompt_type_image_refs(state, video_prompt_type, video_prompt_
    vace= test_vace_module(state["model_type"])
    return video_prompt_type, gr.update(visible = visible),gr.update(visible = visible), gr.update(visible = visible and "F" in video_prompt_type_image_refs), gr.update(visible= ("F" in video_prompt_type_image_refs or "K" in video_prompt_type_image_refs or "V" in video_prompt_type) and vace )

-def refresh_video_prompt_type_video_mask(video_prompt_type, video_prompt_type_video_mask):
+def refresh_video_prompt_type_video_mask(state, video_prompt_type, video_prompt_type_video_mask):
    video_prompt_type = del_in_sequence(video_prompt_type, "XYZWNA")
    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_mask)
    visible= "A" in video_prompt_type     
-    return video_prompt_type, gr.update(visible= visible), gr.update(visible= visible )
+    model_type = state["model_type"]
+    model_def = get_model_def(model_type)
+    image_outputs = model_def.get("image_outputs", False)
+    return video_prompt_type, gr.update(visible= visible and not image_outputs), gr.update(visible= visible and image_outputs), gr.update(visible= visible )

 def refresh_video_prompt_type_video_guide(state, video_prompt_type, video_prompt_type_video_guide):
    video_prompt_type = del_in_sequence(video_prompt_type, "PDSLCMGUV")
    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_guide)
    visible = "V" in video_prompt_type
    mask_visible = visible and "A" in video_prompt_type and not "U" in video_prompt_type
+    model_type = state["model_type"]
+    model_def = get_model_def(model_type)
+    image_outputs = model_def.get("image_outputs", False)
+
    vace= test_vace_module(state["model_type"])
-    return video_prompt_type, gr.update(visible = visible), gr.update(visible = visible), gr.update(visible = visible and "G" in video_prompt_type), gr.update(visible= (visible or "F" in video_prompt_type) and vace), gr.update(visible= visible and not "U" in video_prompt_type), gr.update(visible= mask_visible), gr.update(visible= mask_visible)
+    return video_prompt_type,  gr.update(visible = visible and not image_outputs), gr.update(visible = visible and image_outputs), gr.update(visible = visible and not image_outputs), gr.update(visible = visible and "G" in video_prompt_type), gr.update(visible= (visible or "F" in video_prompt_type) and vace), gr.update(visible= visible and not "U" in video_prompt_type), gr.update(visible= mask_visible and not image_outputs), gr.update(visible= mask_visible and image_outputs), gr.update(visible= mask_visible)

 # def refresh_video_prompt_video_guide_trigger(state, video_prompt_type, video_prompt_type_video_guide):
 #     video_prompt_type_video_guide = video_prompt_type_video_guide.split("#")[0]
@ -6223,6 +6344,8 @@ def get_resolution_choices(current_resolution_choice):
    if resolution_choices == None:
        resolution_choices=[
            # 1080p
+            ("1920x1088 (21:9, 1080p)", "1920x1088"),
+            ("1088x1920 (9:21, 1080p)", "1088x1920"),
            ("1920x832 (21:9, 1080p)", "1920x832"),
            ("832x1920 (9:21, 1080p)", "832x1920"),
            # 720p
@ -6390,7 +6513,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                if vace:
                    image_prompt_type_value= ui_defaults.get("image_prompt_type","")
                    image_prompt_type_value = "" if image_prompt_type_value == "S" else image_prompt_type_value
-                    image_prompt_type = gr.Radio( [("New Video", ""),("Continue Video File", "V"),("Continue Last Video", "L")], value =image_prompt_type_value, label="Source Video", show_label= False, visible= True , scale= 3)
+                    image_prompt_type = gr.Radio( [("New Video", ""),("Continue Video File", "V"),("Continue Last Video", "L")], value =image_prompt_type_value, label="Source Video", show_label= False, visible= not image_outputs , scale= 3)

                    image_start = gr.Gallery(visible = False)
                    image_end  = gr.Gallery(visible = False)
@ -6480,12 +6603,13 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                video_prompt_type_value= ui_defaults.get("video_prompt_type","")
                video_prompt_type = gr.Text(value= video_prompt_type_value, visible= False)
                any_control_video = True
+                any_control_image = image_outputs 
                with gr.Row():
                    if t2v:
                        video_prompt_type_video_guide = gr.Dropdown(
                            choices=[
                                ("Use Text Prompt Only", ""),
-                                ("Video to Video guided by Text Prompt", "GUV"),
+                                ("Image to Image guided by Text Prompt" if image_outputs else "Video to Video guided by Text Prompt", "GUV"),
                           ],
                            value=filter_letters(video_prompt_type_value, "GUV"),
                            label="Video to Video", scale = 2, show_label= False, visible= True
@ -6493,8 +6617,8 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                    elif vace:
                        video_prompt_type_video_guide = gr.Dropdown(
                            choices=[
-                                ("No Control Video", ""),
-                                ("Keep Control Video Unchanged", "UV"),
+                                ("No Control Image" if image_outputs else "No Control Video", ""),
+                                ("Keep Control Image Unchanged" if image_outputs else "Keep Control Video Unchanged", "UV"),
                                ("Transfer Human Motion", "PV"),
                                ("Transfer Depth", "DV"),
                                ("Transfer Shapes", "SV"),
@ -6510,19 +6634,20 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                ("Transfer Shapes & Flow", "SLV"),
                           ],
                            value=filter_letters(video_prompt_type_value, "PDSLCMGUV"),
-                            label="Control Video Process", scale = 2, visible= True, show_label= True,
+                            label="Control Image Process" if image_outputs else "Control Video Process", scale = 2, visible= True, show_label= True,
                        )
                    elif hunyuan_video_custom_edit:
                        video_prompt_type_video_guide = gr.Dropdown(
                            choices=[
-                                ("Inpaint Control Video", "MV"),
+                                ("Inpaint Control Image" if image_outputs else "Inpaint Control Video", "MV"),
                                ("Transfer Human Motion", "PMV"),
                            ],
                            value=filter_letters(video_prompt_type_value, "PDSLCMUV"),
-                            label="Video to Video", scale = 3, visible= True, show_label= True,
+                            label="Image to Image" if image_outputs else "Video to Video", scale = 3, visible= True, show_label= True,
                        )
                    else:
                        any_control_video = False
+                        any_control_image = False
                        video_prompt_type_video_guide = gr.Dropdown(visible= False)

                    # video_prompt_video_guide_trigger = gr.Text(visible=False, value="")
@ -6578,16 +6703,17 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                            visible = False,
                            label="Start / Reference Images", scale = 2
                        )
+                image_guide = gr.Image(label= "Control Image", type ="pil", visible= image_outputs and "V" in video_prompt_type_value, value= ui_defaults.get("image_guide", None))
+                video_guide = gr.Video(label= "Control Video", visible= (not image_outputs) and "V" in video_prompt_type_value, value= ui_defaults.get("video_guide", None))

-                video_guide = gr.Video(label= "Control Video", visible= "V" in video_prompt_type_value, value= ui_defaults.get("video_guide", None),)
                denoising_strength = gr.Slider(0, 1, value= ui_defaults.get("denoising_strength" ,0.5), step=0.01, label="Denoising Strength (the Lower the Closer to the Control Video)", visible = "G" in video_prompt_type_value, show_reset_button= False)
-                keep_frames_video_guide = gr.Text(value=ui_defaults.get("keep_frames_video_guide","") , visible= "V" in video_prompt_type_value, scale = 2, label= "Frames to keep in Control Video (empty=All, 1=first, a:b for a range, space to separate values)" ) #, -1=last
+                keep_frames_video_guide = gr.Text(value=ui_defaults.get("keep_frames_video_guide","") , visible= (not image_outputs) and  "V" in video_prompt_type_value, scale = 2, label= "Frames to keep in Control Video (empty=All, 1=first, a:b for a range, space to separate values)" ) #, -1=last

                with gr.Column(visible= ("V" in video_prompt_type_value  or "K" in video_prompt_type_value  or "F" in video_prompt_type_value) and vace) as video_guide_outpainting_col:
                    video_guide_outpainting_value = ui_defaults.get("video_guide_outpainting","#")
                    video_guide_outpainting = gr.Text(value=video_guide_outpainting_value , visible= False)
                    with gr.Group():
-                        video_guide_outpainting_checkbox = gr.Checkbox(label="Enable Spatial Outpainting on Control Video, Background or Injected Reference Frames", value=len(video_guide_outpainting_value)>0 and not video_guide_outpainting_value.startswith("#") )
+                        video_guide_outpainting_checkbox = gr.Checkbox(label="Enable Spatial Outpainting on Control Video, Landscape or Injected Reference Frames", value=len(video_guide_outpainting_value)>0 and not video_guide_outpainting_value.startswith("#") )
                        with gr.Row(visible = not video_guide_outpainting_value.startswith("#")) as video_guide_outpainting_row:
                            video_guide_outpainting_value = video_guide_outpainting_value[1:] if video_guide_outpainting_value.startswith("#") else video_guide_outpainting_value
                            video_guide_outpainting_list = [0] * 4 if len(video_guide_outpainting_value) == 0 else [int(v) for v in video_guide_outpainting_value.split(" ")]
@ -6595,8 +6721,9 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                            video_guide_outpainting_bottom = gr.Slider(0, 100, value= video_guide_outpainting_list[1], step=5, label="Bottom %", show_reset_button= False)
                            video_guide_outpainting_left = gr.Slider(0, 100, value= video_guide_outpainting_list[2], step=5, label="Left %", show_reset_button= False)
                            video_guide_outpainting_right = gr.Slider(0, 100, value= video_guide_outpainting_list[3], step=5, label="Right %", show_reset_button= False)
-
-                video_mask = gr.Video(label= "Video Mask Area (for Inpainting, white = Control Area, black = Unchanged)", visible= "V" in video_prompt_type_value and "A" in video_prompt_type_value and not "U" in video_prompt_type_value , value= ui_defaults.get("video_mask", None)) 
+                any_image_mask = image_outputs and vace
+                image_mask = gr.Image(label= "Image Mask Area (for Inpainting, white = Control Area, black = Unchanged)", type ="pil", visible= image_outputs and "V" in video_prompt_type_value and "A" in video_prompt_type_value and not "U" in video_prompt_type_value , value= ui_defaults.get("image_mask", None)) 
+                video_mask = gr.Video(label= "Video Mask Area (for Inpainting, white = Control Area, black = Unchanged)", visible= (not image_outputs) and "V" in video_prompt_type_value and "A" in video_prompt_type_value and not "U" in video_prompt_type_value , value= ui_defaults.get("video_mask", None)) 

                mask_expand = gr.Slider(-10, 50, value=ui_defaults.get("mask_expand", 0), step=1, label="Expand / Shrink Mask Area", visible= "V" in video_prompt_type_value and "A" in video_prompt_type_value and not "U" in video_prompt_type_value )
                any_reference_image = vace or phantom or hunyuan_video_custom or hunyuan_video_avatar
@ -6630,7 +6757,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                        ("Two speakers, Speakers Audio sources are assumed to be played in Parallel", "PAB"),
                    ],
                    value= filter_letters(audio_prompt_type_value, "XCPAB"),
-                    label="Voices: if there are multiple People the first is assumed to be to the Left and the second one to the Right", scale = 3, visible = multitalk 
+                    label="Voices", scale = 3, visible = multitalk 
                )
            else:
                audio_prompt_type_sources = gr.Dropdown( choices= [""], value = "", visible=False)
@ -6638,6 +6765,8 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            with gr.Row(visible = any_audio_voices_support) as audio_guide_row:
                audio_guide = gr.Audio(value= ui_defaults.get("audio_guide", None), type="filepath", label="Voice to follow", show_download_button= True, visible= any_audio_voices_support and "A" in audio_prompt_type_value )
                audio_guide2 = gr.Audio(value= ui_defaults.get("audio_guide2", None), type="filepath", label="Voice to follow #2", show_download_button= True, visible= any_audio_voices_support and "B" in audio_prompt_type_value )
+            with gr.Row(visible = any_audio_voices_support and ("B" in audio_prompt_type_value or "X" in audio_prompt_type_value) ) as speakers_locations_row:
+                speakers_locations = gr.Text( ui_defaults.get("speakers_locations", "0:45 55:100"), label="Speakers Locations separated by a Space. Each Location = Left:Right or a BBox Left:Top:Right:Bottom", visible= True)

            advanced_prompt = advanced_ui
            prompt_vars=[]
@ -6694,7 +6823,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                )
            with gr.Row():
                if image_outputs:
-                    video_length = gr.Slider(1, 16, value=ui_defaults.get("video_length", 1), step=1, label="Number of Images to Generate", visible = flux_dev_kontext)
+                    video_length = gr.Slider(1, 16, value=ui_defaults.get("video_length", 1), step=1, label="Number of Images to Generate", visible = True)
                elif recammaster:
                    video_length = gr.Slider(5, 193, value=ui_defaults.get("video_length", 81), step=4, label="Number of frames (16 = 1s), locked", interactive= False, visible = True)
                else:
@ -6702,7 +6831,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non

                    video_length = gr.Slider(min_frames, 737 if test_any_sliding_window(base_model_type) else 337, value=ui_defaults.get(
                        "video_length", 81 if get_model_family(base_model_type)=="wan" else 97), 
-                         step=frames_step, label=f"Number of frames ({fps} = 1s)", interactive= True)
+                         step=frames_step, label=f"Number of frames ({fps} = 1s)", visible = True, interactive= True)

            with gr.Row(visible = not lock_inference_steps) as inference_steps_row:                                       
                num_inference_steps = gr.Slider(1, 100, value=ui_defaults.get("num_inference_steps",30), step=1, label="Number of Inference Steps", visible = True)
@ -6790,12 +6919,12 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                        )
                        skip_steps_start_step_perc = gr.Slider(0, 100, value=ui_defaults.get("skip_steps_start_step_perc",0), step=1, label="Skip Steps starting moment in % of generation") 

-                with gr.Tab("Upsampling"):
+                with gr.Tab("Post Processing"):
                    

                    with gr.Column():
                        gr.Markdown("<B>Upsampling - postprocessing that may improve fluidity and the size of the video</B>")
-                        def gen_upsampling_dropdowns(temporal_upsampling, spatial_upsampling , element_class= None, max_height= None):
+                        def gen_upsampling_dropdowns(temporal_upsampling, spatial_upsampling , film_grain_intensity, film_grain_saturation, element_class= None, max_height= None):
                            temporal_upsampling = gr.Dropdown(
                                choices=[
                                    ("Disabled", ""),
@ -6803,7 +6932,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                    ("Rife x4 frames/s", "rife4"), 
                                ],
                                value=temporal_upsampling,
-                                visible=not image_outputs,
+                                visible=True,
                                scale = 1,
                                label="Temporal Upsampling",
                                elem_classes= element_class
@ -6822,8 +6951,13 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                elem_classes= element_class
                                # max_height = max_height
                            )
-                            return temporal_upsampling, spatial_upsampling
-                        temporal_upsampling, spatial_upsampling = gen_upsampling_dropdowns(ui_defaults.get("temporal_upsampling", ""), ui_defaults.get("spatial_upsampling", ""))
+
+                            with gr.Row():
+                                film_grain_intensity = gr.Slider(0, 1, value=film_grain_intensity, step=0.01, label="Film Grain Intensity (0 = disabled)") 
+                                film_grain_saturation = gr.Slider(0.0, 1, value=film_grain_saturation, step=0.01, label="Film Grain Saturation") 
+
+                            return temporal_upsampling, spatial_upsampling, film_grain_intensity, film_grain_saturation
+                        temporal_upsampling, spatial_upsampling, film_grain_intensity, film_grain_saturation = gen_upsampling_dropdowns(ui_defaults.get("temporal_upsampling", ""), ui_defaults.get("spatial_upsampling", ""), ui_defaults.get("film_grain_intensity", 0), ui_defaults.get("film_grain_saturation", 0.5))

                with gr.Tab("MMAudio", visible = server_config.get("mmaudio_enabled", 0) != 0 and not any_audio_track(base_model_type) and not image_outputs) as mmaudio_tab:
                    with gr.Column():
@ -7024,18 +7158,21 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                        with gr.Row(**default_visibility) as image_buttons_row:
                            video_info_to_start_image_btn = gr.Button("To Start Image", size ="sm", visible = any_start_image )
                            video_info_to_end_image_btn = gr.Button("To End Image", size ="sm", visible = any_end_image)
+                            video_info_to_image_guide_btn = gr.Button("To Control Image", size ="sm", visible = any_control_image )
+                            video_info_to_image_mask_btn = gr.Button("To Mask Image", size ="sm", visible = any_image_mask)
                            video_info_to_reference_image_btn = gr.Button("To Reference Image", size ="sm", visible = any_reference_image)
                            video_info_eject_image_btn = gr.Button("Eject Image", size ="sm")
                    with gr.Tab("Post Processing", id= "post_processing", visible = True) as video_postprocessing_tab:
                        with gr.Group(elem_classes= "postprocess"):
                            with gr.Column():
-                                PP_temporal_upsampling, PP_spatial_upsampling = gen_upsampling_dropdowns("",  "", element_class ="postprocess")
-                            with gr.Column() as PP_MMAudio_col:
+                                PP_temporal_upsampling, PP_spatial_upsampling, PP_film_grain_intensity, PP_film_grain_saturation = gen_upsampling_dropdowns("",  "", 0, 0.5, element_class ="postprocess")
+                            with gr.Column(visible = server_config.get("mmaudio_enabled", 0) == 1) as PP_MMAudio_col:
                                PP_MMAudio_setting, PP_MMAudio_prompt, PP_MMAudio_neg_prompt, _ =  gen_mmaudio_dropdowns(  0, "" , "", None, element_class ="postprocess" )
                                PP_MMAudio_seed = gr.Slider(-1, 999999999, value=-1, step=1, label="Seed (-1 for random)") 
                                PP_repeat_generation = gr.Slider(1, 25.0, value=1, step=1, label="Number of Sample Videos to Generate") 
-
+                        with gr.Row():
                            video_info_postprocessing_btn = gr.Button("Apply Postprocessing", size ="sm", visible=True)
+                            video_info_eject_video2_btn = gr.Button("Eject Video", size ="sm", visible=True)
                    with gr.Tab("Add Videos / Images", id= "video_add"):
                        files_to_load = gr.Files(label= "Files to Load in Gallery", height=120)
                        with gr.Row():
@ -7092,8 +7229,8 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                      video_prompt_type_video_guide, video_prompt_type_video_mask, video_prompt_type_image_refs, apg_col, audio_prompt_type_sources, audio_prompt_type_remux_row,
                                      video_guide_outpainting_col,video_guide_outpainting_top, video_guide_outpainting_bottom, video_guide_outpainting_left, video_guide_outpainting_right,
                                      video_guide_outpainting_checkbox, video_guide_outpainting_row, show_advanced, video_info_to_control_video_btn, video_info_to_video_source_btn, sample_solver_row,
-                                      video_buttons_row, image_buttons_row, video_postprocessing_tab, video_info_to_start_image_btn, video_info_to_end_image_btn, video_info_to_reference_image_btn,
-                                      NAG_col] #  presets_column,
+                                      video_buttons_row, image_buttons_row, video_postprocessing_tab, video_info_to_start_image_btn, video_info_to_end_image_btn, video_info_to_reference_image_btn, video_info_to_image_guide_btn, video_info_to_image_mask_btn,
+                                      NAG_col, speakers_locations_row] #  presets_column,
        if update_form:
            locals_dict = locals()
            gen_inputs = [state_dict if k=="state" else locals_dict[k]  for k in inputs_names] + [state_dict] + extra_inputs
@ -7104,12 +7241,12 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            last_choice = gr.Number(value =-1, interactive= False, visible= False)

            audio_prompt_type_remux.change(fn=refresh_audio_prompt_type_remux, inputs=[state, audio_prompt_type, audio_prompt_type_remux], outputs=[audio_prompt_type])
-            audio_prompt_type_sources.change(fn=refresh_audio_prompt_type_sources, inputs=[state, audio_prompt_type, audio_prompt_type_sources], outputs=[audio_prompt_type, audio_guide, audio_guide2])
+            audio_prompt_type_sources.change(fn=refresh_audio_prompt_type_sources, inputs=[state, audio_prompt_type, audio_prompt_type_sources], outputs=[audio_prompt_type, audio_guide, audio_guide2, speakers_locations_row])
            image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] ) 
            # video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[state, video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, keep_frames_video_guide, denoising_strength, video_guide_outpainting_col, video_prompt_type_video_mask, video_mask, mask_expand])
            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [state, video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref, frames_positions, video_guide_outpainting_col])
-            video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [state, video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, denoising_strength, video_guide_outpainting_col, video_prompt_type_video_mask, video_mask, mask_expand])
-            video_prompt_type_video_mask.input(fn=refresh_video_prompt_type_video_mask, inputs = [video_prompt_type, video_prompt_type_video_mask], outputs = [video_prompt_type, video_mask, mask_expand])
+            video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [state, video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, image_guide, keep_frames_video_guide, denoising_strength, video_guide_outpainting_col, video_prompt_type_video_mask, video_mask, image_mask, mask_expand])
+            video_prompt_type_video_mask.input(fn=refresh_video_prompt_type_video_mask, inputs = [state, video_prompt_type, video_prompt_type_video_mask], outputs = [video_prompt_type, video_mask, image_mask, mask_expand])
            multi_prompts_gen_type.select(fn=refresh_prompt_labels, inputs=multi_prompts_gen_type, outputs=[prompt, wizard_prompt])
            video_guide_outpainting_top.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_top, gr.State(0)], outputs = [video_guide_outpainting], trigger_mode="multiple" )
            video_guide_outpainting_bottom.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_bottom,gr.State(1)], outputs = [video_guide_outpainting], trigger_mode="multiple" )
@ -7119,7 +7256,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name]).then(
                fn=switch_prompt_type, inputs = [state, wizard_prompt_activated_var, wizard_variables_var, prompt, wizard_prompt, *prompt_vars], outputs = [wizard_prompt_activated_var, wizard_variables_var, prompt, wizard_prompt, prompt_column_advanced, prompt_column_wizard, prompt_column_wizard_vars, *prompt_vars])
            queue_df.select( fn=handle_celll_selection, inputs=state, outputs=[queue_df, modal_image_display, modal_container])
-            output.select(select_video, [state, output], outputs=[last_choice, video_info, video_buttons_row, image_buttons_row, video_postprocessing_tab] )
+            output.select(select_video, [state, output], outputs=[last_choice, video_info, video_buttons_row, image_buttons_row, video_postprocessing_tab], trigger_mode="multiple")
            preview_trigger.change(refresh_preview, inputs= [state], outputs= [preview])

            def refresh_status_async(state, progress=gr.Progress()):
@ -7175,13 +7312,15 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            ).then( fn=use_video_settings, inputs =[state, output, last_choice] , outputs= [model_choice, refresh_form_trigger])

            video_info_add_videos_btn.click(fn=add_videos_to_gallery, inputs =[state, output, last_choice, files_to_load], outputs = [output, files_to_load, video_info_tabs] )
-            gr.on(triggers=[video_info_eject_video_btn.click, video_info_eject_image_btn.click], fn=eject_video_from_gallery, inputs =[state, output, last_choice], outputs = [output, video_info, video_buttons_row] )
+            gr.on(triggers=[video_info_eject_video_btn.click, video_info_eject_video2_btn.click, video_info_eject_image_btn.click], fn=eject_video_from_gallery, inputs =[state, output, last_choice], outputs = [output, video_info, video_buttons_row] )
            video_info_to_control_video_btn.click(fn=video_to_control_video, inputs =[state, output, last_choice], outputs = [video_guide] )
            video_info_to_video_source_btn.click(fn=video_to_source_video, inputs =[state, output, last_choice], outputs = [video_source] )
-            video_info_to_start_image_btn.click(fn=image_to_ref_image, inputs =[state, output, last_choice, image_start, gr.State("Start Image")], outputs = [image_start] )
-            video_info_to_end_image_btn.click(fn=image_to_ref_image, inputs =[state, output, last_choice, image_end, gr.State("End Image")], outputs = [image_end] )
-            video_info_to_reference_image_btn.click(fn=image_to_ref_image, inputs =[state, output, last_choice, image_refs, gr.State("Ref Image")],  outputs = [image_refs] )
-            video_info_postprocessing_btn.click(fn=apply_post_processing, inputs =[state, output, last_choice, PP_temporal_upsampling, PP_spatial_upsampling, PP_MMAudio_setting, PP_MMAudio_prompt, PP_MMAudio_neg_prompt, PP_MMAudio_seed, PP_repeat_generation], outputs = [mode, generate_trigger, add_to_queue_trigger ] )
+            video_info_to_start_image_btn.click(fn=image_to_ref_image_add, inputs =[state, output, last_choice, image_start, gr.State("Start Image")], outputs = [image_start] )
+            video_info_to_end_image_btn.click(fn=image_to_ref_image_add, inputs =[state, output, last_choice, image_end, gr.State("End Image")], outputs = [image_end] )
+            video_info_to_image_guide_btn.click(fn=image_to_ref_image_set, inputs =[state, output, last_choice, image_guide, gr.State("Control Image")], outputs = [image_guide] )
+            video_info_to_image_mask_btn.click(fn=image_to_ref_image_set, inputs =[state, output, last_choice, image_mask, gr.State("Image Mask")], outputs = [image_mask] )
+            video_info_to_reference_image_btn.click(fn=image_to_ref_image_add, inputs =[state, output, last_choice, image_refs, gr.State("Ref Image")],  outputs = [image_refs] )
+            video_info_postprocessing_btn.click(fn=apply_post_processing, inputs =[state, output, last_choice, PP_temporal_upsampling, PP_spatial_upsampling, PP_film_grain_intensity, PP_film_grain_saturation, PP_MMAudio_setting, PP_MMAudio_prompt, PP_MMAudio_neg_prompt, PP_MMAudio_seed, PP_repeat_generation], outputs = [mode, generate_trigger, add_to_queue_trigger ] )
            save_lset_btn.click(validate_save_lset, inputs=[state, lset_name], outputs=[apply_lset_btn, refresh_lora_btn, delete_lset_btn, save_lset_btn,confirm_save_lset_btn, cancel_lset_btn, save_lset_prompt_drop])
            delete_lset_btn.click(validate_delete_lset, inputs=[state, lset_name], outputs=[apply_lset_btn, refresh_lora_btn, delete_lset_btn, save_lset_btn,confirm_delete_lset_btn, cancel_lset_btn ])
            confirm_save_lset_btn.click(fn=validate_wizard_prompt, inputs =[state, wizard_prompt_activated_var, wizard_variables_var, prompt, wizard_prompt, *prompt_vars] , outputs= [prompt]).then(
@ -7405,7 +7544,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            )

    return ( state, loras_choices, lset_name, state,
-             video_guide, video_mask, image_refs, prompt_enhancer_row, mmaudio_tab, PP_MMAudio_col  
+             video_guide, image_guide, video_mask, image_mask, image_refs, prompt_enhancer_row, mmaudio_tab, PP_MMAudio_col  
            ) 
 

@ -8220,12 +8359,12 @@ def create_ui():
                    header = gr.Markdown(generate_header(transformer_type, compile, attention_mode), visible= True)
                with gr.Row():
                    (   state, loras_choices, lset_name, state,
-                        video_guide, video_mask, image_refs, prompt_enhancer_row, mmaudio_tab, PP_MMAudio_col
+                        video_guide, image_guide, video_mask, image_mask, image_refs, prompt_enhancer_row, mmaudio_tab, PP_MMAudio_col
                    ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
            with gr.Tab("Guides", id="info") as info_tab:
                generate_info_tab()
            with gr.Tab("Video Mask Creator", id="video_mask_creator") as video_mask_creator:
-                matanyone_app.display(main_tabs, tab_state, model_choice, video_guide, video_mask, image_refs)
+                matanyone_app.display(main_tabs, tab_state, model_choice, video_guide, image_guide, video_mask, image_mask, image_refs)
            if not args.lock_config:
                with gr.Tab("Downloads", id="downloads") as downloads_tab:
                    generate_download_tab(lset_name, loras_choices, state)