Merge branch 'main' into queue_editor

2025-11-04 14:16:57 +00:00 · 2025-09-12 14:44:12 +10:00 · 2025-09-12 14:44:12 +10:00 · e69a406808
commit e69a406808
parent bc4c20a079 9fa267087b
30 changed files with 1316 additions and 631 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,31 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 **Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep

 ## 🔥 Latest Updates : 
+### September 11 2025: WanGP v8.5/8.55 - Wanna be a Cropper or a Painter ?
+
+I have done some intensive internal refactoring of the generation pipeline to ease support of existing models or add new models. Nothing really visible but this makes WanGP is little more future proof.
+
+Otherwise in the news:
+- **Cropped Input Image Prompts**: as quite often most *Image Prompts* provided (*Start Image, Input Video, Reference Image,  Control Video, ...*) rarely matched your requested *Output Resolution*. In that case I used the resolution you gave either as a *Pixels Budget* or as an *Outer Canvas* for the Generated Video. However in some occasion you really want the requested Output Resolution and nothing else. Besides some models deliver much better Generations if you stick to one of their supported resolutions. In order to address this need I have added a new Output Resolution choice in the *Configuration Tab*:  **Dimensions Correspond to the Ouput Weight & Height as the Prompt Images will be Cropped to fit Exactly these dimensins**. In short if needed the *Input Prompt Images* will be cropped (centered cropped for the moment). You will see this can make quite a difference for some models
+
+- *Qwen Edit* has now a new sub Tab called **Inpainting**, that lets you target with a brush which part of the *Image Prompt* you want to modify. This is quite convenient if you find that Qwen Edit modifies usually too many things. Of course, as there are more constraints for Qwen Edit don't be surprised if sometime it will return the original image unchanged. A piece of advise: describe in your *Text Prompt* where (for instance *left to the man*, *top*, ...) the parts that you want to modify are located.
+
+The mask inpainting is fully compatible with *Matanyone Mask generator*: generate first an *Image Mask* with Matanyone, transfer it to the current Image Generator and modify the mask with the *Paint Brush*. Talking about matanyone I have fixed a bug that caused a mask degradation with long videos (now WanGP Matanyone is as good as the original app and still requires 3 times less VRAM)
+
+- This **Inpainting Mask Editor** has been added also to *Vace Image Mode*. Vace is probably still one of best Image Editor today. Here is a very simple & efficient workflow that do marvels with Vace:
+Select *Vace Cocktail > Control Image Process = Perform Inpainting & Area Processed = Masked Area > Upload a Control Image, then draw your mask directly on top of the image & enter a text Prompt that describes the expected change > Generate > Below the Video Gallery click 'To Control Image' > Keep on doing more changes*.
+
+Doing more sophisticated thing Vace Image Editor works very well too: try Image Outpainting, Pose transfer, ...
+
+For the best quality I recommend to set in *Quality Tab* the option: "*Generate a 9 Frames Long video...*" 
+
+**update 8.55**: Flux Festival
+- **Inpainting Mode** also added for *Flux Kontext*
+- **Flux SRPO** : new finetune with x3 better quality vs Flux Dev according to its authors. I have also created a *Flux SRPO USO* finetune which is certainly the best open source *Style Transfer* tool available
+- **Flux UMO**: model specialized in combining multiple reference objects / people together. Works quite well at 768x768
+
+Good luck with finding your way through all the Flux models names !
+
 ### September 5 2025: WanGP v8.4 - Take me to Outer Space
 You have probably seen these short AI generated movies created using *Nano Banana* and the *First Frame - Last Frame* feature of *Kling 2.0*. The idea is to generate an image, modify a part of it with Nano Banana and give the these two images to Kling that will generate the Video between these two images, use now the previous Last Frame as the new First Frame, rinse and repeat and you get a full movie.

--- a/defaults/flux_dev_kontext.json
+++ b/defaults/flux_dev_kontext.json
@ -7,8 +7,6 @@
            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_bf16.safetensors",
            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_quanto_bf16_int8.safetensors"
        ],
-		"image_outputs": true,		
-		"reference_image": true,		
 		"flux-model": "flux-dev-kontext"		
    },
 	"prompt": "add a hat",
--- a/defaults/flux_dev_umo.json
+++ b/defaults/flux_dev_umo.json
@ -0,0 +1,24 @@
+{
+    "model": {
+        "name": "Flux 1 Dev UMO 12B",
+        "architecture": "flux",
+        "description": "FLUX.1 Dev UMO is a model that can Edit Images with a specialization in combining multiple image references (resized internally at 512x512 max) to produce an Image output. Best Image preservation at 768x768 Resolution Output.",
+        "URLs": "flux",
+		"flux-model": "flux-dev-umo",		
+		"loras": ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-UMO_dit_lora_bf16.safetensors"],
+		"resolutions":  [ ["1024x1024 (1:1)", "1024x1024"],
+						["768x1024 (3:4)", "768x1024"],
+						["1024x768 (4:3)", "1024x768"],
+						["512x1024 (1:2)", "512x1024"],
+						["1024x512 (2:1)", "1024x512"],
+						["768x768 (1:1)", "768x768"],
+						["768x512 (3:2)", "768x512"],
+						["512x768 (2:3)", "512x768"]]
+    },	
+	"prompt": "the man is wearing a hat",
+	"embedded_guidance_scale": 4,
+    "resolution": "768x768",
+    "batch_size": 1
+}
+
+	
--- a/defaults/flux_dev_uso.json
+++ b/defaults/flux_dev_uso.json
@ -2,12 +2,10 @@
    "model": {
        "name": "Flux 1 Dev USO 12B",
        "architecture": "flux",
-        "description": "FLUX.1 Dev USO is a model specialized to Edit Images with a specialization in Style Transfers (up to two).",
+        "description": "FLUX.1 Dev USO is a model that can Edit Images with a specialization in Style Transfers (up to two).",
 		"modules": [ ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-USO_projector_bf16.safetensors"]],
        "URLs": "flux",
 		"loras": ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-USO_dit_lora_bf16.safetensors"],
-		"image_outputs": true,		
-		"reference_image": true,		
 		"flux-model": "flux-dev-uso"		
    },
 	"prompt": "the man is wearing a hat",
--- a/defaults/flux_srpo.json
+++ b/defaults/flux_srpo.json
@ -0,0 +1,15 @@
+{
+    "model": {
+        "name": "Flux 1 SRPO Dev 12B",
+        "architecture": "flux",
+        "description": "By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, SRPO improves its human-evaluated realism and aesthetic quality by over 3x.",
+        "URLs": [
+            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-srpo-dev_bf16.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-srpo-dev_quanto_bf16_int8.safetensors"
+        ],
+        "flux-model": "flux-dev"
+    },
+    "prompt": "draw a hat",
+    "resolution": "1024x1024",
+    "batch_size": 1
+}
--- a/defaults/flux_srpo_uso.json
+++ b/defaults/flux_srpo_uso.json
@ -0,0 +1,17 @@
+{
+    "model": {
+        "name": "Flux 1 SRPO USO 12B",
+        "architecture": "flux",
+        "description": "FLUX.1 SRPO USO is a model that can Edit Images with a specialization in Style Transfers (up to two). It leverages the improved Image quality brought by the SRPO process",
+		"modules": [ "flux_dev_uso"],
+        "URLs": "flux_srpo",
+		"loras": "flux_dev_uso",
+		"flux-model": "flux-dev-uso"		
+    },
+	"prompt": "the man is wearing a hat",
+	"embedded_guidance_scale": 4,
+    "resolution": "1024x1024",
+    "batch_size": 1
+}
+
+	
--- a/defaults/qwen_image_edit_20B.json
+++ b/defaults/qwen_image_edit_20B.json
@ -9,9 +9,7 @@
        ],
        "attention": {
            "<89": "sdpa"
-        },
-        "reference_image": true,
-        "image_outputs": true
+        }
    },
    "prompt": "add a hat",
    "resolution": "1280x720",
--- a/defaults/standin.json
+++ b/defaults/standin.json
@ -4,7 +4,7 @@
 		"name": "Wan2.1 Standin 14B",
 		"modules": [ ["https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/Stand-In_wan2.1_T2V_14B_ver1.0_bf16.safetensors"]],
 		"architecture" : "standin",
-		"description": "The original Wan Text 2 Video model combined with the StandIn module to improve Identity Preservation. You need to provide a Reference Image with white background which is a close up of person face to transfer this person in the Video.",
+		"description": "The original Wan Text 2 Video model combined with the StandIn module to improve Identity Preservation. You need to provide a Reference Image with white background which is a close up of a person face to transfer this person in the Video.",
 		"URLs":  "t2v"
 	}
 }
--- a/models/flux/flux_handler.py
+++ b/models/flux/flux_handler.py
@ -13,28 +13,52 @@ class family_handler():
        flux_schnell = flux_model == "flux-schnell" 
        flux_chroma = flux_model == "flux-chroma" 
        flux_uso = flux_model == "flux-dev-uso"
-        model_def_output = {
+        flux_umo = flux_model == "flux-dev-umo"
+        flux_kontext = flux_model == "flux-dev-kontext"
+        
+        extra_model_def = {
            "image_outputs" : True,
            "no_negative_prompt" : not flux_chroma,
        }
        if flux_chroma:
-            model_def_output["guidance_max_phases"] = 1
+            extra_model_def["guidance_max_phases"] = 1
        elif not flux_schnell:
-            model_def_output["embedded_guidance"] = True
+            extra_model_def["embedded_guidance"] = True
        if flux_uso :
-            model_def_output["any_image_refs_relative_size"] = True
-            model_def_output["no_background_removal"] = True
-
-            model_def_output["image_ref_choices"] = {
+            extra_model_def["any_image_refs_relative_size"] = True
+            extra_model_def["no_background_removal"] = True
+            extra_model_def["image_ref_choices"] = {
                "choices":[("No Reference Image", ""),("First Image is a Reference Image, and then the next ones (up to two) are Style Images", "KI"),
                            ("Up to two Images are Style Images", "KIJ")],
                "default": "KI",
                "letters_filter": "KIJ",
                "label": "Reference Images / Style Images"
            }
-        model_def_output["lock_image_refs_ratios"] = True
        
-        return model_def_output
+        if flux_kontext:
+            extra_model_def["inpaint_support"] = True
+            extra_model_def["image_ref_choices"] = {
+                "choices": [
+                    ("None", ""),
+                    ("Conditional Images is first Main Subject / Landscape and may be followed by People / Objects", "KI"),
+                    ("Conditional Images are People / Objects", "I"),
+                    ],
+                "letters_filter": "KI",
+            }
+            extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape" 
+        elif flux_umo:
+            extra_model_def["image_ref_choices"] = {
+                "choices": [
+                    ("Conditional Images are People / Objects", "I"),
+                    ],
+                "letters_filter": "I",
+                "visible": False
+            }
+
+
+        extra_model_def["lock_image_refs_ratios"] = True
+
+        return extra_model_def

    @staticmethod
    def query_supported_types():
@ -118,15 +142,28 @@ class family_handler():
                video_prompt_type = video_prompt_type.replace("I", "KI")
                ui_defaults["video_prompt_type"] = video_prompt_type 

+        if settings_version < 2.34:
+            ui_defaults["denoising_strength"] = 1.
+
    @staticmethod
    def update_default_settings(base_model_type, model_def, ui_defaults):
        flux_model = model_def.get("flux-model", "flux-dev")
        flux_uso = flux_model == "flux-dev-uso"
+        flux_umo = flux_model == "flux-dev-umo"
+        flux_kontext = flux_model == "flux-dev-kontext"
        ui_defaults.update({
            "embedded_guidance":  2.5,
        })
-        if model_def.get("reference_image", False):
+
+        if flux_kontext or flux_uso:
            ui_defaults.update({
                "video_prompt_type": "KI",
+                "denoising_strength": 1.,
+            })
+        elif flux_umo:
+            ui_defaults.update({
+                "video_prompt_type": "I",
+                "remove_background_images_ref": 0,
            })
        
+
--- a/models/flux/flux_main.py
+++ b/models/flux/flux_main.py
@ -23,44 +23,35 @@ from .util import (
 )

 from PIL import Image
+def preprocess_ref(raw_image: Image.Image, long_size: int = 512):
+    # 获取原始图像的宽度和高度
+    image_w, image_h = raw_image.size

-def resize_and_centercrop_image(image, target_height_ref1, target_width_ref1):
-    target_height_ref1 = int(target_height_ref1 // 64 * 64)
-    target_width_ref1 = int(target_width_ref1 // 64 * 64)
-    h, w = image.shape[-2:]
-    if h < target_height_ref1 or w < target_width_ref1:
-        # 计算长宽比
-        aspect_ratio = w / h
-        if h < target_height_ref1:
-            new_h = target_height_ref1
-            new_w = new_h * aspect_ratio
-            if new_w < target_width_ref1:
-                new_w = target_width_ref1
-                new_h = new_w / aspect_ratio
-        else:
-            new_w = target_width_ref1
-            new_h = new_w / aspect_ratio
-            if new_h < target_height_ref1:
-                new_h = target_height_ref1
-                new_w = new_h * aspect_ratio
+    # 计算长边和短边
+    if image_w >= image_h:
+        new_w = long_size
+        new_h = int((long_size / image_w) * image_h)
    else:
-        aspect_ratio = w / h
-        tgt_aspect_ratio = target_width_ref1 / target_height_ref1
-        if aspect_ratio > tgt_aspect_ratio:
-            new_h = target_height_ref1
-            new_w = new_h * aspect_ratio
-        else:
-            new_w = target_width_ref1
-            new_h = new_w / aspect_ratio
-    # 使用 TVF.resize 进行图像缩放
-    image = TVF.resize(image, (math.ceil(new_h), math.ceil(new_w)))
-    # 计算中心裁剪的参数
-    top = (image.shape[-2] - target_height_ref1) // 2
-    left = (image.shape[-1] - target_width_ref1) // 2
-    # 使用 TVF.crop 进行中心裁剪
-    image = TVF.crop(image, top, left, target_height_ref1, target_width_ref1)
-    return image
+        new_h = long_size
+        new_w = int((long_size / image_h) * image_w)

+    # 按新的宽高进行等比例缩放
+    raw_image = raw_image.resize((new_w, new_h), resample=Image.LANCZOS)
+    target_w = new_w // 16 * 16
+    target_h = new_h // 16 * 16
+
+    # 计算裁剪的起始坐标以实现中心裁剪
+    left = (new_w - target_w) // 2
+    top = (new_h - target_h) // 2
+    right = left + target_w
+    bottom = top + target_h
+
+    # 进行中心裁剪
+    raw_image = raw_image.crop((left, top, right, bottom))
+
+    # 转换为 RGB 模式
+    raw_image = raw_image.convert("RGB")
+    return raw_image

 def stitch_images(img1, img2):
    # Resize img2 to match img1's height
@ -105,7 +96,7 @@ class model_factory:
        # self.name= "flux-schnell"
        source =  model_def.get("source", None)
        self.model = load_flow_model(self.name, model_filename[0] if source is None else source, torch_device)
-
+        self.model_def = model_def 
        self.vae = load_ae(self.name, device=torch_device)

        siglip_processor = siglip_model = feature_embedder = None
@ -151,6 +142,8 @@ class model_factory:
            n_prompt: str = None,
            sampling_steps: int = 20,
            input_ref_images = None,
+            image_guide= None,
+            image_mask= None,
            width= 832,
            height=480,
            embedded_guidance_scale: float = 2.5,
@ -162,6 +155,7 @@ class model_factory:
            video_prompt_type = "",
            joint_pass = False,
            image_refs_relative_size = 100,
+            denoising_strength = 1.,
            **bbargs
    ):
            if self._interrupt:
@ -170,10 +164,16 @@ class model_factory:
            if n_prompt is None or len(n_prompt) == 0: n_prompt = "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"
            device="cuda"
            flux_dev_uso = self.name in ['flux-dev-uso']
-            image_stiching =  not self.name in ['flux-dev-uso'] #and False
-            # image_refs_relative_size = 100
-            crop = False
+            flux_dev_umo = self.name in ['flux-dev-umo']
+            latent_stiching =  self.name in ['flux-dev-uso', 'flux-dev-umo'] 
+
+            lock_dimensions=  False
+
            input_ref_images = [] if input_ref_images is None else input_ref_images[:]
+            if flux_dev_umo:
+                ref_long_side = 512 if len(input_ref_images) <= 1 else 320
+                input_ref_images = [preprocess_ref(img, ref_long_side) for img in input_ref_images]
+                lock_dimensions = True
            ref_style_imgs = []
            if "I" in video_prompt_type and len(input_ref_images) > 0: 
                if flux_dev_uso :
@ -183,43 +183,26 @@ class model_factory:
                    elif len(input_ref_images) > 1 :
                        ref_style_imgs = input_ref_images[-1:]
                        input_ref_images = input_ref_images[:-1]
-                if image_stiching:
+
+                if latent_stiching:
+                    # latents stiching with resize 
+                    if not lock_dimensions :
+                        for i in range(len(input_ref_images)):
+                            w, h = input_ref_images[i].size
+                            image_height, image_width = calculate_new_dimensions(int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100), h, w, 0)
+                            input_ref_images[i] = input_ref_images[i].resize((image_width, image_height), resample=Image.Resampling.LANCZOS) 
+                else:
                    # image stiching method
                    stiched = input_ref_images[0]
-                    if "K" in video_prompt_type :
-                        w, h = input_ref_images[0].size
-                        height, width = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
-                        # actual rescale will happen in prepare_kontext
                    for new_img in input_ref_images[1:]:
                        stiched = stitch_images(stiched, new_img)
                    input_ref_images  = [stiched]
-                else:
-                    first_ref = 0
-                    if "K" in video_prompt_type:
-                        # image latents tiling method
-                        w, h = input_ref_images[0].size
-                        if crop :
-                            img = convert_image_to_tensor(input_ref_images[0])
-                            img = resize_and_centercrop_image(img, height, width)                       
-                            input_ref_images[0] = convert_tensor_to_image(img)                    
-                        else:
-                            height, width = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
-                            input_ref_images[0] = input_ref_images[0].resize((width, height), resample=Image.Resampling.LANCZOS) 
-                        first_ref = 1
-
-                    for i in range(first_ref,len(input_ref_images)):
-                        w, h = input_ref_images[i].size
-                        if crop:
-                            img = convert_image_to_tensor(input_ref_images[i])
-                            img = resize_and_centercrop_image(img, int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100)) 
-                            input_ref_images[i] = convert_tensor_to_image(img)                    
-                        else:
-                            image_height, image_width = calculate_new_dimensions(int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100), h, w, fit_into_canvas)
-                            input_ref_images[i] = input_ref_images[i].resize((image_width, image_height), resample=Image.Resampling.LANCZOS) 
+            elif image_guide is not None:
+                input_ref_images = [image_guide] 
            else:
                input_ref_images = None

-            if flux_dev_uso :
+            if self.name in ['flux-dev-uso', 'flux-dev-umo']  :
                inp, height, width = prepare_multi_ip(
                    ae=self.vae,
                    img_cond_list=input_ref_images,
@ -238,6 +221,7 @@ class model_factory:
                    bs=batch_size,
                    seed=seed,
                    device=device,
+                    img_mask=image_mask,
                )

            inp.update(prepare_prompt(self.t5, self.clip, batch_size, input_prompt))
@ -259,13 +243,19 @@ class model_factory:
                return unpack(x.float(), height, width) 

            # denoise initial noise
-            x = denoise(self.model, **inp, timesteps=timesteps, guidance=embedded_guidance_scale, real_guidance_scale =guide_scale, callback=callback, pipeline=self, loras_slists= loras_slists, unpack_latent = unpack_latent, joint_pass = joint_pass)
+            x = denoise(self.model, **inp, timesteps=timesteps, guidance=embedded_guidance_scale, real_guidance_scale =guide_scale, callback=callback, pipeline=self, loras_slists= loras_slists, unpack_latent = unpack_latent, joint_pass = joint_pass, denoising_strength = denoising_strength)
            if x==None: return None
            # decode latents to pixel space
            x = unpack_latent(x)
            with torch.autocast(device_type=device, dtype=torch.bfloat16):
                x = self.vae.decode(x)

+            if image_mask is not None:
+                from shared.utils.utils import convert_image_to_tensor
+                img_msk_rebuilt = inp["img_msk_rebuilt"]
+                img= convert_image_to_tensor(image_guide) 
+                x = img.squeeze(2) * (1 - img_msk_rebuilt) + x.to(img) * img_msk_rebuilt 
+
            x = x.clamp(-1, 1)
            x = x.transpose(0, 1)
            return x
--- a/models/flux/model.py
+++ b/models/flux/model.py
@ -190,6 +190,21 @@ class Flux(nn.Module):
                            v = swap_scale_shift(v)
                        k = k.replace("norm_out.linear", "final_layer.adaLN_modulation.1")            
                new_sd[k] = v
+        # elif not first_key.startswith("diffusion_model.") and not first_key.startswith("transformer."):
+        #     for k,v in sd.items():
+        #         if "double" in k:
+        #             k = k.replace(".processor.proj_lora1.", ".img_attn.proj.lora_")
+        #             k = k.replace(".processor.proj_lora2.", ".txt_attn.proj.lora_")
+        #             k = k.replace(".processor.qkv_lora1.", ".img_attn.qkv.lora_")
+        #             k = k.replace(".processor.qkv_lora2.", ".txt_attn.qkv.lora_")
+        #         else:
+        #             k = k.replace(".processor.qkv_lora.", ".linear1_qkv.lora_")
+        #             k = k.replace(".processor.proj_lora.", ".linear2.lora_")
+
+        #         k = "diffusion_model." + k
+        #         new_sd[k] = v
+        #     from mmgp import safetensors2
+        #     safetensors2.torch_write_file(new_sd, "fff.safetensors")
        else:
            new_sd = sd
        return new_sd    
--- a/models/flux/sampling.py
+++ b/models/flux/sampling.py
@ -138,10 +138,12 @@ def prepare_kontext(
    target_width: int | None = None,
    target_height: int | None = None,
    bs: int = 1,
-
+    img_mask = None,
 ) -> tuple[dict[str, Tensor], int, int]:
    # load and encode the conditioning image

+    res_match_output = img_mask is not None
+
    img_cond_seq = None
    img_cond_seq_ids = None
    if img_cond_list == None: img_cond_list = []
@ -150,9 +152,11 @@ def prepare_kontext(
    for cond_no, img_cond in enumerate(img_cond_list): 
        width, height = img_cond.size
        aspect_ratio = width / height
-
-        # Kontext is trained on specific resolutions, using one of them is recommended
-        _, width, height = min((abs(aspect_ratio - w / h), w, h) for w, h in PREFERED_KONTEXT_RESOLUTIONS)
+        if res_match_output:
+            width, height = target_width, target_height
+        else:
+            # Kontext is trained on specific resolutions, using one of them is recommended
+            _, width, height = min((abs(aspect_ratio - w / h), w, h) for w, h in PREFERED_KONTEXT_RESOLUTIONS)
        width = 2 * int(width / 16)
        height = 2 * int(height / 16)

@ -193,6 +197,19 @@ def prepare_kontext(
        "img_cond_seq": img_cond_seq,
        "img_cond_seq_ids": img_cond_seq_ids,
    }
+    if img_mask is not None:
+        from shared.utils.utils import convert_image_to_tensor, convert_tensor_to_image
+        # image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
+        image_mask_latents = convert_image_to_tensor(img_mask.resize((target_width // 16, target_height // 16), resample=Image.Resampling.LANCZOS))
+        image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
+        image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
+        convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
+        image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)        
+        return_dict.update({
+            "img_msk_latents": image_mask_latents,
+            "img_msk_rebuilt": image_mask_rebuilt,
+        })
+
    img = get_noise(
        bs,
        target_height,
@ -264,6 +281,9 @@ def denoise(
    loras_slists=None,
    unpack_latent = None,
    joint_pass= False,
+    img_msk_latents = None,
+    img_msk_rebuilt = None,
+    denoising_strength = 1,
 ):

    kwargs = {'pipeline': pipeline, 'callback': callback, "img_len" : img.shape[1], "siglip_embedding": siglip_embedding, "siglip_embedding_ids": siglip_embedding_ids}
@ -271,6 +291,21 @@ def denoise(
    if callback != None:
        callback(-1, None, True)

+    original_image_latents = None if img_cond_seq is None else img_cond_seq.clone() 
+
+    morph, first_step = False, 0
+    if img_msk_latents is not None:
+        randn = torch.randn_like(original_image_latents)
+        if denoising_strength < 1.:
+            first_step = int(len(timesteps) * (1. - denoising_strength))
+        if not morph:
+            latent_noise_factor = timesteps[first_step]
+            latents  = original_image_latents  * (1.0 - latent_noise_factor) + randn * latent_noise_factor
+            img = latents.to(img)
+            latents = None
+            timesteps = timesteps[first_step:]
+
+
    updated_num_steps= len(timesteps) -1
    if callback != None:
        from shared.utils.loras_mutipliers import update_loras_slists
@ -280,10 +315,14 @@ def denoise(
    # this is ignored for schnell
    guidance_vec = torch.full((img.shape[0],), guidance, device=img.device, dtype=img.dtype)
    for i, (t_curr, t_prev) in enumerate(zip(timesteps[:-1], timesteps[1:])):
-        offload.set_step_no_for_lora(model, i)
+        offload.set_step_no_for_lora(model, first_step  + i)
        if pipeline._interrupt:
            return None

+        if img_msk_latents is not None and denoising_strength <1. and i == first_step and morph:
+            latent_noise_factor = t_curr/1000
+            img  = original_image_latents  * (1.0 - latent_noise_factor) + img * latent_noise_factor 
+
        t_vec = torch.full((img.shape[0],), t_curr, dtype=img.dtype, device=img.device)
        img_input = img
        img_input_ids = img_ids
@ -333,6 +372,14 @@ def denoise(
            pred = neg_pred + real_guidance_scale * (pred - neg_pred)

        img += (t_prev - t_curr) * pred
+
+        if img_msk_latents is not None:
+            latent_noise_factor = t_prev
+            # noisy_image  = original_image_latents  * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor 
+            noisy_image  = original_image_latents  * (1.0 - latent_noise_factor) + randn * latent_noise_factor 
+            img  =  noisy_image * (1-img_msk_latents)  + img_msk_latents * img
+            noisy_image = None
+
        if callback is not None:
            preview = unpack_latent(img).transpose(0,1)
            callback(i, preview, False)         
--- a/models/flux/util.py
+++ b/models/flux/util.py
@ -640,6 +640,38 @@ configs = {
            shift_factor=0.1159,
        ),
    ),
+    "flux-dev-umo": ModelSpec(
+        repo_id="",
+        repo_flow="",
+        repo_ae="ckpts/flux_vae.safetensors",
+        params=FluxParams(
+            in_channels=64,
+            out_channels=64,
+            vec_in_dim=768,
+            context_in_dim=4096,
+            hidden_size=3072,
+            mlp_ratio=4.0,
+            num_heads=24,
+            depth=19,
+            depth_single_blocks=38,
+            axes_dim=[16, 56, 56],
+            theta=10_000,
+            qkv_bias=True,
+            guidance_embed=True,
+            eso= True,
+        ),
+        ae_params=AutoEncoderParams(
+            resolution=256,
+            in_channels=3,
+            ch=128,
+            out_ch=3,
+            ch_mult=[1, 2, 4, 4],
+            num_res_blocks=2,
+            z_channels=16,
+            scale_factor=0.3611,
+            shift_factor=0.1159,
+        ),
+    ),
 }


--- a/models/hyvideo/hunyuan.py
+++ b/models/hyvideo/hunyuan.py
@ -861,11 +861,6 @@ class HunyuanVideoSampler(Inference):
            freqs_cos, freqs_sin = self.get_rotary_pos_embed(target_frame_num, target_height, target_width, enable_RIFLEx)
        else:
            if self.avatar:
-                w, h = input_ref_images.size
-                target_height, target_width = calculate_new_dimensions(target_height, target_width, h, w, fit_into_canvas)
-                if target_width != w or target_height != h:
-                    input_ref_images = input_ref_images.resize((target_width,target_height), resample=Image.Resampling.LANCZOS) 
-
                concat_dict = {'mode': 'timecat', 'bias': -1} 
                freqs_cos, freqs_sin = self.get_rotary_pos_embed_new(129, target_height, target_width, concat_dict)
            else:
--- a/models/hyvideo/hunyuan_handler.py
+++ b/models/hyvideo/hunyuan_handler.py
@ -51,6 +51,23 @@ class family_handler():
        extra_model_def["tea_cache"] = True
        extra_model_def["mag_cache"] = True

+        if base_model_type in ["hunyuan_custom_edit"]:
+            extra_model_def["guide_preprocessing"] = {
+                "selection": ["MV", "PV"],
+            }
+
+            extra_model_def["mask_preprocessing"] = {
+                "selection": ["A", "NA"],
+                "default" : "NA"
+            }
+
+        if base_model_type in ["hunyuan_custom_audio", "hunyuan_custom_edit", "hunyuan_custom"]:
+            extra_model_def["image_ref_choices"] = {
+                "choices": [("Reference Image", "I")],
+                "letters_filter":"I",
+                "visible": False,
+            }
+
        if base_model_type in ["hunyuan_avatar"]: extra_model_def["no_background_removal"] = True

        if base_model_type in ["hunyuan_custom", "hunyuan_custom_edit", "hunyuan_custom_audio", "hunyuan_avatar"]:
@ -141,6 +158,18 @@ class family_handler():

        return hunyuan_model, pipe

+    @staticmethod
+    def fix_settings(base_model_type, settings_version, model_def, ui_defaults):
+        if settings_version<2.33:
+            if base_model_type in ["hunyuan_custom_edit"]:
+                video_prompt_type=  ui_defaults["video_prompt_type"]
+                if "P" in video_prompt_type and "M" in video_prompt_type: 
+                    video_prompt_type = video_prompt_type.replace("M","")
+                    ui_defaults["video_prompt_type"] = video_prompt_type  
+
+        
+        pass
+    
    @staticmethod
    def update_default_settings(base_model_type, model_def, ui_defaults):
        ui_defaults["embedded_guidance_scale"]= 6.0
--- a/models/ltx_video/ltxv.py
+++ b/models/ltx_video/ltxv.py
@ -300,9 +300,6 @@ class LTXV:
            prefix_size, height, width = input_video.shape[-3:]
        else:
            if image_start != None:
-                frame_width, frame_height  = image_start.size
-                if fit_into_canvas != None:
-                    height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas, 32)
                conditioning_media_paths.append(image_start.unsqueeze(1)) 
                conditioning_start_frames.append(0)
                conditioning_control_frames.append(False)
--- a/models/ltx_video/ltxv_handler.py
+++ b/models/ltx_video/ltxv_handler.py
@ -26,6 +26,15 @@ class family_handler():
        extra_model_def["sliding_window"] = True
        extra_model_def["image_prompt_types_allowed"] = "TSEV"

+        extra_model_def["guide_preprocessing"] = {
+            "selection": ["", "PV", "DV", "EV", "V"],
+            "labels" : { "V": "Use LTXV raw format"}
+        }
+
+        extra_model_def["mask_preprocessing"] = {
+            "selection": ["", "A", "NA", "XA", "XNA"],
+        }
+
        return extra_model_def

    @staticmethod
--- a/models/qwen/pipeline_qwenimage.py
+++ b/models/qwen/pipeline_qwenimage.py
@ -28,7 +28,7 @@ from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer, Aut
 from .autoencoder_kl_qwenimage import AutoencoderKLQwenImage
 from diffusers import FlowMatchEulerDiscreteScheduler
 from PIL import Image
-from shared.utils.utils import calculate_new_dimensions
+from shared.utils.utils import calculate_new_dimensions, convert_image_to_tensor, convert_tensor_to_image

 XLA_AVAILABLE = False

@ -563,6 +563,8 @@ class QwenImagePipeline(): #DiffusionPipeline
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        max_sequence_length: int = 512,
        image = None,
+        image_mask = None,
+        denoising_strength = 0,
        callback=None,
        pipeline=None,
        loras_slists=None,
@ -683,6 +685,7 @@ class QwenImagePipeline(): #DiffusionPipeline
        device = "cuda"

        prompt_image = None
+        image_mask_latents = None
        if image is not None and not (isinstance(image, torch.Tensor) and image.size(1) == self.latent_channels):
            image = image[0] if isinstance(image, list) else image
            image_height, image_width = self.image_processor.get_default_height_width(image)
@ -694,14 +697,32 @@ class QwenImagePipeline(): #DiffusionPipeline
            image_width = image_width // multiple_of * multiple_of
            image_height = image_height // multiple_of * multiple_of
            ref_height, ref_width = 1568, 672
-            if height * width < ref_height * ref_width: ref_height , ref_width = height , width  
-            if image_height * image_width > ref_height * ref_width:
-                image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)

-            image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS) 
+            if image_mask is None:
+                if height * width < ref_height * ref_width: ref_height , ref_width = height , width  
+                if image_height * image_width > ref_height * ref_width:
+                    image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
+                if (image_width,image_height) != image.size:
+                    image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS) 
+            else:
+                # _, image_width, image_height = min(
+                #     (abs(aspect_ratio - w / h), w, h) for w, h in PREFERRED_QWENIMAGE_RESOLUTIONS
+                # )
+                image_height, image_width = calculate_new_dimensions(height, width, image_height, image_width, False, block_size=multiple_of)
+                # image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
+                height, width = image_height, image_width
+                image_mask_latents = convert_image_to_tensor(image_mask.resize((width // 16, height // 16), resample=Image.Resampling.LANCZOS))
+                image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
+                image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
+                # convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
+                image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)
+
            prompt_image = image
-            image = self.image_processor.preprocess(image, image_height, image_width)
-            image = image.unsqueeze(2)
+            if image.size != (image_width, image_height):
+                image = image.resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
+
+            # image.save("nnn.png")
+            image = convert_image_to_tensor(image).unsqueeze(0).unsqueeze(2)

        has_neg_prompt = negative_prompt is not None or (
            negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
@ -744,6 +765,8 @@ class QwenImagePipeline(): #DiffusionPipeline
            generator,
            latents,
        )
+        original_image_latents = None if image_latents is None else image_latents.clone() 
+
        if image is not None:
            img_shapes = [
                [
@ -788,6 +811,18 @@ class QwenImagePipeline(): #DiffusionPipeline
        negative_txt_seq_lens = (
            negative_prompt_embeds_mask.sum(dim=1).tolist() if negative_prompt_embeds_mask is not None else None
        )
+        morph, first_step = False, 0
+        if image_mask_latents is not None:
+            randn = torch.randn_like(original_image_latents)
+            if denoising_strength < 1.:
+                first_step = int(len(timesteps) * (1. - denoising_strength))
+            if not morph:
+                latent_noise_factor = timesteps[first_step]/1000
+                # latents  = original_image_latents  * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor 
+                latents  = original_image_latents  * (1.0 - latent_noise_factor) + randn * latent_noise_factor 
+                timesteps = timesteps[first_step:]
+                self.scheduler.timesteps = timesteps
+                self.scheduler.sigmas= self.scheduler.sigmas[first_step:]

        # 6. Denoising loop
        self.scheduler.set_begin_index(0)
@ -797,10 +832,16 @@ class QwenImagePipeline(): #DiffusionPipeline
            update_loras_slists(self.transformer, loras_slists, updated_num_steps)
            callback(-1, None, True, override_num_inference_steps = updated_num_steps)

+
        for i, t in enumerate(timesteps):
+            offload.set_step_no_for_lora(self.transformer, first_step  + i)
            if self.interrupt:
                continue

+            if image_mask_latents is not None and denoising_strength <1. and i == first_step and morph:
+                latent_noise_factor = t/1000
+                latents  = original_image_latents  * (1.0 - latent_noise_factor) + latents * latent_noise_factor 
+
            self._current_timestep = t
            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
            timestep = t.expand(latents.shape[0]).to(latents.dtype)
@ -865,6 +906,13 @@ class QwenImagePipeline(): #DiffusionPipeline
            # compute the previous noisy sample x_t -> x_t-1
            latents_dtype = latents.dtype
            latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
+            if image_mask_latents is not None:
+                next_t = timesteps[i+1] if i<len(timesteps)-1 else 0
+                latent_noise_factor = next_t / 1000
+                    # noisy_image  = original_image_latents  * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor 
+                noisy_image  = original_image_latents  * (1.0 - latent_noise_factor) + randn * latent_noise_factor 
+                latents  =  noisy_image * (1-image_mask_latents)  + image_mask_latents * latents
+                noisy_image = None

            if latents.dtype != latents_dtype:
                if torch.backends.mps.is_available():
@ -878,7 +926,7 @@ class QwenImagePipeline(): #DiffusionPipeline

        self._current_timestep = None
        if output_type == "latent":
-            image = latents
+            output_image = latents
        else:
            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
            latents = latents.to(self.vae.dtype)
@ -891,7 +939,9 @@ class QwenImagePipeline(): #DiffusionPipeline
                latents.device, latents.dtype
            )
            latents = latents / latents_std + latents_mean
-            image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
+            output_image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
+            if image_mask is not None:
+                output_image = image.squeeze(2) * (1 - image_mask_rebuilt) + output_image.to(image) * image_mask_rebuilt 


-        return image
+        return output_image
--- a/models/qwen/qwen_handler.py
+++ b/models/qwen/qwen_handler.py
@ -9,7 +9,7 @@ def get_qwen_text_encoder_filename(text_encoder_quantization):
 class family_handler():
    @staticmethod
    def query_model_def(base_model_type, model_def):
-        model_def_output = {
+        extra_model_def = {
            "image_outputs" : True,
            "sample_solvers":[
                            ("Default", "default"),
@ -18,8 +18,19 @@ class family_handler():
            "lock_image_refs_ratios": True,
        }

+        if base_model_type in ["qwen_image_edit_20B"]: 
+            extra_model_def["inpaint_support"] = True
+            extra_model_def["image_ref_choices"] = {
+            "choices": [
+                ("None", ""),
+                ("Conditional Images is first Main Subject / Landscape and may be followed by People / Objects", "KI"),
+                ("Conditional Images are People / Objects", "I"),
+                ],
+            "letters_filter": "KI",
+            }
+            extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape" 

-        return model_def_output
+        return extra_model_def

    @staticmethod
    def query_supported_types():
@ -75,14 +86,18 @@ class family_handler():
        if ui_defaults.get("sample_solver", "") == "": 
            ui_defaults["sample_solver"] = "default"

+        if settings_version < 2.32:
+            ui_defaults["denoising_strength"] = 1.
+                            
    @staticmethod
    def update_default_settings(base_model_type, model_def, ui_defaults):
        ui_defaults.update({
            "guidance_scale":  4,
            "sample_solver": "default",
        })            
-        if model_def.get("reference_image", False):
+        if base_model_type in ["qwen_image_edit_20B"]: 
            ui_defaults.update({
                "video_prompt_type": "KI",
+                "denoising_strength" : 1.,
            })

--- a/models/qwen/qwen_main.py
+++ b/models/qwen/qwen_main.py
@ -103,6 +103,8 @@ class model_factory():
        n_prompt = None,
        sampling_steps: int = 20,
        input_ref_images = None,
+        image_guide= None,
+        image_mask= None,
        width= 832,
        height=480,
        guide_scale: float = 4,
@ -114,6 +116,7 @@ class model_factory():
        VAE_tile_size = None, 
        joint_pass = True,
        sample_solver='default',
+        denoising_strength = 1.,
        **bbargs
    ):
        # Generate with different aspect ratios
@ -174,8 +177,9 @@ class model_factory():

        if n_prompt is None or len(n_prompt) == 0:
            n_prompt=  "text, watermark, copyright, blurry, low resolution"
-
-        if input_ref_images is not None:
+        if image_guide is not None:
+            input_ref_images = [image_guide] 
+        elif input_ref_images is not None:
            # image stiching method
            stiched = input_ref_images[0]
            if "K" in video_prompt_type :
@ -190,6 +194,7 @@ class model_factory():
            prompt=input_prompt,
            negative_prompt=n_prompt,
            image = input_ref_images,
+            image_mask = image_mask,
            width=width,
            height=height,
            num_inference_steps=sampling_steps,
@ -199,6 +204,7 @@ class model_factory():
            pipeline=self,
            loras_slists=loras_slists,
            joint_pass = joint_pass,
+            denoising_strength=denoising_strength,
            generator=torch.Generator(device="cuda").manual_seed(seed)
        )        
        if image is None: return None
--- a/models/wan/any2video.py
+++ b/models/wan/any2video.py
@ -261,7 +261,7 @@ class WanAny2V:
    def vace_latent(self, z, m):
        return [torch.cat([zz, mm], dim=0) for zz, mm in zip(z, m)]

-    def fit_image_into_canvas(self, ref_img, image_size, canvas_tf_bg, device, fill_max = False, outpainting_dims = None, return_mask = False):
+    def fit_image_into_canvas(self, ref_img, image_size, canvas_tf_bg, device, full_frame = False, outpainting_dims = None, return_mask = False):
        from shared.utils.utils import save_image
        ref_width, ref_height = ref_img.size
        if (ref_height, ref_width) == image_size and outpainting_dims  == None:
@ -270,18 +270,23 @@ class WanAny2V:
        else:
            if outpainting_dims != None:
                final_height, final_width = image_size
-                canvas_height, canvas_width, margin_top, margin_left =   get_outpainting_frame_location(final_height, final_width,  outpainting_dims, 8)        
+                canvas_height, canvas_width, margin_top, margin_left =   get_outpainting_frame_location(final_height, final_width,  outpainting_dims, 1)        
            else:
                canvas_height, canvas_width = image_size
-            scale = min(canvas_height / ref_height, canvas_width / ref_width)
-            new_height = int(ref_height * scale)
-            new_width = int(ref_width * scale)
-            if fill_max  and (canvas_height - new_height) < 16:
+            if full_frame:
                new_height = canvas_height
-            if fill_max  and (canvas_width - new_width) < 16:
                new_width = canvas_width
-            top = (canvas_height - new_height) // 2
-            left = (canvas_width - new_width) // 2
+                top = left = 0 
+            else:
+                # if fill_max  and (canvas_height - new_height) < 16:
+                #     new_height = canvas_height
+                # if fill_max  and (canvas_width - new_width) < 16:
+                #     new_width = canvas_width
+                scale = min(canvas_height / ref_height, canvas_width / ref_width)
+                new_height = int(ref_height * scale)
+                new_width = int(ref_width * scale)
+                top = (canvas_height - new_height) // 2
+                left = (canvas_width - new_width) // 2
            ref_img = ref_img.resize((new_width, new_height), resample=Image.Resampling.LANCZOS) 
            ref_img = TF.to_tensor(ref_img).sub_(0.5).div_(0.5).unsqueeze(1)
            if outpainting_dims != None:
@ -302,7 +307,7 @@ class WanAny2V:
                canvas = canvas.to(device)
        return ref_img.to(device), canvas

-    def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size,  device, keep_video_guide_frames= [], start_frame = 0,  fit_into_canvas = None, pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
+    def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size,  device, keep_video_guide_frames= [], start_frame = 0, pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
        image_sizes = []
        trim_video_guide = len(keep_video_guide_frames)
        def conv_tensor(t, device):
@ -533,22 +538,16 @@ class WanAny2V:
            any_end_frame = False
            if image_start is None:
                if infinitetalk:
+                    new_shot = "Q" in video_prompt_type
                    if input_frames is not None:
                        image_ref = input_frames[:, 0]
-                        if input_video is None: input_video = input_frames[:, 0:1]
-                        new_shot = "Q" in video_prompt_type
                    else:
-                        if pre_video_frame is None:
-                            new_shot = True
-                        else:
-                            if input_ref_images is None:
-                                input_ref_images, new_shot = [pre_video_frame], False
-                            else:
-                                input_ref_images, new_shot = [img.resize(pre_video_frame.size, resample=Image.Resampling.LANCZOS) for img in input_ref_images], "Q" in video_prompt_type
-                        if input_ref_images is None: raise Exception("Missing Reference Image")
+                        if input_ref_images is None:                        
+                            if pre_video_frame is None: raise Exception("Missing Reference Image")
+                            input_ref_images, new_shot = [pre_video_frame], False
                        new_shot = new_shot and window_no <= len(input_ref_images)
                        image_ref = convert_image_to_tensor(input_ref_images[ min(window_no, len(input_ref_images))-1 ])
-                    if new_shot:  
+                    if new_shot or input_video is None:  
                        input_video = image_ref.unsqueeze(1)
                    else:
                        color_correction_strength = 0 #disable color correction as transition frames between shots may have a complete different color level than the colors of the new shot
@ -847,7 +846,7 @@ class WanAny2V:
        for i, t in enumerate(tqdm(timesteps)):
            guide_scale, guidance_switch_done, trans, denoising_extra = update_guidance(i, t, guide_scale, guide2_scale, guidance_switch_done, switch_threshold, trans, 2, denoising_extra)
            guide_scale, guidance_switch2_done, trans, denoising_extra = update_guidance(i, t, guide_scale, guide3_scale, guidance_switch2_done, switch2_threshold, trans, 3, denoising_extra)
-            offload.set_step_no_for_lora(trans, i)
+            offload.set_step_no_for_lora(trans, start_step_no + i)
            timestep = torch.stack([t])

            if timestep_injection:
--- a/models/wan/df_handler.py
+++ b/models/wan/df_handler.py
@ -35,7 +35,7 @@ class family_handler():
                    "label" : "Generation Type"
        }

-        extra_model_def["image_prompt_types_allowed"] = "TSEV"
+        extra_model_def["image_prompt_types_allowed"] = "TSV"


        return extra_model_def 
@ -66,7 +66,11 @@ class family_handler():
    def query_family_infos():
        return {}

-
+    @staticmethod
+    def get_rgb_factors(base_model_type ):
+        from shared.RGB_factors import get_rgb_factors
+        latent_rgb_factors, latent_rgb_factors_bias = get_rgb_factors("wan", base_model_type)
+        return latent_rgb_factors, latent_rgb_factors_bias

    @staticmethod
    def query_model_files(computeList, base_model_type, model_filename, text_encoder_quantization):
--- a/models/wan/wan_handler.py
+++ b/models/wan/wan_handler.py
@ -110,18 +110,79 @@ class family_handler():
        "tea_cache" : not (base_model_type in ["i2v_2_2", "ti2v_2_2" ] or multiple_submodels),
        "mag_cache" : True,
        "keep_frames_video_guide_not_supported": base_model_type in ["infinitetalk"],
+        "convert_image_guide_to_video" : True,
        "sample_solvers":[
                            ("unipc", "unipc"),
                            ("euler", "euler"),
                            ("dpm++", "dpm++"),
                            ("flowmatch causvid", "causvid"), ]
        })
+
+
+        if base_model_type in ["t2v"]: 
+            extra_model_def["guide_custom_choices"] = {
+                "choices":[("Use Text Prompt Only", ""),("Video to Video guided by Text Prompt", "GUV")],
+                "default": "",
+                "letters_filter": "GUV",
+                "label": "Video to Video"
+            }
+
        if base_model_type in ["infinitetalk"]: 
            extra_model_def["no_background_removal"] = True
-            # extra_model_def["at_least_one_image_ref_needed"] = True
+            extra_model_def["all_image_refs_are_background_ref"] = True
+            extra_model_def["guide_custom_choices"] = {
+            "choices":[
+                ("Images to Video, each Reference Image will start a new shot with a new Sliding Window - Sharp Transitions", "QKI"),
+                ("Images to Video, each Reference Image will start a new shot with a new Sliding Window - Smooth Transitions", "KI"),
+                ("Sparse Video to Video, one Image will by extracted from Video for each new Sliding Window - Sharp Transitions", "QRUV"),
+                ("Sparse Video to Video, one Image will by extracted from Video for each new Sliding Window - Smooth Transitions", "RUV"),
+                ("Video to Video, amount of motion transferred depends on Denoising Strength - Sharp Transitions", "GQUV"),
+                ("Video to Video, amount of motion transferred depends on Denoising Strength - Smooth Transitions", "GUV"),
+            ],
+            "default": "KI",
+            "letters_filter": "RGUVQKI",
+            "label": "Video to Video",
+            "show_label" : False,
+            }
+
+            # extra_model_def["at_least_one_image_ref_needed"] = True
+        if vace_class:
+            extra_model_def["guide_preprocessing"] = {
+                    "selection": ["", "UV", "PV", "DV", "SV", "LV", "CV", "MV", "V", "PDV", "PSV", "PLV" , "DSV", "DLV", "SLV"],
+                    "labels" : { "V": "Use Vace raw format"}
+                }
+            extra_model_def["mask_preprocessing"] = {
+                    "selection": ["", "A", "NA", "XA", "XNA", "YA", "YNA", "WA", "WNA", "ZA", "ZNA"],
+                }
+
+            extra_model_def["image_ref_choices"] = {
+                    "choices": [("None", ""),
+                    ("Inject only People / Objects", "I"),
+                    ("Inject Landscape and then People / Objects", "KI"),
+                    ("Inject Frames and then People / Objects", "FI"),
+                    ],
+                    "letters_filter":  "KFI",
+            }

-        if base_model_type in ["standin"] or vace_class: 
            extra_model_def["lock_image_refs_ratios"] = True
+            extra_model_def["background_removal_label"]= "Remove Backgrounds behind People / Objects, keep it for Landscape or positioned Frames"
+
+        if base_model_type in ["standin"]: 
+            extra_model_def["lock_image_refs_ratios"] = True
+            extra_model_def["image_ref_choices"] = {
+                "choices": [
+                    ("No Reference Image", ""),
+                    ("Reference Image is a Person Face", "I"),
+                    ],
+                "letters_filter":"I",
+            }
+
+        if base_model_type in ["phantom_1.3B", "phantom_14B"]: 
+            extra_model_def["image_ref_choices"] = {
+                "choices": [("Reference Image", "I")],
+                "letters_filter":"I",
+                "visible": False,
+            }

        if base_model_type in ["recam_1.3B"]: 
            extra_model_def["keep_frames_video_guide_not_supported"] = True
@ -141,6 +202,12 @@ class family_handler():
                        "default": 1,
                        "label" : "Camera Movement Type"
            }
+            extra_model_def["guide_preprocessing"] = {
+                    "selection": ["UV"],
+                    "labels" : { "UV": "Control Video"},
+                    "visible" : False,
+                }
+
        if vace_class or base_model_type in ["infinitetalk"]:
            image_prompt_types_allowed = "TVL"
        elif base_model_type in ["ti2v_2_2"]:
--- a/preprocessing/matanyone/app.py
+++ b/preprocessing/matanyone/app.py
@ -7,7 +7,6 @@ import psutil
 # import ffmpeg
 import imageio
 from PIL import Image
-
 import cv2
 import torch
 import torch.nn.functional as F
@ -33,6 +32,8 @@ model_in_GPU = False
 matanyone_in_GPU = False
 bfloat16_supported = False
 # SAM generator
+import copy
+
 class MaskGenerator():
    def __init__(self, sam_checkpoint, device):
        global args_device
@ -89,6 +90,7 @@ def get_frames_from_image(image_input, image_state):
        "last_frame_numer": 0,
        "fps": None
        }
+        
    image_info = "Image Name: N/A,\nFPS: N/A,\nTotal Frames: {},\nImage Size:{}".format(len(frames), image_size)
    set_image_encoder_patch()
    select_SAM()
@ -717,27 +719,33 @@ def load_unload_models(selected):
 def get_vmc_event_handler():
    return load_unload_models

-def export_to_vace_video_input(foreground_video_output):
-    gr.Info("Masked Video Input transferred to Vace For Inpainting")
-    return "V#" + str(time.time()), foreground_video_output

-
-def export_image(image_refs, image_output):
-    gr.Info("Masked Image transferred to Current Video")
+def export_image(state, image_output):
+    ui_settings = get_current_model_settings(state)
+    image_refs = ui_settings["image_refs"]
    if image_refs == None:
        image_refs =[]
    image_refs.append( image_output)
-    return image_refs
+    ui_settings["image_refs"] = image_refs 
+    gr.Info("Masked Image transferred to Current Image Generator")
+    return time.time()

-def export_image_mask(image_input, image_mask):
-    gr.Info("Input Image & Mask transferred to Current Video")
-    return Image.fromarray(image_input), image_mask
+def export_image_mask(state, image_input, image_mask):
+    ui_settings = get_current_model_settings(state)
+    ui_settings["image_guide"] = Image.fromarray(image_input)
+    ui_settings["image_mask"] = image_mask
+
+    gr.Info("Input Image & Mask transferred to Current Image Generator")
+    return time.time()


-def export_to_current_video_engine( foreground_video_output, alpha_video_output):
+def export_to_current_video_engine(state, foreground_video_output, alpha_video_output):
+    ui_settings = get_current_model_settings(state)
+    ui_settings["video_guide"] = foreground_video_output
+    ui_settings["video_mask"] = alpha_video_output
+
    gr.Info("Original Video and Full Mask have been transferred")
-    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
-    return foreground_video_output, alpha_video_output
+    return time.time()


 def teleport_to_video_tab(tab_state):
@ -746,15 +754,29 @@ def teleport_to_video_tab(tab_state):
    return gr.Tabs(selected="video_gen")


-def display(tabs, tab_state, server_config,  vace_video_input, vace_image_input, vace_video_mask, vace_image_mask, vace_image_refs):
+def display(tabs, tab_state, state, refresh_form_trigger, server_config, get_current_model_settings_fn): #,  vace_video_input, vace_image_input, vace_video_mask, vace_image_mask, vace_image_refs):
    # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
-    global image_output_codec, video_output_codec
+    global image_output_codec, video_output_codec, get_current_model_settings
+    get_current_model_settings = get_current_model_settings_fn

    image_output_codec = server_config.get("image_output_codec", None)
    video_output_codec = server_config.get("video_output_codec", None)

    media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"

+    click_brush_js = """
+    () => {
+        setTimeout(() => {
+            const brushButton = document.querySelector('button[aria-label="Brush"]');
+            if (brushButton) {
+                brushButton.click();
+                console.log('Brush button clicked');
+            } else {
+                console.log('Brush button not found');
+            }
+        }, 1000);
+    }    """
+
    # download assets

    gr.Markdown("<B>Mast Edition is provided by MatAnyone and VRAM optimized by DeepBeepMeep</B>")
@ -871,7 +893,7 @@ def display(tabs, tab_state, server_config,  vace_video_input, vace_image_input,
                            template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
                            with gr.Row():
                                clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False,  min_width=100)
-                                add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
+                                add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, min_width=100)
                                remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False,  min_width=100) # no use
                                matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False,  min_width=100)
                            with gr.Row():
@ -892,7 +914,7 @@ def display(tabs, tab_state, server_config,  vace_video_input, vace_image_input,
                            with gr.Row(visible= True):
                                export_to_current_video_engine_btn = gr.Button("Export to Control Video Input and Video Mask Input", visible= False)
                                    
-                export_to_current_video_engine_btn.click(  fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger, 
+                export_to_current_video_engine_btn.click(  fn=export_to_current_video_engine, inputs= [state, foreground_video_output, alpha_video_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger, 
                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])


@ -1089,10 +1111,10 @@ def display(tabs, tab_state, server_config,  vace_video_input, vace_image_input,
                    # with gr.Column(scale=2, visible= True):
                        export_image_mask_btn = gr.Button(value="Set to Control Image & Mask", visible=False, elem_classes="new_button")

-                export_image_btn.click(  fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger, 
-                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
-                export_image_mask_btn.click(  fn=export_image_mask, inputs= [image_input, alpha_image_output], outputs= [vace_image_input, vace_image_mask]).then( #video_prompt_video_guide_trigger, 
+                export_image_btn.click(  fn=export_image, inputs= [state, foreground_image_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger, 
                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
+                export_image_mask_btn.click(  fn=export_image_mask, inputs= [state, image_input, alpha_image_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger, 
+                    fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs]).then(fn=None, inputs=None, outputs=None, js=click_brush_js)

                # first step: get the image information 
                extract_frames_button.click(
@ -1148,5 +1170,21 @@ def display(tabs, tab_state, server_config,  vace_video_input, vace_image_input,
                    outputs=[foreground_image_output, alpha_image_output,foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
                )

-
+                nada = gr.State({})
+                # clear input
+                gr.on(
+                    triggers=[image_input.clear], #image_input.change,
+                    fn=restart,
+                    inputs=[],
+                    outputs=[ 
+                        image_state,
+                        interactive_state,
+                        click_state,
+                        foreground_image_output, alpha_image_output,
+                        template_frame,
+                        image_selection_slider, image_selection_slider, track_pause_number_slider,point_prompt, export_image_btn, export_image_mask_btn, bbox_info, clear_button_click, 
+                        add_mask_button, matting_button, template_frame, foreground_image_output, alpha_image_output, remove_mask_button, export_image_btn, export_image_mask_btn, mask_dropdown, nada, step2_title
+                    ],
+                    queue=False,
+                    show_progress=False)
                
--- a/preprocessing/matanyone/matanyone/model/utils/memory_utils.py
+++ b/preprocessing/matanyone/matanyone/model/utils/memory_utils.py
@ -2,7 +2,6 @@ import math
 import torch
 from typing import Optional, Union, Tuple

-
 # @torch.jit.script
 def get_similarity(mk: torch.Tensor,
                   ms: torch.Tensor,
@ -59,6 +58,7 @@ def get_similarity(mk: torch.Tensor,
        del two_ab 
        # similarity = (-a_sq + two_ab)

+    similarity =similarity.float()
    if ms is not None:
        similarity *= ms
        similarity /=  math.sqrt(CK)
--- a/preprocessing/matanyone/matanyone_wrapper.py
+++ b/preprocessing/matanyone/matanyone_wrapper.py
@ -73,5 +73,5 @@ def matanyone(processor, frames_np, mask, r_erode=0, r_dilate=0, n_warmup=10):
        if ti > (n_warmup-1):
            frames.append((com_np*255).astype(np.uint8))
            phas.append((pha*255).astype(np.uint8))
-    
+            # phas.append(np.clip(pha * 255, 0, 255).astype(np.uint8))    
    return frames, phas
--- a/requirements.txt
+++ b/requirements.txt
@ -23,7 +23,7 @@ librosa==0.11.0
 speechbrain==1.0.3
 
 # UI & interaction
-gradio==5.23.0
+gradio==5.29.0
 dashscope
 loguru

--- a/shared/gradio/gallery.py
+++ b/shared/gradio/gallery.py
@ -4,6 +4,7 @@ from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, Literal

 import gradio as gr
 import PIL
+import time
 from PIL import Image as PILImage

 FilePath = str
@ -20,6 +21,9 @@ def get_list( objs):
        return []
    return [ obj[0] if isinstance(obj, tuple) else obj for obj in objs]

+def record_last_action(st, last_action):
+    st["last_action"] = last_action
+    st["last_time"] = time.time()
 class AdvancedMediaGallery:
    def __init__(
        self,
@ -60,9 +64,10 @@ class AdvancedMediaGallery:
        self.state: Optional[gr.State] = None
        self._initial_state: Dict[str, Any] = {
            "items": items,
-            "selected": (len(items) - 1) if items else None,
+            "selected": (len(items) - 1) if items else 0, # None,
            "single": bool(single_image_mode),
            "mode": self.media_mode,
+            "last_action": "",
        }

    # ---------------- helpers ----------------
@ -210,6 +215,13 @@ class AdvancedMediaGallery:

    def _on_select(self, state: Dict[str, Any], gallery, evt: gr.SelectData) :
        # Mirror the selected index into state and the gallery (server-side selected_index)
+
+        st = get_state(state)
+        last_time = st.get("last_time", None)
+        if last_time is not None and abs(time.time()- last_time)< 0.5: # crappy trick to detect if onselect is unwanted (buggy gallery)
+            # print(f"ignored:{time.time()}, real {st['selected']}")
+            return gr.update(selected_index=st["selected"]), st
+
        idx = None
        if evt is not None and hasattr(evt, "index"):
            ix = evt.index
@ -220,17 +232,28 @@ class AdvancedMediaGallery:
                    idx = ix[0] * max(1, int(self.columns)) + ix[1]
                else:
                    idx = ix[0]
-        st = get_state(state)
        n = len(get_list(gallery))
        sel = idx if (idx is not None and 0 <= idx < n) else None
+        # print(f"image selected evt index:{sel}/{evt.selected}")
        st["selected"] = sel
-        # return gr.update(selected_index=sel), st
-        # return gr.update(), st
-        return st
+        return gr.update(), st
+
+    def _on_upload(self, value: List[Any], state: Dict[str, Any]) :
+        # Fires when users upload via the Gallery itself.
+        # items_filtered = self._filter_items_by_mode(list(value or []))
+        items_filtered = list(value or [])
+        st = get_state(state)
+        new_items = self._paths_from_payload(items_filtered)
+        st["items"] = new_items
+        new_sel = len(new_items) - 1
+        st["selected"] = new_sel
+        record_last_action(st,"add")
+        return gr.update(selected_index=new_sel), st

    def _on_gallery_change(self, value: List[Any], state: Dict[str, Any]) :
        # Fires when users add/drag/drop/delete via the Gallery itself.
-        items_filtered = self._filter_items_by_mode(list(value or []))
+        # items_filtered = self._filter_items_by_mode(list(value or []))
+        items_filtered = list(value or [])
        st = get_state(state)
        st["items"] = items_filtered
        # Keep selection if still valid, else default to last
@ -240,10 +263,9 @@ class AdvancedMediaGallery:
        else:
            new_sel = old_sel
        st["selected"] = new_sel
-        # return gr.update(value=items_filtered, selected_index=new_sel), st
-        # return gr.update(value=items_filtered), st
-
-        return gr.update(), st
+        st["last_action"] ="gallery_change"
+        # print(f"gallery change: set sel {new_sel}")
+        return gr.update(selected_index=new_sel), st

    def _on_add(self, files_payload: Any, state: Dict[str, Any], gallery):
        """
@ -252,7 +274,8 @@ class AdvancedMediaGallery:
        and re-selects the last inserted item.
        """
        # New items (respect image/video mode)
-        new_items = self._filter_items_by_mode(self._paths_from_payload(files_payload))
+        # new_items = self._filter_items_by_mode(self._paths_from_payload(files_payload))
+        new_items = self._paths_from_payload(files_payload)

        st = get_state(state)
        cur: List[Any] = get_list(gallery)
@ -298,30 +321,6 @@ class AdvancedMediaGallery:
                if k is not None:
                    seen_new.add(k)

-        # Remove any existing occurrences of the incoming items from current list,
-        # BUT keep the currently selected item even if it's also in incoming.
-        cur_clean: List[Any] = []
-        # sel_item = cur[sel] if (sel is not None and 0 <= sel < len(cur)) else None
-        # for idx, it in enumerate(cur):
-        #     k = key_of(it)
-        #     if it is sel_item:
-        #         cur_clean.append(it)
-        #         continue
-        #     if k is not None and k in seen_new:
-        #         continue  # drop duplicate; we'll reinsert at the target spot
-        #     cur_clean.append(it)
-
-        # # Compute insertion position: right AFTER the (possibly shifted) selected item
-        # if sel_item is not None:
-        #     # find sel_item's new index in cur_clean
-        #     try:
-        #         pos_sel = cur_clean.index(sel_item)
-        #     except ValueError:
-        #         # Shouldn't happen, but fall back to end
-        #         pos_sel = len(cur_clean) - 1
-        #     insert_pos = pos_sel + 1
-        # else:
-        #     insert_pos = len(cur_clean)  # no selection -> append at end
        insert_pos = min(sel, len(cur) -1)
        cur_clean = cur
        # Build final list and selection
@ -330,6 +329,8 @@ class AdvancedMediaGallery:

        st["items"] = merged
        st["selected"] = new_sel
+        record_last_action(st,"add")
+        # print(f"gallery add: set sel {new_sel}")
        return gr.update(value=merged, selected_index=new_sel), st

    def _on_remove(self, state: Dict[str, Any], gallery) :
@ -342,8 +343,9 @@ class AdvancedMediaGallery:
            return gr.update(value=[], selected_index=None), st
        new_sel = min(sel, len(items) - 1)
        st["items"] = items; st["selected"] = new_sel
-        # return gr.update(value=items, selected_index=new_sel), st
-        return gr.update(value=items), st
+        record_last_action(st,"remove")
+        # print(f"gallery del: new sel {new_sel}")
+        return gr.update(value=items, selected_index=new_sel), st

    def _on_move(self, delta: int, state: Dict[str, Any], gallery) :
        st = get_state(state); items: List[Any] = get_list(gallery); sel = st.get("selected", None)
@ -354,11 +356,15 @@ class AdvancedMediaGallery:
            return gr.update(value=items, selected_index=sel), st
        items[sel], items[j] = items[j], items[sel]
        st["items"] = items; st["selected"] = j
+        record_last_action(st,"move")
+        # print(f"gallery move: set sel {j}")
        return gr.update(value=items, selected_index=j), st

    def _on_clear(self, state: Dict[str, Any]) :
        st = {"items": [], "selected": None, "single": get_state(state).get("single", False), "mode": self.media_mode}
-        return gr.update(value=[], selected_index=0), st
+        record_last_action(st,"clear")
+        # print(f"Clear all")
+        return gr.update(value=[], selected_index=None), st

    def _on_toggle_single(self, to_single: bool, state: Dict[str, Any]) :
        st = get_state(state); st["single"] = bool(to_single)
@ -382,30 +388,38 @@ class AdvancedMediaGallery:
    def mount(self, parent: Optional[gr.Blocks | gr.Group | gr.Row | gr.Column] = None, update_form = False):
        if parent is not None:
            with parent:
-                col = self._build_ui()
+                col = self._build_ui(update_form)
        else:
-            col = self._build_ui()
+            col = self._build_ui(update_form)
        if not update_form:
            self._wire_events()
        return col

-    def _build_ui(self) -> gr.Column:
+    def _build_ui(self, update = False) -> gr.Column:
        with gr.Column(elem_id=self.elem_id, elem_classes=self.elem_classes) as col:
            self.container = col

            self.state = gr.State(dict(self._initial_state))

-            self.gallery = gr.Gallery(
-                label=self.label,
-                value=self._initial_state["items"],
-                height=self.height,
-                columns=self.columns,
-                show_label=self.show_label,
-                preview= True,
-                # type="pil",
-                file_types= list(IMAGE_EXTS) if self.media_mode == "image" else list(VIDEO_EXTS), 
-                selected_index=self._initial_state["selected"],  # server-side selection
-            )
+            if update:
+                self.gallery = gr.update(
+                    value=self._initial_state["items"],
+                    selected_index=self._initial_state["selected"],  # server-side selection
+                    label=self.label,
+                    show_label=self.show_label,
+                )
+            else:
+                self.gallery = gr.Gallery(
+                    value=self._initial_state["items"],
+                    label=self.label,
+                    height=self.height,
+                    columns=self.columns,
+                    show_label=self.show_label,
+                    preview= True,
+                    # type="pil", # very slow
+                    file_types= list(IMAGE_EXTS) if self.media_mode == "image" else list(VIDEO_EXTS), 
+                    selected_index=self._initial_state["selected"],  # server-side selection
+                )

            # One-line controls
            exts = sorted(IMAGE_EXTS if self.media_mode == "image" else VIDEO_EXTS) if self.accept_filter else None
@ -418,10 +432,10 @@ class AdvancedMediaGallery:
                    size="sm",
                    min_width=1,
                )
-                self.btn_remove = gr.Button("Remove", size="sm", min_width=1)
+                self.btn_remove = gr.Button(" Remove ", size="sm", min_width=1)
                self.btn_left   = gr.Button("◀ Left",  size="sm", visible=not self._initial_state["single"], min_width=1)
                self.btn_right  = gr.Button("Right ▶", size="sm", visible=not self._initial_state["single"], min_width=1)
-                self.btn_clear  = gr.Button("Clear",   variant="secondary", size="sm", visible=not self._initial_state["single"], min_width=1)
+                self.btn_clear  = gr.Button(" Clear ",   variant="secondary", size="sm", visible=not self._initial_state["single"], min_width=1)

        return col

@ -430,14 +444,24 @@ class AdvancedMediaGallery:
        self.gallery.select(
            self._on_select,
            inputs=[self.state, self.gallery],
-            outputs=[self.state],
+            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

        # Gallery value changed by user actions (click-to-add, drag-drop, internal remove, etc.)
-        self.gallery.change(
+        self.gallery.upload(
+            self._on_upload,
+            inputs=[self.gallery, self.state],
+            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
+        )
+
+        # Gallery value changed by user actions (click-to-add, drag-drop, internal remove, etc.)
+        self.gallery.upload(
            self._on_gallery_change,
            inputs=[self.gallery, self.state],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

        # Add via UploadButton
@ -445,6 +469,7 @@ class AdvancedMediaGallery:
            self._on_add,
            inputs=[self.upload_btn, self.state, self.gallery],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

        # Remove selected
@ -452,6 +477,7 @@ class AdvancedMediaGallery:
            self._on_remove,
            inputs=[self.state, self.gallery],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

        # Reorder using selected index, keep same item selected
@ -459,11 +485,13 @@ class AdvancedMediaGallery:
            lambda st, gallery: self._on_move(-1, st, gallery),
            inputs=[self.state, self.gallery],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )
        self.btn_right.click(
            lambda st, gallery: self._on_move(+1, st, gallery),
            inputs=[self.state, self.gallery],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

        # Clear all
@ -471,6 +499,7 @@ class AdvancedMediaGallery:
            self._on_clear,
            inputs=[self.state],
            outputs=[self.gallery, self.state],
+            trigger_mode="always_last",
        )

    # ---------------- public API ----------------
--- a/shared/utils/utils.py
+++ b/shared/utils/utils.py
@ -19,6 +19,7 @@ import tempfile
 import subprocess
 import json
 from functools import lru_cache
+os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")


 from PIL import Image
@ -188,6 +189,14 @@ def get_outpainting_full_area_dimensions(frame_height,frame_width, outpainting_d
    frame_width =  int(frame_width * (100 + outpainting_left + outpainting_right) / 100)
    return frame_height, frame_width  

+def rgb_bw_to_rgba_mask(img, thresh=127):
+    a = img.convert('L').point(lambda p: 255 if p > thresh else 0)  # alpha
+    out = Image.new('RGBA', img.size, (255, 255, 255, 0))           # white, transparent
+    out.putalpha(a)                                                 # white where alpha=255
+    return out
+                        
+
+
 def  get_outpainting_frame_location(final_height, final_width,  outpainting_dims, block_size = 8):
    outpainting_top, outpainting_bottom, outpainting_left, outpainting_right= outpainting_dims
    raw_height = int(final_height / ((100 + outpainting_top + outpainting_bottom) / 100))
@ -207,30 +216,62 @@ def  get_outpainting_frame_location(final_height, final_width,  outpainting_dims
    if (margin_left + width) > final_width or outpainting_right == 0: margin_left = final_width - width
    return height, width, margin_top, margin_left

-def calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas, block_size = 16):
-    if fit_into_canvas == None:
+def rescale_and_crop(img, w, h):
+    ow, oh = img.size
+    target_ratio = w / h
+    orig_ratio = ow / oh
+    
+    if orig_ratio > target_ratio:
+        # Crop width first
+        nw = int(oh * target_ratio)
+        img = img.crop(((ow - nw) // 2, 0, (ow + nw) // 2, oh))
+    else:
+        # Crop height first
+        nh = int(ow / target_ratio)
+        img = img.crop((0, (oh - nh) // 2, ow, (oh + nh) // 2))
+    
+    return img.resize((w, h), Image.LANCZOS)
+
+def calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas,  block_size = 16):
+    if fit_into_canvas == None or fit_into_canvas == 2:
        # return image_height, image_width
        return canvas_height, canvas_width
-    if fit_into_canvas:
+    if fit_into_canvas == 1:
        scale1  = min(canvas_height / image_height, canvas_width / image_width)
        scale2  = min(canvas_width / image_height, canvas_height / image_width)
        scale = max(scale1, scale2) 
-    else:
+    else: #0 or #2 (crop)
        scale = (canvas_height * canvas_width / (image_height * image_width))**(1/2)

    new_height = round( image_height * scale / block_size) * block_size
    new_width = round( image_width * scale / block_size) * block_size
    return new_height, new_width

-def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, ignore_first, fit_into_canvas = False ):
+def calculate_dimensions_and_resize_image(image, canvas_height, canvas_width, fit_into_canvas, fit_crop, block_size = 16):
+    if fit_crop:
+        image = rescale_and_crop(image, canvas_width, canvas_height)
+        new_width, new_height = image.size  
+    else:
+        image_width, image_height = image.size
+        new_height, new_width = calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas, block_size = block_size )
+        image = image.resize((new_width, new_height), resample=Image.Resampling.LANCZOS) 
+    return image, new_height, new_width
+
+def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, any_background_ref, fit_into_canvas = 0, block_size= 16, outpainting_dims = None ):
    if rm_background:
        session = new_session() 

    output_list =[]
    for i, img in enumerate(img_list):
        width, height =  img.size 
-
-        if fit_into_canvas:
+        if fit_into_canvas == None or any_background_ref == 1 and i==0 or any_background_ref == 2:
+            if outpainting_dims is not None:
+                resized_image =img 
+            elif img.size != (budget_width, budget_height):
+                resized_image= img.resize((budget_width, budget_height), resample=Image.Resampling.LANCZOS) 
+            else:
+                resized_image =img
+        elif fit_into_canvas == 1:
            white_canvas = np.ones((budget_height, budget_width, 3), dtype=np.uint8) * 255 
            scale = min(budget_height / height, budget_width / width)
            new_height = int(height * scale)
@ -242,10 +283,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
            resized_image = Image.fromarray(white_canvas)  
        else:
            scale = (budget_height * budget_width / (height * width))**(1/2)
-            new_height = int( round(height * scale / 16) * 16)
-            new_width = int( round(width * scale / 16) * 16)
+            new_height = int( round(height * scale / block_size) * block_size)
+            new_width = int( round(width * scale / block_size) * block_size)
            resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS) 
-        if rm_background  and not (ignore_first and i == 0) :
+        if rm_background  and not (any_background_ref and i==0 or any_background_ref == 2) :
            # resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
            resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
        output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,
--- a/wgp.py
+++ b/wgp.py