Vace improvements

2026-01-12 01:03:44 +00:00 · 2025-05-23 21:51:00 +02:00 · 2025-05-23 21:51:00 +02:00 · 86725a65d4
commit 86725a65d4
parent 6706709230
8 changed files with 631 additions and 343 deletions
--- a/README.md
+++ b/README.md
@ -21,6 +21,7 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models


 ## 🔥 Latest News!!
+* May 23 2025: 👋 Wan 2.1GP v5.21 : Improvements for Vace: better transitions between Sliding Windows,Support for Image masks in Matanyone, new Extend Video for Vace, different types of automated background removal 
 * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
 The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
 See instructions below on how to use CausVid.\
@ -307,17 +308,20 @@ You can define multiple lines of macros. If there is only one macro line, the ap

 ### VACE ControlNet introduction

-Vace is a ControlNet 1.3B text2video model that allows you to do Video to Video and Reference to Video (inject your own images into the output video). So with Vace you can inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ... 
+Vace is a ControlNet that allows you to do Video to Video and Reference to Video (inject your own images into the output video). It is probably one of the most powerful Wan models and you will be able to do amazing things when you master it: inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ... 

-First you need to select the Vace 1.3B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 5s (81 frames).
+First you need to select the Vace 1.3B model or the Vace 13B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 7s with the Riflex option turned on.

 Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
- a Control Video: Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting ). If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images. 
+- *a Control Video*\
+Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting. If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images. 

- reference Images: Use this to inject people or objects of your choice in the video. You can select multiple reference Images. The integration of the image is more efficient if the background is replaced by the full white color. You can do that with your preferred background remover or use the built in background remover by checking the box *Remove background*
+- *Reference Images*\
+ A reference Image can be as well a background that you want to use as a setting for the video or people or objects of your choice that you want to inject in the video. You can select multiple reference Images. The integration of object / person image is more efficient if the background is replaced by the full white color. For complex background removal you can use the Image version of the Matanyone tool that is embedded with WanGP or use you can use the fast on the fly background remover by selecting an option in the drop down box *Remove background*. Becareful not to remove the background of the reference image that is a landscape or setting  (always the first reference image) that you want to use as a start image / background for the video. It helps greatly to reference and describe explictly the injected objects / people of the Reference Images in the text prompt.
+
+- *a Video Mask*\
+This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. For instance, if a video mask is white except at the beginning and at the end where it is black, the first and last frames will be kept and everything in between will be generated.

- a Video Mask
-This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. If a video mask is white, it will be generated so with black frames at the beginning and at the end and the rest white, you could generate the missing frames in between.


 Examples:
@ -340,9 +344,25 @@ Other recommended setttings for Vace:
 - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
 - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry

+**WanGP integrates the Matanyone tool which is tuned to work with Vace**.

-### VACE and Sky Reels v2 Diffusion Forcing Slidig Window
-With this mode (that works for the moment only with Vace and Sky Reels v2) you can merge mutiple Videos to form a very long video (up to 1 min). 
+This can be very useful to create at the same time a control video and a mask video that go together.\
+For example, if you want to replace a face of a person in a video:
+- load the video in the Matanyone tool
+- click the face on the first frame and create a mask for it (if you have some trouble to select only the face look at the tips below)
+- generate both the control video and the mask video by clicking *Generate Video Matting*
+- Click *Export to current Video Input and Video Mask*
+- In the *Reference Image* field of the Vace screen, load a picture of the replacement face
+
+Please notes that sometime it may be useful to create *Background Masks* if want for instance to replace everything but a character that is in the video. You can do that by selecting *Background Mask* in the *Matanyone settings*
+
+If you have some trouble creating the perfect mask, be aware of these tips:
+- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.
+- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an  area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.
+
+
+### VACE, Sky Reels v2 Diffusion Forcing Slidig Window and LTX Video
+With this mode (that works for the moment only with Vace, Sky Reels v2 and LTX Video) you can merge mutiple Videos to form a very long video (up to 1 min). 

 When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.

@ -352,12 +372,16 @@ Sliding Windows are turned on by default and are triggered as soon as you try to

 Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*: 
 - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
- *discard last frames* : quite often (Vace model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
-s
+- *discard last frames* : sometime (Vace 1.3B model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
+
+There is some inevitable quality degradation over time to due to accumulated errors in calculation. One trick to reduce it / hide it is to add some noise (usually not noticable) on the overlapped frames using the *add overlapped noise* option. 
+
+
 Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) +  [Window Size]

 Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated. 

+
 ### Command line parameters for Gradio Server
 --i2v : launch the image to video generator\
 --t2v : launch the text to video generator (default defined in the configuration)\
--- a/preprocessing/matanyone/app.py
+++ b/preprocessing/matanyone/app.py
@ -85,7 +85,7 @@ def get_frames_from_image(image_input, image_state):
    model.samcontroler.sam_controler.reset_image() 
    model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
    return image_state, image_info, image_state["origin_images"][0], \
-                        gr.update(visible=True, maximum=10, value=10), gr.update(visible=True, maximum=len(frames), value=len(frames)),  gr.update(visible=False, maximum=len(frames), value=len(frames)), \
+                        gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
                        gr.update(visible=True), gr.update(visible=True), \
                        gr.update(visible=True), gr.update(visible=True),\
                        gr.update(visible=True), gr.update(visible=True), \
@ -273,6 +273,57 @@ def save_video(frames, output_path, fps):

    return output_path

+# image matting
+def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, refine_iter):
+    matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
+    if interactive_state["track_end_number"]:
+        following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
+    else:
+        following_frames = video_state["origin_images"][video_state["select_frame_number"]:]
+
+    if interactive_state["multi_mask"]["masks"]:
+        if len(mask_dropdown) == 0:
+            mask_dropdown = ["mask_001"]
+        mask_dropdown.sort()
+        template_mask = interactive_state["multi_mask"]["masks"][int(mask_dropdown[0].split("_")[1]) - 1] * (int(mask_dropdown[0].split("_")[1]))
+        for i in range(1,len(mask_dropdown)):
+            mask_number = int(mask_dropdown[i].split("_")[1]) - 1 
+            template_mask = np.clip(template_mask+interactive_state["multi_mask"]["masks"][mask_number]*(mask_number+1), 0, mask_number+1)
+        video_state["masks"][video_state["select_frame_number"]]= template_mask
+    else:      
+        template_mask = video_state["masks"][video_state["select_frame_number"]]
+
+    # operation error
+    if len(np.unique(template_mask))==1:
+        template_mask[0][0]=1
+    foreground, alpha = matanyone(matanyone_processor, following_frames, template_mask*255, r_erode=erode_kernel_size, r_dilate=dilate_kernel_size, n_warmup=refine_iter)
+
+
+    foreground_mat = False
+    
+    output_frames = []
+    for frame_origin, frame_alpha in zip(following_frames, alpha):
+        if foreground_mat:
+            frame_alpha[frame_alpha > 127] = 255
+            frame_alpha[frame_alpha <= 127] = 0
+        else:
+            frame_temp = frame_alpha.copy()
+            frame_alpha[frame_temp > 127] = 0
+            frame_alpha[frame_temp <= 127] = 255
+
+
+        output_frame = np.bitwise_and(frame_origin, 255-frame_alpha)
+        frame_grey = frame_alpha.copy()
+        frame_grey[frame_alpha == 255] = 255
+        output_frame += frame_grey
+        output_frames.append(output_frame)
+    foreground = output_frames
+
+    foreground_output = Image.fromarray(foreground[-1])
+    alpha_output = Image.fromarray(alpha[-1][:,:,0])
+
+    return foreground_output, gr.update(visible=True) 
+
 # video matting
 def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
    matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
@ -397,7 +448,7 @@ def restart():
            "inference_times": 0,
            "negative_click_times" : 0,
            "positive_click_times": 0,
-            "mask_save": arg_mask_save,
+            "mask_save": False,
            "multi_mask": {
                "mask_names": [],
                "masks": []
@ -457,6 +508,15 @@ def export_to_vace_video_input(foreground_video_output):
    gr.Info("Masked Video Input transferred to Vace For Inpainting")
    return "V#" + str(time.time()), foreground_video_output

+
+def export_image(image_refs, image_output):
+    gr.Info("Masked Image transferred to Current Video")
+    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
+    if image_refs == None:
+        image_refs =[]
+    image_refs.append( image_output)
+    return image_refs
+
 def export_to_current_video_engine(foreground_video_output, alpha_video_output):
    gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
@ -471,14 +531,17 @@ def teleport_to_vace_1_3B():
 def teleport_to_vace_14B():
    return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")

-def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_video_guide_trigger):
+def display(tabs, model_choice, vace_video_input, vace_video_mask, vace_image_refs, video_prompt_video_guide_trigger):
    # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])

    media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"

    # download assets

-    gr.Markdown("Mast Edition is provided by MatAnyone")
+    gr.Markdown("<B>Mast Edition is provided by MatAnyone</B>")
+    gr.Markdown("If you have some trouble creating the perfect mask, be aware of these tips:")
+    gr.Markdown("- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.")
+    gr.Markdown("- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an  area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.")
    
    with gr.Column( visible=True):
        with gr.Row():
@ -493,6 +556,11 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
                        gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")


+        
+
+        with gr.Tabs():
+            with gr.TabItem("Video"):
+
                click_state = gr.State([[],[]])

                interactive_state = gr.State({
@ -568,9 +636,6 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
                                    scale=1)
                                mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)

-        gr.Markdown("---")
-
-        with gr.Column():
                    # input video
                    with gr.Row(equal_height=True):
                        with gr.Column(scale=2): 
@ -613,6 +678,7 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
                export_to_current_video_engine_btn.click(  fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger, 
                    fn=teleport_to_video_tab, inputs= [], outputs= [tabs])

+
                # first step: get the video information     
                extract_frames_button.click(
                    fn=get_frames_from_video,
@ -706,3 +772,152 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
                    inputs = [video_state, click_state,],
                    outputs = [template_frame,click_state],
                )
+
+
+
+            with gr.TabItem("Image"):
+                click_state = gr.State([[],[]])
+
+                interactive_state = gr.State({
+                    "inference_times": 0,
+                    "negative_click_times" : 0,
+                    "positive_click_times": 0,
+                    "mask_save": False,
+                    "multi_mask": {
+                        "mask_names": [],
+                        "masks": []
+                    },
+                    "track_end_number": None,
+                    }
+                )
+
+                image_state = gr.State(
+                    {
+                    "user_name": "",
+                    "image_name": "",
+                    "origin_images": None,
+                    "painted_images": None,
+                    "masks": None,
+                    "inpaint_masks": None,
+                    "logits": None,
+                    "select_frame_number": 0,
+                    "fps": 30
+                    }
+                )
+
+                with gr.Group(elem_classes="gr-monochrome-group", visible=True):
+                    with gr.Row():
+                        with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
+                            with gr.Row():
+                                erode_kernel_size = gr.Slider(label='Erode Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Erosion on the added mask",
+                                                        interactive=True)
+                                dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Dilation on the added mask",
+                                                        interactive=True)
+                                
+                            with gr.Row():
+                                image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Num of Refinement Iterations", info="More iterations → More details & More time", visible=False)
+                                track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Track end frame", visible=False)
+                            with gr.Row():
+                                point_prompt = gr.Radio(
+                                    choices=["Positive", "Negative"],
+                                    value="Positive",
+                                    label="Point Prompt",
+                                    info="Click to add positive or negative point for target mask",
+                                    interactive=True,
+                                    visible=False,
+                                    min_width=100,
+                                    scale=1)
+                                mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False)
+                
+
+                with gr.Column():
+                    # input image
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2): 
+                            gr.Markdown("## Step1: Upload image")
+                        with gr.Column(scale=2): 
+                            step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2):      
+                            image_input = gr.Image(label="Input Image", elem_classes="image")
+                            extract_frames_button = gr.Button(value="Load Image", interactive=True, elem_classes="new_button")
+                        with gr.Column(scale=2):
+                            image_info = gr.Textbox(label="Image Info", visible=False)
+                            template_frame = gr.Image(type="pil", label="Start Frame", interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
+                            with gr.Row(equal_height=True, elem_classes="mask_button_group"):
+                                clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                matting_button = gr.Button(value="Image Matting", interactive=True, visible=False, elem_classes="green_button", min_width=100)
+
+                    # output image
+                    with gr.Row(equal_height=True):
+                        foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
+                    with gr.Row():
+                        with gr.Row():
+                            export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
+                    with gr.Column(scale=2, visible= False):
+                        alpha_image_output = gr.Image(type="pil", label="Alpha Output", visible=False, elem_classes="image")
+                        alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
+
+                export_image_btn.click(  fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger, 
+                    fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
+
+                # first step: get the image information 
+                extract_frames_button.click(
+                    fn=get_frames_from_image,
+                    inputs=[
+                        image_input, image_state
+                    ],
+                    outputs=[image_state, image_info, template_frame,
+                            image_selection_slider, track_pause_number_slider,point_prompt, clear_button_click, add_mask_button, matting_button, template_frame,
+                            foreground_image_output, alpha_image_output, export_image_btn, alpha_output_button, mask_dropdown, step2_title]
+                )   
+
+                # second step: select images from slider
+                image_selection_slider.release(fn=select_image_template, 
+                                            inputs=[image_selection_slider, image_state, interactive_state], 
+                                            outputs=[template_frame, image_state, interactive_state], api_name="select_image")
+                track_pause_number_slider.release(fn=get_end_number, 
+                                            inputs=[track_pause_number_slider, image_state, interactive_state], 
+                                            outputs=[template_frame, interactive_state], api_name="end_image")
+                
+                # click select image to get mask using sam
+                template_frame.select(
+                    fn=sam_refine,
+                    inputs=[image_state, point_prompt, click_state, interactive_state],
+                    outputs=[template_frame, image_state, interactive_state]
+                )
+
+                # add different mask
+                add_mask_button.click(
+                    fn=add_multi_mask,
+                    inputs=[image_state, interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown, template_frame, click_state]
+                )
+
+                remove_mask_button.click(
+                    fn=remove_multi_mask,
+                    inputs=[interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown]
+                )
+
+                # image matting
+                matting_button.click(
+                    fn=image_matting,
+                    inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
+                    outputs=[foreground_image_output, export_image_btn]
+                )
+
+
+
--- a/preprocessing/matanyone/tutorial_multi_targets.mp4
+++ b/preprocessing/matanyone/tutorial_multi_targets.mp4
--- a/preprocessing/matanyone/tutorial_single_target.mp4
+++ b/preprocessing/matanyone/tutorial_single_target.mp4
--- a/wan/text2video.py
+++ b/wan/text2video.py
@ -111,7 +111,7 @@ class WanT2V:

            self.adapt_vace_model()

-    def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = 0, overlap_noise = 0):
+    def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = None):
        if ref_images is None:
            ref_images = [None] * len(frames)
        else:
@ -123,10 +123,10 @@ class WanT2V:
            inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
            reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
            inactive = self.vae.encode(inactive, tile_size = tile_size)
-            # inactive = [ t  * (1.0 - noise_factor) + torch.randn_like(t ) * noise_factor for t in inactive]
-            # if overlapped_latents > 0:
-            #     for t in inactive:
-            #         t[:, :overlapped_latents ]   = t[:, :overlapped_latents ]  * (1.0 - noise_factor) + torch.randn_like(t[:, :overlapped_latents ] ) * noise_factor 
+            self.toto = inactive[0].clone() 
+            if overlapped_latents  != None : 
+                # inactive[0][:, 0:1] = self.vae.encode([frames[0][:, 0:1]], tile_size = tile_size)[0] # redundant
+                inactive[0][:, 1:overlapped_latents.shape[1] + 1] = overlapped_latents

            reactive = self.vae.encode(reactive, tile_size = tile_size)
            latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
@ -190,13 +190,13 @@ class WanT2V:
            num_frames = total_frames - prepend_count 
            if sub_src_mask is not None and sub_src_video is not None:
                src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
-                # src_video is [-1, 1], 0 = inpainting area (in fact 127  in [0, 255])
-                # src_mask is [-1, 1], 0 = preserve original video (in fact 127  in [0, 255]) and 1 = Inpainting (in fact 255  in [0, 255])
+                # src_video is [-1, 1] (at this function output), 0 = inpainting area (in fact 127  in [0, 255])
+                # src_mask is [-1, 1] (at this function output), 0 = preserve original video (in fact 127  in [0, 255]) and 1 = Inpainting (in fact 255  in [0, 255])
                src_video[i] = src_video[i].to(device)
                src_mask[i] = src_mask[i].to(device)
                if prepend_count > 0:
                    src_video[i] =  torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
-                    src_mask[i] =  torch.cat( [torch.zeros_like(sub_pre_src_video), src_mask[i]] ,1)
+                    src_mask[i] =  torch.cat( [torch.full_like(sub_pre_src_video, -1.0), src_mask[i]] ,1)
                src_video_shape = src_video[i].shape
                if src_video_shape[1] != total_frames:
                    src_video[i] =  torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
@ -300,7 +300,8 @@ class WanT2V:
                slg_end = 1.0,
                cfg_star_switch = True,
                cfg_zero_step = 5,
-                overlapped_latents  = 0,
+                overlapped_latents  = None,
+                return_latent_slice = None,
                overlap_noise = 0,
                model_filename = None,
                **bbargs
@ -373,8 +374,10 @@ class WanT2V:
            input_frames = [u.to(self.device) for u in input_frames]
            input_ref_images = [ None if u == None else [v.to(self.device) for v in u]  for u in input_ref_images]
            input_masks = [u.to(self.device) for u in input_masks]
-
-            z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents, overlap_noise = overlap_noise )
+            previous_latents = None
+            # if overlapped_latents != None:
+                # input_ref_images = [u[-1:] for u in input_ref_images]
+            z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents )
            m0 = self.vace_encode_masks(input_masks, input_ref_images)
            z = self.vace_latent(z0, m0)

@ -442,8 +445,9 @@ class WanT2V:
        if vace:
            ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0 
            kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
-            if overlapped_latents > 0:
-                z_reactive = [  zz[0:16, ref_images_count:overlapped_latents + ref_images_count].clone() for zz in z]
+            if overlapped_latents != None:
+                overlapped_latents_size = overlapped_latents.shape[1] + 1
+                z_reactive = [  zz[0:16, 0:overlapped_latents_size + ref_images_count].clone() for zz in z]


        if self.model.enable_teacache:
@ -453,13 +457,14 @@ class WanT2V:
        if callback != None:
            callback(-1, None, True)
        for i, t in enumerate(tqdm(timesteps)):
-            if vace and overlapped_latents > 0 :
-                # noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
-                noise_factor = overlap_noise / 1000 # * (999-t) / 999
-                # noise_factor = overlap_noise / 1000 # * t / 999
-                for zz, zz_r in zip(z, z_reactive):
-                    zz[0:16, ref_images_count:overlapped_latents + ref_images_count]   = zz_r  * (1.0 - noise_factor) + torch.randn_like(zz_r ) * noise_factor 
-
+            if overlapped_latents != None:
+                # overlap_noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
+                overlap_noise_factor = overlap_noise / 1000 
+                latent_noise_factor = t / 1000
+                for zz, zz_r, ll in zip(z, z_reactive, [latents]):
+                    pass
+                    zz[0:16, ref_images_count:overlapped_latents_size + ref_images_count]   = zz_r[:, ref_images_count:]  * (1.0 - overlap_noise_factor) + torch.randn_like(zz_r[:, ref_images_count:] ) * overlap_noise_factor 
+                    ll[:, 0:overlapped_latents_size + ref_images_count]   = zz_r  * (1.0 - latent_noise_factor) + torch.randn_like(zz_r ) * latent_noise_factor 
            if target_camera != None:
                latent_model_input = torch.cat([latents, source_latents], dim=1)
            else:
@ -552,6 +557,13 @@ class WanT2V:

        x0 = [latents]

+        if return_latent_slice != None:
+            if overlapped_latents != None:
+                # latents [:, 1:] = self.toto
+                for zz, zz_r, ll  in zip(z, z_reactive, [latents]):
+                    ll[:, 0:overlapped_latents_size + ref_images_count]   = zz_r 
+
+            latent_slice = latents[:, return_latent_slice].clone()
        if input_frames == None:
            if phantom:
                # phantom post processing
@ -560,11 +572,9 @@ class WanT2V:
        else:
            # vace post processing
            videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
-
-        del latents
-        del sample_scheduler
-
-        return videos[0] if self.rank == 0 else None
+        if return_latent_slice != None:
+            return { "x" : videos[0], "latent_slice" : latent_slice }
+        return videos[0]

    def adapt_vace_model(self):
        model = self.model
--- a/wan/utils/utils.py
+++ b/wan/utils/utils.py
@ -91,11 +91,11 @@ def calculate_new_dimensions(canvas_height, canvas_width, height, width, fit_int
    return new_height, new_width

 def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
-    if rm_background:
+    if rm_background  > 0:
        session = new_session() 

    output_list =[]
-    for img in img_list:
+    for i, img in enumerate(img_list):
        width, height =  img.size 

        if fit_into_canvas:
@ -113,9 +113,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
            new_height = int( round(height * scale / 16) * 16)
            new_width = int( round(width * scale / 16) * 16)
            resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS) 
-        if rm_background:
-            resized_image = remove(resized_image, session=session, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
-        output_list.append(resized_image)
+        if rm_background == 1 or rm_background == 2 and i > 0 :
+            # resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
+            resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
+        output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,
    return output_list


--- a/wgp.py
+++ b/wgp.py
@ -204,9 +204,6 @@ def process_prompt_and_add_tasks(state, model_choice):

        if isinstance(image_refs, list):
            image_refs = [ convert_image(tup[0]) for tup in image_refs ]
-            # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
-            # from wan.utils.utils import resize_and_remove_background
-            # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1, fit_into_canvas= True)
        

        if len(prompts) > 0:
@ -333,8 +330,10 @@ def process_prompt_and_add_tasks(state, model_choice):
        if "O" in video_prompt_type :
            keep_frames_video_guide= inputs["keep_frames_video_guide"] 
            video_length = inputs["video_length"]
-            if len(keep_frames_video_guide) ==0: 
-                gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
+            if len(keep_frames_video_guide) > 0: 
+                gr.Info("Keeping Frames with Extending Video is not yet supported")
+                return
+                # gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
            # elif keep_frames >= video_length:
            #     gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
            #     return
@ -349,11 +348,6 @@ def process_prompt_and_add_tasks(state, model_choice):
        if isinstance(image_refs, list):
            image_refs = [ convert_image(tup[0]) for tup in image_refs ]        

-            # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
-            # from wan.utils.utils import resize_and_remove_background
-            # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1)
-        
-
        if len(prompts) > 0:
            prompts = ["\n".join(prompts)]

@ -1464,7 +1458,6 @@ lock_ui_attention = False
 lock_ui_transformer = False
 lock_ui_compile = False

-preload =int(args.preload)
 force_profile_no = int(args.profile)
 verbose_level = int(args.verbose)
 quantizeTransformer = args.quantize_transformer
@ -1482,17 +1475,21 @@ if os.path.isfile("t2v_settings.json"):
 if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
    shutil.move("gradio_config.json", server_config_filename) 

+if not os.path.isdir("ckpts/umt5-xxl/"):
+    os.makedirs("ckpts/umt5-xxl/")
 src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
 tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
 for src,tgt in zip(src_move,tgt_move):
    if os.path.isfile(src):
        try:
+            if os.path.isfile(tgt):
+                shutil.remove(src)
+            else:
                shutil.move(src, tgt)
        except:
            pass
    

-
 if not Path(server_config_filename).is_file():
    server_config = {"attention_mode" : "auto",  
                     "transformer_types": [], 
@ -1755,7 +1752,10 @@ def get_default_settings(filename):
                "flow_shift": 13,
                "resolution": "1280x720" 
            })
-
+        elif get_model_type(filename) in ("vace_14B"):
+            ui_defaults.update({
+                "sliding_window_discard_last_frames": 0,
+            })
            

        with open(defaults_filename, "w", encoding="utf-8") as f:
@ -2136,6 +2136,9 @@ def load_models(model_filename):
    global transformer_filename, transformer_loras_filenames
    model_family = get_model_family(model_filename)
    perc_reserved_mem_max = args.perc_reserved_mem_max
+    preload =int(args.preload)
+    if preload == 0:
+        preload = server_config.get("preload_in_VRAM", 0)
    new_transformer_loras_filenames = None
    dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy =  transformer_dtype_policy) 
    new_transformer_loras_filenames = [model_filename]  if "_lora" in model_filename else None
@ -2259,7 +2262,8 @@ def apply_changes(  state,
                    preload_model_policy_choice = 1,
                    UI_theme_choice = "default",
                    enhancer_enabled_choice = 0,
-                    fit_canvas_choice = 0
+                    fit_canvas_choice = 0,
+                    preload_in_VRAM_choice = 0
 ):
    if args.lock_config:
        return
@ -2284,6 +2288,7 @@ def apply_changes(  state,
                     "UI_theme" : UI_theme_choice,
                     "fit_canvas": fit_canvas_choice,
                     "enhancer_enabled" : enhancer_enabled_choice,
+                     "preload_in_VRAM" : preload_in_VRAM_choice
                       }

    if Path(server_config_filename).is_file():
@ -2456,26 +2461,20 @@ def refresh_gallery(state): #, msg
            prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
        if enhanced:
            prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
-
+        list_uri = []
        start_img_uri = task.get('start_image_data_base64')
-        start_img_uri = start_img_uri[0] if start_img_uri !=None else None
+        if start_img_uri != None:
+            list_uri += start_img_uri
        end_img_uri = task.get('end_image_data_base64')
-        end_img_uri = end_img_uri[0] if end_img_uri !=None else None
+        if end_img_uri != None:
+            list_uri += end_img_uri
+
        thumbnail_size = "100px"
-        if start_img_uri:
-            start_img_md = f'<img src="{start_img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
-        if end_img_uri:
-            end_img_md = f'<img src="{end_img_uri}" alt="End" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
+        thumbnails = ""
+        for img_uri in list_uri:
+            thumbnails += f'<TD><img src="{img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" /></TD>'
        
-        label = f"Prompt of Video being Generated"            
- 
-        html = "<STYLE> #PINFO, #PINFO  th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>"
-        if start_img_md != "":
-            html += "<TD>" + start_img_md +  "</TD>"
-        if end_img_md != "":
-            html += "<TD>" + end_img_md +  "</TD>" 
-
-        html += "</TR></TABLE>" 
+        html = "<STYLE> #PINFO, #PINFO  th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>" + thumbnails + "</TR></TABLE>" 
        html_output = gr.HTML(html, visible= True)
        return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive=  abort_interactive), gr.Button(visible= onemorewindow_visible)

@ -2680,7 +2679,7 @@ def generate_video(
    sliding_window_overlap,
    sliding_window_overlap_noise,
    sliding_window_discard_last_frames,
-    remove_background_image_ref,
+    remove_background_images_ref,
    temporal_upsampling,
    spatial_upsampling,
    RIFLEx_setting,
@ -2816,13 +2815,14 @@ def generate_video(
        fps = 30
    else:
        fps = 16
+    latent_size = 8 if ltxv else 4

    original_image_refs = image_refs 
    if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
        send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
        os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
        from wan.utils.utils import resize_and_remove_background
-        image_refs = resize_and_remove_background(image_refs, width, height, remove_background_image_ref ==1, fit_into_canvas= not vace)
+        image_refs = resize_and_remove_background(image_refs, width, height, remove_background_images_ref, fit_into_canvas= not vace)
        update_task_thumbnails(task, locals())
        send_cmd("output")

@ -2879,7 +2879,6 @@ def generate_video(
    repeat_no = 0
    extra_generation = 0
    initial_total_windows = 0
-    max_frames_to_generate = video_length
    if diffusion_forcing or vace or ltxv:
        reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
    else:
@ -2888,8 +2887,9 @@ def generate_video(
        video_length +=  sliding_window_overlap
    sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size

-    if sliding_window:
    discard_last_frames = sliding_window_discard_last_frames
+    default_max_frames_to_generate = video_length
+    if sliding_window:
        left_after_first_window = video_length - sliding_window_size + discard_last_frames
        initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
        video_length = sliding_window_size
@ -2913,6 +2913,7 @@ def generate_video(
        prefix_video_frames_count = 0 
        frames_already_processed = None
        pre_video_guide = None
+        overlapped_latents = None
        window_no = 0
        extra_windows = 0
        guide_start_frame = 0
@ -2920,6 +2921,8 @@ def generate_video(
        gen["extra_windows"] = 0
        gen["total_windows"] = 1
        gen["window_no"] = 1
+        num_frames_generated = 0
+        max_frames_to_generate = default_max_frames_to_generate
        start_time = time.time()
        if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
            text_encoder_max_tokens = 256
@ -2955,38 +2958,50 @@ def generate_video(
        while not abort:
            if sliding_window:
                prompt =  prompts[window_no] if window_no < len(prompts) else prompts[-1]
-            extra_windows += gen.get("extra_windows",0)
-            if extra_windows > 0:
-                video_length = sliding_window_size
+            new_extra_windows = gen.get("extra_windows",0)
            gen["extra_windows"] = 0
+            extra_windows += new_extra_windows
+            max_frames_to_generate +=  new_extra_windows * (sliding_window_size - discard_last_frames - reuse_frames)
+            sliding_window = sliding_window  or extra_windows > 0
+            if sliding_window and window_no > 0:
+                num_frames_generated -= reuse_frames
+                if (max_frames_to_generate - prefix_video_frames_count - num_frames_generated) <  latent_size:
+                    break
+                video_length = min(sliding_window_size, ((max_frames_to_generate - num_frames_generated - prefix_video_frames_count + reuse_frames + discard_last_frames) // latent_size) * latent_size + 1 )
+
            total_windows = initial_total_windows + extra_windows
            gen["total_windows"] = total_windows
            if window_no >= total_windows:
                break
            window_no += 1
            gen["window_no"] = window_no
+            return_latent_slice = None 
+            if reuse_frames > 0:                
+                return_latent_slice = slice(-(reuse_frames - 1 + discard_last_frames ) // latent_size, None if discard_last_frames == 0 else -(discard_last_frames // latent_size) )

            if hunyuan_custom:
                src_ref_images  = image_refs
            elif phantom:
                src_ref_images = image_refs.copy() if image_refs != None else None
-            elif diffusion_forcing or ltxv:
+            elif diffusion_forcing or ltxv or vace and "O" in video_prompt_type:
+                if vace:
+                   video_source =  video_guide
+                   video_guide = None
                if video_source != None and len(video_source) > 0 and window_no == 1:
                    keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source) 
+                    keep_frames_video_source =  (keep_frames_video_source // latent_size  ) * latent_size + 1  
                    prefix_video  = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
                    prefix_video  = prefix_video .permute(3, 0, 1, 2)
                    prefix_video  = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
-                    prefix_video_frames_count = prefix_video.shape[1]
                    pre_video_guide =  prefix_video[:, -reuse_frames:]
-
-            elif vace:
-                # video_prompt_type =  video_prompt_type +"G"
+                    prefix_video_frames_count = pre_video_guide.shape[1]
+                    if vace:
+                        height, width  = pre_video_guide.shape[-2:]     
+            if vace:
                image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
                video_guide_copy = video_guide
                video_mask_copy = video_mask
                if any(process in video_prompt_type for process in ("P", "D", "G")) :
-                    prompts_max = gen["prompts_max"]
-
                    preprocess_type = None
                    if "P" in video_prompt_type :
                        progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
@ -3005,8 +3020,11 @@ def generate_video(
                if len(error) > 0:
                    raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
                keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
+
                if window_no == 1:
-                    image_size = (height, width) # VACE_SIZE_CONFIGS[resolution_reformated] # default frame dimensions until it is set by video_src (if there is any)
+                    image_size = (height, width) #  default frame dimensions until it is set by video_src (if there is any)
+                
+
                src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
                                                                        [video_mask_copy ],
                                                                        [image_refs_copy], 
@ -3017,22 +3035,13 @@ def generate_video(
                                                                        pre_src_video = [pre_video_guide],
                                                                        fit_into_canvas = fit_canvas 
                                                                        )
-                # if window_no == 1 and src_video != None and len(src_video) > 0:
-                #     image_size = src_video[0].shape[-2:]
-            prompts_max = gen["prompts_max"]
            status = get_latest_status(state)
-
-
            gen["progress_status"] = status 
            gen["progress_phase"] = ("Encoding Prompt", -1 )
            callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
            progress_args = [0, merge_status_context(status, "Encoding Prompt")]
            send_cmd("progress", progress_args)

-            # samples = torch.empty( (1,2)) #for testing
-            # if False:
-            
-            try:
            if trans.enable_teacache:
                trans.teacache_counter = 0
                trans.num_steps = num_inference_steps                
@ -3040,6 +3049,10 @@ def generate_video(
                trans.previous_residual = None
                trans.previous_modulated_input = None

+            # samples = torch.empty( (1,2)) #for testing
+            # if False:
+            
+            try:
                samples = wan_model.generate(
                    input_prompt = prompt,
                    image_start = image_start,  
@ -3049,7 +3062,7 @@ def generate_video(
                    input_masks = src_mask,
                    input_video= pre_video_guide  if diffusion_forcing or ltxv else source_video,
                    target_camera= target_camera,
-                    frame_num=(video_length // 4)* 4 + 1,
+                    frame_num=(video_length // latent_size)* latent_size + 1,
                    height =  height,
                    width = width,
                    fit_into_canvas = fit_canvas == 1,
@ -3076,7 +3089,8 @@ def generate_video(
                    causal_block_size = 5,
                    causal_attention = True,
                    fps = fps,
-                    overlapped_latents = 0 if reuse_frames == 0 or window_no == 1 else ((reuse_frames - 1) // 4 + 1),
+                    overlapped_latents = overlapped_latents,
+                    return_latent_slice= return_latent_slice,
                    overlap_noise = sliding_window_overlap_noise,
                    model_filename = model_filename,
                )
@ -3109,6 +3123,7 @@ def generate_video(
                tb = traceback.format_exc().split('\n')[:-1] 
                print('\n'.join(tb))
                send_cmd("error", new_error)
+                clear_status(state)
                return
            finally:
                trans.previous_residual = None
@ -3118,33 +3133,42 @@ def generate_video(
                print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )

            if samples != None:
+                if isinstance(samples, dict):
+                    overlapped_latents = samples.get("latent_slice", None)
+                    samples= samples["x"]
                samples = samples.to("cpu")
            offload.last_offload_obj.unload_all()
            gc.collect()
            torch.cuda.empty_cache()

+            # time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%Hh%Mm%Ss")
+            # save_prompt = "_in_" + original_prompts[0]
+            # file_name = f"{time_flag}_seed{seed}_{sanitize_file_name(save_prompt[:50]).strip()}.mp4"
+            # sample = samples.cpu()
+            # cache_video( tensor=sample[None].clone(), save_file=os.path.join(save_path, file_name), fps=16, nrow=1, normalize=True, value_range=(-1, 1))
+
            if samples == None:
                abort = True
                state["prompt"] = ""
                send_cmd("output")  
            else:
                sample = samples.cpu()
-                if True: # for testing
-                    torch.save(sample, "output.pt")
-                else:
-                    sample =torch.load("output.pt")
-
+                # if True: # for testing
+                #     torch.save(sample, "output.pt")
+                # else:
+                #     sample =torch.load("output.pt")
+                if gen.get("extra_windows",0) > 0:
+                    sliding_window = True 
                if sliding_window :
                    guide_start_frame += video_length
                    if discard_last_frames > 0:
                        sample = sample[: , :-discard_last_frames]
                        guide_start_frame -= discard_last_frames
                    if reuse_frames == 0:
-                        pre_video_guide =  sample[:,9999 :]
+                        pre_video_guide =  sample[:,9999 :].clone()
                    else:
-                        # noise_factor = 200/ 1000
-                        # pre_video_guide =  sample[:, -reuse_frames:] * (1.0 - noise_factor) + torch.randn_like(sample[:, -reuse_frames:] ) * noise_factor
-                        pre_video_guide =  sample[:, -reuse_frames:]
+                        pre_video_guide =  sample[:, -reuse_frames:].clone()
+                num_frames_generated += sample.shape[1] 


                if prefix_video != None:
@ -3158,7 +3182,6 @@ def generate_video(
                        sample = sample[: , :]
                    else:
                        sample = sample[: , reuse_frames:]
-
                    guide_start_frame -= reuse_frames 

                exp = 0
@ -3252,15 +3275,9 @@ def generate_video(
                print(f"New video saved to Path: "+video_path)
                file_list.append(video_path)
                send_cmd("output")
-                if sliding_window :
-                    if max_frames_to_generate > 0 and extra_windows == 0:
-                        current_length = sample.shape[1]
-                        if (current_length - prefix_video_frames_count)>= max_frames_to_generate:
-                            break
-                        video_length = min(sliding_window_size, ((max_frames_to_generate - (current_length - prefix_video_frames_count) + reuse_frames + discard_last_frames) // 4) * 4 + 1 )

        seed += 1
-
+    clear_status(state)
    if temp_filename!= None and  os.path.isfile(temp_filename):
        os.remove(temp_filename)
    offload.unload_loras_from_model(trans)
@ -3631,6 +3648,15 @@ def merge_status_context(status="", context=""):
    else:
        return status + " - " + context

+def clear_status(state):
+    gen = get_gen_info(state)
+    gen["extra_windows"] = 0
+    gen["total_windows"] = 1
+    gen["window_no"] = 1
+    gen["extra_orders"] = 0
+    gen["repeat_no"] = 0
+    gen["total_generation"] = 0
+
 def get_latest_status(state, context=""):
    gen = get_gen_info(state)
    prompt_no = gen["prompt_no"] 
@ -3999,7 +4025,7 @@ def prepare_inputs_dict(target, inputs ):
        inputs.pop("model_mode")

    if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
-        unsaved_params = ["keep_frames_video_guide", "video_prompt_type",  "remove_background_image_ref"]
+        unsaved_params = ["keep_frames_video_guide", "video_prompt_type",  "remove_background_images_ref"]
        for k in unsaved_params:
            inputs.pop(k)

@ -4066,7 +4092,7 @@ def save_inputs(
            sliding_window_overlap,
            sliding_window_overlap_noise,
            sliding_window_discard_last_frames,            
-            remove_background_image_ref,
+            remove_background_images_ref,
            temporal_upsampling,
            spatial_upsampling,
            RIFLEx_setting,
@ -4458,7 +4484,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                ("Transfer Human Motion from the Control Video", "PV"),
                                ("Transfer Depth from the Control Video", "DV"),
                                ("Recolorize the Control Video", "CV"),
-                                # ("Alternate Video Ending", "OV"),
+                                ("Extend Video", "OV"),
                                ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
                                ("Control Video and Mask video for Inpainting ", "MV"),
                            ],
@ -4489,7 +4515,17 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                 )

                # with gr.Row():
-                remove_background_image_ref = gr.Checkbox(value=ui_defaults.get("remove_background_image_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 ) 
+                remove_background_images_ref = gr.Dropdown(
+                    choices=[
+                        ("Keep Backgrounds of All Images (landscape)", 0),
+                        ("Remove Backgrounds of All Images (objects / faces)", 1),
+                        ("Keep it for first Image (landscape) and remove it for other Images (objects / faces)", 2),
+                    ],
+                    value=ui_defaults.get("remove_background_images_ref",1),
+                    label="Remove Background of Images References", scale = 3, visible= "I" in video_prompt_type_value
+                )
+
+                # remove_background_images_ref = gr.Checkbox(value=ui_defaults.get("remove_background_images_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 ) 


                video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None)) 
@ -4730,7 +4766,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                        else:
                            sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
                            sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
-                            sliding_window_overlap_noise = gr.Slider(0, 100, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
+                            sliding_window_overlap_noise = gr.Slider(0, 150, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
                            sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)


@ -4811,7 +4847,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non

            image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] ) 
            video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
-            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_image_ref ])
+            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref ])
            video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])

            show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
@ -5036,7 +5072,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            )

    return ( state, loras_choices, lset_name, state,
-             video_guide, video_mask, video_prompt_video_guide_trigger, prompt_enhancer    
+             video_guide, video_mask, image_refs, video_prompt_video_guide_trigger, prompt_enhancer    
        ) 
 

@ -5250,6 +5286,7 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
                    value= profile,
                    label="Profile (for power users only, not needed to change it)"
                )
+                preload_in_VRAM_choice = gr.Slider(0, 40000, value=server_config.get("preload_in_VRAM", 0), step=100, label="Number of MB of Models that are Preloaded in VRAM (0 will use Profile default)")



@ -5277,7 +5314,8 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
                    preload_model_policy_choice,
                    UI_theme_choice,
                    enhancer_enabled_choice,
-                    fit_canvas_choice
+                    fit_canvas_choice,
+                    preload_in_VRAM_choice
                ],
                outputs= [msg , header, model_choice, prompt_enhancer_row]
        )
@ -5661,7 +5699,7 @@ def create_demo():
        theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")

    with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
-        gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.2 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
+        gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.21 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
        global model_list

        tab_state = gr.State({ "tab_no":0 }) 
@ -5680,7 +5718,7 @@ def create_demo():
                    header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
                with gr.Row():
                    (   state, loras_choices, lset_name, state,
-                        video_guide, video_mask, video_prompt_type_video_trigger, prompt_enhancer_row
+                        video_guide, video_mask, image_refs, video_prompt_type_video_trigger, prompt_enhancer_row
                    ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
            with gr.Tab("Informations", id="info"):
                generate_info_tab()
@ -5688,7 +5726,7 @@ def create_demo():
                from preprocessing.matanyone  import app as matanyone_app
                vmc_event_handler = matanyone_app.get_vmc_event_handler()

-                matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, video_prompt_type_video_trigger)
+                matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, image_refs, video_prompt_type_video_trigger)
            if not args.lock_config:
                with gr.Tab("Downloads", id="downloads") as downloads_tab:
                    generate_download_tab(lset_name, loras_choices, state)