Vace improvements

This commit is contained in:
DeepBeepMeep 2025-05-23 21:51:00 +02:00
parent 6706709230
commit 86725a65d4
8 changed files with 631 additions and 343 deletions

View File

@ -21,6 +21,7 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
## 🔥 Latest News!! ## 🔥 Latest News!!
* May 23 2025: 👋 Wan 2.1GP v5.21 : Improvements for Vace: better transitions between Sliding Windows,Support for Image masks in Matanyone, new Extend Video for Vace, different types of automated background removal
* May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps. * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B. The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
See instructions below on how to use CausVid.\ See instructions below on how to use CausVid.\
@ -307,17 +308,20 @@ You can define multiple lines of macros. If there is only one macro line, the ap
### VACE ControlNet introduction ### VACE ControlNet introduction
Vace is a ControlNet 1.3B text2video model that allows you to do Video to Video and Reference to Video (inject your own images into the output video). So with Vace you can inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ... Vace is a ControlNet that allows you to do Video to Video and Reference to Video (inject your own images into the output video). It is probably one of the most powerful Wan models and you will be able to do amazing things when you master it: inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ...
First you need to select the Vace 1.3B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 5s (81 frames). First you need to select the Vace 1.3B model or the Vace 13B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 7s with the Riflex option turned on.
Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !): Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
- a Control Video: Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting ). If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images. - *a Control Video*\
Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting. If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images.
- reference Images: Use this to inject people or objects of your choice in the video. You can select multiple reference Images. The integration of the image is more efficient if the background is replaced by the full white color. You can do that with your preferred background remover or use the built in background remover by checking the box *Remove background* - *Reference Images*\
A reference Image can be as well a background that you want to use as a setting for the video or people or objects of your choice that you want to inject in the video. You can select multiple reference Images. The integration of object / person image is more efficient if the background is replaced by the full white color. For complex background removal you can use the Image version of the Matanyone tool that is embedded with WanGP or use you can use the fast on the fly background remover by selecting an option in the drop down box *Remove background*. Becareful not to remove the background of the reference image that is a landscape or setting (always the first reference image) that you want to use as a start image / background for the video. It helps greatly to reference and describe explictly the injected objects / people of the Reference Images in the text prompt.
- *a Video Mask*\
This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. For instance, if a video mask is white except at the beginning and at the end where it is black, the first and last frames will be kept and everything in between will be generated.
- a Video Mask
This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. If a video mask is white, it will be generated so with black frames at the beginning and at the end and the rest white, you could generate the missing frames in between.
Examples: Examples:
@ -336,13 +340,29 @@ There is also a guide that describes the various combination of hints (https://g
It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration. It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration.
Other recommended setttings for Vace: Other recommended setttings for Vace:
- Use a long prompt description especially for the people / objects that are in the background and not in reference images. This will ensure consistency between the windows. - Use a long prompt description especially for the people / objects that are in the background and not in reference images. This will ensure consistency between the windows.
- Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
- Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry
**WanGP integrates the Matanyone tool which is tuned to work with Vace**.
### VACE and Sky Reels v2 Diffusion Forcing Slidig Window This can be very useful to create at the same time a control video and a mask video that go together.\
With this mode (that works for the moment only with Vace and Sky Reels v2) you can merge mutiple Videos to form a very long video (up to 1 min). For example, if you want to replace a face of a person in a video:
- load the video in the Matanyone tool
- click the face on the first frame and create a mask for it (if you have some trouble to select only the face look at the tips below)
- generate both the control video and the mask video by clicking *Generate Video Matting*
- Click *Export to current Video Input and Video Mask*
- In the *Reference Image* field of the Vace screen, load a picture of the replacement face
Please notes that sometime it may be useful to create *Background Masks* if want for instance to replace everything but a character that is in the video. You can do that by selecting *Background Mask* in the *Matanyone settings*
If you have some trouble creating the perfect mask, be aware of these tips:
- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.
- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.
### VACE, Sky Reels v2 Diffusion Forcing Slidig Window and LTX Video
With this mode (that works for the moment only with Vace, Sky Reels v2 and LTX Video) you can merge mutiple Videos to form a very long video (up to 1 min).
When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking. When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.
@ -352,12 +372,16 @@ Sliding Windows are turned on by default and are triggered as soon as you try to
Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*: Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*:
- *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
- *discard last frames* : quite often (Vace model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped. - *discard last frames* : sometime (Vace 1.3B model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
s
There is some inevitable quality degradation over time to due to accumulated errors in calculation. One trick to reduce it / hide it is to add some noise (usually not noticable) on the overlapped frames using the *add overlapped noise* option.
Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) + [Window Size] Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) + [Window Size]
Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated. Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated.
### Command line parameters for Gradio Server ### Command line parameters for Gradio Server
--i2v : launch the image to video generator\ --i2v : launch the image to video generator\
--t2v : launch the text to video generator (default defined in the configuration)\ --t2v : launch the text to video generator (default defined in the configuration)\

View File

@ -1502,7 +1502,7 @@ class LTXVideoPipeline(DiffusionPipeline):
extra_conditioning_mask.append(conditioning_mask) extra_conditioning_mask.append(conditioning_mask)
# Patchify the updated latents and calculate their pixel coordinates # Patchify the updated latents and calculate their pixel coordinates
init_latents, init_latent_coords = self.patchifier.patchify( init_latents, init_latent_coords = self.patchifier.patchify(
latents=init_latents latents=init_latents
) )
init_pixel_coords = latent_to_pixel_coords( init_pixel_coords = latent_to_pixel_coords(

View File

@ -85,7 +85,7 @@ def get_frames_from_image(image_input, image_state):
model.samcontroler.sam_controler.reset_image() model.samcontroler.sam_controler.reset_image()
model.samcontroler.sam_controler.set_image(image_state["origin_images"][0]) model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
return image_state, image_info, image_state["origin_images"][0], \ return image_state, image_info, image_state["origin_images"][0], \
gr.update(visible=True, maximum=10, value=10), gr.update(visible=True, maximum=len(frames), value=len(frames)), gr.update(visible=False, maximum=len(frames), value=len(frames)), \ gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
gr.update(visible=True), gr.update(visible=True), \ gr.update(visible=True), gr.update(visible=True), \
gr.update(visible=True), gr.update(visible=True),\ gr.update(visible=True), gr.update(visible=True),\
gr.update(visible=True), gr.update(visible=True), \ gr.update(visible=True), gr.update(visible=True), \
@ -273,6 +273,57 @@ def save_video(frames, output_path, fps):
return output_path return output_path
# image matting
def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, refine_iter):
matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
if interactive_state["track_end_number"]:
following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
else:
following_frames = video_state["origin_images"][video_state["select_frame_number"]:]
if interactive_state["multi_mask"]["masks"]:
if len(mask_dropdown) == 0:
mask_dropdown = ["mask_001"]
mask_dropdown.sort()
template_mask = interactive_state["multi_mask"]["masks"][int(mask_dropdown[0].split("_")[1]) - 1] * (int(mask_dropdown[0].split("_")[1]))
for i in range(1,len(mask_dropdown)):
mask_number = int(mask_dropdown[i].split("_")[1]) - 1
template_mask = np.clip(template_mask+interactive_state["multi_mask"]["masks"][mask_number]*(mask_number+1), 0, mask_number+1)
video_state["masks"][video_state["select_frame_number"]]= template_mask
else:
template_mask = video_state["masks"][video_state["select_frame_number"]]
# operation error
if len(np.unique(template_mask))==1:
template_mask[0][0]=1
foreground, alpha = matanyone(matanyone_processor, following_frames, template_mask*255, r_erode=erode_kernel_size, r_dilate=dilate_kernel_size, n_warmup=refine_iter)
foreground_mat = False
output_frames = []
for frame_origin, frame_alpha in zip(following_frames, alpha):
if foreground_mat:
frame_alpha[frame_alpha > 127] = 255
frame_alpha[frame_alpha <= 127] = 0
else:
frame_temp = frame_alpha.copy()
frame_alpha[frame_temp > 127] = 0
frame_alpha[frame_temp <= 127] = 255
output_frame = np.bitwise_and(frame_origin, 255-frame_alpha)
frame_grey = frame_alpha.copy()
frame_grey[frame_alpha == 255] = 255
output_frame += frame_grey
output_frames.append(output_frame)
foreground = output_frames
foreground_output = Image.fromarray(foreground[-1])
alpha_output = Image.fromarray(alpha[-1][:,:,0])
return foreground_output, gr.update(visible=True)
# video matting # video matting
def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size): def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg) matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
@ -397,7 +448,7 @@ def restart():
"inference_times": 0, "inference_times": 0,
"negative_click_times" : 0, "negative_click_times" : 0,
"positive_click_times": 0, "positive_click_times": 0,
"mask_save": arg_mask_save, "mask_save": False,
"multi_mask": { "multi_mask": {
"mask_names": [], "mask_names": [],
"masks": [] "masks": []
@ -457,6 +508,15 @@ def export_to_vace_video_input(foreground_video_output):
gr.Info("Masked Video Input transferred to Vace For Inpainting") gr.Info("Masked Video Input transferred to Vace For Inpainting")
return "V#" + str(time.time()), foreground_video_output return "V#" + str(time.time()), foreground_video_output
def export_image(image_refs, image_output):
gr.Info("Masked Image transferred to Current Video")
# return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
if image_refs == None:
image_refs =[]
image_refs.append( image_output)
return image_refs
def export_to_current_video_engine(foreground_video_output, alpha_video_output): def export_to_current_video_engine(foreground_video_output, alpha_video_output):
gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting") gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
# return "MV#" + str(time.time()), foreground_video_output, alpha_video_output # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
@ -471,14 +531,17 @@ def teleport_to_vace_1_3B():
def teleport_to_vace_14B(): def teleport_to_vace_14B():
return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B") return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")
def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_video_guide_trigger): def display(tabs, model_choice, vace_video_input, vace_video_mask, vace_image_refs, video_prompt_video_guide_trigger):
# my_tab.select(fn=load_unload_models, inputs=[], outputs=[]) # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/" media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
# download assets # download assets
gr.Markdown("Mast Edition is provided by MatAnyone") gr.Markdown("<B>Mast Edition is provided by MatAnyone</B>")
gr.Markdown("If you have some trouble creating the perfect mask, be aware of these tips:")
gr.Markdown("- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.")
gr.Markdown("- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.")
with gr.Column( visible=True): with gr.Column( visible=True):
with gr.Row(): with gr.Row():
@ -493,216 +556,368 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video") gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")
click_state = gr.State([[],[]])
interactive_state = gr.State({
"inference_times": 0,
"negative_click_times" : 0,
"positive_click_times": 0,
"mask_save": arg_mask_save,
"multi_mask": {
"mask_names": [],
"masks": []
},
"track_end_number": None,
}
)
video_state = gr.State( with gr.Tabs():
{ with gr.TabItem("Video"):
"user_name": "",
"video_name": "",
"origin_images": None,
"painted_images": None,
"masks": None,
"inpaint_masks": None,
"logits": None,
"select_frame_number": 0,
"fps": 16,
"audio": "",
}
)
with gr.Column( visible=True): click_state = gr.State([[],[]])
with gr.Row():
with gr.Accordion('MatAnyone Settings (click to expand)', open=False): interactive_state = gr.State({
"inference_times": 0,
"negative_click_times" : 0,
"positive_click_times": 0,
"mask_save": arg_mask_save,
"multi_mask": {
"mask_names": [],
"masks": []
},
"track_end_number": None,
}
)
video_state = gr.State(
{
"user_name": "",
"video_name": "",
"origin_images": None,
"painted_images": None,
"masks": None,
"inpaint_masks": None,
"logits": None,
"select_frame_number": 0,
"fps": 16,
"audio": "",
}
)
with gr.Column( visible=True):
with gr.Row(): with gr.Row():
erode_kernel_size = gr.Slider(label='Erode Kernel Size', with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
minimum=0, with gr.Row():
maximum=30, erode_kernel_size = gr.Slider(label='Erode Kernel Size',
step=1, minimum=0,
value=10, maximum=30,
info="Erosion on the added mask", step=1,
interactive=True) value=10,
dilate_kernel_size = gr.Slider(label='Dilate Kernel Size', info="Erosion on the added mask",
minimum=0, interactive=True)
maximum=30, dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
step=1, minimum=0,
value=10, maximum=30,
info="Dilation on the added mask", step=1,
interactive=True) value=10,
info="Dilation on the added mask",
interactive=True)
with gr.Row():
image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False)
end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False)
track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False)
with gr.Row():
point_prompt = gr.Radio(
choices=["Positive", "Negative"],
value="Positive",
label="Point Prompt",
info="Click to add positive or negative point for target mask",
interactive=True,
visible=False,
min_width=100,
scale=1)
matting_type = gr.Radio(
choices=["Foreground", "Background"],
value="Foreground",
label="Matting Type",
info="Type of Video Matting to Generate",
interactive=True,
visible=False,
min_width=100,
scale=1)
mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
# input video
with gr.Row(equal_height=True):
with gr.Column(scale=2):
gr.Markdown("## Step1: Upload video")
with gr.Column(scale=2):
step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
with gr.Row(equal_height=True):
with gr.Column(scale=2):
video_input = gr.Video(label="Input Video", elem_classes="video")
extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
with gr.Column(scale=2):
video_info = gr.Textbox(label="Video Info", visible=False)
template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
with gr.Row():
clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, min_width=100)
add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, min_width=100) # no use
matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False, min_width=100)
with gr.Row():
gr.Markdown("")
# output video
with gr.Column() as output_row: #equal_height=True
with gr.Row():
with gr.Column(scale=2):
foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
with gr.Column(scale=2):
alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
with gr.Row():
with gr.Row(visible= False):
export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
with gr.Row(visible= True):
export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then(
fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask])
export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
# first step: get the video information
extract_frames_button.click(
fn=get_frames_from_video,
inputs=[
video_input, video_state
],
outputs=[video_state, video_info, template_frame,
image_selection_slider, end_selection_slider, track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
)
# second step: select images from slider
image_selection_slider.release(fn=select_video_template,
inputs=[image_selection_slider, video_state, interactive_state],
outputs=[template_frame, video_state, interactive_state], api_name="select_image")
track_pause_number_slider.release(fn=get_end_number,
inputs=[track_pause_number_slider, video_state, interactive_state],
outputs=[template_frame, interactive_state], api_name="end_image")
# click select image to get mask using sam
template_frame.select(
fn=sam_refine,
inputs=[video_state, point_prompt, click_state, interactive_state],
outputs=[template_frame, video_state, interactive_state]
)
# add different mask
add_mask_button.click(
fn=add_multi_mask,
inputs=[video_state, interactive_state, mask_dropdown],
outputs=[interactive_state, mask_dropdown, template_frame, click_state]
)
remove_mask_button.click(
fn=remove_multi_mask,
inputs=[interactive_state, mask_dropdown],
outputs=[interactive_state, mask_dropdown]
)
# video matting
matting_button.click(
fn=show_outputs,
inputs=[],
outputs=[foreground_video_output, alpha_video_output]).then(
fn=video_matting,
inputs=[video_state, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
)
# click to get mask
mask_dropdown.change(
fn=show_mask,
inputs=[video_state, interactive_state, mask_dropdown],
outputs=[template_frame]
)
# clear input
video_input.change(
fn=restart,
inputs=[],
outputs=[
video_state,
interactive_state,
click_state,
foreground_video_output, alpha_video_output,
template_frame,
image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
],
queue=False,
show_progress=False)
video_input.clear(
fn=restart,
inputs=[],
outputs=[
video_state,
interactive_state,
click_state,
foreground_video_output, alpha_video_output,
template_frame,
image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
],
queue=False,
show_progress=False)
# points clear
clear_button_click.click(
fn = clear_click,
inputs = [video_state, click_state,],
outputs = [template_frame,click_state],
)
with gr.TabItem("Image"):
click_state = gr.State([[],[]])
interactive_state = gr.State({
"inference_times": 0,
"negative_click_times" : 0,
"positive_click_times": 0,
"mask_save": False,
"multi_mask": {
"mask_names": [],
"masks": []
},
"track_end_number": None,
}
)
image_state = gr.State(
{
"user_name": "",
"image_name": "",
"origin_images": None,
"painted_images": None,
"masks": None,
"inpaint_masks": None,
"logits": None,
"select_frame_number": 0,
"fps": 30
}
)
with gr.Group(elem_classes="gr-monochrome-group", visible=True):
with gr.Row(): with gr.Row():
image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False) with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False) with gr.Row():
erode_kernel_size = gr.Slider(label='Erode Kernel Size',
minimum=0,
maximum=30,
step=1,
value=10,
info="Erosion on the added mask",
interactive=True)
dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
minimum=0,
maximum=30,
step=1,
value=10,
info="Dilation on the added mask",
interactive=True)
track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False) with gr.Row():
image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Num of Refinement Iterations", info="More iterations → More details & More time", visible=False)
track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Track end frame", visible=False)
with gr.Row():
point_prompt = gr.Radio(
choices=["Positive", "Negative"],
value="Positive",
label="Point Prompt",
info="Click to add positive or negative point for target mask",
interactive=True,
visible=False,
min_width=100,
scale=1)
mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False)
with gr.Column():
# input image
with gr.Row(equal_height=True):
with gr.Column(scale=2):
gr.Markdown("## Step1: Upload image")
with gr.Column(scale=2):
step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
with gr.Row(equal_height=True):
with gr.Column(scale=2):
image_input = gr.Image(label="Input Image", elem_classes="image")
extract_frames_button = gr.Button(value="Load Image", interactive=True, elem_classes="new_button")
with gr.Column(scale=2):
image_info = gr.Textbox(label="Image Info", visible=False)
template_frame = gr.Image(type="pil", label="Start Frame", interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
with gr.Row(equal_height=True, elem_classes="mask_button_group"):
clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, elem_classes="new_button", min_width=100)
add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
matting_button = gr.Button(value="Image Matting", interactive=True, visible=False, elem_classes="green_button", min_width=100)
# output image
with gr.Row(equal_height=True):
foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
with gr.Row(): with gr.Row():
point_prompt = gr.Radio( with gr.Row():
choices=["Positive", "Negative"], export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
value="Positive", with gr.Column(scale=2, visible= False):
label="Point Prompt", alpha_image_output = gr.Image(type="pil", label="Alpha Output", visible=False, elem_classes="image")
info="Click to add positive or negative point for target mask",
interactive=True,
visible=False,
min_width=100,
scale=1)
matting_type = gr.Radio(
choices=["Foreground", "Background"],
value="Foreground",
label="Matting Type",
info="Type of Video Matting to Generate",
interactive=True,
visible=False,
min_width=100,
scale=1)
mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
gr.Markdown("---")
with gr.Column():
# input video
with gr.Row(equal_height=True):
with gr.Column(scale=2):
gr.Markdown("## Step1: Upload video")
with gr.Column(scale=2):
step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
with gr.Row(equal_height=True):
with gr.Column(scale=2):
video_input = gr.Video(label="Input Video", elem_classes="video")
extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
with gr.Column(scale=2):
video_info = gr.Textbox(label="Video Info", visible=False)
template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
with gr.Row():
clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, min_width=100)
add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, min_width=100) # no use
matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False, min_width=100)
with gr.Row():
gr.Markdown("")
# output video
with gr.Column() as output_row: #equal_height=True
with gr.Row():
with gr.Column(scale=2):
foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
with gr.Column(scale=2):
alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button") alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
with gr.Row():
with gr.Row(visible= False):
export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
with gr.Row(visible= True):
export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then( export_image_btn.click( fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger,
fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask]) fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger, # first step: get the image information
fn=teleport_to_video_tab, inputs= [], outputs= [tabs]) extract_frames_button.click(
fn=get_frames_from_image,
inputs=[
image_input, image_state
],
outputs=[image_state, image_info, template_frame,
image_selection_slider, track_pause_number_slider,point_prompt, clear_button_click, add_mask_button, matting_button, template_frame,
foreground_image_output, alpha_image_output, export_image_btn, alpha_output_button, mask_dropdown, step2_title]
)
# first step: get the video information # second step: select images from slider
extract_frames_button.click( image_selection_slider.release(fn=select_image_template,
fn=get_frames_from_video, inputs=[image_selection_slider, image_state, interactive_state],
inputs=[ outputs=[template_frame, image_state, interactive_state], api_name="select_image")
video_input, video_state track_pause_number_slider.release(fn=get_end_number,
], inputs=[track_pause_number_slider, image_state, interactive_state],
outputs=[video_state, video_info, template_frame, outputs=[template_frame, interactive_state], api_name="end_image")
image_selection_slider, end_selection_slider, track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
)
# second step: select images from slider # click select image to get mask using sam
image_selection_slider.release(fn=select_video_template, template_frame.select(
inputs=[image_selection_slider, video_state, interactive_state], fn=sam_refine,
outputs=[template_frame, video_state, interactive_state], api_name="select_image") inputs=[image_state, point_prompt, click_state, interactive_state],
track_pause_number_slider.release(fn=get_end_number, outputs=[template_frame, image_state, interactive_state]
inputs=[track_pause_number_slider, video_state, interactive_state], )
outputs=[template_frame, interactive_state], api_name="end_image")
# click select image to get mask using sam # add different mask
template_frame.select( add_mask_button.click(
fn=sam_refine, fn=add_multi_mask,
inputs=[video_state, point_prompt, click_state, interactive_state], inputs=[image_state, interactive_state, mask_dropdown],
outputs=[template_frame, video_state, interactive_state] outputs=[interactive_state, mask_dropdown, template_frame, click_state]
) )
# add different mask remove_mask_button.click(
add_mask_button.click( fn=remove_multi_mask,
fn=add_multi_mask, inputs=[interactive_state, mask_dropdown],
inputs=[video_state, interactive_state, mask_dropdown], outputs=[interactive_state, mask_dropdown]
outputs=[interactive_state, mask_dropdown, template_frame, click_state] )
)
remove_mask_button.click( # image matting
fn=remove_multi_mask, matting_button.click(
inputs=[interactive_state, mask_dropdown], fn=image_matting,
outputs=[interactive_state, mask_dropdown] inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
) outputs=[foreground_image_output, export_image_btn]
)
# video matting
matting_button.click(
fn=show_outputs,
inputs=[],
outputs=[foreground_video_output, alpha_video_output]).then(
fn=video_matting,
inputs=[video_state, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
)
# click to get mask
mask_dropdown.change(
fn=show_mask,
inputs=[video_state, interactive_state, mask_dropdown],
outputs=[template_frame]
)
# clear input
video_input.change(
fn=restart,
inputs=[],
outputs=[
video_state,
interactive_state,
click_state,
foreground_video_output, alpha_video_output,
template_frame,
image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
],
queue=False,
show_progress=False)
video_input.clear(
fn=restart,
inputs=[],
outputs=[
video_state,
interactive_state,
click_state,
foreground_video_output, alpha_video_output,
template_frame,
image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
],
queue=False,
show_progress=False)
# points clear
clear_button_click.click(
fn = clear_click,
inputs = [video_state, click_state,],
outputs = [template_frame,click_state],
)

Binary file not shown.

Binary file not shown.

View File

@ -111,7 +111,7 @@ class WanT2V:
self.adapt_vace_model() self.adapt_vace_model()
def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = 0, overlap_noise = 0): def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = None):
if ref_images is None: if ref_images is None:
ref_images = [None] * len(frames) ref_images = [None] * len(frames)
else: else:
@ -123,10 +123,10 @@ class WanT2V:
inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)] inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)] reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
inactive = self.vae.encode(inactive, tile_size = tile_size) inactive = self.vae.encode(inactive, tile_size = tile_size)
# inactive = [ t * (1.0 - noise_factor) + torch.randn_like(t ) * noise_factor for t in inactive] self.toto = inactive[0].clone()
# if overlapped_latents > 0: if overlapped_latents != None :
# for t in inactive: # inactive[0][:, 0:1] = self.vae.encode([frames[0][:, 0:1]], tile_size = tile_size)[0] # redundant
# t[:, :overlapped_latents ] = t[:, :overlapped_latents ] * (1.0 - noise_factor) + torch.randn_like(t[:, :overlapped_latents ] ) * noise_factor inactive[0][:, 1:overlapped_latents.shape[1] + 1] = overlapped_latents
reactive = self.vae.encode(reactive, tile_size = tile_size) reactive = self.vae.encode(reactive, tile_size = tile_size)
latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)] latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
@ -190,13 +190,13 @@ class WanT2V:
num_frames = total_frames - prepend_count num_frames = total_frames - prepend_count
if sub_src_mask is not None and sub_src_video is not None: if sub_src_mask is not None and sub_src_video is not None:
src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas) src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
# src_video is [-1, 1], 0 = inpainting area (in fact 127 in [0, 255]) # src_video is [-1, 1] (at this function output), 0 = inpainting area (in fact 127 in [0, 255])
# src_mask is [-1, 1], 0 = preserve original video (in fact 127 in [0, 255]) and 1 = Inpainting (in fact 255 in [0, 255]) # src_mask is [-1, 1] (at this function output), 0 = preserve original video (in fact 127 in [0, 255]) and 1 = Inpainting (in fact 255 in [0, 255])
src_video[i] = src_video[i].to(device) src_video[i] = src_video[i].to(device)
src_mask[i] = src_mask[i].to(device) src_mask[i] = src_mask[i].to(device)
if prepend_count > 0: if prepend_count > 0:
src_video[i] = torch.cat( [sub_pre_src_video, src_video[i]], dim=1) src_video[i] = torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
src_mask[i] = torch.cat( [torch.zeros_like(sub_pre_src_video), src_mask[i]] ,1) src_mask[i] = torch.cat( [torch.full_like(sub_pre_src_video, -1.0), src_mask[i]] ,1)
src_video_shape = src_video[i].shape src_video_shape = src_video[i].shape
if src_video_shape[1] != total_frames: if src_video_shape[1] != total_frames:
src_video[i] = torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1) src_video[i] = torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
@ -300,7 +300,8 @@ class WanT2V:
slg_end = 1.0, slg_end = 1.0,
cfg_star_switch = True, cfg_star_switch = True,
cfg_zero_step = 5, cfg_zero_step = 5,
overlapped_latents = 0, overlapped_latents = None,
return_latent_slice = None,
overlap_noise = 0, overlap_noise = 0,
model_filename = None, model_filename = None,
**bbargs **bbargs
@ -373,8 +374,10 @@ class WanT2V:
input_frames = [u.to(self.device) for u in input_frames] input_frames = [u.to(self.device) for u in input_frames]
input_ref_images = [ None if u == None else [v.to(self.device) for v in u] for u in input_ref_images] input_ref_images = [ None if u == None else [v.to(self.device) for v in u] for u in input_ref_images]
input_masks = [u.to(self.device) for u in input_masks] input_masks = [u.to(self.device) for u in input_masks]
previous_latents = None
z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents, overlap_noise = overlap_noise ) # if overlapped_latents != None:
# input_ref_images = [u[-1:] for u in input_ref_images]
z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents )
m0 = self.vace_encode_masks(input_masks, input_ref_images) m0 = self.vace_encode_masks(input_masks, input_ref_images)
z = self.vace_latent(z0, m0) z = self.vace_latent(z0, m0)
@ -442,8 +445,9 @@ class WanT2V:
if vace: if vace:
ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0 ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0
kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale}) kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
if overlapped_latents > 0: if overlapped_latents != None:
z_reactive = [ zz[0:16, ref_images_count:overlapped_latents + ref_images_count].clone() for zz in z] overlapped_latents_size = overlapped_latents.shape[1] + 1
z_reactive = [ zz[0:16, 0:overlapped_latents_size + ref_images_count].clone() for zz in z]
if self.model.enable_teacache: if self.model.enable_teacache:
@ -453,13 +457,14 @@ class WanT2V:
if callback != None: if callback != None:
callback(-1, None, True) callback(-1, None, True)
for i, t in enumerate(tqdm(timesteps)): for i, t in enumerate(tqdm(timesteps)):
if vace and overlapped_latents > 0 : if overlapped_latents != None:
# noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000 # overlap_noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
noise_factor = overlap_noise / 1000 # * (999-t) / 999 overlap_noise_factor = overlap_noise / 1000
# noise_factor = overlap_noise / 1000 # * t / 999 latent_noise_factor = t / 1000
for zz, zz_r in zip(z, z_reactive): for zz, zz_r, ll in zip(z, z_reactive, [latents]):
zz[0:16, ref_images_count:overlapped_latents + ref_images_count] = zz_r * (1.0 - noise_factor) + torch.randn_like(zz_r ) * noise_factor pass
zz[0:16, ref_images_count:overlapped_latents_size + ref_images_count] = zz_r[:, ref_images_count:] * (1.0 - overlap_noise_factor) + torch.randn_like(zz_r[:, ref_images_count:] ) * overlap_noise_factor
ll[:, 0:overlapped_latents_size + ref_images_count] = zz_r * (1.0 - latent_noise_factor) + torch.randn_like(zz_r ) * latent_noise_factor
if target_camera != None: if target_camera != None:
latent_model_input = torch.cat([latents, source_latents], dim=1) latent_model_input = torch.cat([latents, source_latents], dim=1)
else: else:
@ -552,6 +557,13 @@ class WanT2V:
x0 = [latents] x0 = [latents]
if return_latent_slice != None:
if overlapped_latents != None:
# latents [:, 1:] = self.toto
for zz, zz_r, ll in zip(z, z_reactive, [latents]):
ll[:, 0:overlapped_latents_size + ref_images_count] = zz_r
latent_slice = latents[:, return_latent_slice].clone()
if input_frames == None: if input_frames == None:
if phantom: if phantom:
# phantom post processing # phantom post processing
@ -560,11 +572,9 @@ class WanT2V:
else: else:
# vace post processing # vace post processing
videos = self.decode_latent(x0, input_ref_images, VAE_tile_size) videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
if return_latent_slice != None:
del latents return { "x" : videos[0], "latent_slice" : latent_slice }
del sample_scheduler return videos[0]
return videos[0] if self.rank == 0 else None
def adapt_vace_model(self): def adapt_vace_model(self):
model = self.model model = self.model

View File

@ -91,11 +91,11 @@ def calculate_new_dimensions(canvas_height, canvas_width, height, width, fit_int
return new_height, new_width return new_height, new_width
def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ): def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
if rm_background: if rm_background > 0:
session = new_session() session = new_session()
output_list =[] output_list =[]
for img in img_list: for i, img in enumerate(img_list):
width, height = img.size width, height = img.size
if fit_into_canvas: if fit_into_canvas:
@ -113,9 +113,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
new_height = int( round(height * scale / 16) * 16) new_height = int( round(height * scale / 16) * 16)
new_width = int( round(width * scale / 16) * 16) new_width = int( round(width * scale / 16) * 16)
resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS) resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
if rm_background: if rm_background == 1 or rm_background == 2 and i > 0 :
resized_image = remove(resized_image, session=session, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB') # resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
output_list.append(resized_image) resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,
return output_list return output_list

214
wgp.py
View File

@ -204,9 +204,6 @@ def process_prompt_and_add_tasks(state, model_choice):
if isinstance(image_refs, list): if isinstance(image_refs, list):
image_refs = [ convert_image(tup[0]) for tup in image_refs ] image_refs = [ convert_image(tup[0]) for tup in image_refs ]
# os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
# from wan.utils.utils import resize_and_remove_background
# image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1, fit_into_canvas= True)
if len(prompts) > 0: if len(prompts) > 0:
@ -333,8 +330,10 @@ def process_prompt_and_add_tasks(state, model_choice):
if "O" in video_prompt_type : if "O" in video_prompt_type :
keep_frames_video_guide= inputs["keep_frames_video_guide"] keep_frames_video_guide= inputs["keep_frames_video_guide"]
video_length = inputs["video_length"] video_length = inputs["video_length"]
if len(keep_frames_video_guide) ==0: if len(keep_frames_video_guide) > 0:
gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.") gr.Info("Keeping Frames with Extending Video is not yet supported")
return
# gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
# elif keep_frames >= video_length: # elif keep_frames >= video_length:
# gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.") # gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
# return # return
@ -349,11 +348,6 @@ def process_prompt_and_add_tasks(state, model_choice):
if isinstance(image_refs, list): if isinstance(image_refs, list):
image_refs = [ convert_image(tup[0]) for tup in image_refs ] image_refs = [ convert_image(tup[0]) for tup in image_refs ]
# os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
# from wan.utils.utils import resize_and_remove_background
# image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1)
if len(prompts) > 0: if len(prompts) > 0:
prompts = ["\n".join(prompts)] prompts = ["\n".join(prompts)]
@ -1464,7 +1458,6 @@ lock_ui_attention = False
lock_ui_transformer = False lock_ui_transformer = False
lock_ui_compile = False lock_ui_compile = False
preload =int(args.preload)
force_profile_no = int(args.profile) force_profile_no = int(args.profile)
verbose_level = int(args.verbose) verbose_level = int(args.verbose)
quantizeTransformer = args.quantize_transformer quantizeTransformer = args.quantize_transformer
@ -1482,17 +1475,21 @@ if os.path.isfile("t2v_settings.json"):
if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"): if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
shutil.move("gradio_config.json", server_config_filename) shutil.move("gradio_config.json", server_config_filename)
if not os.path.isdir("ckpts/umt5-xxl/"):
os.makedirs("ckpts/umt5-xxl/")
src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ] src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"] tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
for src,tgt in zip(src_move,tgt_move): for src,tgt in zip(src_move,tgt_move):
if os.path.isfile(src): if os.path.isfile(src):
try: try:
shutil.move(src, tgt) if os.path.isfile(tgt):
shutil.remove(src)
else:
shutil.move(src, tgt)
except: except:
pass pass
if not Path(server_config_filename).is_file(): if not Path(server_config_filename).is_file():
server_config = {"attention_mode" : "auto", server_config = {"attention_mode" : "auto",
"transformer_types": [], "transformer_types": [],
@ -1755,7 +1752,10 @@ def get_default_settings(filename):
"flow_shift": 13, "flow_shift": 13,
"resolution": "1280x720" "resolution": "1280x720"
}) })
elif get_model_type(filename) in ("vace_14B"):
ui_defaults.update({
"sliding_window_discard_last_frames": 0,
})
with open(defaults_filename, "w", encoding="utf-8") as f: with open(defaults_filename, "w", encoding="utf-8") as f:
@ -2136,6 +2136,9 @@ def load_models(model_filename):
global transformer_filename, transformer_loras_filenames global transformer_filename, transformer_loras_filenames
model_family = get_model_family(model_filename) model_family = get_model_family(model_filename)
perc_reserved_mem_max = args.perc_reserved_mem_max perc_reserved_mem_max = args.perc_reserved_mem_max
preload =int(args.preload)
if preload == 0:
preload = server_config.get("preload_in_VRAM", 0)
new_transformer_loras_filenames = None new_transformer_loras_filenames = None
dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy = transformer_dtype_policy) dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy = transformer_dtype_policy)
new_transformer_loras_filenames = [model_filename] if "_lora" in model_filename else None new_transformer_loras_filenames = [model_filename] if "_lora" in model_filename else None
@ -2259,7 +2262,8 @@ def apply_changes( state,
preload_model_policy_choice = 1, preload_model_policy_choice = 1,
UI_theme_choice = "default", UI_theme_choice = "default",
enhancer_enabled_choice = 0, enhancer_enabled_choice = 0,
fit_canvas_choice = 0 fit_canvas_choice = 0,
preload_in_VRAM_choice = 0
): ):
if args.lock_config: if args.lock_config:
return return
@ -2284,6 +2288,7 @@ def apply_changes( state,
"UI_theme" : UI_theme_choice, "UI_theme" : UI_theme_choice,
"fit_canvas": fit_canvas_choice, "fit_canvas": fit_canvas_choice,
"enhancer_enabled" : enhancer_enabled_choice, "enhancer_enabled" : enhancer_enabled_choice,
"preload_in_VRAM" : preload_in_VRAM_choice
} }
if Path(server_config_filename).is_file(): if Path(server_config_filename).is_file():
@ -2456,26 +2461,20 @@ def refresh_gallery(state): #, msg
prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts) prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
if enhanced: if enhanced:
prompt = "<U><B>Enhanced:</B></U><BR>" + prompt prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
list_uri = []
start_img_uri = task.get('start_image_data_base64') start_img_uri = task.get('start_image_data_base64')
start_img_uri = start_img_uri[0] if start_img_uri !=None else None if start_img_uri != None:
list_uri += start_img_uri
end_img_uri = task.get('end_image_data_base64') end_img_uri = task.get('end_image_data_base64')
end_img_uri = end_img_uri[0] if end_img_uri !=None else None if end_img_uri != None:
list_uri += end_img_uri
thumbnail_size = "100px" thumbnail_size = "100px"
if start_img_uri: thumbnails = ""
start_img_md = f'<img src="{start_img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />' for img_uri in list_uri:
if end_img_uri: thumbnails += f'<TD><img src="{img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" /></TD>'
end_img_md = f'<img src="{end_img_uri}" alt="End" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
label = f"Prompt of Video being Generated" html = "<STYLE> #PINFO, #PINFO th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>" + thumbnails + "</TR></TABLE>"
html = "<STYLE> #PINFO, #PINFO th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>"
if start_img_md != "":
html += "<TD>" + start_img_md + "</TD>"
if end_img_md != "":
html += "<TD>" + end_img_md + "</TD>"
html += "</TR></TABLE>"
html_output = gr.HTML(html, visible= True) html_output = gr.HTML(html, visible= True)
return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive= abort_interactive), gr.Button(visible= onemorewindow_visible) return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive= abort_interactive), gr.Button(visible= onemorewindow_visible)
@ -2680,7 +2679,7 @@ def generate_video(
sliding_window_overlap, sliding_window_overlap,
sliding_window_overlap_noise, sliding_window_overlap_noise,
sliding_window_discard_last_frames, sliding_window_discard_last_frames,
remove_background_image_ref, remove_background_images_ref,
temporal_upsampling, temporal_upsampling,
spatial_upsampling, spatial_upsampling,
RIFLEx_setting, RIFLEx_setting,
@ -2816,13 +2815,14 @@ def generate_video(
fps = 30 fps = 30
else: else:
fps = 16 fps = 16
latent_size = 8 if ltxv else 4
original_image_refs = image_refs original_image_refs = image_refs
if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace): if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")]) send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg") os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
from wan.utils.utils import resize_and_remove_background from wan.utils.utils import resize_and_remove_background
image_refs = resize_and_remove_background(image_refs, width, height, remove_background_image_ref ==1, fit_into_canvas= not vace) image_refs = resize_and_remove_background(image_refs, width, height, remove_background_images_ref, fit_into_canvas= not vace)
update_task_thumbnails(task, locals()) update_task_thumbnails(task, locals())
send_cmd("output") send_cmd("output")
@ -2879,7 +2879,6 @@ def generate_video(
repeat_no = 0 repeat_no = 0
extra_generation = 0 extra_generation = 0
initial_total_windows = 0 initial_total_windows = 0
max_frames_to_generate = video_length
if diffusion_forcing or vace or ltxv: if diffusion_forcing or vace or ltxv:
reuse_frames = min(sliding_window_size - 4, sliding_window_overlap) reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
else: else:
@ -2888,8 +2887,9 @@ def generate_video(
video_length += sliding_window_overlap video_length += sliding_window_overlap
sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size
discard_last_frames = sliding_window_discard_last_frames
default_max_frames_to_generate = video_length
if sliding_window: if sliding_window:
discard_last_frames = sliding_window_discard_last_frames
left_after_first_window = video_length - sliding_window_size + discard_last_frames left_after_first_window = video_length - sliding_window_size + discard_last_frames
initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames)) initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
video_length = sliding_window_size video_length = sliding_window_size
@ -2913,6 +2913,7 @@ def generate_video(
prefix_video_frames_count = 0 prefix_video_frames_count = 0
frames_already_processed = None frames_already_processed = None
pre_video_guide = None pre_video_guide = None
overlapped_latents = None
window_no = 0 window_no = 0
extra_windows = 0 extra_windows = 0
guide_start_frame = 0 guide_start_frame = 0
@ -2920,6 +2921,8 @@ def generate_video(
gen["extra_windows"] = 0 gen["extra_windows"] = 0
gen["total_windows"] = 1 gen["total_windows"] = 1
gen["window_no"] = 1 gen["window_no"] = 1
num_frames_generated = 0
max_frames_to_generate = default_max_frames_to_generate
start_time = time.time() start_time = time.time()
if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0: if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
text_encoder_max_tokens = 256 text_encoder_max_tokens = 256
@ -2955,38 +2958,50 @@ def generate_video(
while not abort: while not abort:
if sliding_window: if sliding_window:
prompt = prompts[window_no] if window_no < len(prompts) else prompts[-1] prompt = prompts[window_no] if window_no < len(prompts) else prompts[-1]
extra_windows += gen.get("extra_windows",0) new_extra_windows = gen.get("extra_windows",0)
if extra_windows > 0:
video_length = sliding_window_size
gen["extra_windows"] = 0 gen["extra_windows"] = 0
extra_windows += new_extra_windows
max_frames_to_generate += new_extra_windows * (sliding_window_size - discard_last_frames - reuse_frames)
sliding_window = sliding_window or extra_windows > 0
if sliding_window and window_no > 0:
num_frames_generated -= reuse_frames
if (max_frames_to_generate - prefix_video_frames_count - num_frames_generated) < latent_size:
break
video_length = min(sliding_window_size, ((max_frames_to_generate - num_frames_generated - prefix_video_frames_count + reuse_frames + discard_last_frames) // latent_size) * latent_size + 1 )
total_windows = initial_total_windows + extra_windows total_windows = initial_total_windows + extra_windows
gen["total_windows"] = total_windows gen["total_windows"] = total_windows
if window_no >= total_windows: if window_no >= total_windows:
break break
window_no += 1 window_no += 1
gen["window_no"] = window_no gen["window_no"] = window_no
return_latent_slice = None
if reuse_frames > 0:
return_latent_slice = slice(-(reuse_frames - 1 + discard_last_frames ) // latent_size, None if discard_last_frames == 0 else -(discard_last_frames // latent_size) )
if hunyuan_custom: if hunyuan_custom:
src_ref_images = image_refs src_ref_images = image_refs
elif phantom: elif phantom:
src_ref_images = image_refs.copy() if image_refs != None else None src_ref_images = image_refs.copy() if image_refs != None else None
elif diffusion_forcing or ltxv: elif diffusion_forcing or ltxv or vace and "O" in video_prompt_type:
if vace:
video_source = video_guide
video_guide = None
if video_source != None and len(video_source) > 0 and window_no == 1: if video_source != None and len(video_source) > 0 and window_no == 1:
keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source) keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source)
keep_frames_video_source = (keep_frames_video_source // latent_size ) * latent_size + 1
prefix_video = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16) prefix_video = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
prefix_video = prefix_video .permute(3, 0, 1, 2) prefix_video = prefix_video .permute(3, 0, 1, 2)
prefix_video = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w prefix_video = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
prefix_video_frames_count = prefix_video.shape[1]
pre_video_guide = prefix_video[:, -reuse_frames:] pre_video_guide = prefix_video[:, -reuse_frames:]
prefix_video_frames_count = pre_video_guide.shape[1]
elif vace: if vace:
# video_prompt_type = video_prompt_type +"G" height, width = pre_video_guide.shape[-2:]
if vace:
image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
video_guide_copy = video_guide video_guide_copy = video_guide
video_mask_copy = video_mask video_mask_copy = video_mask
if any(process in video_prompt_type for process in ("P", "D", "G")) : if any(process in video_prompt_type for process in ("P", "D", "G")) :
prompts_max = gen["prompts_max"]
preprocess_type = None preprocess_type = None
if "P" in video_prompt_type : if "P" in video_prompt_type :
progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")] progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
@ -3005,8 +3020,11 @@ def generate_video(
if len(error) > 0: if len(error) > 0:
raise gr.Error(f"invalid keep frames {keep_frames_video_guide}") raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length] keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
if window_no == 1: if window_no == 1:
image_size = (height, width) # VACE_SIZE_CONFIGS[resolution_reformated] # default frame dimensions until it is set by video_src (if there is any) image_size = (height, width) # default frame dimensions until it is set by video_src (if there is any)
src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy], src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
[video_mask_copy ], [video_mask_copy ],
[image_refs_copy], [image_refs_copy],
@ -3017,29 +3035,24 @@ def generate_video(
pre_src_video = [pre_video_guide], pre_src_video = [pre_video_guide],
fit_into_canvas = fit_canvas fit_into_canvas = fit_canvas
) )
# if window_no == 1 and src_video != None and len(src_video) > 0:
# image_size = src_video[0].shape[-2:]
prompts_max = gen["prompts_max"]
status = get_latest_status(state) status = get_latest_status(state)
gen["progress_status"] = status gen["progress_status"] = status
gen["progress_phase"] = ("Encoding Prompt", -1 ) gen["progress_phase"] = ("Encoding Prompt", -1 )
callback = build_callback(state, trans, send_cmd, status, num_inference_steps) callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
progress_args = [0, merge_status_context(status, "Encoding Prompt")] progress_args = [0, merge_status_context(status, "Encoding Prompt")]
send_cmd("progress", progress_args) send_cmd("progress", progress_args)
if trans.enable_teacache:
trans.teacache_counter = 0
trans.num_steps = num_inference_steps
trans.teacache_skipped_steps = 0
trans.previous_residual = None
trans.previous_modulated_input = None
# samples = torch.empty( (1,2)) #for testing # samples = torch.empty( (1,2)) #for testing
# if False: # if False:
try: try:
if trans.enable_teacache:
trans.teacache_counter = 0
trans.num_steps = num_inference_steps
trans.teacache_skipped_steps = 0
trans.previous_residual = None
trans.previous_modulated_input = None
samples = wan_model.generate( samples = wan_model.generate(
input_prompt = prompt, input_prompt = prompt,
image_start = image_start, image_start = image_start,
@ -3049,7 +3062,7 @@ def generate_video(
input_masks = src_mask, input_masks = src_mask,
input_video= pre_video_guide if diffusion_forcing or ltxv else source_video, input_video= pre_video_guide if diffusion_forcing or ltxv else source_video,
target_camera= target_camera, target_camera= target_camera,
frame_num=(video_length // 4)* 4 + 1, frame_num=(video_length // latent_size)* latent_size + 1,
height = height, height = height,
width = width, width = width,
fit_into_canvas = fit_canvas == 1, fit_into_canvas = fit_canvas == 1,
@ -3076,7 +3089,8 @@ def generate_video(
causal_block_size = 5, causal_block_size = 5,
causal_attention = True, causal_attention = True,
fps = fps, fps = fps,
overlapped_latents = 0 if reuse_frames == 0 or window_no == 1 else ((reuse_frames - 1) // 4 + 1), overlapped_latents = overlapped_latents,
return_latent_slice= return_latent_slice,
overlap_noise = sliding_window_overlap_noise, overlap_noise = sliding_window_overlap_noise,
model_filename = model_filename, model_filename = model_filename,
) )
@ -3109,6 +3123,7 @@ def generate_video(
tb = traceback.format_exc().split('\n')[:-1] tb = traceback.format_exc().split('\n')[:-1]
print('\n'.join(tb)) print('\n'.join(tb))
send_cmd("error", new_error) send_cmd("error", new_error)
clear_status(state)
return return
finally: finally:
trans.previous_residual = None trans.previous_residual = None
@ -3118,33 +3133,42 @@ def generate_video(
print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" ) print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )
if samples != None: if samples != None:
if isinstance(samples, dict):
overlapped_latents = samples.get("latent_slice", None)
samples= samples["x"]
samples = samples.to("cpu") samples = samples.to("cpu")
offload.last_offload_obj.unload_all() offload.last_offload_obj.unload_all()
gc.collect() gc.collect()
torch.cuda.empty_cache() torch.cuda.empty_cache()
# time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%Hh%Mm%Ss")
# save_prompt = "_in_" + original_prompts[0]
# file_name = f"{time_flag}_seed{seed}_{sanitize_file_name(save_prompt[:50]).strip()}.mp4"
# sample = samples.cpu()
# cache_video( tensor=sample[None].clone(), save_file=os.path.join(save_path, file_name), fps=16, nrow=1, normalize=True, value_range=(-1, 1))
if samples == None: if samples == None:
abort = True abort = True
state["prompt"] = "" state["prompt"] = ""
send_cmd("output") send_cmd("output")
else: else:
sample = samples.cpu() sample = samples.cpu()
if True: # for testing # if True: # for testing
torch.save(sample, "output.pt") # torch.save(sample, "output.pt")
else: # else:
sample =torch.load("output.pt") # sample =torch.load("output.pt")
if gen.get("extra_windows",0) > 0:
sliding_window = True
if sliding_window : if sliding_window :
guide_start_frame += video_length guide_start_frame += video_length
if discard_last_frames > 0: if discard_last_frames > 0:
sample = sample[: , :-discard_last_frames] sample = sample[: , :-discard_last_frames]
guide_start_frame -= discard_last_frames guide_start_frame -= discard_last_frames
if reuse_frames == 0: if reuse_frames == 0:
pre_video_guide = sample[:,9999 :] pre_video_guide = sample[:,9999 :].clone()
else: else:
# noise_factor = 200/ 1000 pre_video_guide = sample[:, -reuse_frames:].clone()
# pre_video_guide = sample[:, -reuse_frames:] * (1.0 - noise_factor) + torch.randn_like(sample[:, -reuse_frames:] ) * noise_factor num_frames_generated += sample.shape[1]
pre_video_guide = sample[:, -reuse_frames:]
if prefix_video != None: if prefix_video != None:
@ -3158,7 +3182,6 @@ def generate_video(
sample = sample[: , :] sample = sample[: , :]
else: else:
sample = sample[: , reuse_frames:] sample = sample[: , reuse_frames:]
guide_start_frame -= reuse_frames guide_start_frame -= reuse_frames
exp = 0 exp = 0
@ -3252,15 +3275,9 @@ def generate_video(
print(f"New video saved to Path: "+video_path) print(f"New video saved to Path: "+video_path)
file_list.append(video_path) file_list.append(video_path)
send_cmd("output") send_cmd("output")
if sliding_window :
if max_frames_to_generate > 0 and extra_windows == 0:
current_length = sample.shape[1]
if (current_length - prefix_video_frames_count)>= max_frames_to_generate:
break
video_length = min(sliding_window_size, ((max_frames_to_generate - (current_length - prefix_video_frames_count) + reuse_frames + discard_last_frames) // 4) * 4 + 1 )
seed += 1 seed += 1
clear_status(state)
if temp_filename!= None and os.path.isfile(temp_filename): if temp_filename!= None and os.path.isfile(temp_filename):
os.remove(temp_filename) os.remove(temp_filename)
offload.unload_loras_from_model(trans) offload.unload_loras_from_model(trans)
@ -3631,6 +3648,15 @@ def merge_status_context(status="", context=""):
else: else:
return status + " - " + context return status + " - " + context
def clear_status(state):
gen = get_gen_info(state)
gen["extra_windows"] = 0
gen["total_windows"] = 1
gen["window_no"] = 1
gen["extra_orders"] = 0
gen["repeat_no"] = 0
gen["total_generation"] = 0
def get_latest_status(state, context=""): def get_latest_status(state, context=""):
gen = get_gen_info(state) gen = get_gen_info(state)
prompt_no = gen["prompt_no"] prompt_no = gen["prompt_no"]
@ -3999,7 +4025,7 @@ def prepare_inputs_dict(target, inputs ):
inputs.pop("model_mode") inputs.pop("model_mode")
if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename: if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
unsaved_params = ["keep_frames_video_guide", "video_prompt_type", "remove_background_image_ref"] unsaved_params = ["keep_frames_video_guide", "video_prompt_type", "remove_background_images_ref"]
for k in unsaved_params: for k in unsaved_params:
inputs.pop(k) inputs.pop(k)
@ -4066,7 +4092,7 @@ def save_inputs(
sliding_window_overlap, sliding_window_overlap,
sliding_window_overlap_noise, sliding_window_overlap_noise,
sliding_window_discard_last_frames, sliding_window_discard_last_frames,
remove_background_image_ref, remove_background_images_ref,
temporal_upsampling, temporal_upsampling,
spatial_upsampling, spatial_upsampling,
RIFLEx_setting, RIFLEx_setting,
@ -4458,7 +4484,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
("Transfer Human Motion from the Control Video", "PV"), ("Transfer Human Motion from the Control Video", "PV"),
("Transfer Depth from the Control Video", "DV"), ("Transfer Depth from the Control Video", "DV"),
("Recolorize the Control Video", "CV"), ("Recolorize the Control Video", "CV"),
# ("Alternate Video Ending", "OV"), ("Extend Video", "OV"),
("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"), ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
("Control Video and Mask video for Inpainting ", "MV"), ("Control Video and Mask video for Inpainting ", "MV"),
], ],
@ -4489,7 +4515,17 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
) )
# with gr.Row(): # with gr.Row():
remove_background_image_ref = gr.Checkbox(value=ui_defaults.get("remove_background_image_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 ) remove_background_images_ref = gr.Dropdown(
choices=[
("Keep Backgrounds of All Images (landscape)", 0),
("Remove Backgrounds of All Images (objects / faces)", 1),
("Keep it for first Image (landscape) and remove it for other Images (objects / faces)", 2),
],
value=ui_defaults.get("remove_background_images_ref",1),
label="Remove Background of Images References", scale = 3, visible= "I" in video_prompt_type_value
)
# remove_background_images_ref = gr.Checkbox(value=ui_defaults.get("remove_background_images_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 )
video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None)) video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None))
@ -4730,7 +4766,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
else: else:
sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size") sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)") sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
sliding_window_overlap_noise = gr.Slider(0, 100, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect") sliding_window_overlap_noise = gr.Slider(0, 150, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True) sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)
@ -4811,7 +4847,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] ) image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] )
video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide]) video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_image_ref ]) video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref ])
video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask]) video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])
show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then( show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
@ -5036,7 +5072,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
) )
return ( state, loras_choices, lset_name, state, return ( state, loras_choices, lset_name, state,
video_guide, video_mask, video_prompt_video_guide_trigger, prompt_enhancer video_guide, video_mask, image_refs, video_prompt_video_guide_trigger, prompt_enhancer
) )
@ -5250,6 +5286,7 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
value= profile, value= profile,
label="Profile (for power users only, not needed to change it)" label="Profile (for power users only, not needed to change it)"
) )
preload_in_VRAM_choice = gr.Slider(0, 40000, value=server_config.get("preload_in_VRAM", 0), step=100, label="Number of MB of Models that are Preloaded in VRAM (0 will use Profile default)")
@ -5277,7 +5314,8 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
preload_model_policy_choice, preload_model_policy_choice,
UI_theme_choice, UI_theme_choice,
enhancer_enabled_choice, enhancer_enabled_choice,
fit_canvas_choice fit_canvas_choice,
preload_in_VRAM_choice
], ],
outputs= [msg , header, model_choice, prompt_enhancer_row] outputs= [msg , header, model_choice, prompt_enhancer_row]
) )
@ -5661,7 +5699,7 @@ def create_demo():
theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md") theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")
with gr.Blocks(css=css, theme=theme, title= "WanGP") as main: with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.2 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>") gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.21 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
global model_list global model_list
tab_state = gr.State({ "tab_no":0 }) tab_state = gr.State({ "tab_no":0 })
@ -5680,7 +5718,7 @@ def create_demo():
header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True) header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
with gr.Row(): with gr.Row():
( state, loras_choices, lset_name, state, ( state, loras_choices, lset_name, state,
video_guide, video_mask, video_prompt_type_video_trigger, prompt_enhancer_row video_guide, video_mask, image_refs, video_prompt_type_video_trigger, prompt_enhancer_row
) = generate_video_tab(model_choice=model_choice, header=header, main = main) ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
with gr.Tab("Informations", id="info"): with gr.Tab("Informations", id="info"):
generate_info_tab() generate_info_tab()
@ -5688,7 +5726,7 @@ def create_demo():
from preprocessing.matanyone import app as matanyone_app from preprocessing.matanyone import app as matanyone_app
vmc_event_handler = matanyone_app.get_vmc_event_handler() vmc_event_handler = matanyone_app.get_vmc_event_handler()
matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, video_prompt_type_video_trigger) matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, image_refs, video_prompt_type_video_trigger)
if not args.lock_config: if not args.lock_config:
with gr.Tab("Downloads", id="downloads") as downloads_tab: with gr.Tab("Downloads", id="downloads") as downloads_tab:
generate_download_tab(lset_name, loras_choices, state) generate_download_tab(lset_name, loras_choices, state)