Merge branch 'main' into queue_editor

This commit is contained in:
Chris Malone 2025-09-12 14:44:12 +10:00 committed by GitHub
commit e69a406808
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
30 changed files with 1316 additions and 631 deletions

View File

@ -20,6 +20,31 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
**Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep
## 🔥 Latest Updates :
### September 11 2025: WanGP v8.5/8.55 - Wanna be a Cropper or a Painter ?
I have done some intensive internal refactoring of the generation pipeline to ease support of existing models or add new models. Nothing really visible but this makes WanGP is little more future proof.
Otherwise in the news:
- **Cropped Input Image Prompts**: as quite often most *Image Prompts* provided (*Start Image, Input Video, Reference Image, Control Video, ...*) rarely matched your requested *Output Resolution*. In that case I used the resolution you gave either as a *Pixels Budget* or as an *Outer Canvas* for the Generated Video. However in some occasion you really want the requested Output Resolution and nothing else. Besides some models deliver much better Generations if you stick to one of their supported resolutions. In order to address this need I have added a new Output Resolution choice in the *Configuration Tab*: **Dimensions Correspond to the Ouput Weight & Height as the Prompt Images will be Cropped to fit Exactly these dimensins**. In short if needed the *Input Prompt Images* will be cropped (centered cropped for the moment). You will see this can make quite a difference for some models
- *Qwen Edit* has now a new sub Tab called **Inpainting**, that lets you target with a brush which part of the *Image Prompt* you want to modify. This is quite convenient if you find that Qwen Edit modifies usually too many things. Of course, as there are more constraints for Qwen Edit don't be surprised if sometime it will return the original image unchanged. A piece of advise: describe in your *Text Prompt* where (for instance *left to the man*, *top*, ...) the parts that you want to modify are located.
The mask inpainting is fully compatible with *Matanyone Mask generator*: generate first an *Image Mask* with Matanyone, transfer it to the current Image Generator and modify the mask with the *Paint Brush*. Talking about matanyone I have fixed a bug that caused a mask degradation with long videos (now WanGP Matanyone is as good as the original app and still requires 3 times less VRAM)
- This **Inpainting Mask Editor** has been added also to *Vace Image Mode*. Vace is probably still one of best Image Editor today. Here is a very simple & efficient workflow that do marvels with Vace:
Select *Vace Cocktail > Control Image Process = Perform Inpainting & Area Processed = Masked Area > Upload a Control Image, then draw your mask directly on top of the image & enter a text Prompt that describes the expected change > Generate > Below the Video Gallery click 'To Control Image' > Keep on doing more changes*.
Doing more sophisticated thing Vace Image Editor works very well too: try Image Outpainting, Pose transfer, ...
For the best quality I recommend to set in *Quality Tab* the option: "*Generate a 9 Frames Long video...*"
**update 8.55**: Flux Festival
- **Inpainting Mode** also added for *Flux Kontext*
- **Flux SRPO** : new finetune with x3 better quality vs Flux Dev according to its authors. I have also created a *Flux SRPO USO* finetune which is certainly the best open source *Style Transfer* tool available
- **Flux UMO**: model specialized in combining multiple reference objects / people together. Works quite well at 768x768
Good luck with finding your way through all the Flux models names !
### September 5 2025: WanGP v8.4 - Take me to Outer Space
You have probably seen these short AI generated movies created using *Nano Banana* and the *First Frame - Last Frame* feature of *Kling 2.0*. The idea is to generate an image, modify a part of it with Nano Banana and give the these two images to Kling that will generate the Video between these two images, use now the previous Last Frame as the new First Frame, rinse and repeat and you get a full movie.

View File

@ -7,8 +7,6 @@
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_bf16.safetensors",
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1_kontext_dev_quanto_bf16_int8.safetensors"
],
"image_outputs": true,
"reference_image": true,
"flux-model": "flux-dev-kontext"
},
"prompt": "add a hat",

View File

@ -0,0 +1,24 @@
{
"model": {
"name": "Flux 1 Dev UMO 12B",
"architecture": "flux",
"description": "FLUX.1 Dev UMO is a model that can Edit Images with a specialization in combining multiple image references (resized internally at 512x512 max) to produce an Image output. Best Image preservation at 768x768 Resolution Output.",
"URLs": "flux",
"flux-model": "flux-dev-umo",
"loras": ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-UMO_dit_lora_bf16.safetensors"],
"resolutions": [ ["1024x1024 (1:1)", "1024x1024"],
["768x1024 (3:4)", "768x1024"],
["1024x768 (4:3)", "1024x768"],
["512x1024 (1:2)", "512x1024"],
["1024x512 (2:1)", "1024x512"],
["768x768 (1:1)", "768x768"],
["768x512 (3:2)", "768x512"],
["512x768 (2:3)", "512x768"]]
},
"prompt": "the man is wearing a hat",
"embedded_guidance_scale": 4,
"resolution": "768x768",
"batch_size": 1
}

View File

@ -2,12 +2,10 @@
"model": {
"name": "Flux 1 Dev USO 12B",
"architecture": "flux",
"description": "FLUX.1 Dev USO is a model specialized to Edit Images with a specialization in Style Transfers (up to two).",
"description": "FLUX.1 Dev USO is a model that can Edit Images with a specialization in Style Transfers (up to two).",
"modules": [ ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-USO_projector_bf16.safetensors"]],
"URLs": "flux",
"loras": ["https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-dev-USO_dit_lora_bf16.safetensors"],
"image_outputs": true,
"reference_image": true,
"flux-model": "flux-dev-uso"
},
"prompt": "the man is wearing a hat",

15
defaults/flux_srpo.json Normal file
View File

@ -0,0 +1,15 @@
{
"model": {
"name": "Flux 1 SRPO Dev 12B",
"architecture": "flux",
"description": "By fine-tuning the FLUX.1.dev model with optimized denoising and online reward adjustment, SRPO improves its human-evaluated realism and aesthetic quality by over 3x.",
"URLs": [
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-srpo-dev_bf16.safetensors",
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-srpo-dev_quanto_bf16_int8.safetensors"
],
"flux-model": "flux-dev"
},
"prompt": "draw a hat",
"resolution": "1024x1024",
"batch_size": 1
}

View File

@ -0,0 +1,17 @@
{
"model": {
"name": "Flux 1 SRPO USO 12B",
"architecture": "flux",
"description": "FLUX.1 SRPO USO is a model that can Edit Images with a specialization in Style Transfers (up to two). It leverages the improved Image quality brought by the SRPO process",
"modules": [ "flux_dev_uso"],
"URLs": "flux_srpo",
"loras": "flux_dev_uso",
"flux-model": "flux-dev-uso"
},
"prompt": "the man is wearing a hat",
"embedded_guidance_scale": 4,
"resolution": "1024x1024",
"batch_size": 1
}

View File

@ -9,9 +9,7 @@
],
"attention": {
"<89": "sdpa"
},
"reference_image": true,
"image_outputs": true
}
},
"prompt": "add a hat",
"resolution": "1280x720",

View File

@ -4,7 +4,7 @@
"name": "Wan2.1 Standin 14B",
"modules": [ ["https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/Stand-In_wan2.1_T2V_14B_ver1.0_bf16.safetensors"]],
"architecture" : "standin",
"description": "The original Wan Text 2 Video model combined with the StandIn module to improve Identity Preservation. You need to provide a Reference Image with white background which is a close up of person face to transfer this person in the Video.",
"description": "The original Wan Text 2 Video model combined with the StandIn module to improve Identity Preservation. You need to provide a Reference Image with white background which is a close up of a person face to transfer this person in the Video.",
"URLs": "t2v"
}
}

View File

@ -13,28 +13,52 @@ class family_handler():
flux_schnell = flux_model == "flux-schnell"
flux_chroma = flux_model == "flux-chroma"
flux_uso = flux_model == "flux-dev-uso"
model_def_output = {
flux_umo = flux_model == "flux-dev-umo"
flux_kontext = flux_model == "flux-dev-kontext"
extra_model_def = {
"image_outputs" : True,
"no_negative_prompt" : not flux_chroma,
}
if flux_chroma:
model_def_output["guidance_max_phases"] = 1
extra_model_def["guidance_max_phases"] = 1
elif not flux_schnell:
model_def_output["embedded_guidance"] = True
extra_model_def["embedded_guidance"] = True
if flux_uso :
model_def_output["any_image_refs_relative_size"] = True
model_def_output["no_background_removal"] = True
model_def_output["image_ref_choices"] = {
extra_model_def["any_image_refs_relative_size"] = True
extra_model_def["no_background_removal"] = True
extra_model_def["image_ref_choices"] = {
"choices":[("No Reference Image", ""),("First Image is a Reference Image, and then the next ones (up to two) are Style Images", "KI"),
("Up to two Images are Style Images", "KIJ")],
"default": "KI",
"letters_filter": "KIJ",
"label": "Reference Images / Style Images"
}
model_def_output["lock_image_refs_ratios"] = True
return model_def_output
if flux_kontext:
extra_model_def["inpaint_support"] = True
extra_model_def["image_ref_choices"] = {
"choices": [
("None", ""),
("Conditional Images is first Main Subject / Landscape and may be followed by People / Objects", "KI"),
("Conditional Images are People / Objects", "I"),
],
"letters_filter": "KI",
}
extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape"
elif flux_umo:
extra_model_def["image_ref_choices"] = {
"choices": [
("Conditional Images are People / Objects", "I"),
],
"letters_filter": "I",
"visible": False
}
extra_model_def["lock_image_refs_ratios"] = True
return extra_model_def
@staticmethod
def query_supported_types():
@ -118,15 +142,28 @@ class family_handler():
video_prompt_type = video_prompt_type.replace("I", "KI")
ui_defaults["video_prompt_type"] = video_prompt_type
if settings_version < 2.34:
ui_defaults["denoising_strength"] = 1.
@staticmethod
def update_default_settings(base_model_type, model_def, ui_defaults):
flux_model = model_def.get("flux-model", "flux-dev")
flux_uso = flux_model == "flux-dev-uso"
flux_umo = flux_model == "flux-dev-umo"
flux_kontext = flux_model == "flux-dev-kontext"
ui_defaults.update({
"embedded_guidance": 2.5,
})
if model_def.get("reference_image", False):
if flux_kontext or flux_uso:
ui_defaults.update({
"video_prompt_type": "KI",
"denoising_strength": 1.,
})
elif flux_umo:
ui_defaults.update({
"video_prompt_type": "I",
"remove_background_images_ref": 0,
})

View File

@ -23,44 +23,35 @@ from .util import (
)
from PIL import Image
def preprocess_ref(raw_image: Image.Image, long_size: int = 512):
# 获取原始图像的宽度和高度
image_w, image_h = raw_image.size
def resize_and_centercrop_image(image, target_height_ref1, target_width_ref1):
target_height_ref1 = int(target_height_ref1 // 64 * 64)
target_width_ref1 = int(target_width_ref1 // 64 * 64)
h, w = image.shape[-2:]
if h < target_height_ref1 or w < target_width_ref1:
# 计算长宽比
aspect_ratio = w / h
if h < target_height_ref1:
new_h = target_height_ref1
new_w = new_h * aspect_ratio
if new_w < target_width_ref1:
new_w = target_width_ref1
new_h = new_w / aspect_ratio
else:
new_w = target_width_ref1
new_h = new_w / aspect_ratio
if new_h < target_height_ref1:
new_h = target_height_ref1
new_w = new_h * aspect_ratio
# 计算长边和短边
if image_w >= image_h:
new_w = long_size
new_h = int((long_size / image_w) * image_h)
else:
aspect_ratio = w / h
tgt_aspect_ratio = target_width_ref1 / target_height_ref1
if aspect_ratio > tgt_aspect_ratio:
new_h = target_height_ref1
new_w = new_h * aspect_ratio
else:
new_w = target_width_ref1
new_h = new_w / aspect_ratio
# 使用 TVF.resize 进行图像缩放
image = TVF.resize(image, (math.ceil(new_h), math.ceil(new_w)))
# 计算中心裁剪的参数
top = (image.shape[-2] - target_height_ref1) // 2
left = (image.shape[-1] - target_width_ref1) // 2
# 使用 TVF.crop 进行中心裁剪
image = TVF.crop(image, top, left, target_height_ref1, target_width_ref1)
return image
new_h = long_size
new_w = int((long_size / image_h) * image_w)
# 按新的宽高进行等比例缩放
raw_image = raw_image.resize((new_w, new_h), resample=Image.LANCZOS)
target_w = new_w // 16 * 16
target_h = new_h // 16 * 16
# 计算裁剪的起始坐标以实现中心裁剪
left = (new_w - target_w) // 2
top = (new_h - target_h) // 2
right = left + target_w
bottom = top + target_h
# 进行中心裁剪
raw_image = raw_image.crop((left, top, right, bottom))
# 转换为 RGB 模式
raw_image = raw_image.convert("RGB")
return raw_image
def stitch_images(img1, img2):
# Resize img2 to match img1's height
@ -105,7 +96,7 @@ class model_factory:
# self.name= "flux-schnell"
source = model_def.get("source", None)
self.model = load_flow_model(self.name, model_filename[0] if source is None else source, torch_device)
self.model_def = model_def
self.vae = load_ae(self.name, device=torch_device)
siglip_processor = siglip_model = feature_embedder = None
@ -151,6 +142,8 @@ class model_factory:
n_prompt: str = None,
sampling_steps: int = 20,
input_ref_images = None,
image_guide= None,
image_mask= None,
width= 832,
height=480,
embedded_guidance_scale: float = 2.5,
@ -162,6 +155,7 @@ class model_factory:
video_prompt_type = "",
joint_pass = False,
image_refs_relative_size = 100,
denoising_strength = 1.,
**bbargs
):
if self._interrupt:
@ -170,10 +164,16 @@ class model_factory:
if n_prompt is None or len(n_prompt) == 0: n_prompt = "low quality, ugly, unfinished, out of focus, deformed, disfigure, blurry, smudged, restricted palette, flat colors"
device="cuda"
flux_dev_uso = self.name in ['flux-dev-uso']
image_stiching = not self.name in ['flux-dev-uso'] #and False
# image_refs_relative_size = 100
crop = False
flux_dev_umo = self.name in ['flux-dev-umo']
latent_stiching = self.name in ['flux-dev-uso', 'flux-dev-umo']
lock_dimensions= False
input_ref_images = [] if input_ref_images is None else input_ref_images[:]
if flux_dev_umo:
ref_long_side = 512 if len(input_ref_images) <= 1 else 320
input_ref_images = [preprocess_ref(img, ref_long_side) for img in input_ref_images]
lock_dimensions = True
ref_style_imgs = []
if "I" in video_prompt_type and len(input_ref_images) > 0:
if flux_dev_uso :
@ -183,43 +183,26 @@ class model_factory:
elif len(input_ref_images) > 1 :
ref_style_imgs = input_ref_images[-1:]
input_ref_images = input_ref_images[:-1]
if image_stiching:
if latent_stiching:
# latents stiching with resize
if not lock_dimensions :
for i in range(len(input_ref_images)):
w, h = input_ref_images[i].size
image_height, image_width = calculate_new_dimensions(int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100), h, w, 0)
input_ref_images[i] = input_ref_images[i].resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
else:
# image stiching method
stiched = input_ref_images[0]
if "K" in video_prompt_type :
w, h = input_ref_images[0].size
height, width = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
# actual rescale will happen in prepare_kontext
for new_img in input_ref_images[1:]:
stiched = stitch_images(stiched, new_img)
input_ref_images = [stiched]
else:
first_ref = 0
if "K" in video_prompt_type:
# image latents tiling method
w, h = input_ref_images[0].size
if crop :
img = convert_image_to_tensor(input_ref_images[0])
img = resize_and_centercrop_image(img, height, width)
input_ref_images[0] = convert_tensor_to_image(img)
else:
height, width = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
input_ref_images[0] = input_ref_images[0].resize((width, height), resample=Image.Resampling.LANCZOS)
first_ref = 1
for i in range(first_ref,len(input_ref_images)):
w, h = input_ref_images[i].size
if crop:
img = convert_image_to_tensor(input_ref_images[i])
img = resize_and_centercrop_image(img, int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100))
input_ref_images[i] = convert_tensor_to_image(img)
else:
image_height, image_width = calculate_new_dimensions(int(height*image_refs_relative_size/100), int(width*image_refs_relative_size/100), h, w, fit_into_canvas)
input_ref_images[i] = input_ref_images[i].resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
elif image_guide is not None:
input_ref_images = [image_guide]
else:
input_ref_images = None
if flux_dev_uso :
if self.name in ['flux-dev-uso', 'flux-dev-umo'] :
inp, height, width = prepare_multi_ip(
ae=self.vae,
img_cond_list=input_ref_images,
@ -238,6 +221,7 @@ class model_factory:
bs=batch_size,
seed=seed,
device=device,
img_mask=image_mask,
)
inp.update(prepare_prompt(self.t5, self.clip, batch_size, input_prompt))
@ -259,13 +243,19 @@ class model_factory:
return unpack(x.float(), height, width)
# denoise initial noise
x = denoise(self.model, **inp, timesteps=timesteps, guidance=embedded_guidance_scale, real_guidance_scale =guide_scale, callback=callback, pipeline=self, loras_slists= loras_slists, unpack_latent = unpack_latent, joint_pass = joint_pass)
x = denoise(self.model, **inp, timesteps=timesteps, guidance=embedded_guidance_scale, real_guidance_scale =guide_scale, callback=callback, pipeline=self, loras_slists= loras_slists, unpack_latent = unpack_latent, joint_pass = joint_pass, denoising_strength = denoising_strength)
if x==None: return None
# decode latents to pixel space
x = unpack_latent(x)
with torch.autocast(device_type=device, dtype=torch.bfloat16):
x = self.vae.decode(x)
if image_mask is not None:
from shared.utils.utils import convert_image_to_tensor
img_msk_rebuilt = inp["img_msk_rebuilt"]
img= convert_image_to_tensor(image_guide)
x = img.squeeze(2) * (1 - img_msk_rebuilt) + x.to(img) * img_msk_rebuilt
x = x.clamp(-1, 1)
x = x.transpose(0, 1)
return x

View File

@ -190,6 +190,21 @@ class Flux(nn.Module):
v = swap_scale_shift(v)
k = k.replace("norm_out.linear", "final_layer.adaLN_modulation.1")
new_sd[k] = v
# elif not first_key.startswith("diffusion_model.") and not first_key.startswith("transformer."):
# for k,v in sd.items():
# if "double" in k:
# k = k.replace(".processor.proj_lora1.", ".img_attn.proj.lora_")
# k = k.replace(".processor.proj_lora2.", ".txt_attn.proj.lora_")
# k = k.replace(".processor.qkv_lora1.", ".img_attn.qkv.lora_")
# k = k.replace(".processor.qkv_lora2.", ".txt_attn.qkv.lora_")
# else:
# k = k.replace(".processor.qkv_lora.", ".linear1_qkv.lora_")
# k = k.replace(".processor.proj_lora.", ".linear2.lora_")
# k = "diffusion_model." + k
# new_sd[k] = v
# from mmgp import safetensors2
# safetensors2.torch_write_file(new_sd, "fff.safetensors")
else:
new_sd = sd
return new_sd

View File

@ -138,10 +138,12 @@ def prepare_kontext(
target_width: int | None = None,
target_height: int | None = None,
bs: int = 1,
img_mask = None,
) -> tuple[dict[str, Tensor], int, int]:
# load and encode the conditioning image
res_match_output = img_mask is not None
img_cond_seq = None
img_cond_seq_ids = None
if img_cond_list == None: img_cond_list = []
@ -150,9 +152,11 @@ def prepare_kontext(
for cond_no, img_cond in enumerate(img_cond_list):
width, height = img_cond.size
aspect_ratio = width / height
# Kontext is trained on specific resolutions, using one of them is recommended
_, width, height = min((abs(aspect_ratio - w / h), w, h) for w, h in PREFERED_KONTEXT_RESOLUTIONS)
if res_match_output:
width, height = target_width, target_height
else:
# Kontext is trained on specific resolutions, using one of them is recommended
_, width, height = min((abs(aspect_ratio - w / h), w, h) for w, h in PREFERED_KONTEXT_RESOLUTIONS)
width = 2 * int(width / 16)
height = 2 * int(height / 16)
@ -193,6 +197,19 @@ def prepare_kontext(
"img_cond_seq": img_cond_seq,
"img_cond_seq_ids": img_cond_seq_ids,
}
if img_mask is not None:
from shared.utils.utils import convert_image_to_tensor, convert_tensor_to_image
# image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
image_mask_latents = convert_image_to_tensor(img_mask.resize((target_width // 16, target_height // 16), resample=Image.Resampling.LANCZOS))
image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)
return_dict.update({
"img_msk_latents": image_mask_latents,
"img_msk_rebuilt": image_mask_rebuilt,
})
img = get_noise(
bs,
target_height,
@ -264,6 +281,9 @@ def denoise(
loras_slists=None,
unpack_latent = None,
joint_pass= False,
img_msk_latents = None,
img_msk_rebuilt = None,
denoising_strength = 1,
):
kwargs = {'pipeline': pipeline, 'callback': callback, "img_len" : img.shape[1], "siglip_embedding": siglip_embedding, "siglip_embedding_ids": siglip_embedding_ids}
@ -271,6 +291,21 @@ def denoise(
if callback != None:
callback(-1, None, True)
original_image_latents = None if img_cond_seq is None else img_cond_seq.clone()
morph, first_step = False, 0
if img_msk_latents is not None:
randn = torch.randn_like(original_image_latents)
if denoising_strength < 1.:
first_step = int(len(timesteps) * (1. - denoising_strength))
if not morph:
latent_noise_factor = timesteps[first_step]
latents = original_image_latents * (1.0 - latent_noise_factor) + randn * latent_noise_factor
img = latents.to(img)
latents = None
timesteps = timesteps[first_step:]
updated_num_steps= len(timesteps) -1
if callback != None:
from shared.utils.loras_mutipliers import update_loras_slists
@ -280,10 +315,14 @@ def denoise(
# this is ignored for schnell
guidance_vec = torch.full((img.shape[0],), guidance, device=img.device, dtype=img.dtype)
for i, (t_curr, t_prev) in enumerate(zip(timesteps[:-1], timesteps[1:])):
offload.set_step_no_for_lora(model, i)
offload.set_step_no_for_lora(model, first_step + i)
if pipeline._interrupt:
return None
if img_msk_latents is not None and denoising_strength <1. and i == first_step and morph:
latent_noise_factor = t_curr/1000
img = original_image_latents * (1.0 - latent_noise_factor) + img * latent_noise_factor
t_vec = torch.full((img.shape[0],), t_curr, dtype=img.dtype, device=img.device)
img_input = img
img_input_ids = img_ids
@ -333,6 +372,14 @@ def denoise(
pred = neg_pred + real_guidance_scale * (pred - neg_pred)
img += (t_prev - t_curr) * pred
if img_msk_latents is not None:
latent_noise_factor = t_prev
# noisy_image = original_image_latents * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor
noisy_image = original_image_latents * (1.0 - latent_noise_factor) + randn * latent_noise_factor
img = noisy_image * (1-img_msk_latents) + img_msk_latents * img
noisy_image = None
if callback is not None:
preview = unpack_latent(img).transpose(0,1)
callback(i, preview, False)

View File

@ -640,6 +640,38 @@ configs = {
shift_factor=0.1159,
),
),
"flux-dev-umo": ModelSpec(
repo_id="",
repo_flow="",
repo_ae="ckpts/flux_vae.safetensors",
params=FluxParams(
in_channels=64,
out_channels=64,
vec_in_dim=768,
context_in_dim=4096,
hidden_size=3072,
mlp_ratio=4.0,
num_heads=24,
depth=19,
depth_single_blocks=38,
axes_dim=[16, 56, 56],
theta=10_000,
qkv_bias=True,
guidance_embed=True,
eso= True,
),
ae_params=AutoEncoderParams(
resolution=256,
in_channels=3,
ch=128,
out_ch=3,
ch_mult=[1, 2, 4, 4],
num_res_blocks=2,
z_channels=16,
scale_factor=0.3611,
shift_factor=0.1159,
),
),
}

View File

@ -861,11 +861,6 @@ class HunyuanVideoSampler(Inference):
freqs_cos, freqs_sin = self.get_rotary_pos_embed(target_frame_num, target_height, target_width, enable_RIFLEx)
else:
if self.avatar:
w, h = input_ref_images.size
target_height, target_width = calculate_new_dimensions(target_height, target_width, h, w, fit_into_canvas)
if target_width != w or target_height != h:
input_ref_images = input_ref_images.resize((target_width,target_height), resample=Image.Resampling.LANCZOS)
concat_dict = {'mode': 'timecat', 'bias': -1}
freqs_cos, freqs_sin = self.get_rotary_pos_embed_new(129, target_height, target_width, concat_dict)
else:

View File

@ -51,6 +51,23 @@ class family_handler():
extra_model_def["tea_cache"] = True
extra_model_def["mag_cache"] = True
if base_model_type in ["hunyuan_custom_edit"]:
extra_model_def["guide_preprocessing"] = {
"selection": ["MV", "PV"],
}
extra_model_def["mask_preprocessing"] = {
"selection": ["A", "NA"],
"default" : "NA"
}
if base_model_type in ["hunyuan_custom_audio", "hunyuan_custom_edit", "hunyuan_custom"]:
extra_model_def["image_ref_choices"] = {
"choices": [("Reference Image", "I")],
"letters_filter":"I",
"visible": False,
}
if base_model_type in ["hunyuan_avatar"]: extra_model_def["no_background_removal"] = True
if base_model_type in ["hunyuan_custom", "hunyuan_custom_edit", "hunyuan_custom_audio", "hunyuan_avatar"]:
@ -141,6 +158,18 @@ class family_handler():
return hunyuan_model, pipe
@staticmethod
def fix_settings(base_model_type, settings_version, model_def, ui_defaults):
if settings_version<2.33:
if base_model_type in ["hunyuan_custom_edit"]:
video_prompt_type= ui_defaults["video_prompt_type"]
if "P" in video_prompt_type and "M" in video_prompt_type:
video_prompt_type = video_prompt_type.replace("M","")
ui_defaults["video_prompt_type"] = video_prompt_type
pass
@staticmethod
def update_default_settings(base_model_type, model_def, ui_defaults):
ui_defaults["embedded_guidance_scale"]= 6.0

View File

@ -300,9 +300,6 @@ class LTXV:
prefix_size, height, width = input_video.shape[-3:]
else:
if image_start != None:
frame_width, frame_height = image_start.size
if fit_into_canvas != None:
height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas, 32)
conditioning_media_paths.append(image_start.unsqueeze(1))
conditioning_start_frames.append(0)
conditioning_control_frames.append(False)

View File

@ -26,6 +26,15 @@ class family_handler():
extra_model_def["sliding_window"] = True
extra_model_def["image_prompt_types_allowed"] = "TSEV"
extra_model_def["guide_preprocessing"] = {
"selection": ["", "PV", "DV", "EV", "V"],
"labels" : { "V": "Use LTXV raw format"}
}
extra_model_def["mask_preprocessing"] = {
"selection": ["", "A", "NA", "XA", "XNA"],
}
return extra_model_def
@staticmethod

View File

@ -28,7 +28,7 @@ from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2Tokenizer, Aut
from .autoencoder_kl_qwenimage import AutoencoderKLQwenImage
from diffusers import FlowMatchEulerDiscreteScheduler
from PIL import Image
from shared.utils.utils import calculate_new_dimensions
from shared.utils.utils import calculate_new_dimensions, convert_image_to_tensor, convert_tensor_to_image
XLA_AVAILABLE = False
@ -563,6 +563,8 @@ class QwenImagePipeline(): #DiffusionPipeline
callback_on_step_end_tensor_inputs: List[str] = ["latents"],
max_sequence_length: int = 512,
image = None,
image_mask = None,
denoising_strength = 0,
callback=None,
pipeline=None,
loras_slists=None,
@ -683,6 +685,7 @@ class QwenImagePipeline(): #DiffusionPipeline
device = "cuda"
prompt_image = None
image_mask_latents = None
if image is not None and not (isinstance(image, torch.Tensor) and image.size(1) == self.latent_channels):
image = image[0] if isinstance(image, list) else image
image_height, image_width = self.image_processor.get_default_height_width(image)
@ -694,14 +697,32 @@ class QwenImagePipeline(): #DiffusionPipeline
image_width = image_width // multiple_of * multiple_of
image_height = image_height // multiple_of * multiple_of
ref_height, ref_width = 1568, 672
if height * width < ref_height * ref_width: ref_height , ref_width = height , width
if image_height * image_width > ref_height * ref_width:
image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS)
if image_mask is None:
if height * width < ref_height * ref_width: ref_height , ref_width = height , width
if image_height * image_width > ref_height * ref_width:
image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
if (image_width,image_height) != image.size:
image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS)
else:
# _, image_width, image_height = min(
# (abs(aspect_ratio - w / h), w, h) for w, h in PREFERRED_QWENIMAGE_RESOLUTIONS
# )
image_height, image_width = calculate_new_dimensions(height, width, image_height, image_width, False, block_size=multiple_of)
# image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
height, width = image_height, image_width
image_mask_latents = convert_image_to_tensor(image_mask.resize((width // 16, height // 16), resample=Image.Resampling.LANCZOS))
image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
# convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)
prompt_image = image
image = self.image_processor.preprocess(image, image_height, image_width)
image = image.unsqueeze(2)
if image.size != (image_width, image_height):
image = image.resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
# image.save("nnn.png")
image = convert_image_to_tensor(image).unsqueeze(0).unsqueeze(2)
has_neg_prompt = negative_prompt is not None or (
negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
@ -744,6 +765,8 @@ class QwenImagePipeline(): #DiffusionPipeline
generator,
latents,
)
original_image_latents = None if image_latents is None else image_latents.clone()
if image is not None:
img_shapes = [
[
@ -788,6 +811,18 @@ class QwenImagePipeline(): #DiffusionPipeline
negative_txt_seq_lens = (
negative_prompt_embeds_mask.sum(dim=1).tolist() if negative_prompt_embeds_mask is not None else None
)
morph, first_step = False, 0
if image_mask_latents is not None:
randn = torch.randn_like(original_image_latents)
if denoising_strength < 1.:
first_step = int(len(timesteps) * (1. - denoising_strength))
if not morph:
latent_noise_factor = timesteps[first_step]/1000
# latents = original_image_latents * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor
latents = original_image_latents * (1.0 - latent_noise_factor) + randn * latent_noise_factor
timesteps = timesteps[first_step:]
self.scheduler.timesteps = timesteps
self.scheduler.sigmas= self.scheduler.sigmas[first_step:]
# 6. Denoising loop
self.scheduler.set_begin_index(0)
@ -797,10 +832,16 @@ class QwenImagePipeline(): #DiffusionPipeline
update_loras_slists(self.transformer, loras_slists, updated_num_steps)
callback(-1, None, True, override_num_inference_steps = updated_num_steps)
for i, t in enumerate(timesteps):
offload.set_step_no_for_lora(self.transformer, first_step + i)
if self.interrupt:
continue
if image_mask_latents is not None and denoising_strength <1. and i == first_step and morph:
latent_noise_factor = t/1000
latents = original_image_latents * (1.0 - latent_noise_factor) + latents * latent_noise_factor
self._current_timestep = t
# broadcast to batch dimension in a way that's compatible with ONNX/Core ML
timestep = t.expand(latents.shape[0]).to(latents.dtype)
@ -865,6 +906,13 @@ class QwenImagePipeline(): #DiffusionPipeline
# compute the previous noisy sample x_t -> x_t-1
latents_dtype = latents.dtype
latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
if image_mask_latents is not None:
next_t = timesteps[i+1] if i<len(timesteps)-1 else 0
latent_noise_factor = next_t / 1000
# noisy_image = original_image_latents * (1.0 - latent_noise_factor) + torch.randn_like(original_image_latents) * latent_noise_factor
noisy_image = original_image_latents * (1.0 - latent_noise_factor) + randn * latent_noise_factor
latents = noisy_image * (1-image_mask_latents) + image_mask_latents * latents
noisy_image = None
if latents.dtype != latents_dtype:
if torch.backends.mps.is_available():
@ -878,7 +926,7 @@ class QwenImagePipeline(): #DiffusionPipeline
self._current_timestep = None
if output_type == "latent":
image = latents
output_image = latents
else:
latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
latents = latents.to(self.vae.dtype)
@ -891,7 +939,9 @@ class QwenImagePipeline(): #DiffusionPipeline
latents.device, latents.dtype
)
latents = latents / latents_std + latents_mean
image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
output_image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
if image_mask is not None:
output_image = image.squeeze(2) * (1 - image_mask_rebuilt) + output_image.to(image) * image_mask_rebuilt
return image
return output_image

View File

@ -9,7 +9,7 @@ def get_qwen_text_encoder_filename(text_encoder_quantization):
class family_handler():
@staticmethod
def query_model_def(base_model_type, model_def):
model_def_output = {
extra_model_def = {
"image_outputs" : True,
"sample_solvers":[
("Default", "default"),
@ -18,8 +18,19 @@ class family_handler():
"lock_image_refs_ratios": True,
}
if base_model_type in ["qwen_image_edit_20B"]:
extra_model_def["inpaint_support"] = True
extra_model_def["image_ref_choices"] = {
"choices": [
("None", ""),
("Conditional Images is first Main Subject / Landscape and may be followed by People / Objects", "KI"),
("Conditional Images are People / Objects", "I"),
],
"letters_filter": "KI",
}
extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape"
return model_def_output
return extra_model_def
@staticmethod
def query_supported_types():
@ -75,14 +86,18 @@ class family_handler():
if ui_defaults.get("sample_solver", "") == "":
ui_defaults["sample_solver"] = "default"
if settings_version < 2.32:
ui_defaults["denoising_strength"] = 1.
@staticmethod
def update_default_settings(base_model_type, model_def, ui_defaults):
ui_defaults.update({
"guidance_scale": 4,
"sample_solver": "default",
})
if model_def.get("reference_image", False):
if base_model_type in ["qwen_image_edit_20B"]:
ui_defaults.update({
"video_prompt_type": "KI",
"denoising_strength" : 1.,
})

View File

@ -103,6 +103,8 @@ class model_factory():
n_prompt = None,
sampling_steps: int = 20,
input_ref_images = None,
image_guide= None,
image_mask= None,
width= 832,
height=480,
guide_scale: float = 4,
@ -114,6 +116,7 @@ class model_factory():
VAE_tile_size = None,
joint_pass = True,
sample_solver='default',
denoising_strength = 1.,
**bbargs
):
# Generate with different aspect ratios
@ -174,8 +177,9 @@ class model_factory():
if n_prompt is None or len(n_prompt) == 0:
n_prompt= "text, watermark, copyright, blurry, low resolution"
if input_ref_images is not None:
if image_guide is not None:
input_ref_images = [image_guide]
elif input_ref_images is not None:
# image stiching method
stiched = input_ref_images[0]
if "K" in video_prompt_type :
@ -190,6 +194,7 @@ class model_factory():
prompt=input_prompt,
negative_prompt=n_prompt,
image = input_ref_images,
image_mask = image_mask,
width=width,
height=height,
num_inference_steps=sampling_steps,
@ -199,6 +204,7 @@ class model_factory():
pipeline=self,
loras_slists=loras_slists,
joint_pass = joint_pass,
denoising_strength=denoising_strength,
generator=torch.Generator(device="cuda").manual_seed(seed)
)
if image is None: return None

View File

@ -261,7 +261,7 @@ class WanAny2V:
def vace_latent(self, z, m):
return [torch.cat([zz, mm], dim=0) for zz, mm in zip(z, m)]
def fit_image_into_canvas(self, ref_img, image_size, canvas_tf_bg, device, fill_max = False, outpainting_dims = None, return_mask = False):
def fit_image_into_canvas(self, ref_img, image_size, canvas_tf_bg, device, full_frame = False, outpainting_dims = None, return_mask = False):
from shared.utils.utils import save_image
ref_width, ref_height = ref_img.size
if (ref_height, ref_width) == image_size and outpainting_dims == None:
@ -270,18 +270,23 @@ class WanAny2V:
else:
if outpainting_dims != None:
final_height, final_width = image_size
canvas_height, canvas_width, margin_top, margin_left = get_outpainting_frame_location(final_height, final_width, outpainting_dims, 8)
canvas_height, canvas_width, margin_top, margin_left = get_outpainting_frame_location(final_height, final_width, outpainting_dims, 1)
else:
canvas_height, canvas_width = image_size
scale = min(canvas_height / ref_height, canvas_width / ref_width)
new_height = int(ref_height * scale)
new_width = int(ref_width * scale)
if fill_max and (canvas_height - new_height) < 16:
if full_frame:
new_height = canvas_height
if fill_max and (canvas_width - new_width) < 16:
new_width = canvas_width
top = (canvas_height - new_height) // 2
left = (canvas_width - new_width) // 2
top = left = 0
else:
# if fill_max and (canvas_height - new_height) < 16:
# new_height = canvas_height
# if fill_max and (canvas_width - new_width) < 16:
# new_width = canvas_width
scale = min(canvas_height / ref_height, canvas_width / ref_width)
new_height = int(ref_height * scale)
new_width = int(ref_width * scale)
top = (canvas_height - new_height) // 2
left = (canvas_width - new_width) // 2
ref_img = ref_img.resize((new_width, new_height), resample=Image.Resampling.LANCZOS)
ref_img = TF.to_tensor(ref_img).sub_(0.5).div_(0.5).unsqueeze(1)
if outpainting_dims != None:
@ -302,7 +307,7 @@ class WanAny2V:
canvas = canvas.to(device)
return ref_img.to(device), canvas
def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size, device, keep_video_guide_frames= [], start_frame = 0, fit_into_canvas = None, pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size, device, keep_video_guide_frames= [], start_frame = 0, pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
image_sizes = []
trim_video_guide = len(keep_video_guide_frames)
def conv_tensor(t, device):
@ -533,22 +538,16 @@ class WanAny2V:
any_end_frame = False
if image_start is None:
if infinitetalk:
new_shot = "Q" in video_prompt_type
if input_frames is not None:
image_ref = input_frames[:, 0]
if input_video is None: input_video = input_frames[:, 0:1]
new_shot = "Q" in video_prompt_type
else:
if pre_video_frame is None:
new_shot = True
else:
if input_ref_images is None:
input_ref_images, new_shot = [pre_video_frame], False
else:
input_ref_images, new_shot = [img.resize(pre_video_frame.size, resample=Image.Resampling.LANCZOS) for img in input_ref_images], "Q" in video_prompt_type
if input_ref_images is None: raise Exception("Missing Reference Image")
if input_ref_images is None:
if pre_video_frame is None: raise Exception("Missing Reference Image")
input_ref_images, new_shot = [pre_video_frame], False
new_shot = new_shot and window_no <= len(input_ref_images)
image_ref = convert_image_to_tensor(input_ref_images[ min(window_no, len(input_ref_images))-1 ])
if new_shot:
if new_shot or input_video is None:
input_video = image_ref.unsqueeze(1)
else:
color_correction_strength = 0 #disable color correction as transition frames between shots may have a complete different color level than the colors of the new shot
@ -847,7 +846,7 @@ class WanAny2V:
for i, t in enumerate(tqdm(timesteps)):
guide_scale, guidance_switch_done, trans, denoising_extra = update_guidance(i, t, guide_scale, guide2_scale, guidance_switch_done, switch_threshold, trans, 2, denoising_extra)
guide_scale, guidance_switch2_done, trans, denoising_extra = update_guidance(i, t, guide_scale, guide3_scale, guidance_switch2_done, switch2_threshold, trans, 3, denoising_extra)
offload.set_step_no_for_lora(trans, i)
offload.set_step_no_for_lora(trans, start_step_no + i)
timestep = torch.stack([t])
if timestep_injection:

View File

@ -35,7 +35,7 @@ class family_handler():
"label" : "Generation Type"
}
extra_model_def["image_prompt_types_allowed"] = "TSEV"
extra_model_def["image_prompt_types_allowed"] = "TSV"
return extra_model_def
@ -66,7 +66,11 @@ class family_handler():
def query_family_infos():
return {}
@staticmethod
def get_rgb_factors(base_model_type ):
from shared.RGB_factors import get_rgb_factors
latent_rgb_factors, latent_rgb_factors_bias = get_rgb_factors("wan", base_model_type)
return latent_rgb_factors, latent_rgb_factors_bias
@staticmethod
def query_model_files(computeList, base_model_type, model_filename, text_encoder_quantization):

View File

@ -110,18 +110,79 @@ class family_handler():
"tea_cache" : not (base_model_type in ["i2v_2_2", "ti2v_2_2" ] or multiple_submodels),
"mag_cache" : True,
"keep_frames_video_guide_not_supported": base_model_type in ["infinitetalk"],
"convert_image_guide_to_video" : True,
"sample_solvers":[
("unipc", "unipc"),
("euler", "euler"),
("dpm++", "dpm++"),
("flowmatch causvid", "causvid"), ]
})
if base_model_type in ["t2v"]:
extra_model_def["guide_custom_choices"] = {
"choices":[("Use Text Prompt Only", ""),("Video to Video guided by Text Prompt", "GUV")],
"default": "",
"letters_filter": "GUV",
"label": "Video to Video"
}
if base_model_type in ["infinitetalk"]:
extra_model_def["no_background_removal"] = True
# extra_model_def["at_least_one_image_ref_needed"] = True
extra_model_def["all_image_refs_are_background_ref"] = True
extra_model_def["guide_custom_choices"] = {
"choices":[
("Images to Video, each Reference Image will start a new shot with a new Sliding Window - Sharp Transitions", "QKI"),
("Images to Video, each Reference Image will start a new shot with a new Sliding Window - Smooth Transitions", "KI"),
("Sparse Video to Video, one Image will by extracted from Video for each new Sliding Window - Sharp Transitions", "QRUV"),
("Sparse Video to Video, one Image will by extracted from Video for each new Sliding Window - Smooth Transitions", "RUV"),
("Video to Video, amount of motion transferred depends on Denoising Strength - Sharp Transitions", "GQUV"),
("Video to Video, amount of motion transferred depends on Denoising Strength - Smooth Transitions", "GUV"),
],
"default": "KI",
"letters_filter": "RGUVQKI",
"label": "Video to Video",
"show_label" : False,
}
# extra_model_def["at_least_one_image_ref_needed"] = True
if vace_class:
extra_model_def["guide_preprocessing"] = {
"selection": ["", "UV", "PV", "DV", "SV", "LV", "CV", "MV", "V", "PDV", "PSV", "PLV" , "DSV", "DLV", "SLV"],
"labels" : { "V": "Use Vace raw format"}
}
extra_model_def["mask_preprocessing"] = {
"selection": ["", "A", "NA", "XA", "XNA", "YA", "YNA", "WA", "WNA", "ZA", "ZNA"],
}
extra_model_def["image_ref_choices"] = {
"choices": [("None", ""),
("Inject only People / Objects", "I"),
("Inject Landscape and then People / Objects", "KI"),
("Inject Frames and then People / Objects", "FI"),
],
"letters_filter": "KFI",
}
if base_model_type in ["standin"] or vace_class:
extra_model_def["lock_image_refs_ratios"] = True
extra_model_def["background_removal_label"]= "Remove Backgrounds behind People / Objects, keep it for Landscape or positioned Frames"
if base_model_type in ["standin"]:
extra_model_def["lock_image_refs_ratios"] = True
extra_model_def["image_ref_choices"] = {
"choices": [
("No Reference Image", ""),
("Reference Image is a Person Face", "I"),
],
"letters_filter":"I",
}
if base_model_type in ["phantom_1.3B", "phantom_14B"]:
extra_model_def["image_ref_choices"] = {
"choices": [("Reference Image", "I")],
"letters_filter":"I",
"visible": False,
}
if base_model_type in ["recam_1.3B"]:
extra_model_def["keep_frames_video_guide_not_supported"] = True
@ -141,6 +202,12 @@ class family_handler():
"default": 1,
"label" : "Camera Movement Type"
}
extra_model_def["guide_preprocessing"] = {
"selection": ["UV"],
"labels" : { "UV": "Control Video"},
"visible" : False,
}
if vace_class or base_model_type in ["infinitetalk"]:
image_prompt_types_allowed = "TVL"
elif base_model_type in ["ti2v_2_2"]:

View File

@ -7,7 +7,6 @@ import psutil
# import ffmpeg
import imageio
from PIL import Image
import cv2
import torch
import torch.nn.functional as F
@ -33,6 +32,8 @@ model_in_GPU = False
matanyone_in_GPU = False
bfloat16_supported = False
# SAM generator
import copy
class MaskGenerator():
def __init__(self, sam_checkpoint, device):
global args_device
@ -89,6 +90,7 @@ def get_frames_from_image(image_input, image_state):
"last_frame_numer": 0,
"fps": None
}
image_info = "Image Name: N/A,\nFPS: N/A,\nTotal Frames: {},\nImage Size:{}".format(len(frames), image_size)
set_image_encoder_patch()
select_SAM()
@ -717,27 +719,33 @@ def load_unload_models(selected):
def get_vmc_event_handler():
return load_unload_models
def export_to_vace_video_input(foreground_video_output):
gr.Info("Masked Video Input transferred to Vace For Inpainting")
return "V#" + str(time.time()), foreground_video_output
def export_image(image_refs, image_output):
gr.Info("Masked Image transferred to Current Video")
def export_image(state, image_output):
ui_settings = get_current_model_settings(state)
image_refs = ui_settings["image_refs"]
if image_refs == None:
image_refs =[]
image_refs.append( image_output)
return image_refs
ui_settings["image_refs"] = image_refs
gr.Info("Masked Image transferred to Current Image Generator")
return time.time()
def export_image_mask(image_input, image_mask):
gr.Info("Input Image & Mask transferred to Current Video")
return Image.fromarray(image_input), image_mask
def export_image_mask(state, image_input, image_mask):
ui_settings = get_current_model_settings(state)
ui_settings["image_guide"] = Image.fromarray(image_input)
ui_settings["image_mask"] = image_mask
gr.Info("Input Image & Mask transferred to Current Image Generator")
return time.time()
def export_to_current_video_engine( foreground_video_output, alpha_video_output):
def export_to_current_video_engine(state, foreground_video_output, alpha_video_output):
ui_settings = get_current_model_settings(state)
ui_settings["video_guide"] = foreground_video_output
ui_settings["video_mask"] = alpha_video_output
gr.Info("Original Video and Full Mask have been transferred")
# return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
return foreground_video_output, alpha_video_output
return time.time()
def teleport_to_video_tab(tab_state):
@ -746,15 +754,29 @@ def teleport_to_video_tab(tab_state):
return gr.Tabs(selected="video_gen")
def display(tabs, tab_state, server_config, vace_video_input, vace_image_input, vace_video_mask, vace_image_mask, vace_image_refs):
def display(tabs, tab_state, state, refresh_form_trigger, server_config, get_current_model_settings_fn): #, vace_video_input, vace_image_input, vace_video_mask, vace_image_mask, vace_image_refs):
# my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
global image_output_codec, video_output_codec
global image_output_codec, video_output_codec, get_current_model_settings
get_current_model_settings = get_current_model_settings_fn
image_output_codec = server_config.get("image_output_codec", None)
video_output_codec = server_config.get("video_output_codec", None)
media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
click_brush_js = """
() => {
setTimeout(() => {
const brushButton = document.querySelector('button[aria-label="Brush"]');
if (brushButton) {
brushButton.click();
console.log('Brush button clicked');
} else {
console.log('Brush button not found');
}
}, 1000);
} """
# download assets
gr.Markdown("<B>Mast Edition is provided by MatAnyone and VRAM optimized by DeepBeepMeep</B>")
@ -871,7 +893,7 @@ def display(tabs, tab_state, server_config, vace_video_input, vace_image_input,
template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
with gr.Row():
clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, min_width=100)
add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, min_width=100)
remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, min_width=100) # no use
matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False, min_width=100)
with gr.Row():
@ -892,7 +914,7 @@ def display(tabs, tab_state, server_config, vace_video_input, vace_image_input,
with gr.Row(visible= True):
export_to_current_video_engine_btn = gr.Button("Export to Control Video Input and Video Mask Input", visible= False)
export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [state, foreground_video_output, alpha_video_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger,
fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
@ -1089,10 +1111,10 @@ def display(tabs, tab_state, server_config, vace_video_input, vace_image_input,
# with gr.Column(scale=2, visible= True):
export_image_mask_btn = gr.Button(value="Set to Control Image & Mask", visible=False, elem_classes="new_button")
export_image_btn.click( fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger,
fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
export_image_mask_btn.click( fn=export_image_mask, inputs= [image_input, alpha_image_output], outputs= [vace_image_input, vace_image_mask]).then( #video_prompt_video_guide_trigger,
export_image_btn.click( fn=export_image, inputs= [state, foreground_image_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger,
fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs])
export_image_mask_btn.click( fn=export_image_mask, inputs= [state, image_input, alpha_image_output], outputs= [refresh_form_trigger]).then( #video_prompt_video_guide_trigger,
fn=teleport_to_video_tab, inputs= [tab_state], outputs= [tabs]).then(fn=None, inputs=None, outputs=None, js=click_brush_js)
# first step: get the image information
extract_frames_button.click(
@ -1148,5 +1170,21 @@ def display(tabs, tab_state, server_config, vace_video_input, vace_image_input,
outputs=[foreground_image_output, alpha_image_output,foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
)
nada = gr.State({})
# clear input
gr.on(
triggers=[image_input.clear], #image_input.change,
fn=restart,
inputs=[],
outputs=[
image_state,
interactive_state,
click_state,
foreground_image_output, alpha_image_output,
template_frame,
image_selection_slider, image_selection_slider, track_pause_number_slider,point_prompt, export_image_btn, export_image_mask_btn, bbox_info, clear_button_click,
add_mask_button, matting_button, template_frame, foreground_image_output, alpha_image_output, remove_mask_button, export_image_btn, export_image_mask_btn, mask_dropdown, nada, step2_title
],
queue=False,
show_progress=False)

View File

@ -2,7 +2,6 @@ import math
import torch
from typing import Optional, Union, Tuple
# @torch.jit.script
def get_similarity(mk: torch.Tensor,
ms: torch.Tensor,
@ -59,6 +58,7 @@ def get_similarity(mk: torch.Tensor,
del two_ab
# similarity = (-a_sq + two_ab)
similarity =similarity.float()
if ms is not None:
similarity *= ms
similarity /= math.sqrt(CK)

View File

@ -73,5 +73,5 @@ def matanyone(processor, frames_np, mask, r_erode=0, r_dilate=0, n_warmup=10):
if ti > (n_warmup-1):
frames.append((com_np*255).astype(np.uint8))
phas.append((pha*255).astype(np.uint8))
# phas.append(np.clip(pha * 255, 0, 255).astype(np.uint8))
return frames, phas

View File

@ -23,7 +23,7 @@ librosa==0.11.0
speechbrain==1.0.3
# UI & interaction
gradio==5.23.0
gradio==5.29.0
dashscope
loguru

View File

@ -4,6 +4,7 @@ from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, Literal
import gradio as gr
import PIL
import time
from PIL import Image as PILImage
FilePath = str
@ -20,6 +21,9 @@ def get_list( objs):
return []
return [ obj[0] if isinstance(obj, tuple) else obj for obj in objs]
def record_last_action(st, last_action):
st["last_action"] = last_action
st["last_time"] = time.time()
class AdvancedMediaGallery:
def __init__(
self,
@ -60,9 +64,10 @@ class AdvancedMediaGallery:
self.state: Optional[gr.State] = None
self._initial_state: Dict[str, Any] = {
"items": items,
"selected": (len(items) - 1) if items else None,
"selected": (len(items) - 1) if items else 0, # None,
"single": bool(single_image_mode),
"mode": self.media_mode,
"last_action": "",
}
# ---------------- helpers ----------------
@ -210,6 +215,13 @@ class AdvancedMediaGallery:
def _on_select(self, state: Dict[str, Any], gallery, evt: gr.SelectData) :
# Mirror the selected index into state and the gallery (server-side selected_index)
st = get_state(state)
last_time = st.get("last_time", None)
if last_time is not None and abs(time.time()- last_time)< 0.5: # crappy trick to detect if onselect is unwanted (buggy gallery)
# print(f"ignored:{time.time()}, real {st['selected']}")
return gr.update(selected_index=st["selected"]), st
idx = None
if evt is not None and hasattr(evt, "index"):
ix = evt.index
@ -220,17 +232,28 @@ class AdvancedMediaGallery:
idx = ix[0] * max(1, int(self.columns)) + ix[1]
else:
idx = ix[0]
st = get_state(state)
n = len(get_list(gallery))
sel = idx if (idx is not None and 0 <= idx < n) else None
# print(f"image selected evt index:{sel}/{evt.selected}")
st["selected"] = sel
# return gr.update(selected_index=sel), st
# return gr.update(), st
return st
return gr.update(), st
def _on_upload(self, value: List[Any], state: Dict[str, Any]) :
# Fires when users upload via the Gallery itself.
# items_filtered = self._filter_items_by_mode(list(value or []))
items_filtered = list(value or [])
st = get_state(state)
new_items = self._paths_from_payload(items_filtered)
st["items"] = new_items
new_sel = len(new_items) - 1
st["selected"] = new_sel
record_last_action(st,"add")
return gr.update(selected_index=new_sel), st
def _on_gallery_change(self, value: List[Any], state: Dict[str, Any]) :
# Fires when users add/drag/drop/delete via the Gallery itself.
items_filtered = self._filter_items_by_mode(list(value or []))
# items_filtered = self._filter_items_by_mode(list(value or []))
items_filtered = list(value or [])
st = get_state(state)
st["items"] = items_filtered
# Keep selection if still valid, else default to last
@ -240,10 +263,9 @@ class AdvancedMediaGallery:
else:
new_sel = old_sel
st["selected"] = new_sel
# return gr.update(value=items_filtered, selected_index=new_sel), st
# return gr.update(value=items_filtered), st
return gr.update(), st
st["last_action"] ="gallery_change"
# print(f"gallery change: set sel {new_sel}")
return gr.update(selected_index=new_sel), st
def _on_add(self, files_payload: Any, state: Dict[str, Any], gallery):
"""
@ -252,7 +274,8 @@ class AdvancedMediaGallery:
and re-selects the last inserted item.
"""
# New items (respect image/video mode)
new_items = self._filter_items_by_mode(self._paths_from_payload(files_payload))
# new_items = self._filter_items_by_mode(self._paths_from_payload(files_payload))
new_items = self._paths_from_payload(files_payload)
st = get_state(state)
cur: List[Any] = get_list(gallery)
@ -298,30 +321,6 @@ class AdvancedMediaGallery:
if k is not None:
seen_new.add(k)
# Remove any existing occurrences of the incoming items from current list,
# BUT keep the currently selected item even if it's also in incoming.
cur_clean: List[Any] = []
# sel_item = cur[sel] if (sel is not None and 0 <= sel < len(cur)) else None
# for idx, it in enumerate(cur):
# k = key_of(it)
# if it is sel_item:
# cur_clean.append(it)
# continue
# if k is not None and k in seen_new:
# continue # drop duplicate; we'll reinsert at the target spot
# cur_clean.append(it)
# # Compute insertion position: right AFTER the (possibly shifted) selected item
# if sel_item is not None:
# # find sel_item's new index in cur_clean
# try:
# pos_sel = cur_clean.index(sel_item)
# except ValueError:
# # Shouldn't happen, but fall back to end
# pos_sel = len(cur_clean) - 1
# insert_pos = pos_sel + 1
# else:
# insert_pos = len(cur_clean) # no selection -> append at end
insert_pos = min(sel, len(cur) -1)
cur_clean = cur
# Build final list and selection
@ -330,6 +329,8 @@ class AdvancedMediaGallery:
st["items"] = merged
st["selected"] = new_sel
record_last_action(st,"add")
# print(f"gallery add: set sel {new_sel}")
return gr.update(value=merged, selected_index=new_sel), st
def _on_remove(self, state: Dict[str, Any], gallery) :
@ -342,8 +343,9 @@ class AdvancedMediaGallery:
return gr.update(value=[], selected_index=None), st
new_sel = min(sel, len(items) - 1)
st["items"] = items; st["selected"] = new_sel
# return gr.update(value=items, selected_index=new_sel), st
return gr.update(value=items), st
record_last_action(st,"remove")
# print(f"gallery del: new sel {new_sel}")
return gr.update(value=items, selected_index=new_sel), st
def _on_move(self, delta: int, state: Dict[str, Any], gallery) :
st = get_state(state); items: List[Any] = get_list(gallery); sel = st.get("selected", None)
@ -354,11 +356,15 @@ class AdvancedMediaGallery:
return gr.update(value=items, selected_index=sel), st
items[sel], items[j] = items[j], items[sel]
st["items"] = items; st["selected"] = j
record_last_action(st,"move")
# print(f"gallery move: set sel {j}")
return gr.update(value=items, selected_index=j), st
def _on_clear(self, state: Dict[str, Any]) :
st = {"items": [], "selected": None, "single": get_state(state).get("single", False), "mode": self.media_mode}
return gr.update(value=[], selected_index=0), st
record_last_action(st,"clear")
# print(f"Clear all")
return gr.update(value=[], selected_index=None), st
def _on_toggle_single(self, to_single: bool, state: Dict[str, Any]) :
st = get_state(state); st["single"] = bool(to_single)
@ -382,30 +388,38 @@ class AdvancedMediaGallery:
def mount(self, parent: Optional[gr.Blocks | gr.Group | gr.Row | gr.Column] = None, update_form = False):
if parent is not None:
with parent:
col = self._build_ui()
col = self._build_ui(update_form)
else:
col = self._build_ui()
col = self._build_ui(update_form)
if not update_form:
self._wire_events()
return col
def _build_ui(self) -> gr.Column:
def _build_ui(self, update = False) -> gr.Column:
with gr.Column(elem_id=self.elem_id, elem_classes=self.elem_classes) as col:
self.container = col
self.state = gr.State(dict(self._initial_state))
self.gallery = gr.Gallery(
label=self.label,
value=self._initial_state["items"],
height=self.height,
columns=self.columns,
show_label=self.show_label,
preview= True,
# type="pil",
file_types= list(IMAGE_EXTS) if self.media_mode == "image" else list(VIDEO_EXTS),
selected_index=self._initial_state["selected"], # server-side selection
)
if update:
self.gallery = gr.update(
value=self._initial_state["items"],
selected_index=self._initial_state["selected"], # server-side selection
label=self.label,
show_label=self.show_label,
)
else:
self.gallery = gr.Gallery(
value=self._initial_state["items"],
label=self.label,
height=self.height,
columns=self.columns,
show_label=self.show_label,
preview= True,
# type="pil", # very slow
file_types= list(IMAGE_EXTS) if self.media_mode == "image" else list(VIDEO_EXTS),
selected_index=self._initial_state["selected"], # server-side selection
)
# One-line controls
exts = sorted(IMAGE_EXTS if self.media_mode == "image" else VIDEO_EXTS) if self.accept_filter else None
@ -418,10 +432,10 @@ class AdvancedMediaGallery:
size="sm",
min_width=1,
)
self.btn_remove = gr.Button("Remove", size="sm", min_width=1)
self.btn_remove = gr.Button(" Remove ", size="sm", min_width=1)
self.btn_left = gr.Button("◀ Left", size="sm", visible=not self._initial_state["single"], min_width=1)
self.btn_right = gr.Button("Right ▶", size="sm", visible=not self._initial_state["single"], min_width=1)
self.btn_clear = gr.Button("Clear", variant="secondary", size="sm", visible=not self._initial_state["single"], min_width=1)
self.btn_clear = gr.Button(" Clear ", variant="secondary", size="sm", visible=not self._initial_state["single"], min_width=1)
return col
@ -430,14 +444,24 @@ class AdvancedMediaGallery:
self.gallery.select(
self._on_select,
inputs=[self.state, self.gallery],
outputs=[self.state],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Gallery value changed by user actions (click-to-add, drag-drop, internal remove, etc.)
self.gallery.change(
self.gallery.upload(
self._on_upload,
inputs=[self.gallery, self.state],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Gallery value changed by user actions (click-to-add, drag-drop, internal remove, etc.)
self.gallery.upload(
self._on_gallery_change,
inputs=[self.gallery, self.state],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Add via UploadButton
@ -445,6 +469,7 @@ class AdvancedMediaGallery:
self._on_add,
inputs=[self.upload_btn, self.state, self.gallery],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Remove selected
@ -452,6 +477,7 @@ class AdvancedMediaGallery:
self._on_remove,
inputs=[self.state, self.gallery],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Reorder using selected index, keep same item selected
@ -459,11 +485,13 @@ class AdvancedMediaGallery:
lambda st, gallery: self._on_move(-1, st, gallery),
inputs=[self.state, self.gallery],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
self.btn_right.click(
lambda st, gallery: self._on_move(+1, st, gallery),
inputs=[self.state, self.gallery],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# Clear all
@ -471,6 +499,7 @@ class AdvancedMediaGallery:
self._on_clear,
inputs=[self.state],
outputs=[self.gallery, self.state],
trigger_mode="always_last",
)
# ---------------- public API ----------------

View File

@ -19,6 +19,7 @@ import tempfile
import subprocess
import json
from functools import lru_cache
os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
from PIL import Image
@ -188,6 +189,14 @@ def get_outpainting_full_area_dimensions(frame_height,frame_width, outpainting_d
frame_width = int(frame_width * (100 + outpainting_left + outpainting_right) / 100)
return frame_height, frame_width
def rgb_bw_to_rgba_mask(img, thresh=127):
a = img.convert('L').point(lambda p: 255 if p > thresh else 0) # alpha
out = Image.new('RGBA', img.size, (255, 255, 255, 0)) # white, transparent
out.putalpha(a) # white where alpha=255
return out
def get_outpainting_frame_location(final_height, final_width, outpainting_dims, block_size = 8):
outpainting_top, outpainting_bottom, outpainting_left, outpainting_right= outpainting_dims
raw_height = int(final_height / ((100 + outpainting_top + outpainting_bottom) / 100))
@ -207,30 +216,62 @@ def get_outpainting_frame_location(final_height, final_width, outpainting_dims
if (margin_left + width) > final_width or outpainting_right == 0: margin_left = final_width - width
return height, width, margin_top, margin_left
def calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas, block_size = 16):
if fit_into_canvas == None:
def rescale_and_crop(img, w, h):
ow, oh = img.size
target_ratio = w / h
orig_ratio = ow / oh
if orig_ratio > target_ratio:
# Crop width first
nw = int(oh * target_ratio)
img = img.crop(((ow - nw) // 2, 0, (ow + nw) // 2, oh))
else:
# Crop height first
nh = int(ow / target_ratio)
img = img.crop((0, (oh - nh) // 2, ow, (oh + nh) // 2))
return img.resize((w, h), Image.LANCZOS)
def calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas, block_size = 16):
if fit_into_canvas == None or fit_into_canvas == 2:
# return image_height, image_width
return canvas_height, canvas_width
if fit_into_canvas:
if fit_into_canvas == 1:
scale1 = min(canvas_height / image_height, canvas_width / image_width)
scale2 = min(canvas_width / image_height, canvas_height / image_width)
scale = max(scale1, scale2)
else:
else: #0 or #2 (crop)
scale = (canvas_height * canvas_width / (image_height * image_width))**(1/2)
new_height = round( image_height * scale / block_size) * block_size
new_width = round( image_width * scale / block_size) * block_size
return new_height, new_width
def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, ignore_first, fit_into_canvas = False ):
def calculate_dimensions_and_resize_image(image, canvas_height, canvas_width, fit_into_canvas, fit_crop, block_size = 16):
if fit_crop:
image = rescale_and_crop(image, canvas_width, canvas_height)
new_width, new_height = image.size
else:
image_width, image_height = image.size
new_height, new_width = calculate_new_dimensions(canvas_height, canvas_width, image_height, image_width, fit_into_canvas, block_size = block_size )
image = image.resize((new_width, new_height), resample=Image.Resampling.LANCZOS)
return image, new_height, new_width
def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, any_background_ref, fit_into_canvas = 0, block_size= 16, outpainting_dims = None ):
if rm_background:
session = new_session()
output_list =[]
for i, img in enumerate(img_list):
width, height = img.size
if fit_into_canvas:
if fit_into_canvas == None or any_background_ref == 1 and i==0 or any_background_ref == 2:
if outpainting_dims is not None:
resized_image =img
elif img.size != (budget_width, budget_height):
resized_image= img.resize((budget_width, budget_height), resample=Image.Resampling.LANCZOS)
else:
resized_image =img
elif fit_into_canvas == 1:
white_canvas = np.ones((budget_height, budget_width, 3), dtype=np.uint8) * 255
scale = min(budget_height / height, budget_width / width)
new_height = int(height * scale)
@ -242,10 +283,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
resized_image = Image.fromarray(white_canvas)
else:
scale = (budget_height * budget_width / (height * width))**(1/2)
new_height = int( round(height * scale / 16) * 16)
new_width = int( round(width * scale / 16) * 16)
new_height = int( round(height * scale / block_size) * block_size)
new_width = int( round(width * scale / block_size) * block_size)
resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
if rm_background and not (ignore_first and i == 0) :
if rm_background and not (any_background_ref and i==0 or any_background_ref == 2) :
# resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,

978
wgp.py

File diff suppressed because it is too large Load Diff