WanGP remuxed

This commit is contained in:
deepbeepmeep 2025-08-04 02:28:19 +02:00
parent 35e4ee2b59
commit d2a9d5483d
21 changed files with 875 additions and 514 deletions

View File

@ -20,6 +20,28 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
**Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep
## 🔥 Latest Updates :
### August 4 2025: WanGP v7.6 - Remuxed
With this new version you won't have any excuse if there is no sound in your video.
*Continue Video* now works with any video that has already some sound (hint: Multitalk ).
Also, on top of MMaudio and the various sound driven models I have added the ability to use your own soundtrack.
As a result you can apply a different sound source on each new video segment when doing a *Continue Video*.
For instance:
- first video part: use Multitalk with two people speaking
- second video part: you apply your own soundtrack which will gently follow the multitalk conversation
- third video part: you use Vace effect and its corresponding control audio will be concatenated to the rest of the audio
To multiply the combinations I have also implemented *Continue Video* with the various image2video models.
Also:
- End Frame support added for LTX Video models
- Loras can now be targetted specifically at the High noise or Low noise models with Wan 2.2, check the Loras and Finetune guides
- Flux Krea Dev support
### July 30 2025: WanGP v7.5: Just another release ... Wan 2.2 part 2
Here is now Wan 2.2 image2video a very good model if you want to set Start and End frames. Two Wan 2.2 models delivered, only one to go ...

16
defaults/flux_krea.json Normal file
View File

@ -0,0 +1,16 @@
{
"model": {
"name": "Flux 1 Krea Dev 12B",
"architecture": "flux",
"description": "Cutting-edge output quality, with a focus on aesthetic photography..",
"URLs": [
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-krea-dev_bf16.safetensors",
"https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-krea-dev_quanto_bf16_int8.safetensors"
],
"image_outputs": true,
"flux-model": "flux-dev"
},
"prompt": "draw a hat",
"resolution": "1280x720",
"batch_size": 1
}

View File

@ -55,9 +55,16 @@ For instance if one adds a module *vace_14B* on top of a model with architecture
- *architecture* : architecture Id of the base model of the finetune (see previous section)
- *description*: description of the finetune that will appear at the top
- *URLs*: URLs of all the finetune versions (quantized / non quantized). WanGP will pick the version that is the closest to the user preferences. You will need to follow a naming convention to help WanGP identify the content of each version (see next section). Right now WanGP supports only 8 bits quantized model that have been quantized using **quanto**. WanGP offers a command switch to build easily such a quantized model (see below). *URLs* can contain also paths to local file to allow testing.
- *URLs2*: URLs of all the finetune versions (quantized / non quantized) of the weights used for the second phase of a model. For instance with Wan 2.2, the first phase contains the High Noise model weights and the second phase contains the Low Noise model weights. This feature can be used with other models than Wan 2.2 to combine different model weights during the same video generation.
- *modules*: this a list of modules to be combined with the models referenced by the URLs. A module is a model extension that is merged with a model to expand its capabilities. Supported models so far are : *vace_14B* and *multitalk*. For instance the full Vace model is the fusion of a Wan text 2 video and the Vace module.
- *preload_URLs* : URLs of files to download no matter what (used to load quantization maps for instance)
-*loras* : URLs of Loras that will applied before any other Lora specified by the user. These loras will be quite often Loras accelerator. For instance if you specified here the FusioniX Lora you will be able to reduce the number of generation steps to -*loras_multipliers* : a list of float numbers that defines the weight of each Lora mentioned above.
-*loras* : URLs of Loras that will applied before any other Lora specified by the user. These loras will be quite often Loras accelerators. For instance if you specify here the FusioniX Lora you will be able to reduce the number of generation steps to 10
-*loras_multipliers* : a list of float numbers or strings that defines the weight of each Lora mentioned in *Loras*. The string syntax is used if you want your lora multiplier to change over the steps (please check the Loras doc) or if you want a multiplier to be applied on a specific High Noise phase or Low Noise phase of a Wan 2.2 model. For instance, here the multiplier will be only applied during the High Noise phase and for half of the steps of this phase the multiplier will be 1 and for the other half 1.1.
```
"loras" : [ "my_lora.safetensors"],
"loras_multipliers" : [ "1,1.1;0"]
```
- *auto_quantize*: if set to True and no quantized model URL is provided, WanGP will perform on the fly quantization if the user expects a quantized model
-*visible* : by default assumed to be true. If set to false the model will no longer be visible. This can be useful if you create a finetune to override a default model and hide it.
-*image_outputs* : turn any model that generates a video into a model that generates images. In fact it will adapt the user interface for image generation and ask the model to generate a video with a single frame.

View File

@ -63,6 +63,26 @@ For dynamic effects over generation steps, use comma-separated values:
- First lora: 0.9 → 0.8 → 0.7
- Second lora: 1.2 → 1.1 → 1.0
With models like Wan 2.2 that uses internally two diffusion models (*High noise* / *Low Noise*) you can specify which Loras you want to be applied for a specific phase by separating each phase with a ";".
For instance, if you want to disable a lora for phase *High Noise* and enablesit only for phase *Low Noise*:
```
0;1
```
As usual, you can use any float for of multiplier and have a multiplier varries throughout one phase for one Lora:
```
0.9,0.8;1.2,1.1,1
```
In this example multiplier 0.9 and 0.8 will be used during the *High Noise* phase and 1.2, 1.1 and 1 during the *Low Noise* phase.
Here is another example for two loras:
```
0.9,0.8;1.2,1.1,1
0.5;0,0.7
```
Note that the syntax for multipliers can also be used in a Finetune model definition file (except that each multiplier definition is a string in a json list)
## Lora Presets
Lora Presets are combinations of loras with predefined multipliers and prompts.

View File

@ -58,12 +58,18 @@ class model_factory:
# self.name= "flux-dev-kontext"
# self.name= "flux-dev"
# self.name= "flux-schnell"
self.model = load_flow_model(self.name, model_filename[0], torch_device)
source = model_def.get("source", None)
self.model = load_flow_model(self.name, model_filename[0] if source is None else source, torch_device)
self.vae = load_ae(self.name, device=torch_device)
# offload.change_dtype(self.model, dtype, True)
# offload.save_model(self.model, "flux-dev.safetensors")
if not source is None:
from wgp import save_model
save_model(self.model, model_type, dtype, None)
if save_quantized:
from wgp import save_quantized_model
save_quantized_model(self.model, model_type, model_filename[0], dtype, None)

View File

@ -343,7 +343,7 @@ def denoise(
updated_num_steps= len(timesteps) -1
if callback != None:
from wgp import update_loras_slists
from wan.utils.loras_mutipliers import update_loras_slists
update_loras_slists(model, loras_slists, updated_num_steps)
callback(-1, None, True, override_num_inference_steps = updated_num_steps)
from mmgp import offload

View File

@ -21,7 +21,7 @@ from PIL import Image
import numpy as np
import torchvision.transforms as transforms
import cv2
from wan.utils.utils import resize_lanczos, calculate_new_dimensions
from wan.utils.utils import calculate_new_dimensions, convert_tensor_to_image
from hyvideo.data_kits.audio_preprocessor import encode_audio, get_facemask
from transformers import WhisperModel
from transformers import AutoFeatureExtractor
@ -720,7 +720,6 @@ class HunyuanVideoSampler(Inference):
embedded_guidance_scale=6.0,
batch_size=1,
num_videos_per_prompt=1,
i2v_resolution="720p",
image_start=None,
enable_RIFLEx = False,
i2v_condition_type: str = "token_replace",
@ -846,39 +845,13 @@ class HunyuanVideoSampler(Inference):
denoise_strength = 0
ip_cfg_scale = 0
if i2v_mode:
if i2v_resolution == "720p":
bucket_hw_base_size = 960
elif i2v_resolution == "540p":
bucket_hw_base_size = 720
elif i2v_resolution == "360p":
bucket_hw_base_size = 480
else:
raise ValueError(f"i2v_resolution: {i2v_resolution} must be in [360p, 540p, 720p]")
# semantic_images = [Image.open(i2v_image_path).convert('RGB')]
semantic_images = [image_start.convert('RGB')] #
origin_size = semantic_images[0].size
h, w = origin_size
h, w = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
closest_size = (w, h)
# crop_size_list = generate_crop_size_list(bucket_hw_base_size, 32)
# aspect_ratios = np.array([round(float(h)/float(w), 5) for h, w in crop_size_list])
# closest_size, closest_ratio = get_closest_ratio(origin_size[1], origin_size[0], aspect_ratios, crop_size_list)
ref_image_transform = transforms.Compose([
transforms.Resize(closest_size),
transforms.CenterCrop(closest_size),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5])
])
semantic_image_pixel_values = [ref_image_transform(semantic_image) for semantic_image in semantic_images]
semantic_image_pixel_values = torch.cat(semantic_image_pixel_values).unsqueeze(0).unsqueeze(2).to(self.device)
semantic_images = convert_tensor_to_image(image_start)
semantic_image_pixel_values = image_start.unsqueeze(0).unsqueeze(2).to(self.device)
with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=True):
img_latents = self.pipeline.vae.encode(semantic_image_pixel_values).latent_dist.mode() # B, C, F, H, W
img_latents.mul_(self.pipeline.vae.config.scaling_factor)
target_height, target_width = closest_size
target_height, target_width = image_start.shape[1:]
# ========================================================================
# Build Rope freqs

View File

@ -303,14 +303,15 @@ class LTXV:
frame_width, frame_height = image_start.size
if fit_into_canvas != None:
height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas, 32)
conditioning_media_paths.append(image_start)
conditioning_media_paths.append(image_start.unsqueeze(1))
conditioning_start_frames.append(0)
conditioning_control_frames.append(False)
prefix_size = 1
if image_end != None:
conditioning_media_paths.append(image_end)
conditioning_start_frames.append(frame_num-1)
conditioning_control_frames.append(False)
if image_end != None:
conditioning_media_paths.append(image_end.unsqueeze(1))
conditioning_start_frames.append(frame_num-1)
conditioning_control_frames.append(False)
if input_frames!= None:
conditioning_media_paths.append(input_frames)

View File

@ -132,11 +132,13 @@ import torch
def remux_with_audio(video_path: Path, output_path: Path, audio: torch.Tensor, sampling_rate: int):
from wan.utils.utils import extract_audio_tracks, combine_video_with_audio_tracks, cleanup_temp_audio_files
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
temp_path = Path(f.name)
temp_path_str= str(temp_path)
import torchaudio
torchaudio.save(temp_path_str, audio.unsqueeze(0) if audio.dim() == 1 else audio, sampling_rate)
combine_video_with_audio_tracks(video_path, [temp_path_str], output_path )
temp_path.unlink(missing_ok=True)

View File

@ -76,7 +76,7 @@ def get_model(persistent_models = False, verboseLevel = 1) -> tuple[MMAudio, Fea
@torch.inference_mode()
def video_to_audio(video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
cfg_strength: float, duration: float, video_save_path , persistent_models = False, verboseLevel = 1):
cfg_strength: float, duration: float, save_path , persistent_models = False, audio_file_only = False, verboseLevel = 1):
global device
@ -110,11 +110,17 @@ def video_to_audio(video, prompt: str, negative_prompt: str, seed: int, num_step
)
audio = audios.float().cpu()[0]
make_video(video, video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
if audio_file_only:
import torchaudio
torchaudio.save(save_path, audio.unsqueeze(0) if audio.dim() == 1 else audio, seq_cfg.sampling_rate)
else:
make_video(video, video_info, save_path, audio, sampling_rate=seq_cfg.sampling_rate)
offloadobj.unload_all()
if not persistent_models:
offloadobj.release()
torch.cuda.empty_cache()
gc.collect()
return video_save_path
return save_path

View File

@ -69,6 +69,10 @@ def get_frames_from_image(image_input, image_state):
[[0:nearest_frame], [nearest_frame:], nearest_frame]
"""
if image_input is None:
gr.Info("Please select an Image file")
return [gr.update()] * 17
user_name = time.time()
frames = [image_input] * 2 # hardcode: mimic a video with 2 frames
image_size = (frames[0].shape[0],frames[0].shape[1])
@ -94,11 +98,12 @@ def get_frames_from_image(image_input, image_state):
gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
gr.update(visible=True), gr.update(visible=True), \
gr.update(visible=True), gr.update(visible=True),\
gr.update(visible=True), gr.update(visible=True), \
gr.update(visible=True), gr.update(value="", visible=True), gr.update(visible=False), \
gr.update(visible=True), gr.update(visible=False), \
gr.update(visible=False), gr.update(value="", visible=False), gr.update(visible=False), \
gr.update(visible=False), gr.update(visible=True), \
gr.update(visible=True)
# extract frames from upload video
def get_frames_from_video(video_input, video_state):
"""
@ -108,7 +113,9 @@ def get_frames_from_video(video_input, video_state):
Return
[[0:nearest_frame], [nearest_frame:], nearest_frame]
"""
if video_input is None:
gr.Info("Please select a Video file")
return [gr.update()] * 18
while model == None:
time.sleep(1)
@ -381,6 +388,7 @@ def save_video(frames, output_path, fps):
def mask_to_xyxy_box(mask):
rows, cols = np.where(mask == 255)
if len(rows) == 0 or len(cols) == 0: return []
xmin = min(cols)
xmax = max(cols) + 1
ymin = min(rows)
@ -449,13 +457,18 @@ def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_si
bbox_info = mask_to_xyxy_box(alpha_output)
h = alpha_output.shape[0]
w = alpha_output.shape[1]
bbox_info = [str(int(bbox_info[0]/ w * 100 )), str(int(bbox_info[1]/ h * 100 )), str(int(bbox_info[2]/ w * 100 )), str(int(bbox_info[3]/ h * 100 )) ]
bbox_info = ":".join(bbox_info)
if len(bbox_info) == 0:
bbox_info = ""
else:
bbox_info = [str(int(bbox_info[0]/ w * 100 )), str(int(bbox_info[1]/ h * 100 )), str(int(bbox_info[2]/ w * 100 )), str(int(bbox_info[3]/ h * 100 )) ]
bbox_info = ":".join(bbox_info)
alpha_output = Image.fromarray(alpha_output)
return foreground_output, alpha_output, bbox_info, gr.update(visible=True), gr.update(visible=True)
# return gr.update(value=foreground_output, visible= True), gr.update(value=alpha_output, visible= True), gr.update(value=bbox_info, visible= True), gr.update(visible=True), gr.update(visible=True)
return foreground_output, alpha_output, gr.update(visible = True), gr.update(visible = True), gr.update(value=bbox_info, visible= True), gr.update(visible=True), gr.update(visible=True)
# video matting
def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
def video_matting(video_state,video_input, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
# if interactive_state["track_end_number"]:
# following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
@ -521,10 +534,21 @@ def video_matting(video_state, end_slider, matting_type, interactive_state, mask
file_name= video_state["video_name"]
file_name = ".".join(file_name.split(".")[:-1])
foreground_output = save_video(foreground, output_path="./mask_outputs/{}_fg.mp4".format(file_name), fps=fps)
# foreground_output = generate_video_from_frames(foreground, output_path="./results/{}_fg.mp4".format(video_state["video_name"]), fps=fps, audio_path=audio_path) # import video_input to name the output video
from wan.utils.utils import extract_audio_tracks, combine_video_with_audio_tracks, cleanup_temp_audio_files
source_audio_tracks, audio_metadata = extract_audio_tracks(video_input)
output_fg_path = f"./mask_outputs/{file_name}_fg.mp4"
output_fg_temp_path = f"./mask_outputs/{file_name}_fg_tmp.mp4"
if len(source_audio_tracks) == 0:
foreground_output = save_video(foreground, output_path=output_fg_path , fps=fps)
else:
foreground_output_tmp = save_video(foreground, output_path=output_fg_temp_path , fps=fps)
combine_video_with_audio_tracks(output_fg_temp_path, source_audio_tracks, output_fg_path, audio_metadata=audio_metadata)
cleanup_temp_audio_files(source_audio_tracks)
os.remove(foreground_output_tmp)
foreground_output = output_fg_path
alpha_output = save_video(alpha, output_path="./mask_outputs/{}_alpha.mp4".format(file_name), fps=fps)
# alpha_output = generate_video_from_frames(alpha, output_path="./results/{}_alpha.mp4".format(video_state["video_name"]), fps=fps, gray2rgb=True, audio_path=audio_path) # import video_input to name the output video
return foreground_output, alpha_output, gr.update(visible=True), gr.update(visible=True), gr.update(visible=True), gr.update(visible=True)
@ -912,7 +936,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
inputs=[],
outputs=[foreground_video_output, alpha_video_output]).then(
fn=video_matting,
inputs=[video_state, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
inputs=[video_state, video_input, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
)
@ -1053,7 +1077,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
alpha_image_output = gr.Image(type="pil", label="Mask", visible=False, elem_classes="image")
with gr.Row(equal_height=True):
bbox_info = gr.Text(label ="Mask BBox Info (Left:Top:Right:Bottom)", interactive= False)
bbox_info = gr.Text(label ="Mask BBox Info (Left:Top:Right:Bottom)", visible = False, interactive= False)
with gr.Row():
# with gr.Row():
export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
@ -1116,7 +1140,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
matting_button.click(
fn=image_matting,
inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
outputs=[foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
outputs=[foreground_image_output, alpha_image_output,foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
)

View File

@ -17,7 +17,7 @@ gradio==5.23.0
numpy>=1.23.5,<2
einops
moviepy==1.0.3
mmgp==3.5.3
mmgp==3.5.5
peft==0.15.0
mutagen
pydantic==2.10.6
@ -46,5 +46,6 @@ soundfile
ffmpeg-python
pyannote.audio
pynvml
huggingface_hub[hf_xet]
# num2words
# spacy

View File

@ -141,7 +141,8 @@ class WanAny2V:
if save_quantized:
from wgp import save_quantized_model
save_quantized_model(self.model, model_type, model_filename[0], dtype, base_config_file)
if self.model2 is not None:
save_quantized_model(self.model2, model_type, model_filename[1], dtype, base_config_file, submodel_no=2)
self.sample_neg_prompt = config.sample_neg_prompt
if self.model.config.get("vace_in_dim", None) != None:
@ -357,7 +358,7 @@ class WanAny2V:
input_frames= None,
input_masks = None,
input_ref_images = None,
input_video=None,
input_video = None,
image_start = None,
image_end = None,
denoising_strength = 1.0,
@ -395,6 +396,7 @@ class WanAny2V:
conditioning_latents_size = 0,
keep_frames_parsed = [],
model_type = None,
model_mode = None,
loras_slists = None,
NAG_scale = 0,
NAG_tau = 3.5,
@ -475,67 +477,63 @@ class WanAny2V:
phantom = model_type in ["phantom_1.3B", "phantom_14B"]
fantasy = model_type in ["fantasy"]
multitalk = model_type in ["multitalk", "vace_multitalk_14B"]
recam = model_type in ["recam_1.3B"]
ref_images_count = 0
trim_frames = 0
extended_overlapped_latents = None
# image2video
lat_frames = int((frame_num - 1) // self.vae_stride[0]) + 1
if image_start != None:
# image2video
if model_type in ["i2v", "i2v_2_2", "fantasy", "multitalk", "flf2v_720p"]:
any_end_frame = False
if input_frames != None:
_ , preframes_count, height, width = input_frames.shape
if image_start is None:
_ , preframes_count, height, width = input_video.shape
lat_h, lat_w = height // self.vae_stride[1], width // self.vae_stride[2]
if hasattr(self, "clip"):
clip_context = self.clip.visual([input_frames[:, -1:]]) if model_type != "flf2v_720p" else self.clip.visual([input_frames[:, -1:], input_frames[:, -1:]])
if hasattr(self, "clip"):
clip_image_size = self.clip.model.image_size
clip_image = resize_lanczos(input_video[:, -1], clip_image_size, clip_image_size)[:, None, :, :]
clip_context = self.clip.visual([clip_image]) if model_type != "flf2v_720p" else self.clip.visual([clip_image , clip_image ])
clip_image = None
else:
clip_context = None
input_frames = input_frames.to(device=self.device).to(dtype= self.VAE_dtype)
enc = torch.concat( [input_frames, torch.zeros( (3, frame_num-preframes_count, height, width),
input_video = input_video.to(device=self.device).to(dtype= self.VAE_dtype)
enc = torch.concat( [input_video, torch.zeros( (3, frame_num-preframes_count, height, width),
device=self.device, dtype= self.VAE_dtype)],
dim = 1).to(self.device)
color_reference_frame = input_frames[:, -1:].clone()
input_frames = None
color_reference_frame = input_video[:, -1:].clone()
input_video = None
else:
preframes_count = 1
image_start = TF.to_tensor(image_start)
any_end_frame = image_end != None
any_end_frame = image_end is not None
add_frames_for_end_image = any_end_frame and model_type == "i2v"
if any_end_frame:
image_end = TF.to_tensor(image_end)
if add_frames_for_end_image:
frame_num +=1
lat_frames = int((frame_num - 2) // self.vae_stride[0] + 2)
trim_frames = 1
h, w = image_start.shape[1:]
height, width = image_start.shape[1:]
h, w = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
width, height = w, h
lat_h = round(
h // self.vae_stride[1] //
height // self.vae_stride[1] //
self.patch_size[1] * self.patch_size[1])
lat_w = round(
w // self.vae_stride[2] //
width // self.vae_stride[2] //
self.patch_size[2] * self.patch_size[2])
h = lat_h * self.vae_stride[1]
w = lat_w * self.vae_stride[2]
img_interpolated = resize_lanczos(image_start, h, w).sub_(0.5).div_(0.5).unsqueeze(0).transpose(0,1).to(self.device) #, self.dtype
color_reference_frame = img_interpolated.clone()
if image_end!= None:
img_interpolated2 = resize_lanczos(image_end, h, w).sub_(0.5).div_(0.5).unsqueeze(0).transpose(0,1).to(self.device) #, self.dtype
height = lat_h * self.vae_stride[1]
width = lat_w * self.vae_stride[2]
image_start_frame = image_start.unsqueeze(1).to(self.device)
color_reference_frame = image_start_frame.clone()
if image_end is not None:
img_end_frame = image_end.unsqueeze(1).to(self.device)
if hasattr(self, "clip"):
clip_image_size = self.clip.model.image_size
image_start = resize_lanczos(image_start, clip_image_size, clip_image_size)
image_start = image_start.sub_(0.5).div_(0.5).to(self.device) #, self.dtype
if image_end!= None:
image_end = resize_lanczos(image_end, clip_image_size, clip_image_size)
image_end = image_end.sub_(0.5).div_(0.5).to(self.device) #, self.dtype
if image_end is not None: image_end = resize_lanczos(image_end, clip_image_size, clip_image_size)
if model_type == "flf2v_720p":
clip_context = self.clip.visual([image_start[:, None, :, :], image_end[:, None, :, :] if image_end != None else image_start[:, None, :, :]])
clip_context = self.clip.visual([image_start[:, None, :, :], image_end[:, None, :, :] if image_end is not None else image_start[:, None, :, :]])
else:
clip_context = self.clip.visual([image_start[:, None, :, :]])
else:
@ -543,17 +541,17 @@ class WanAny2V:
if any_end_frame:
enc= torch.concat([
img_interpolated,
torch.zeros( (3, frame_num-2, h, w), device=self.device, dtype= self.VAE_dtype),
img_interpolated2,
image_start_frame,
torch.zeros( (3, frame_num-2, height, width), device=self.device, dtype= self.VAE_dtype),
img_end_frame,
], dim=1).to(self.device)
else:
enc= torch.concat([
img_interpolated,
torch.zeros( (3, frame_num-1, h, w), device=self.device, dtype= self.VAE_dtype)
image_start_frame,
torch.zeros( (3, frame_num-1, height, width), device=self.device, dtype= self.VAE_dtype)
], dim=1).to(self.device)
image_start = image_end = img_interpolated = img_interpolated2 = None
image_start = image_end = image_start_frame = img_end_frame = None
msk = torch.ones(1, frame_num, lat_h, lat_w, device=self.device)
if any_end_frame:
@ -582,11 +580,12 @@ class WanAny2V:
kwargs.update({'clip_fea': clip_context})
# Recam Master
if target_camera != None:
if recam:
# should be be in fact in input_frames since it is control video not a video to be extended
target_camera = model_mode
width = input_video.shape[2]
height = input_video.shape[1]
input_video = input_video.to(dtype=self.dtype , device=self.device)
input_video = input_video.permute(3, 0, 1, 2).div_(127.5).sub_(1.)
source_latents = self.vae.encode([input_video])[0] #.to(dtype=self.dtype, device=self.device)
del input_video
# Process target camera (recammaster)
@ -718,8 +717,13 @@ class WanAny2V:
# init denoising
updated_num_steps= len(timesteps)
if callback != None:
from wan.utils.utils import update_loras_slists
update_loras_slists(self.model, loras_slists, updated_num_steps)
from wan.utils.loras_mutipliers import update_loras_slists
model_switch_step = updated_num_steps
for i, t in enumerate(timesteps):
if t <= switch_threshold:
model_switch_step = i
break
update_loras_slists(self.model, loras_slists, updated_num_steps, model_switch_step= model_switch_step)
callback(-1, None, True, override_num_inference_steps = updated_num_steps)
if sample_scheduler != None:

View File

@ -19,7 +19,7 @@ from wan.utils.utils import calculate_new_dimensions
from .utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
get_sampling_sigmas, retrieve_timesteps)
from .utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
from wan.utils.utils import update_loras_slists
from wan.utils.loras_mutipliers import update_loras_slists
class DTT2V:
@ -199,7 +199,6 @@ class DTT2V:
self,
input_prompt: Union[str, List[str]],
n_prompt: Union[str, List[str]] = "",
image_start: PipelineImageInput = None,
input_video = None,
height: int = 480,
width: int = 832,
@ -242,11 +241,6 @@ class DTT2V:
if input_video != None:
_ , _ , height, width = input_video.shape
elif image_start != None:
image_start = image_start
frame_width, frame_height = image_start.size
height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas)
image_start = np.array(image_start.resize((width, height))).transpose(2, 0, 1)
latent_length = (frame_num - 1) // 4 + 1
@ -276,18 +270,8 @@ class DTT2V:
output_video = input_video
if image_start is not None or output_video is not None: # i !=0
if output_video is not None:
prefix_video = output_video.to(self.device)
else:
causal_block_size = 1
causal_attention = False
ar_step = 0
prefix_video = image_start
prefix_video = torch.tensor(prefix_video).unsqueeze(1) # .to(image_embeds.dtype).unsqueeze(1)
if prefix_video.dtype == torch.uint8:
prefix_video = (prefix_video.float() / (255.0 / 2.0)) - 1.0
prefix_video = prefix_video.to(self.device)
if output_video is not None: # i !=0
prefix_video = output_video.to(self.device)
prefix_video = self.vae.encode(prefix_video.unsqueeze(0))[0] # [(c, f, h, w)]
predix_video_latent_length = prefix_video.shape[1]
truncate_len = predix_video_latent_length % causal_block_size

View File

@ -6,7 +6,7 @@ from .model import FantasyTalkingAudioConditionModel
from .utils import get_audio_features
import gc, torch
def parse_audio(audio_path, num_frames, fps = 23, device = "cuda"):
def parse_audio(audio_path, start_frame, num_frames, fps = 23, device = "cuda"):
fantasytalking = FantasyTalkingAudioConditionModel(None, 768, 2048).to(device)
from mmgp import offload
from accelerate import init_empty_weights
@ -24,7 +24,7 @@ def parse_audio(audio_path, num_frames, fps = 23, device = "cuda"):
wav2vec = Wav2Vec2Model.from_pretrained(wav2vec_model_dir, device_map="cpu").eval().requires_grad_(False)
wav2vec.to(device)
proj_model.to(device)
audio_wav2vec_fea = get_audio_features( wav2vec, wav2vec_processor, audio_path, fps, num_frames )
audio_wav2vec_fea = get_audio_features( wav2vec, wav2vec_processor, audio_path, fps, start_frame, num_frames)
audio_proj_fea = proj_model(audio_wav2vec_fea)
pos_idx_ranges = fantasytalking.split_audio_sequence( audio_proj_fea.size(1), num_frames=num_frames )

View File

@ -26,13 +26,18 @@ def save_video(frames, save_path, fps, quality=9, ffmpeg_params=None):
writer.close()
def get_audio_features(wav2vec, audio_processor, audio_path, fps, num_frames):
def get_audio_features(wav2vec, audio_processor, audio_path, fps, start_frame, num_frames):
sr = 16000
audio_input, sample_rate = librosa.load(audio_path, sr=sr) # 采样率为 16kHz
audio_input, sample_rate = librosa.load(audio_path, sr=sr) # 采样率为 16kHz start_time = 0
if start_frame < 0:
pad = int(abs(start_frame)/ fps * sr)
audio_input = np.concatenate([np.zeros(pad), audio_input])
end_frame = num_frames
else:
end_frame = start_frame + num_frames
start_time = 0
# end_time = (0 + (num_frames - 1) * 1) / fps
end_time = num_frames / fps
start_time = start_frame / fps
end_time = end_frame / fps
start_sample = int(start_time * sr)
end_sample = int(end_time * sr)

View File

@ -762,7 +762,11 @@ class WanModel(ModelMixin, ConfigMixin):
offload.shared_state["_chipmunk_layers"] = None
def preprocess_loras(self, model_type, sd):
new_sd = {}
for k,v in sd.items():
if not k.endswith(".modulation.diff"):
new_sd[ k] = v
sd = new_sd
first = next(iter(sd), None)
if first == None:
return sd

View File

@ -74,7 +74,7 @@ def audio_prepare_single(audio_path, sample_rate=16000, duration = 0):
return human_speech_array
def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=16000, duration = 0):
def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=16000, duration = 0, pad = 0):
if not (left_path==None or right_path==None):
human_speech_array1 = audio_prepare_single(left_path, duration = duration)
human_speech_array2 = audio_prepare_single(right_path, duration = duration)
@ -91,7 +91,13 @@ def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=1
elif audio_type=='add':
new_human_speech1 = np.concatenate([human_speech_array1[: human_speech_array1.shape[0]], np.zeros(human_speech_array2.shape[0])])
new_human_speech2 = np.concatenate([np.zeros(human_speech_array1.shape[0]), human_speech_array2[:human_speech_array2.shape[0]]])
#dont include the padding on the summed audio which is used to build the output audio track
sum_human_speechs = new_human_speech1 + new_human_speech2
if pad > 0:
new_human_speech1 = np.concatenate([np.zeros(pad), new_human_speech1])
new_human_speech2 = np.concatenate([np.zeros(pad), new_human_speech2])
return new_human_speech1, new_human_speech2, sum_human_speechs
def process_tts_single(text, save_dir, voice1):
@ -167,14 +173,13 @@ def process_tts_multi(text, save_dir, voice1, voice2):
return s1, s2, save_path_sum
def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames = 0, fps = 25, sr = 16000):
def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames = 0, fps = 25, sr = 16000, padded_frames_for_embeddings = 0):
wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/chinese-wav2vec2-base")
# wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/wav2vec")
new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps)
pad = int(padded_frames_for_embeddings/ fps * sr)
new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps, pad = pad)
audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
full_audio_embs = []
if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
# if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)

View File

@ -0,0 +1,91 @@
def preparse_loras_multipliers(loras_multipliers):
if isinstance(loras_multipliers, list):
return [multi.strip(" \r\n") if isinstance(multi, str) else multi for multi in loras_multipliers]
loras_multipliers = loras_multipliers.strip(" \r\n")
loras_mult_choices_list = loras_multipliers.replace("\r", "").split("\n")
loras_mult_choices_list = [multi.strip() for multi in loras_mult_choices_list if len(multi)>0 and not multi.startswith("#")]
loras_multipliers = " ".join(loras_mult_choices_list)
return loras_multipliers.split(" ")
def expand_slist(slists_dict, mult_no, num_inference_steps, model_switch_step ):
def expand_one(slist, num_inference_steps):
if not isinstance(slist, list): slist = [slist]
new_slist= []
if num_inference_steps <=0:
return new_slist
inc = len(slist) / num_inference_steps
pos = 0
for i in range(num_inference_steps):
new_slist.append(slist[ int(pos)])
pos += inc
return new_slist
phase1 = slists_dict["phase1"][mult_no]
phase2 = slists_dict["phase2"][mult_no]
if isinstance(phase1, float) and isinstance(phase2, float) and phase1 == phase2:
return phase1
return expand_one(phase1, model_switch_step) + expand_one(phase2, num_inference_steps - model_switch_step)
def parse_loras_multipliers(loras_multipliers, nb_loras, num_inference_steps, merge_slist = None, max_phases = 2, model_switch_step = None):
if model_switch_step is None:
model_switch_step = num_inference_steps
def is_float(element: any) -> bool:
if element is None:
return False
try:
float(element)
return True
except ValueError:
return False
loras_list_mult_choices_nums = []
slists_dict = { "model_switch_step": model_switch_step}
slists_dict["phase1"] = phase1 = [1.] * nb_loras
slists_dict["phase2"] = phase2 = [1.] * nb_loras
if isinstance(loras_multipliers, list) or len(loras_multipliers) > 0:
list_mult_choices_list = preparse_loras_multipliers(loras_multipliers)
for i, mult in enumerate(list_mult_choices_list):
current_phase = phase1
if isinstance(mult, str):
mult = mult.strip()
phase_mult = mult.split(";")
shared_phases = len(phase_mult) <=1
if len(phase_mult) > max_phases:
return "", "", f"Loras can not be defined for more than {max_phases} Denoising phases for this model"
for phase_no, mult in enumerate(phase_mult):
if phase_no > 0: current_phase = phase2
if "," in mult:
multlist = mult.split(",")
slist = []
for smult in multlist:
if not is_float(smult):
return "", "", f"Lora sub value no {i+1} ({smult}) in Multiplier definition '{multlist}' is invalid"
slist.append(float(smult))
else:
if not is_float(mult):
return "", "", f"Lora Multiplier no {i+1} ({mult}) is invalid"
slist = float(mult)
if shared_phases:
phase1[i] = phase2[i] = slist
else:
current_phase[i] = slist
else:
phase1[i] = phase2[i] = float(mult)
if merge_slist is not None:
slists_dict["phase1"] = phase1 = merge_slist["phase1"] + phase1
slists_dict["phase2"] = phase2 = merge_slist["phase2"] + phase2
loras_list_mult_choices_nums = [ expand_slist(slists_dict, i, num_inference_steps, model_switch_step ) for i in range(len(phase1)) ]
loras_list_mult_choices_nums = [ slist[0] if isinstance(slist, list) else slist for slist in loras_list_mult_choices_nums ]
return loras_list_mult_choices_nums, slists_dict, ""
def update_loras_slists(trans, slists_dict, num_inference_steps, model_switch_step = None ):
from mmgp import offload
sz = len(slists_dict["phase1"])
slists = [ expand_slist(slists_dict, i, num_inference_steps, model_switch_step ) for i in range(sz) ]
nos = [str(l) for l in range(sz)]
offload.activate_loras(trans, nos, slists )

View File

@ -18,6 +18,8 @@ import random
import ffmpeg
import os
import tempfile
import subprocess
import json
__all__ = ['cache_video', 'cache_image', 'str2bool']
@ -34,21 +36,6 @@ def seed_everything(seed: int):
if torch.backends.mps.is_available():
torch.mps.manual_seed(seed)
def expand_slist(slist, num_inference_steps ):
new_slist= []
inc = len(slist) / num_inference_steps
pos = 0
for i in range(num_inference_steps):
new_slist.append(slist[ int(pos)])
pos += inc
return new_slist
def update_loras_slists(trans, slists, num_inference_steps ):
from mmgp import offload
slists = [ expand_slist(slist, num_inference_steps ) if isinstance(slist, list) else slist for slist in slists ]
nos = [str(l) for l in range(len(slists))]
offload.activate_loras(trans, nos, slists )
def resample(video_fps, video_frames_count, max_target_frames_count, target_fps, start_target_frame ):
import math
@ -141,10 +128,12 @@ def convert_image_to_video(image):
return temp_video.name
def resize_lanczos(img, h, w):
img = Image.fromarray(np.clip(255. * img.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8))
img = (img + 1).float().mul_(127.5)
img = Image.fromarray(np.clip(img.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8))
img = img.resize((w,h), resample=Image.Resampling.LANCZOS)
return torch.from_numpy(np.array(img).astype(np.float32) / 255.0).movedim(-1, 0)
img = torch.from_numpy(np.array(img).astype(np.float32)).movedim(-1, 0)
img = img.div(127.5).sub_(1)
return img
def remove_background(img, session=None):
if session ==None:
@ -445,109 +434,180 @@ def create_progress_hook(filename):
return progress_hook(block_num, block_size, total_size, filename)
return hook
import tempfile, os
import ffmpeg
import os
import tempfile
def extract_audio_tracks(source_video, verbose=False, query_only= False):
def extract_audio_tracks(source_video, verbose=False, query_only=False):
"""
Extract all audio tracks from source video to temporary files.
Args:
source_video: Path to video with audio to extract
verbose: Enable verbose output (default: False)
Extract all audio tracks from a source video into temporary AAC files.
Returns:
List of temporary audio file paths, or empty list if no audio tracks
Tuple:
- List of temp file paths for extracted audio tracks
- List of corresponding metadata dicts:
{'codec', 'sample_rate', 'channels', 'duration', 'language'}
where 'duration' is set to container duration (for consistency).
"""
probe = ffmpeg.probe(source_video)
audio_streams = [s for s in probe['streams'] if s['codec_type'] == 'audio']
container_duration = float(probe['format'].get('duration', 0.0))
if not audio_streams:
if query_only: return 0
if verbose: print(f"No audio track found in {source_video}")
return [], []
if query_only:
return len(audio_streams)
if verbose:
print(f"Found {len(audio_streams)} audio track(s), container duration = {container_duration:.3f}s")
file_paths = []
metadata = []
for i, stream in enumerate(audio_streams):
fd, temp_path = tempfile.mkstemp(suffix=f'_track{i}.aac', prefix='audio_')
os.close(fd)
file_paths.append(temp_path)
metadata.append({
'codec': stream.get('codec_name'),
'sample_rate': int(stream.get('sample_rate', 0)),
'channels': int(stream.get('channels', 0)),
'duration': container_duration,
'language': stream.get('tags', {}).get('language', None)
})
ffmpeg.input(source_video).output(
temp_path,
**{f'map': f'0:a:{i}', 'acodec': 'aac', 'b:a': '128k'}
).overwrite_output().run(quiet=not verbose)
return file_paths, metadata
import subprocess
def combine_and_concatenate_video_with_audio_tracks(
save_path_tmp, video_path,
source_audio_tracks, new_audio_tracks,
source_audio_duration, audio_sampling_rate,
new_audio_from_start=False,
source_audio_metadata=None,
audio_bitrate='128k',
audio_codec='aac'
):
inputs, filters, maps, idx = ['-i', video_path], [], ['-map', '0:v'], 1
metadata_args = []
sources = source_audio_tracks or []
news = new_audio_tracks or []
duplicate_source = len(sources) == 1 and len(news) > 1
N = len(news) if source_audio_duration == 0 else max(len(sources), len(news)) or 1
for i in range(N):
s = (sources[i] if i < len(sources)
else sources[0] if duplicate_source else None)
n = news[i] if len(news) == N else (news[0] if news else None)
if source_audio_duration == 0:
if n:
inputs += ['-i', n]
filters.append(f'[{idx}:a]apad=pad_dur=100[aout{i}];')
idx += 1
else:
filters.append(f'anullsrc=r={audio_sampling_rate}:cl=mono,apad=pad_dur=100[aout{i}];')
else:
if s:
inputs += ['-i', s]
meta = source_audio_metadata[i] if source_audio_metadata and i < len(source_audio_metadata) else {}
needs_filter = (
meta.get('codec') != audio_codec or
meta.get('sample_rate') != audio_sampling_rate or
meta.get('channels') != 1 or
meta.get('duration', 0) < source_audio_duration
)
if needs_filter:
filters.append(
f'[{idx}:a]aresample={audio_sampling_rate},aformat=channel_layouts=mono,'
f'apad=pad_dur={source_audio_duration},atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
else:
filters.append(
f'[{idx}:a]apad=pad_dur={source_audio_duration},atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
if lang := meta.get('language'):
metadata_args += ['-metadata:s:a:' + str(i), f'language={lang}']
idx += 1
else:
filters.append(
f'anullsrc=r={audio_sampling_rate}:cl=mono,atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
if n:
inputs += ['-i', n]
start = '0' if new_audio_from_start else source_audio_duration
filters.append(
f'[{idx}:a]aresample={audio_sampling_rate},aformat=channel_layouts=mono,'
f'atrim=start={start},asetpts=PTS-STARTPTS[n{i}];'
f'[s{i}][n{i}]concat=n=2:v=0:a=1[aout{i}];')
idx += 1
else:
filters.append(f'[s{i}]apad=pad_dur=100[aout{i}];')
maps += ['-map', f'[aout{i}]']
cmd = ['ffmpeg', '-y', *inputs,
'-filter_complex', ''.join(filters),
*maps, *metadata_args,
'-c:v', 'copy',
'-c:a', audio_codec,
'-b:a', audio_bitrate,
'-ar', str(audio_sampling_rate),
'-ac', '1',
'-shortest', save_path_tmp]
try:
# Check if source video has audio
probe = ffmpeg.probe(source_video)
audio_streams = [s for s in probe['streams'] if s['codec_type'] == 'audio']
if not audio_streams:
if query_only: return 0
if verbose:
print(f"No audio track found in {source_video}")
return []
if query_only: return len(audio_streams)
if verbose:
print(f"Found {len(audio_streams)} audio track(s)")
# Create temporary audio files for each track
temp_audio_files = []
for i in range(len(audio_streams)):
fd, temp_path = tempfile.mkstemp(suffix=f'_track{i}.aac', prefix='audio_')
os.close(fd) # Close file descriptor immediately
temp_audio_files.append(temp_path)
# Extract each audio track
for i, temp_path in enumerate(temp_audio_files):
(ffmpeg
.input(source_video)
.output(temp_path, **{f'map': f'0:a:{i}', 'acodec': 'aac'})
.overwrite_output()
.run(quiet=not verbose))
return temp_audio_files
except ffmpeg.Error as e:
print(f"FFmpeg error during audio extraction: {e}")
return 0 if query_only else []
except Exception as e:
print(f"Error during audio extraction: {e}")
return 0 if query_only else []
subprocess.run(cmd, check=True, capture_output=True, text=True)
except subprocess.CalledProcessError as e:
raise Exception(f"FFmpeg error: {e.stderr}")
def combine_video_with_audio_tracks(target_video, audio_tracks, output_video, verbose=False):
"""
Combine video with audio tracks. Output duration matches video length exactly.
Args:
target_video: Path to video to receive the audio
audio_tracks: List of audio file paths to combine
output_video: Path for the output video
verbose: Enable verbose output (default: False)
Returns:
True if successful, False otherwise
"""
import ffmpeg
import subprocess
import ffmpeg
def combine_video_with_audio_tracks(target_video, audio_tracks, output_video,
audio_metadata=None, verbose=False):
if not audio_tracks:
if verbose:
print("No audio tracks to combine")
return False
try:
# Get video duration to ensure exact alignment
video_probe = ffmpeg.probe(target_video)
video_duration = float(video_probe['streams'][0]['duration'])
if verbose:
print(f"Target video duration: {video_duration:.3f} seconds")
# Combine target video with all audio tracks, force video duration
video = ffmpeg.input(target_video).video
audio_inputs = [ffmpeg.input(audio_path).audio for audio_path in audio_tracks]
# Create output with video duration as master timing
inputs = [video] + audio_inputs
(ffmpeg
.output(*inputs, output_video,
vcodec='copy',
acodec='copy',
t=video_duration) # Force exact video duration
.overwrite_output()
.run(quiet=not verbose))
if verbose:
print(f"Successfully created {output_video} with {len(audio_tracks)} audio track(s) aligned to video duration")
return True
except ffmpeg.Error as e:
print(f"FFmpeg error during video combination: {e}")
return False
except Exception as e:
print(f"Error during video combination: {e}")
return False
if verbose: print("No audio tracks to combine."); return False
dur = float(next(s for s in ffmpeg.probe(target_video)['streams']
if s['codec_type'] == 'video')['duration'])
if verbose: print(f"Video duration: {dur:.3f}s")
cmd = ['ffmpeg', '-y', '-i', target_video]
for path in audio_tracks:
cmd += ['-i', path]
cmd += ['-map', '0:v']
for i in range(len(audio_tracks)):
cmd += ['-map', f'{i+1}:a']
for i, meta in enumerate(audio_metadata or []):
if (lang := meta.get('language')):
cmd += ['-metadata:s:a:' + str(i), f'language={lang}']
cmd += ['-c:v', 'copy', '-c:a', 'copy', '-t', str(dur), output_video]
result = subprocess.run(cmd, capture_output=not verbose, text=True)
if result.returncode != 0:
raise Exception(f"FFmpeg error:\n{result.stderr}")
if verbose:
print(f"Created {output_video} with {len(audio_tracks)} audio track(s)")
return True
def cleanup_temp_audio_files(audio_tracks, verbose=False):
"""

672
wgp.py

File diff suppressed because it is too large Load Diff