WanGP remuxed

2026-01-11 16:53:34 +00:00 · 2025-08-04 02:28:19 +02:00 · 2025-08-04 02:28:19 +02:00 · d2a9d5483d
commit d2a9d5483d
parent 35e4ee2b59
21 changed files with 875 additions and 514 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,28 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 **Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep

 ## 🔥 Latest Updates : 
+### August 4 2025: WanGP v7.6 - Remuxed
+
+With this new version you won't have any excuse if there is no sound in your video.
+
+*Continue Video* now works with any video that has already some sound (hint: Multitalk ).
+
+Also, on top of MMaudio and the various sound driven models I have added the ability to use your own soundtrack.
+
+As a result you can apply a different sound source on each new video segment when doing a *Continue Video*. 
+
+For instance:
+- first video part: use Multitalk with two people speaking
+- second video part: you apply your own soundtrack which will gently follow the multitalk conversation
+- third video part: you use Vace effect and its corresponding control audio will be concatenated to the rest of the audio
+
+To multiply the combinations I have also implemented *Continue Video* with the various image2video models.
+
+Also:
+- End Frame support added for LTX Video models
+- Loras can now be targetted specifically at the High noise or Low noise models with Wan 2.2, check the Loras and Finetune guides
+- Flux Krea Dev support
+
 ### July 30 2025: WanGP v7.5:  Just another release ... Wan 2.2 part 2
 Here is now Wan 2.2 image2video a very good model if you want to set Start and End frames. Two Wan 2.2 models delivered, only one to go ...

--- a/defaults/flux_krea.json
+++ b/defaults/flux_krea.json
@ -0,0 +1,16 @@
+{
+    "model": {
+        "name": "Flux 1 Krea Dev 12B",
+        "architecture": "flux",
+        "description": "Cutting-edge output quality, with a focus on aesthetic photography..",
+        "URLs": [
+            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-krea-dev_bf16.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Flux/resolve/main/flux1-krea-dev_quanto_bf16_int8.safetensors"
+        ],
+        "image_outputs": true,
+        "flux-model": "flux-dev"
+    },
+    "prompt": "draw a hat",
+    "resolution": "1280x720",
+    "batch_size": 1
+}
--- a/docs/FINETUNES.md
+++ b/docs/FINETUNES.md
@ -55,9 +55,16 @@ For instance if one adds a module *vace_14B* on top of a model with architecture
 - *architecture* : architecture Id of the base model of the finetune (see previous section)
 - *description*: description of the finetune that will appear at the top
 - *URLs*: URLs of all the finetune versions (quantized / non quantized). WanGP will pick the version that is the closest to the user preferences. You will need to follow a naming convention to help WanGP identify the content of each version (see next section). Right now WanGP supports only 8 bits quantized model that have been quantized using **quanto**. WanGP offers a command switch to build easily such a quantized model (see below). *URLs* can contain also paths to local file to allow testing.
+- *URLs2*: URLs of all the finetune versions (quantized / non quantized) of the weights used for the second phase of a model. For instance with Wan 2.2, the first phase contains the High Noise model weights and the second phase contains the Low Noise model weights. This feature can be used with other models than Wan 2.2 to combine different model weights during the same video generation.
 - *modules*: this a list of modules to be combined with the models referenced by the URLs. A module is a model extension that is merged with a model to expand its capabilities. Supported models so far are : *vace_14B* and *multitalk*. For instance the full Vace model is the fusion of a Wan text 2 video and the Vace module.
 - *preload_URLs* : URLs of files to download no matter what (used to load quantization maps for instance)
-*loras* : URLs of Loras that will applied before any other Lora specified by the user. These loras will be quite often Loras accelerator. For instance if you specified here the FusioniX Lora you will be able to reduce the number of generation steps to -*loras_multipliers* : a list of float numbers that defines the weight of each Lora mentioned above.
+-*loras* : URLs of Loras that will applied before any other Lora specified by the user. These loras will be quite often Loras accelerators. For instance if you specify here the FusioniX Lora you will be able to reduce the number of generation steps to 10
+-*loras_multipliers* : a list of float numbers or strings that defines the weight of each Lora mentioned in *Loras*. The string syntax is used if you want your lora multiplier to change over the steps (please check the Loras doc) or if you want a multiplier to be applied on a specific High Noise phase or Low Noise phase of a Wan 2.2 model. For instance, here the multiplier will be only applied during the High Noise phase and for half of the steps of this phase the multiplier will be 1 and for the other half 1.1.
+```
+"loras" : [ "my_lora.safetensors"],
+"loras_multipliers" : [ "1,1.1;0"]
+```
+
 - *auto_quantize*: if set to True and no quantized model URL is provided, WanGP will perform on the fly quantization if the user expects a quantized model
 -*visible* : by default assumed to be true. If set to false the model will no longer be visible. This can be useful if you create a finetune to override a default model and hide it.
 -*image_outputs* : turn any model that generates a video into a model that generates images. In fact it will adapt the user interface for image generation and ask the model to generate a video with a single frame.
--- a/docs/LORAS.md
+++ b/docs/LORAS.md
@ -63,6 +63,26 @@ For dynamic effects over generation steps, use comma-separated values:
 - First lora: 0.9 → 0.8 → 0.7
 - Second lora: 1.2 → 1.1 → 1.0

+With models like Wan 2.2 that uses internally two diffusion models (*High noise* / *Low Noise*) you can specify which Loras you want to be applied for a specific phase by separating each phase with a ";".
+
+For instance, if you want to disable a lora for phase *High Noise* and enablesit only for phase *Low Noise*:
+```
+0;1
+```
+
+As usual, you can use any float for of multiplier and have a multiplier varries throughout one phase for one Lora:
+```
+0.9,0.8;1.2,1.1,1
+```
+In this example multiplier 0.9 and 0.8 will be used during the *High Noise* phase and 1.2, 1.1 and 1 during the *Low Noise* phase.
+
+Here is another example for two loras:
+```
+0.9,0.8;1.2,1.1,1
+0.5;0,0.7
+```
+
+Note that the syntax for multipliers can also be used in a Finetune model definition file (except that each multiplier definition is a string in a json list)
 ## Lora Presets

 Lora Presets are combinations of loras with predefined multipliers and prompts.
--- a/flux/flux_main.py
+++ b/flux/flux_main.py
@ -58,12 +58,18 @@ class model_factory:
        # self.name= "flux-dev-kontext"
        # self.name= "flux-dev"
        # self.name= "flux-schnell"
-        self.model = load_flow_model(self.name, model_filename[0], torch_device)
+        source =  model_def.get("source", None)
+        self.model = load_flow_model(self.name, model_filename[0] if source is None else source, torch_device)

        self.vae = load_ae(self.name, device=torch_device)

        # offload.change_dtype(self.model, dtype, True)
        # offload.save_model(self.model, "flux-dev.safetensors")
+
+        if not source is None:
+            from wgp import save_model
+            save_model(self.model, model_type, dtype, None)
+
        if save_quantized:
            from wgp import save_quantized_model
            save_quantized_model(self.model, model_type, model_filename[0], dtype, None)
--- a/flux/sampling.py
+++ b/flux/sampling.py
@ -343,7 +343,7 @@ def denoise(

    updated_num_steps= len(timesteps) -1
    if callback != None:
-        from wgp import update_loras_slists
+        from wan.utils.loras_mutipliers import update_loras_slists
        update_loras_slists(model, loras_slists, updated_num_steps)
        callback(-1, None, True, override_num_inference_steps = updated_num_steps)
    from mmgp import offload
--- a/hyvideo/hunyuan.py
+++ b/hyvideo/hunyuan.py
@ -21,7 +21,7 @@ from PIL import Image
 import numpy as np
 import torchvision.transforms as transforms
 import cv2
-from wan.utils.utils import resize_lanczos, calculate_new_dimensions
+from wan.utils.utils import calculate_new_dimensions, convert_tensor_to_image
 from hyvideo.data_kits.audio_preprocessor import encode_audio, get_facemask
 from transformers import WhisperModel
 from transformers import AutoFeatureExtractor
@ -720,7 +720,6 @@ class HunyuanVideoSampler(Inference):
        embedded_guidance_scale=6.0,
        batch_size=1,
        num_videos_per_prompt=1,
-        i2v_resolution="720p",
        image_start=None,
        enable_RIFLEx = False,
        i2v_condition_type: str = "token_replace",
@ -846,39 +845,13 @@ class HunyuanVideoSampler(Inference):
        denoise_strength = 0
        ip_cfg_scale = 0
        if i2v_mode:
-            if i2v_resolution == "720p":
-                bucket_hw_base_size = 960
-            elif i2v_resolution == "540p":
-                bucket_hw_base_size = 720
-            elif i2v_resolution == "360p":
-                bucket_hw_base_size = 480
-            else:
-                raise ValueError(f"i2v_resolution: {i2v_resolution} must be in [360p, 540p, 720p]")
-
-            # semantic_images = [Image.open(i2v_image_path).convert('RGB')]
-            semantic_images = [image_start.convert('RGB')] #
-            origin_size = semantic_images[0].size
-            h, w = origin_size
-            h, w = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
-            closest_size = (w, h)
-            # crop_size_list = generate_crop_size_list(bucket_hw_base_size, 32)
-            # aspect_ratios = np.array([round(float(h)/float(w), 5) for h, w in crop_size_list])
-            # closest_size, closest_ratio = get_closest_ratio(origin_size[1], origin_size[0], aspect_ratios, crop_size_list)
-            ref_image_transform = transforms.Compose([
-                transforms.Resize(closest_size),
-                transforms.CenterCrop(closest_size),
-                transforms.ToTensor(),
-                transforms.Normalize([0.5], [0.5])
-            ])
-
-            semantic_image_pixel_values = [ref_image_transform(semantic_image) for semantic_image in semantic_images]
-            semantic_image_pixel_values = torch.cat(semantic_image_pixel_values).unsqueeze(0).unsqueeze(2).to(self.device)
-
+            semantic_images = convert_tensor_to_image(image_start)
+            semantic_image_pixel_values = image_start.unsqueeze(0).unsqueeze(2).to(self.device)
            with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=True):
                img_latents = self.pipeline.vae.encode(semantic_image_pixel_values).latent_dist.mode() # B, C, F, H, W
                img_latents.mul_(self.pipeline.vae.config.scaling_factor)

-            target_height, target_width = closest_size
+            target_height, target_width = image_start.shape[1:] 

        # ========================================================================
        # Build Rope freqs
--- a/ltx_video/ltxv.py
+++ b/ltx_video/ltxv.py
@ -303,14 +303,15 @@ class LTXV:
                frame_width, frame_height  = image_start.size
                if fit_into_canvas != None:
                    height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas, 32)
-                conditioning_media_paths.append(image_start) 
+                conditioning_media_paths.append(image_start.unsqueeze(1)) 
                conditioning_start_frames.append(0)
                conditioning_control_frames.append(False)
                prefix_size = 1
-            if image_end != None:
-                conditioning_media_paths.append(image_end) 
-                conditioning_start_frames.append(frame_num-1)
-                conditioning_control_frames.append(False)
+                
+        if image_end != None:
+            conditioning_media_paths.append(image_end.unsqueeze(1)) 
+            conditioning_start_frames.append(frame_num-1)
+            conditioning_control_frames.append(False)

        if input_frames!= None:
            conditioning_media_paths.append(input_frames) 
--- a/postprocessing/mmaudio/data/av_utils.py
+++ b/postprocessing/mmaudio/data/av_utils.py
@ -132,11 +132,13 @@ import torch

 def remux_with_audio(video_path: Path, output_path: Path, audio: torch.Tensor, sampling_rate: int):
    from wan.utils.utils import extract_audio_tracks, combine_video_with_audio_tracks, cleanup_temp_audio_files
+
    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        temp_path = Path(f.name)
    temp_path_str= str(temp_path)
    import torchaudio
    torchaudio.save(temp_path_str, audio.unsqueeze(0) if audio.dim() == 1 else audio, sampling_rate)
+
    combine_video_with_audio_tracks(video_path, [temp_path_str], output_path )
    temp_path.unlink(missing_ok=True)

--- a/postprocessing/mmaudio/mmaudio.py
+++ b/postprocessing/mmaudio/mmaudio.py
@ -76,7 +76,7 @@ def get_model(persistent_models = False, verboseLevel = 1) -> tuple[MMAudio, Fea

@torch.inference_mode()
 def video_to_audio(video, prompt: str, negative_prompt: str, seed: int, num_steps: int,
-                   cfg_strength: float, duration: float, video_save_path , persistent_models = False, verboseLevel = 1):
+                   cfg_strength: float, duration: float, save_path , persistent_models = False, audio_file_only = False, verboseLevel = 1):

    global device

@ -110,11 +110,17 @@ def video_to_audio(video, prompt: str, negative_prompt: str, seed: int, num_step
                      )
    audio = audios.float().cpu()[0]

-    make_video(video, video_info, video_save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+
+    if audio_file_only:
+        import torchaudio
+        torchaudio.save(save_path, audio.unsqueeze(0) if audio.dim() == 1 else audio, seq_cfg.sampling_rate)
+    else:
+        make_video(video, video_info, save_path, audio, sampling_rate=seq_cfg.sampling_rate)
+
    offloadobj.unload_all()
    if not persistent_models:
        offloadobj.release()

    torch.cuda.empty_cache()
    gc.collect()
-    return video_save_path
+    return save_path
--- a/preprocessing/matanyone/app.py
+++ b/preprocessing/matanyone/app.py
@ -69,6 +69,10 @@ def get_frames_from_image(image_input, image_state):
        [[0:nearest_frame], [nearest_frame:], nearest_frame]
    """

+    if image_input is None:
+       gr.Info("Please select an Image file")
+       return [gr.update()] * 17
+
    user_name = time.time()
    frames = [image_input] * 2  # hardcode: mimic a video with 2 frames
    image_size = (frames[0].shape[0],frames[0].shape[1]) 
@ -94,11 +98,12 @@ def get_frames_from_image(image_input, image_state):
                        gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
                        gr.update(visible=True), gr.update(visible=True), \
                        gr.update(visible=True), gr.update(visible=True),\
-                        gr.update(visible=True), gr.update(visible=True), \
-                        gr.update(visible=True), gr.update(value="", visible=True),  gr.update(visible=False), \
+                        gr.update(visible=True), gr.update(visible=False), \
+                        gr.update(visible=False), gr.update(value="", visible=False),  gr.update(visible=False), \
                        gr.update(visible=False), gr.update(visible=True), \
                        gr.update(visible=True)

+
 # extract frames from upload video
 def get_frames_from_video(video_input, video_state):
    """
@ -108,7 +113,9 @@ def get_frames_from_video(video_input, video_state):
    Return 
        [[0:nearest_frame], [nearest_frame:], nearest_frame]
    """
-
+    if video_input is None:
+       gr.Info("Please select a Video file")
+       return [gr.update()] * 18 

    while model == None:
        time.sleep(1)
@ -381,6 +388,7 @@ def save_video(frames, output_path, fps):

 def mask_to_xyxy_box(mask):
    rows, cols = np.where(mask == 255)
+    if len(rows) == 0 or len(cols) == 0: return []
    xmin = min(cols)
    xmax = max(cols) + 1
    ymin = min(rows)
@ -449,13 +457,18 @@ def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_si
    bbox_info = mask_to_xyxy_box(alpha_output)
    h = alpha_output.shape[0]
    w = alpha_output.shape[1]
-    bbox_info = [str(int(bbox_info[0]/ w * 100 )), str(int(bbox_info[1]/ h * 100 )),  str(int(bbox_info[2]/ w * 100 )), str(int(bbox_info[3]/ h * 100 )) ]
-    bbox_info = ":".join(bbox_info)
+    if len(bbox_info) == 0:
+        bbox_info = ""
+    else:
+        bbox_info = [str(int(bbox_info[0]/ w * 100 )), str(int(bbox_info[1]/ h * 100 )),  str(int(bbox_info[2]/ w * 100 )), str(int(bbox_info[3]/ h * 100 )) ]
+        bbox_info = ":".join(bbox_info)
    alpha_output = Image.fromarray(alpha_output)
-    return foreground_output, alpha_output, bbox_info, gr.update(visible=True), gr.update(visible=True) 
+    # return gr.update(value=foreground_output, visible= True), gr.update(value=alpha_output, visible= True), gr.update(value=bbox_info, visible= True), gr.update(visible=True), gr.update(visible=True)
+ 
+    return foreground_output, alpha_output, gr.update(visible = True), gr.update(visible = True), gr.update(value=bbox_info, visible= True), gr.update(visible=True), gr.update(visible=True)

 # video matting
-def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
+def video_matting(video_state,video_input, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
    matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
    # if interactive_state["track_end_number"]:
    #     following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
@ -521,10 +534,21 @@ def video_matting(video_state, end_slider, matting_type, interactive_state, mask

    file_name= video_state["video_name"]
    file_name = ".".join(file_name.split(".")[:-1]) 
-    foreground_output = save_video(foreground, output_path="./mask_outputs/{}_fg.mp4".format(file_name), fps=fps)
-    # foreground_output = generate_video_from_frames(foreground, output_path="./results/{}_fg.mp4".format(video_state["video_name"]), fps=fps, audio_path=audio_path) # import video_input to name the output video
+ 
+    from wan.utils.utils import extract_audio_tracks, combine_video_with_audio_tracks, cleanup_temp_audio_files    
+    source_audio_tracks, audio_metadata  = extract_audio_tracks(video_input)
+    output_fg_path =  f"./mask_outputs/{file_name}_fg.mp4"
+    output_fg_temp_path =  f"./mask_outputs/{file_name}_fg_tmp.mp4"
+    if len(source_audio_tracks) == 0:
+        foreground_output = save_video(foreground, output_path=output_fg_path , fps=fps)
+    else:
+        foreground_output_tmp = save_video(foreground, output_path=output_fg_temp_path , fps=fps)
+        combine_video_with_audio_tracks(output_fg_temp_path, source_audio_tracks, output_fg_path, audio_metadata=audio_metadata)
+        cleanup_temp_audio_files(source_audio_tracks)
+        os.remove(foreground_output_tmp)
+        foreground_output = output_fg_path
+
    alpha_output = save_video(alpha, output_path="./mask_outputs/{}_alpha.mp4".format(file_name), fps=fps)
-    # alpha_output = generate_video_from_frames(alpha, output_path="./results/{}_alpha.mp4".format(video_state["video_name"]), fps=fps, gray2rgb=True, audio_path=audio_path) # import video_input to name the output video

    return foreground_output, alpha_output, gr.update(visible=True), gr.update(visible=True), gr.update(visible=True), gr.update(visible=True)

@ -912,7 +936,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
                    inputs=[],
                    outputs=[foreground_video_output, alpha_video_output]).then(
                    fn=video_matting,
-                    inputs=[video_state, end_selection_slider,  matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
+                    inputs=[video_state, video_input, end_selection_slider,  matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
                    outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
                )

@ -1053,7 +1077,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
                        foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
                        alpha_image_output = gr.Image(type="pil", label="Mask", visible=False, elem_classes="image")
                    with gr.Row(equal_height=True):
-                        bbox_info = gr.Text(label ="Mask BBox Info (Left:Top:Right:Bottom)", interactive= False)
+                        bbox_info = gr.Text(label ="Mask BBox Info (Left:Top:Right:Bottom)", visible = False, interactive= False)
                    with gr.Row():
                        # with gr.Row():
                        export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
@ -1116,7 +1140,7 @@ def display(tabs, tab_state, vace_video_input, vace_image_input, vace_video_mask
                matting_button.click(
                    fn=image_matting,
                    inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
-                    outputs=[foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
+                    outputs=[foreground_image_output, alpha_image_output,foreground_image_output, alpha_image_output,bbox_info, export_image_btn, export_image_mask_btn]
                )


--- a/requirements.txt
+++ b/requirements.txt
@ -17,7 +17,7 @@ gradio==5.23.0
 numpy>=1.23.5,<2
 einops
 moviepy==1.0.3
-mmgp==3.5.3
+mmgp==3.5.5
 peft==0.15.0
 mutagen
 pydantic==2.10.6
@ -46,5 +46,6 @@ soundfile
 ffmpeg-python
 pyannote.audio
 pynvml
+huggingface_hub[hf_xet]
 # num2words
 # spacy
--- a/wan/any2video.py
+++ b/wan/any2video.py
@ -141,7 +141,8 @@ class WanAny2V:
        if save_quantized:
            from wgp import save_quantized_model
            save_quantized_model(self.model, model_type, model_filename[0], dtype, base_config_file)
-
+            if self.model2 is not None:
+                save_quantized_model(self.model2, model_type, model_filename[1], dtype, base_config_file, submodel_no=2)
        self.sample_neg_prompt = config.sample_neg_prompt

        if self.model.config.get("vace_in_dim", None) != None:
@ -357,7 +358,7 @@ class WanAny2V:
        input_frames= None,
        input_masks = None,
        input_ref_images = None,      
-        input_video=None,
+        input_video = None,
        image_start = None,
        image_end = None,
        denoising_strength = 1.0,
@ -395,6 +396,7 @@ class WanAny2V:
        conditioning_latents_size = 0,
        keep_frames_parsed = [],
        model_type = None,
+        model_mode = None,
        loras_slists = None,
        NAG_scale = 0,
        NAG_tau = 3.5,
@ -475,67 +477,63 @@ class WanAny2V:
        phantom = model_type in ["phantom_1.3B", "phantom_14B"]
        fantasy = model_type in ["fantasy"]
        multitalk = model_type in ["multitalk", "vace_multitalk_14B"]
+        recam = model_type in ["recam_1.3B"]

        ref_images_count = 0
        trim_frames = 0
        extended_overlapped_latents = None

-        # image2video 
        lat_frames = int((frame_num - 1) // self.vae_stride[0]) + 1
-        if image_start != None:
+        # image2video 
+        if model_type in ["i2v", "i2v_2_2", "fantasy", "multitalk", "flf2v_720p"]:
            any_end_frame = False
-            if input_frames != None:
-                _ , preframes_count, height, width = input_frames.shape
+            if image_start is None:
+                _ , preframes_count, height, width = input_video.shape
                lat_h, lat_w = height // self.vae_stride[1], width // self.vae_stride[2]
-                if hasattr(self, "clip"):                                   
-                    clip_context = self.clip.visual([input_frames[:, -1:]]) if model_type != "flf2v_720p" else self.clip.visual([input_frames[:, -1:], input_frames[:, -1:]])
+                if hasattr(self, "clip"):
+                    clip_image_size = self.clip.model.image_size
+                    clip_image = resize_lanczos(input_video[:, -1], clip_image_size, clip_image_size)[:, None, :, :]
+                    clip_context = self.clip.visual([clip_image]) if model_type != "flf2v_720p" else self.clip.visual([clip_image , clip_image ])
+                    clip_image = None
                else:
                    clip_context = None
-                input_frames = input_frames.to(device=self.device).to(dtype= self.VAE_dtype)
-                enc =  torch.concat( [input_frames, torch.zeros( (3, frame_num-preframes_count, height, width), 
+                input_video = input_video.to(device=self.device).to(dtype= self.VAE_dtype)
+                enc =  torch.concat( [input_video, torch.zeros( (3, frame_num-preframes_count, height, width), 
                                     device=self.device, dtype= self.VAE_dtype)], 
                                     dim = 1).to(self.device)
-                color_reference_frame = input_frames[:, -1:].clone()
-                input_frames = None
+                color_reference_frame = input_video[:, -1:].clone()
+                input_video = None
            else:
                preframes_count = 1
-                image_start = TF.to_tensor(image_start)
-                any_end_frame = image_end != None 
+                any_end_frame = image_end is not None 
                add_frames_for_end_image = any_end_frame and model_type == "i2v"
                if any_end_frame:
-                    image_end = TF.to_tensor(image_end) 
                    if add_frames_for_end_image:
                        frame_num +=1
                        lat_frames = int((frame_num - 2) // self.vae_stride[0] + 2)
                        trim_frames = 1
                
-                h, w = image_start.shape[1:]
+                height, width = image_start.shape[1:]

-                h, w = calculate_new_dimensions(height, width, h, w, fit_into_canvas)
-                width, height = w, h
-        
                lat_h = round(
-                    h // self.vae_stride[1] //
+                    height // self.vae_stride[1] //
                    self.patch_size[1] * self.patch_size[1])
                lat_w = round(
-                    w // self.vae_stride[2] //
+                    width // self.vae_stride[2] //
                    self.patch_size[2] * self.patch_size[2])
-                h = lat_h * self.vae_stride[1]
-                w = lat_w * self.vae_stride[2]
-                img_interpolated = resize_lanczos(image_start, h, w).sub_(0.5).div_(0.5).unsqueeze(0).transpose(0,1).to(self.device) #, self.dtype
-                color_reference_frame = img_interpolated.clone()
-                if image_end!= None:
-                    img_interpolated2 = resize_lanczos(image_end, h, w).sub_(0.5).div_(0.5).unsqueeze(0).transpose(0,1).to(self.device) #, self.dtype
+                height = lat_h * self.vae_stride[1]
+                width = lat_w * self.vae_stride[2]
+                image_start_frame = image_start.unsqueeze(1).to(self.device)
+                color_reference_frame = image_start_frame.clone()
+                if image_end is not None:
+                    img_end_frame = image_end.unsqueeze(1).to(self.device)

                if hasattr(self, "clip"):                                   
                    clip_image_size = self.clip.model.image_size
                    image_start = resize_lanczos(image_start, clip_image_size, clip_image_size)
-                    image_start = image_start.sub_(0.5).div_(0.5).to(self.device) #, self.dtype
-                    if image_end!= None:
-                        image_end = resize_lanczos(image_end, clip_image_size, clip_image_size)
-                        image_end = image_end.sub_(0.5).div_(0.5).to(self.device) #, self.dtype
+                    if image_end is not None: image_end = resize_lanczos(image_end, clip_image_size, clip_image_size)
                    if model_type == "flf2v_720p":                    
-                        clip_context = self.clip.visual([image_start[:, None, :, :], image_end[:, None, :, :] if image_end != None else image_start[:, None, :, :]])
+                        clip_context = self.clip.visual([image_start[:, None, :, :], image_end[:, None, :, :] if image_end is not None else image_start[:, None, :, :]])
                    else:
                        clip_context = self.clip.visual([image_start[:, None, :, :]])
                else:
@ -543,17 +541,17 @@ class WanAny2V:

                if any_end_frame:
                    enc= torch.concat([
-                            img_interpolated,
-                            torch.zeros( (3, frame_num-2,  h, w), device=self.device, dtype= self.VAE_dtype),
-                            img_interpolated2,
+                            image_start_frame,
+                            torch.zeros( (3, frame_num-2,  height, width), device=self.device, dtype= self.VAE_dtype),
+                            img_end_frame,
                    ], dim=1).to(self.device)
                else:
                    enc= torch.concat([
-                            img_interpolated,
-                            torch.zeros( (3, frame_num-1, h, w), device=self.device, dtype= self.VAE_dtype)
+                            image_start_frame,
+                            torch.zeros( (3, frame_num-1, height, width), device=self.device, dtype= self.VAE_dtype)
                    ], dim=1).to(self.device)

-                image_start = image_end = img_interpolated = img_interpolated2 = None
+                image_start = image_end = image_start_frame = img_end_frame = None

            msk = torch.ones(1, frame_num, lat_h, lat_w, device=self.device)
            if any_end_frame:
@ -582,11 +580,12 @@ class WanAny2V:
                kwargs.update({'clip_fea': clip_context})

        # Recam Master
-        if target_camera != None:
+        if recam:
+            # should be be in fact in input_frames since it is control video not a video to be extended
+            target_camera = model_mode
            width = input_video.shape[2]
            height = input_video.shape[1]
            input_video = input_video.to(dtype=self.dtype , device=self.device)
-            input_video = input_video.permute(3, 0, 1, 2).div_(127.5).sub_(1.)            
            source_latents = self.vae.encode([input_video])[0] #.to(dtype=self.dtype, device=self.device)
            del input_video
            # Process target camera (recammaster)
@ -718,8 +717,13 @@ class WanAny2V:
        # init denoising
        updated_num_steps= len(timesteps)
        if callback != None:
-            from wan.utils.utils import update_loras_slists
-            update_loras_slists(self.model, loras_slists, updated_num_steps)
+            from wan.utils.loras_mutipliers import update_loras_slists
+            model_switch_step = updated_num_steps
+            for i, t in enumerate(timesteps):
+                if t <= switch_threshold:
+                    model_switch_step = i
+                    break
+            update_loras_slists(self.model, loras_slists, updated_num_steps, model_switch_step= model_switch_step)
            callback(-1, None, True, override_num_inference_steps = updated_num_steps)

        if sample_scheduler != None:
--- a/wan/diffusion_forcing.py
+++ b/wan/diffusion_forcing.py
@ -19,7 +19,7 @@ from wan.utils.utils import calculate_new_dimensions
 from .utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
                               get_sampling_sigmas, retrieve_timesteps)
 from .utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
-from wan.utils.utils import update_loras_slists
+from wan.utils.loras_mutipliers import update_loras_slists

 class DTT2V:

@ -199,7 +199,6 @@ class DTT2V:
        self,
        input_prompt: Union[str, List[str]],
        n_prompt: Union[str, List[str]] = "",
-        image_start: PipelineImageInput = None,
        input_video = None,
        height: int = 480,
        width: int = 832,
@ -242,11 +241,6 @@ class DTT2V:

        if input_video != None:
            _ , _ , height, width  = input_video.shape
-        elif image_start != None:
-            image_start = image_start
-            frame_width, frame_height  = image_start.size
-            height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas)
-            image_start = np.array(image_start.resize((width, height))).transpose(2, 0, 1)


        latent_length = (frame_num - 1) // 4 + 1 
@ -276,18 +270,8 @@ class DTT2V:

        output_video = input_video

-        if image_start is not None or output_video is not None:  # i !=0
-            if output_video is not None:
-                prefix_video = output_video.to(self.device)
-            else:
-                causal_block_size = 1
-                causal_attention = False
-                ar_step = 0
-                prefix_video = image_start
-                prefix_video = torch.tensor(prefix_video).unsqueeze(1)  # .to(image_embeds.dtype).unsqueeze(1)
-                if prefix_video.dtype == torch.uint8:
-                    prefix_video = (prefix_video.float() / (255.0 / 2.0)) - 1.0
-                prefix_video = prefix_video.to(self.device)
+        if output_video is not None:  # i !=0
+            prefix_video = output_video.to(self.device)
            prefix_video = self.vae.encode(prefix_video.unsqueeze(0))[0]  # [(c, f, h, w)]
            predix_video_latent_length = prefix_video.shape[1]
            truncate_len = predix_video_latent_length % causal_block_size
--- a/wan/fantasytalking/infer.py
+++ b/wan/fantasytalking/infer.py
@ -6,7 +6,7 @@ from .model import FantasyTalkingAudioConditionModel
 from .utils import get_audio_features
 import gc, torch

-def parse_audio(audio_path, num_frames, fps = 23, device = "cuda"):
+def parse_audio(audio_path, start_frame, num_frames, fps = 23, device = "cuda"):
    fantasytalking = FantasyTalkingAudioConditionModel(None, 768, 2048).to(device)
    from mmgp import offload
    from accelerate import init_empty_weights
@ -24,7 +24,7 @@ def parse_audio(audio_path, num_frames, fps = 23, device = "cuda"):
    wav2vec = Wav2Vec2Model.from_pretrained(wav2vec_model_dir, device_map="cpu").eval().requires_grad_(False)
    wav2vec.to(device)
    proj_model.to(device)
-    audio_wav2vec_fea = get_audio_features( wav2vec, wav2vec_processor, audio_path, fps, num_frames )
+    audio_wav2vec_fea = get_audio_features( wav2vec, wav2vec_processor, audio_path, fps, start_frame, num_frames)

    audio_proj_fea = proj_model(audio_wav2vec_fea)
    pos_idx_ranges = fantasytalking.split_audio_sequence( audio_proj_fea.size(1), num_frames=num_frames )
--- a/wan/fantasytalking/utils.py
+++ b/wan/fantasytalking/utils.py
@ -26,13 +26,18 @@ def save_video(frames, save_path, fps, quality=9, ffmpeg_params=None):
    writer.close()


-def get_audio_features(wav2vec, audio_processor, audio_path, fps, num_frames):
+def get_audio_features(wav2vec, audio_processor, audio_path, fps, start_frame, num_frames):
    sr = 16000
-    audio_input, sample_rate = librosa.load(audio_path, sr=sr)  # 采样率为 16kHz
+    audio_input, sample_rate = librosa.load(audio_path, sr=sr)  # 采样率为 16kHz    start_time = 0
+    if start_frame  < 0:
+        pad = int(abs(start_frame)/ fps * sr)
+        audio_input = np.concatenate([np.zeros(pad), audio_input])
+        end_frame = num_frames
+    else:
+        end_frame = start_frame + num_frames

-    start_time = 0
-    # end_time = (0 + (num_frames - 1) * 1) / fps
-    end_time = num_frames / fps
+    start_time = start_frame / fps
+    end_time = end_frame / fps

    start_sample = int(start_time * sr)
    end_sample = int(end_time * sr)
--- a/wan/modules/model.py
+++ b/wan/modules/model.py
@ -762,7 +762,11 @@ class WanModel(ModelMixin, ConfigMixin):
        offload.shared_state["_chipmunk_layers"] = None

    def preprocess_loras(self, model_type, sd):
-
+        new_sd = {}
+        for k,v in sd.items():
+            if not k.endswith(".modulation.diff"):
+                new_sd[ k] = v
+        sd = new_sd
        first = next(iter(sd), None)
        if first == None:
            return sd
--- a/wan/multitalk/multitalk.py
+++ b/wan/multitalk/multitalk.py
@ -74,7 +74,7 @@ def audio_prepare_single(audio_path, sample_rate=16000, duration = 0):
        return human_speech_array

 
-def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=16000, duration = 0):
+def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=16000, duration = 0, pad = 0):
    if not (left_path==None or right_path==None):
        human_speech_array1 = audio_prepare_single(left_path, duration = duration)
        human_speech_array2 = audio_prepare_single(right_path, duration = duration)
@ -91,7 +91,13 @@ def audio_prepare_multi(left_path, right_path, audio_type = "add", sample_rate=1
    elif audio_type=='add':
        new_human_speech1 = np.concatenate([human_speech_array1[: human_speech_array1.shape[0]], np.zeros(human_speech_array2.shape[0])]) 
        new_human_speech2 = np.concatenate([np.zeros(human_speech_array1.shape[0]), human_speech_array2[:human_speech_array2.shape[0]]])
+
+    #dont include the padding on the summed audio which is used to build the output audio track
    sum_human_speechs = new_human_speech1 + new_human_speech2
+    if pad  > 0:
+        new_human_speech1 = np.concatenate([np.zeros(pad), new_human_speech1])
+        new_human_speech2 = np.concatenate([np.zeros(pad), new_human_speech2])
+
    return new_human_speech1, new_human_speech2, sum_human_speechs

 def process_tts_single(text, save_dir, voice1):    
@ -167,14 +173,13 @@ def process_tts_multi(text, save_dir, voice1, voice2):
    return s1, s2, save_path_sum


-def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames =  0, fps = 25, sr = 16000):
+def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames =  0, fps = 25, sr = 16000, padded_frames_for_embeddings = 0):
    wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/chinese-wav2vec2-base")
    # wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/wav2vec")
-
-    new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps)
+    pad = int(padded_frames_for_embeddings/ fps * sr)
+    new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps, pad = pad)
    audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
    audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
-
    full_audio_embs = []
    if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
    # if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
--- a/wan/utils/loras_mutipliers.py
+++ b/wan/utils/loras_mutipliers.py
@ -0,0 +1,91 @@
+def preparse_loras_multipliers(loras_multipliers):
+    if isinstance(loras_multipliers, list):
+        return [multi.strip(" \r\n") if isinstance(multi, str) else multi for multi in loras_multipliers]
+
+    loras_multipliers = loras_multipliers.strip(" \r\n")
+    loras_mult_choices_list = loras_multipliers.replace("\r", "").split("\n")
+    loras_mult_choices_list = [multi.strip() for multi in loras_mult_choices_list if len(multi)>0 and not multi.startswith("#")]
+    loras_multipliers = " ".join(loras_mult_choices_list)
+    return loras_multipliers.split(" ")
+
+def expand_slist(slists_dict, mult_no, num_inference_steps, model_switch_step ):
+    def expand_one(slist, num_inference_steps):
+        if not isinstance(slist, list): slist = [slist]
+        new_slist= []
+        if num_inference_steps <=0:
+            return new_slist
+        inc =  len(slist) / num_inference_steps 
+        pos = 0
+        for i in range(num_inference_steps):
+            new_slist.append(slist[ int(pos)])
+            pos += inc
+        return new_slist
+
+    phase1 = slists_dict["phase1"][mult_no]
+    phase2 = slists_dict["phase2"][mult_no]
+    if isinstance(phase1, float) and isinstance(phase2, float) and phase1 == phase2:
+        return phase1 
+    return expand_one(phase1, model_switch_step) + expand_one(phase2, num_inference_steps - model_switch_step)
+
+def parse_loras_multipliers(loras_multipliers, nb_loras, num_inference_steps, merge_slist = None, max_phases = 2, model_switch_step = None):
+    if model_switch_step is None:
+        model_switch_step = num_inference_steps
+    def is_float(element: any) -> bool:
+        if element is None: 
+            return False
+        try:
+            float(element)
+            return True
+        except ValueError:
+            return False
+    loras_list_mult_choices_nums = []
+    slists_dict = { "model_switch_step": model_switch_step}
+    slists_dict["phase1"] = phase1 = [1.] * nb_loras
+    slists_dict["phase2"] = phase2 = [1.] * nb_loras
+
+    if isinstance(loras_multipliers, list) or len(loras_multipliers) > 0:
+        list_mult_choices_list = preparse_loras_multipliers(loras_multipliers)
+        for i, mult in enumerate(list_mult_choices_list):
+            current_phase = phase1
+            if isinstance(mult, str):
+                mult = mult.strip()
+                phase_mult = mult.split(";")
+                shared_phases = len(phase_mult) <=1
+                if len(phase_mult) > max_phases:
+                    return "", "", f"Loras can not be defined for more than {max_phases} Denoising phases for this model"
+                for phase_no, mult in enumerate(phase_mult):
+                    if phase_no > 0: current_phase = phase2
+                    if "," in mult:
+                        multlist = mult.split(",")
+                        slist = []
+                        for smult in multlist:
+                            if not is_float(smult):                
+                                return "", "", f"Lora sub value no {i+1} ({smult}) in Multiplier definition '{multlist}' is invalid"
+                            slist.append(float(smult))
+                    else:
+                        if not is_float(mult):                
+                            return "", "", f"Lora Multiplier no {i+1} ({mult}) is invalid"
+                        slist = float(mult)
+                    if shared_phases:
+                        phase1[i] = phase2[i] = slist
+                    else:
+                        current_phase[i] = slist
+            else:
+                phase1[i] = phase2[i] = float(mult)
+
+    if merge_slist is not None:
+        slists_dict["phase1"] = phase1 = merge_slist["phase1"] + phase1
+        slists_dict["phase2"] = phase2 = merge_slist["phase2"] + phase2
+
+    loras_list_mult_choices_nums = [ expand_slist(slists_dict, i, num_inference_steps, model_switch_step )  for i in range(len(phase1)) ]
+    loras_list_mult_choices_nums = [ slist[0] if isinstance(slist, list) else slist for slist in loras_list_mult_choices_nums ]
+    
+    return  loras_list_mult_choices_nums, slists_dict, ""
+
+def update_loras_slists(trans, slists_dict, num_inference_steps, model_switch_step = None ):
+    from mmgp import offload
+    sz = len(slists_dict["phase1"])
+    slists = [ expand_slist(slists_dict, i, num_inference_steps, model_switch_step ) for i in range(sz)  ]
+    nos = [str(l) for l in range(sz)]
+    offload.activate_loras(trans, nos, slists ) 
+
--- a/wan/utils/utils.py
+++ b/wan/utils/utils.py
@ -18,6 +18,8 @@ import random
 import ffmpeg
 import os
 import tempfile
+import subprocess
+import json

 __all__ = ['cache_video', 'cache_image', 'str2bool']

@ -34,21 +36,6 @@ def seed_everything(seed: int):
    if torch.backends.mps.is_available():
        torch.mps.manual_seed(seed)

-def expand_slist(slist, num_inference_steps ):
-    new_slist= []
-    inc =  len(slist) / num_inference_steps 
-    pos = 0
-    for i in range(num_inference_steps):
-        new_slist.append(slist[ int(pos)])
-        pos += inc
-    return new_slist
-
-def update_loras_slists(trans, slists, num_inference_steps ):
-    from mmgp import offload
-    slists = [ expand_slist(slist, num_inference_steps ) if isinstance(slist, list) else slist for slist in slists ]
-    nos = [str(l) for l in range(len(slists))]
-    offload.activate_loras(trans, nos, slists ) 
-
 def resample(video_fps, video_frames_count, max_target_frames_count, target_fps, start_target_frame ):
    import math

@ -141,10 +128,12 @@ def convert_image_to_video(image):
        return temp_video.name
    
 def resize_lanczos(img, h, w):
-    img = Image.fromarray(np.clip(255. * img.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8))
+    img = (img + 1).float().mul_(127.5)
+    img = Image.fromarray(np.clip(img.movedim(0, -1).cpu().numpy(), 0, 255).astype(np.uint8))
    img = img.resize((w,h), resample=Image.Resampling.LANCZOS) 
-    return torch.from_numpy(np.array(img).astype(np.float32) / 255.0).movedim(-1, 0)
-
+    img = torch.from_numpy(np.array(img).astype(np.float32)).movedim(-1, 0)
+    img = img.div(127.5).sub_(1)
+    return img

 def remove_background(img, session=None):
    if session ==None:
@ -445,109 +434,180 @@ def create_progress_hook(filename):
        return progress_hook(block_num, block_size, total_size, filename)
    return hook

+
+import tempfile, os
 import ffmpeg
-import os
-import tempfile

-def extract_audio_tracks(source_video, verbose=False, query_only= False):
+def extract_audio_tracks(source_video, verbose=False, query_only=False):
    """
-    Extract all audio tracks from source video to temporary files.
-    
-    Args:
-        source_video: Path to video with audio to extract
-        verbose: Enable verbose output (default: False)
-        
+    Extract all audio tracks from a source video into temporary AAC files.
+
    Returns:
-        List of temporary audio file paths, or empty list if no audio tracks
+        Tuple:
+          - List of temp file paths for extracted audio tracks
+          - List of corresponding metadata dicts:
+              {'codec', 'sample_rate', 'channels', 'duration', 'language'}
+              where 'duration' is set to container duration (for consistency).
    """
+    probe = ffmpeg.probe(source_video)
+    audio_streams = [s for s in probe['streams'] if s['codec_type'] == 'audio']
+    container_duration = float(probe['format'].get('duration', 0.0))
+
+    if not audio_streams:
+        if query_only: return 0
+        if verbose: print(f"No audio track found in {source_video}")
+        return [], []
+
+    if query_only:
+        return len(audio_streams)
+
+    if verbose:
+        print(f"Found {len(audio_streams)} audio track(s), container duration = {container_duration:.3f}s")
+
+    file_paths = []
+    metadata = []
+
+    for i, stream in enumerate(audio_streams):
+        fd, temp_path = tempfile.mkstemp(suffix=f'_track{i}.aac', prefix='audio_')
+        os.close(fd)
+
+        file_paths.append(temp_path)
+        metadata.append({
+            'codec': stream.get('codec_name'),
+            'sample_rate': int(stream.get('sample_rate', 0)),
+            'channels': int(stream.get('channels', 0)),
+            'duration': container_duration,
+            'language': stream.get('tags', {}).get('language', None)
+        })
+
+        ffmpeg.input(source_video).output(
+            temp_path,
+            **{f'map': f'0:a:{i}', 'acodec': 'aac', 'b:a': '128k'}
+        ).overwrite_output().run(quiet=not verbose)
+
+    return file_paths, metadata
+
+
+import subprocess
+
+def combine_and_concatenate_video_with_audio_tracks(
+    save_path_tmp, video_path,
+    source_audio_tracks, new_audio_tracks,
+    source_audio_duration, audio_sampling_rate,
+    new_audio_from_start=False,
+    source_audio_metadata=None,
+    audio_bitrate='128k',
+    audio_codec='aac' 
+):
+    inputs, filters, maps, idx = ['-i', video_path], [], ['-map', '0:v'], 1
+    metadata_args = []
+    sources = source_audio_tracks or []
+    news = new_audio_tracks or []
+
+    duplicate_source = len(sources) == 1 and len(news) > 1
+    N = len(news) if source_audio_duration == 0 else max(len(sources), len(news)) or 1
+
+    for i in range(N):
+        s = (sources[i] if i < len(sources)
+             else sources[0] if duplicate_source else None)
+        n = news[i] if len(news) == N else (news[0] if news else None)
+
+        if source_audio_duration == 0:
+            if n:
+                inputs += ['-i', n]
+                filters.append(f'[{idx}:a]apad=pad_dur=100[aout{i}];')
+                idx += 1
+            else:
+                filters.append(f'anullsrc=r={audio_sampling_rate}:cl=mono,apad=pad_dur=100[aout{i}];')
+        else:
+            if s:
+                inputs += ['-i', s]
+                meta = source_audio_metadata[i] if source_audio_metadata and i < len(source_audio_metadata) else {}
+                needs_filter = (
+                    meta.get('codec') != audio_codec or
+                    meta.get('sample_rate') != audio_sampling_rate or
+                    meta.get('channels') != 1 or
+                    meta.get('duration', 0) < source_audio_duration
+                )
+                if needs_filter:
+                    filters.append(
+                        f'[{idx}:a]aresample={audio_sampling_rate},aformat=channel_layouts=mono,'
+                        f'apad=pad_dur={source_audio_duration},atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
+                else:
+                    filters.append(
+                        f'[{idx}:a]apad=pad_dur={source_audio_duration},atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
+                if lang := meta.get('language'):
+                    metadata_args += ['-metadata:s:a:' + str(i), f'language={lang}']
+                idx += 1
+            else:
+                filters.append(
+                    f'anullsrc=r={audio_sampling_rate}:cl=mono,atrim=0:{source_audio_duration},asetpts=PTS-STARTPTS[s{i}];')
+
+            if n:
+                inputs += ['-i', n]
+                start = '0' if new_audio_from_start else source_audio_duration
+                filters.append(
+                    f'[{idx}:a]aresample={audio_sampling_rate},aformat=channel_layouts=mono,'
+                    f'atrim=start={start},asetpts=PTS-STARTPTS[n{i}];'
+                    f'[s{i}][n{i}]concat=n=2:v=0:a=1[aout{i}];')
+                idx += 1
+            else:
+                filters.append(f'[s{i}]apad=pad_dur=100[aout{i}];')
+
+        maps += ['-map', f'[aout{i}]']
+
+    cmd = ['ffmpeg', '-y', *inputs,
+           '-filter_complex', ''.join(filters),
+           *maps, *metadata_args,
+           '-c:v', 'copy',
+           '-c:a', audio_codec,
+           '-b:a', audio_bitrate,
+           '-ar', str(audio_sampling_rate),
+           '-ac', '1',
+           '-shortest', save_path_tmp]
+
    try:
-        # Check if source video has audio
-        probe = ffmpeg.probe(source_video)
-        audio_streams = [s for s in probe['streams'] if s['codec_type'] == 'audio']
-        
-        if not audio_streams:
-            if query_only: return 0
-            if verbose:
-                print(f"No audio track found in {source_video}")
-            return []
-        if query_only: return len(audio_streams)
-        if verbose:
-            print(f"Found {len(audio_streams)} audio track(s)")
-        
-        # Create temporary audio files for each track
-        temp_audio_files = []
-        for i in range(len(audio_streams)):
-            fd, temp_path = tempfile.mkstemp(suffix=f'_track{i}.aac', prefix='audio_')
-            os.close(fd)  # Close file descriptor immediately
-            temp_audio_files.append(temp_path)
-        
-        # Extract each audio track
-        for i, temp_path in enumerate(temp_audio_files):
-            (ffmpeg
-             .input(source_video)
-             .output(temp_path, **{f'map': f'0:a:{i}', 'acodec': 'aac'})
-             .overwrite_output()
-             .run(quiet=not verbose))
-        
-        return temp_audio_files
-        
-    except ffmpeg.Error as e:
-        print(f"FFmpeg error during audio extraction: {e}")
-        return 0 if query_only else []
-    except Exception as e:
-        print(f"Error during audio extraction: {e}")
-        return 0 if query_only else []
+        subprocess.run(cmd, check=True, capture_output=True, text=True)
+    except subprocess.CalledProcessError as e:
+        raise Exception(f"FFmpeg error: {e.stderr}")

-def combine_video_with_audio_tracks(target_video, audio_tracks, output_video, verbose=False):
-    """
-    Combine video with audio tracks. Output duration matches video length exactly.
-    
-    Args:
-        target_video: Path to video to receive the audio
-        audio_tracks: List of audio file paths to combine
-        output_video: Path for the output video
-        verbose: Enable verbose output (default: False)
-        
-    Returns:
-        True if successful, False otherwise
-    """
+
+import ffmpeg
+
+
+import subprocess
+import ffmpeg
+
+def combine_video_with_audio_tracks(target_video, audio_tracks, output_video,
+                                     audio_metadata=None, verbose=False):
    if not audio_tracks:
-        if verbose:
-            print("No audio tracks to combine")
-        return False
-    
-    try:
-        # Get video duration to ensure exact alignment
-        video_probe = ffmpeg.probe(target_video)
-        video_duration = float(video_probe['streams'][0]['duration'])
-        
-        if verbose:
-            print(f"Target video duration: {video_duration:.3f} seconds")
-        
-        # Combine target video with all audio tracks, force video duration
-        video = ffmpeg.input(target_video).video
-        audio_inputs = [ffmpeg.input(audio_path).audio for audio_path in audio_tracks]
-        
-        # Create output with video duration as master timing
-        inputs = [video] + audio_inputs
-        (ffmpeg
-         .output(*inputs, output_video, 
-                vcodec='copy', 
-                acodec='copy', 
-                t=video_duration)  # Force exact video duration
-         .overwrite_output()
-         .run(quiet=not verbose))
-        
-        if verbose:
-            print(f"Successfully created {output_video} with {len(audio_tracks)} audio track(s) aligned to video duration")
-        return True
-        
-    except ffmpeg.Error as e:
-        print(f"FFmpeg error during video combination: {e}")
-        return False
-    except Exception as e:
-        print(f"Error during video combination: {e}")
-        return False
+        if verbose: print("No audio tracks to combine."); return False
+
+    dur = float(next(s for s in ffmpeg.probe(target_video)['streams']
+                     if s['codec_type'] == 'video')['duration'])
+    if verbose: print(f"Video duration: {dur:.3f}s")
+
+    cmd = ['ffmpeg', '-y', '-i', target_video]
+    for path in audio_tracks:
+        cmd += ['-i', path]
+
+    cmd += ['-map', '0:v']
+    for i in range(len(audio_tracks)):
+        cmd += ['-map', f'{i+1}:a']
+
+    for i, meta in enumerate(audio_metadata or []):
+        if (lang := meta.get('language')):
+            cmd += ['-metadata:s:a:' + str(i), f'language={lang}']
+
+    cmd += ['-c:v', 'copy', '-c:a', 'copy', '-t', str(dur), output_video]
+
+    result = subprocess.run(cmd, capture_output=not verbose, text=True)
+    if result.returncode != 0:
+        raise Exception(f"FFmpeg error:\n{result.stderr}")
+    if verbose:
+        print(f"Created {output_video} with {len(audio_tracks)} audio track(s)")
+    return True
+

 def cleanup_temp_audio_files(audio_tracks, verbose=False):
    """
--- a/wgp.py
+++ b/wgp.py