mirror of
				https://github.com/Wan-Video/Wan2.1.git
				synced 2025-11-04 06:15:17 +00:00 
			
		
		
		
	attack of the clones
This commit is contained in:
		
							parent
							
								
									9fa267087b
								
							
						
					
					
						commit
						7bcd724621
					
				
							
								
								
									
										228
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										228
									
								
								README.md
									
									
									
									
									
								
							@ -20,6 +20,19 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 | 
				
			|||||||
**Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep
 | 
					**Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## 🔥 Latest Updates : 
 | 
					## 🔥 Latest Updates : 
 | 
				
			||||||
 | 
					### September 15 2025: WanGP v8.6 - Attack of the Clones
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- The long awaited **Vace for Wan 2.2** is at last here or maybe not: it has been released by the *Fun Team* of *Alibaba* and it is not official. You can play with the vanilla version (**Vace Fun**) or with the one accelerated with Loras (**Vace Fan Cocktail**)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **First Frame / Last Frame for Vace** : Vace model are so powerful that they could do *First frame / Last frame* since day one using the *Injected Frames* feature. However this required to compute by hand the locations of each end frame since this feature expects frames positions. I made it easier to compute these locations by using the "L" alias :
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					For a video Gen from scratch *"1 L L L"* means the 4 Injected Frames will be injected like this: frame no 1 at the first position, the next frame at the end of the first window, then the following frame at the end of the next window, and so on ....
 | 
				
			||||||
 | 
					If you *Continue a Video* , you just need *"L L L"* since the the first frame is the last frame of the *Source Video*. In any case remember that numeral frames positions (like "1") are aligned by default to the beginning of the source window, so low values such as 1 will be considered in the past unless you change this behaviour in *Sliding Window Tab/ Control Video, Injected Frames aligment*.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **Qwen Inpainting** exist now in two versions: the original version of the previous release and a Lora based version. Each version has its pros and cons. For instance the Lora version supports also **Outpainting** ! However it tends to change slightly the original image even outside the outpainted area.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **Better Lipsync with all the Audio to Video models**: you probably noticed that *Multitalk*, *InfiniteTalk* or *Hunyuan Avatar* had so so lipsync when the audio provided contained some background music. The problem should be solved now thanks to an automated background music removal all done by IA. Don't worry you will still hear the music as it is added back in the generated Video.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### September 11 2025: WanGP v8.5/8.55 - Wanna be a Cropper or a Painter ?
 | 
					### September 11 2025: WanGP v8.5/8.55 - Wanna be a Cropper or a Painter ?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
I have done some intensive internal refactoring of the generation pipeline to ease support of existing models or add new models. Nothing really visible but this makes WanGP is little more future proof.
 | 
					I have done some intensive internal refactoring of the generation pipeline to ease support of existing models or add new models. Nothing really visible but this makes WanGP is little more future proof.
 | 
				
			||||||
@ -74,221 +87,6 @@ You will find below a 33s movie I have created using these two methods. Quality
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
*update 8.31: one shouldnt talk about bugs if one doesn't want to attract bugs*
 | 
					*update 8.31: one shouldnt talk about bugs if one doesn't want to attract bugs*
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### August 29 2025: WanGP v8.21 -  Here Goes Your Weekend
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- **InfiniteTalk Video to Video**: this feature can be used for Video Dubbing. Keep in mind that it is a *Sparse Video to Video*, that is internally only image is used by Sliding Window. However thanks to the new *Smooth Transition* mode, each new clip is connected to the previous and all the camera work is done by InfiniteTalk. If you dont get any transition, increase the number of frames of a Sliding Window (81 frames recommended)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- **StandIn**: very light model specialized in Identity Transfer. I have provided two versions of Standin: a basic one derived from the text 2 video model and another based on Vace. If used with Vace, the last reference frame given to Vace will be also used for StandIn
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- **Flux ESO**: a new Flux dervied *Image Editing tool*, but this one is specialized both in *Identity Transfer* and *Style Transfer*. Style has to be understood in its wide meaning: give a reference picture of a person and another one of Sushis and you will turn this person into Sushis
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 24 2025: WanGP v8.1 -  the RAM Liberator
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- **Reserved RAM entirely freed when switching models**, you should get much less out of memory related to RAM. I have also added a button in *Configuration / Performance* that will release most of the RAM used by WanGP if you want to use another application without quitting WanGP 
 | 
					 | 
				
			||||||
- **InfiniteTalk** support: improved version of Multitalk that supposedly supports very long video generations based on an audio track. Exists in two flavors (*Single Speaker* and *Multi Speakers*) but doesnt seem to be compatible with Vace. One key new feature compared to Multitalk is that you can have different visual shots associated to the same audio: each Reference frame you provide you will be associated to a new Sliding Window. If only Reference frame is provided, it will be used for all windows. When Continuing a video, you can either continue the current shot (no Reference Frame) or add new shots (one or more Reference Frames).\
 | 
					 | 
				
			||||||
If you are not into audio, you can use still this model to generate infinite long image2video, just select "no speaker". Last but not least, Infinitetalk works works with all the Loras accelerators.
 | 
					 | 
				
			||||||
- **Flux Chroma 1 HD** support: uncensored flux based model and lighter than Flux (8.9B versus 12B) and can fit entirely in VRAM with only 16 GB of VRAM. Unfortunalely it is not distilled and you will need CFG at minimum 20 steps
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 21 2025: WanGP v8.01 - the killer of seven
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- **Qwen Image Edit** : Flux Kontext challenger (prompt driven image edition). Best results (including Identity preservation) will be obtained at 720p. Beyond you may get image outpainting and / or lose identity preservation. Below 720p prompt adherence will be worse. Qwen Image Edit works with Qwen Lora Lightning 4 steps. I have also unlocked all the resolutions for Qwen models. Bonus Zone: support for multiple image compositions but identity preservation won't be as good.
 | 
					 | 
				
			||||||
- **On demand Prompt Enhancer** (needs to be enabled in Configuration Tab) that you can use to Enhance a Text Prompt before starting a Generation. You can refine the Enhanced Prompt or change the original Prompt.
 | 
					 | 
				
			||||||
- Choice of a **Non censored Prompt Enhancer**. Beware this is one is VRAM hungry and will require 12 GB of VRAM to work
 | 
					 | 
				
			||||||
- **Memory Profile customizable per model** : useful to set for instance Profile 3 (preload the model entirely in VRAM) with only Image Generation models, if you have 24 GB of VRAM. In that case Generation will be much faster because with Image generators (contrary to Video generators) as a lot of time is wasted in offloading 
 | 
					 | 
				
			||||||
- **Expert Guidance Mode**: change the Guidance during the generation up to 2 times. Very useful with Wan 2.2 Ligthning to reduce the slow motion effect. The idea is to insert a CFG phase before the 2 accelerated phases that follow and have no Guidance. I have added the finetune *Wan2.2 Vace Lightning 3 Phases 14B* with a prebuilt configuration. Please note that it is a 8 steps process although the lora lightning is 4 steps. This expert guidance mode is also available with Wan 2.1.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
*WanGP 8.01 update, improved Qwen Image Edit Identity Preservation*
 | 
					 | 
				
			||||||
### August 12 2025: WanGP v7.7777 - Lucky Day(s)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This is your lucky day ! thanks to new configuration options that will let you store generated Videos and Images in lossless compressed formats, you will find they in fact they look two times better without doing anything !
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Just kidding, they will be only marginally better, but at least this opens the way to professionnal editing.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Support:
 | 
					 | 
				
			||||||
- Video: x264, x264 lossless, x265
 | 
					 | 
				
			||||||
- Images: jpeg, png, webp, wbp lossless
 | 
					 | 
				
			||||||
Generation Settings are stored in each of the above regardless of the format (that was the hard part).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also you can now choose different output directories for images and videos.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
unexpected luck: fixed lightning 8 steps for Qwen, and lightning 4 steps for Wan 2.2, now you just need 1x multiplier no weird numbers. 
 | 
					 | 
				
			||||||
*update 7.777 : oops got a crash a with FastWan ? Luck comes and goes, try a new update, maybe you will have a better chance this time*
 | 
					 | 
				
			||||||
*update 7.7777 : Sometime good luck seems to last forever. For instance what if Qwen Lightning 4 steps could also work with WanGP ?*
 | 
					 | 
				
			||||||
- https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors (Qwen Lightning 4 steps)
 | 
					 | 
				
			||||||
- https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.1-bf16.safetensors (new improved version of Qwen Lightning 8 steps)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 10 2025: WanGP v7.76 - Faster than the VAE ...
 | 
					 | 
				
			||||||
We have a funny one here today: FastWan 2.2 5B, the Fastest Video Generator, only 20s to generate 121 frames at 720p. The snag is that VAE is twice as slow... 
 | 
					 | 
				
			||||||
Thanks to Kijai for extracting the Lora that is used to build the corresponding finetune.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
*WanGP 7.76: fixed the messed up I did to i2v models (loras path was wrong for Wan2.2 and Clip broken)*
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 9 2025: WanGP v7.74 - Qwen Rebirth part 2
 | 
					 | 
				
			||||||
Added support for Qwen Lightning lora for a 8 steps generation (https://huggingface.co/lightx2v/Qwen-Image-Lightning/blob/main/Qwen-Image-Lightning-8steps-V1.0.safetensors). Lora is not normalized and you can use a multiplier around 0.1.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Mag Cache support for all the Wan2.2 models Don't forget to set guidance to 1 and 8 denoising steps , your gen will be 7x faster !
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 8 2025: WanGP v7.73 - Qwen Rebirth
 | 
					 | 
				
			||||||
Ever wondered what impact not using Guidance has on a model that expects it ? Just look at Qween Image in WanGP 7.71 whose outputs were erratic. Somehow I had convinced myself that Qwen was a distilled model. In fact Qwen was dying for a negative prompt. And in WanGP 7.72 there is at last one for him.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
As Qwen is not so picky after all I have added also quantized text encoder which reduces the RAM requirements of Qwen by 10 GB (the text encoder quantized version produced garbage before) 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Unfortunately still the Sage bug for older GPU architectures. Added Sdpa fallback for these architectures.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
*7.73 update: still Sage / Sage2 bug for GPUs before RTX40xx. I have added a detection mechanism that forces Sdpa attention if that's the case*
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 6 2025: WanGP v7.71 - Picky, picky
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This release comes with two new models :
 | 
					 | 
				
			||||||
- Qwen Image: a Commercial grade Image generator capable to inject full sentences in the generated Image while still offering incredible visuals
 | 
					 | 
				
			||||||
- Wan 2.2 TextImage to Video 5B: the last Wan 2.2 needed if you want to complete your Wan 2.2 collection (loras for this folder can be stored in "\loras\5B"     )
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
There is catch though, they are very picky if you want to get good generations: first they both need lots of steps (50 ?) to show what they have to offer. Then for Qwen Image I had to hardcode the supported resolutions, because if you try anything else, you will get garbage. Likewise Wan 2.2 5B will remind you of Wan 1.0 if you don't ask for at least  720p. 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
*7.71 update: Added VAE Tiling for both Qwen Image and Wan 2.2 TextImage to Video 5B, for low VRAM during a whole gen.*
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### August 4 2025: WanGP v7.6 - Remuxed
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
With this new version you won't have any excuse if there is no sound in your video.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
*Continue Video* now works with any video that has already some sound (hint: Multitalk ).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also, on top of MMaudio and the various sound driven models I have added the ability to use your own soundtrack.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
As a result you can apply a different sound source on each new video segment when doing a *Continue Video*. 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
For instance:
 | 
					 | 
				
			||||||
- first video part: use Multitalk with two people speaking
 | 
					 | 
				
			||||||
- second video part: you apply your own soundtrack which will gently follow the multitalk conversation
 | 
					 | 
				
			||||||
- third video part: you use Vace effect and its corresponding control audio will be concatenated to the rest of the audio
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
To multiply the combinations I have also implemented *Continue Video* with the various image2video models.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also:
 | 
					 | 
				
			||||||
- End Frame support added for LTX Video models
 | 
					 | 
				
			||||||
- Loras can now be targetted specifically at the High noise or Low noise models with Wan 2.2, check the Loras and Finetune guides
 | 
					 | 
				
			||||||
- Flux Krea Dev support
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 30 2025: WanGP v7.5:  Just another release ... Wan 2.2 part 2
 | 
					 | 
				
			||||||
Here is now Wan 2.2 image2video a very good model if you want to set Start and End frames. Two Wan 2.2 models delivered, only one to go ...
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Please note that although it is an image2video model it is structurally very close to Wan 2.2 text2video (same layers with only a different initial projection). Given that Wan 2.1 image2video loras don't work too well (half of their tensors are not supported), I have decided that this model will look for its loras in the text2video loras folder instead of the image2video folder.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
I have also optimized RAM management with Wan 2.2 so that loras and modules will be loaded only once in RAM and Reserved RAM, this saves up to 5 GB of RAM which can make a difference...
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
And this time I really removed Vace Cocktail Light which gave a blurry vision.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 29 2025: WanGP v7.4:  Just another release ... Wan 2.2 Preview
 | 
					 | 
				
			||||||
Wan 2.2 is here.  The good news is that WanGP wont require a single byte of extra VRAM to run it and it will be as fast as Wan 2.1. The bad news is that you will need much more RAM if you want to leverage entirely this new model since it has twice has many parameters.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
So here is a preview version of Wan 2.2 that is without the 5B model and Wan 2.2 image to video for the moment.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
However as I felt bad to deliver only half of the wares, I gave you instead .....** Wan 2.2 Vace Experimental Cocktail** !
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Very good surprise indeed, the loras and Vace partially work with Wan 2.2. We will need to wait for the official Vace 2.2 release since some Vace features are broken like identity preservation
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Bonus zone: Flux multi images conditions has been added, or maybe not if I broke everything as I have been distracted by Wan...
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
7.4 update: I forgot to update the version number. I also removed Vace Cocktail light which didnt work well.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 27 2025: WanGP v7.3 : Interlude
 | 
					 | 
				
			||||||
While waiting for Wan 2.2, you will appreciate the model selection hierarchy which is very useful to collect even more models. You will also appreciate that WanGP remembers which model you used last in each model family.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 26 2025: WanGP v7.2 : Ode to Vace
 | 
					 | 
				
			||||||
I am really convinced that Vace can do everything the other models can do and in a better way especially as Vace can be combined with Multitalk.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Here are some new Vace improvements:
 | 
					 | 
				
			||||||
- I have provided a default finetune named *Vace Cocktail*  which is a model created on the fly using the Wan text 2 video model and the Loras used to build FusioniX. The weight of the *Detail Enhancer* Lora has been reduced to improve identity preservation. Copy the model definition in *defaults/vace_14B_cocktail.json* in the *finetunes/* folder to change the Cocktail composition. Cocktail contains already some Loras acccelerators so no need to add on top a Lora Accvid, Causvid or Fusionix, ... . The whole point of Cocktail is to be able  to build you own FusioniX (which originally is a combination of 4 loras) but without the inconvenient of FusioniX.
 | 
					 | 
				
			||||||
- Talking about identity preservation, it tends to go away when one generates a single Frame instead of a Video which is shame for our Vace photoshop. But there is a solution : I have added an Advanced Quality option, that tells WanGP to generate a little more than a frame (it will still keep only the first frame). It will be a little slower but you will be amazed how Vace Cocktail combined with this option will preserve identities (bye bye *Phantom*). 
 | 
					 | 
				
			||||||
- As in practise I have observed one switches frequently between *Vace text2video* and *Vace text2image* I have put them in the same place they are now just one tab away, no need to reload the model. Likewise *Wan text2video* and *Wan tex2image* have been merged.
 | 
					 | 
				
			||||||
- Color fixing when using Sliding Windows. A new postprocessing *Color Correction* applied automatically by default (you can disable it in the *Advanced tab Sliding Window*) will try to match the colors of the new window with that of the previous window. It doesnt fix all the unwanted artifacts of the new window but at least this makes the transition smoother. Thanks to the multitalk team for the original code.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also you will enjoy our new real time statistics (CPU / GPU usage, RAM / VRAM used, ... ). Many thanks to **Redtash1** for providing the framework for this new feature ! You need to go in the Config tab to enable real time stats.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 21 2025: WanGP v7.12
 | 
					 | 
				
			||||||
- Flux Family Reunion : *Flux Dev* and *Flux Schnell* have been invited aboard WanGP. To celebrate that, Loras support for the Flux *diffusers* format has also been added.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- LTX Video upgraded to version 0.9.8: you can now generate 1800 frames (1 min of video !) in one go without a sliding window. With the distilled model it will take only 5 minutes with a RTX 4090 (you will need 22 GB of VRAM though). I have added options to select higher humber frames if you want to experiment (go to Configuration Tab / General / Increase the Max Number of Frames, change the value and restart the App)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- LTX Video ControlNet : it is a Control Net that allows you for instance to transfer a Human motion or Depth from a control video. It is not as powerful as Vace but can produce interesting things especially as now you can generate quickly a 1 min video. Under the scene IC-Loras (see below) for Pose, Depth and Canny are automatically loaded for you, no need to add them. 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- LTX IC-Lora support: these are special Loras that consumes a conditional image or video
 | 
					 | 
				
			||||||
Beside the pose, depth and canny IC-Loras transparently loaded there is the *detailer* (https://huggingface.co/Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8) which is basically an upsampler. Add the *detailer* as a Lora and use LTX Raw Format as control net choice to use it.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- Matanyone is now also for the GPU Poor as its VRAM requirements have been divided by 2! (7.12 shadow update)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
- Easier way to select video resolution 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 15 2025: WanGP v7.0 is an AI Powered Photoshop
 | 
					 | 
				
			||||||
This release turns the Wan models into Image Generators. This goes way more than allowing to generate a video made of single frame :
 | 
					 | 
				
			||||||
- Multiple Images generated at the same time so that you can choose the one you like best.It is Highly VRAM optimized so that you can generate for instance 4 720p Images at the same time with less than 10 GB
 | 
					 | 
				
			||||||
- With the *image2image* the original text2video WanGP becomes an image upsampler / restorer
 | 
					 | 
				
			||||||
- *Vace image2image* comes out of the box with image outpainting, person / object replacement, ...
 | 
					 | 
				
			||||||
- You can use in one click a newly Image generated as Start Image or Reference Image for a Video generation
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
And to complete the full suite of AI Image Generators, Ladies and Gentlemen please welcome for the first time in WanGP : **Flux Kontext**.\
 | 
					 | 
				
			||||||
As a reminder Flux Kontext is an image editor : give it an image and a prompt and it will do the change for you.\
 | 
					 | 
				
			||||||
This highly optimized version of Flux Kontext will make you feel that you have been cheated all this time as WanGP Flux Kontext requires only 8 GB of VRAM to generate 4 images at the same time with no need for quantization.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
WanGP v7 comes with *Image2image* vanilla and *Vace FusinoniX*. However you can build your own finetune where you will combine a text2video or Vace model with any combination of Loras.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also in the news:
 | 
					 | 
				
			||||||
- You can now enter the *Bbox* for each speaker in *Multitalk* to precisely locate who is speaking. And to save some headaches the *Image Mask generator* will give you the *Bbox* coordinates of an area you have selected.
 | 
					 | 
				
			||||||
- *Film Grain* post processing to add a vintage look at your video
 | 
					 | 
				
			||||||
- *First Last Frame to Video* model should work much better now as I have discovered rencently its implementation was not complete
 | 
					 | 
				
			||||||
- More power for the finetuners, you can now embed Loras directly in the finetune definition. You can also override the default models (titles, visibility, ...) with your own finetunes. Check the doc that has been updated.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 10 2025: WanGP v6.7, is NAG a game changer ? you tell me
 | 
					 | 
				
			||||||
Maybe you knew that already but most *Loras accelerators* we use today (Causvid, FusioniX) don't use *Guidance* at all (that it is *CFG* is set to 1). This helps to get much faster generations but the downside is that *Negative Prompts* are completely ignored (including the default ones set by the models). **NAG** (https://github.com/ChenDarYen/Normalized-Attention-Guidance) aims to solve that by injecting the *Negative Prompt* during the *attention* processing phase.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
So WanGP 6.7 gives you NAG, but not any NAG, a *Low VRAM* implementation, the default one ends being VRAM greedy. You will find NAG in the *General* advanced tab for most Wan models. 
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Use NAG especially when Guidance is set to 1. To turn it on set the **NAG scale** to something around 10. There are other NAG parameters **NAG tau** and **NAG alpha** which I recommend to change only if you don't get good results by just playing with the NAG scale. Don't hesitate to share on this discord server the best combinations for these 3 parameters.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
The authors of NAG claim that NAG can also be used when using a Guidance (CFG > 1) and to improve the prompt adherence.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 8 2025: WanGP v6.6, WanGP offers you **Vace Multitalk Dual Voices Fusionix Infinite** :
 | 
					 | 
				
			||||||
**Vace** our beloved super Control Net has been combined with **Multitalk** the new king in town that can animate up to two people speaking (**Dual Voices**). It is accelerated by the **Fusionix** model and thanks to *Sliding Windows* support and *Adaptive Projected Guidance* (much slower but should reduce the reddish effect with long videos) your two people will be able to talk for very a long time (which is an **Infinite** amount of time in the field of video generation).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Of course you will get as well *Multitalk* vanilla and also *Multitalk 720p* as a bonus.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
And since I am mister nice guy I have enclosed as an exclusivity an *Audio Separator* that will save you time to isolate each voice when using Multitalk with two people.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
As I feel like resting a bit I haven't produced yet a nice sample Video to illustrate all these new capabilities. But here is the thing, I ams sure you will publish in the *Share Your Best Video* channel your *Master Pieces*. The best ones will be added to the *Announcements Channel* and will bring eternal fame to its authors.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
But wait, there is more:
 | 
					 | 
				
			||||||
- Sliding Windows support has been added anywhere with Wan models, so imagine with text2video recently upgraded in 6.5 into a video2video, you can now upsample very long videos regardless of your VRAM. The good old image2video model can now reuse the last image to produce new videos (as requested by many of you)
 | 
					 | 
				
			||||||
- I have added also the capability to transfer the audio of the original control video (Misc. advanced tab) and an option to preserve the fps into the generated video, so from now on you will be to upsample / restore your old families video and keep the audio at their original pace. Be aware that the duration will be limited to 1000 frames as I still need to add streaming support for unlimited video sizes.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Also, of interest too:
 | 
					 | 
				
			||||||
- Extract video info from Videos that have not been generated by WanGP, even better you can also apply post processing (Upsampling / MMAudio) on non WanGP videos
 | 
					 | 
				
			||||||
- Force the generated video fps to your liking, works wery well with Vace when using a Control Video
 | 
					 | 
				
			||||||
- Ability to chain URLs of Finetune models (for instance put the URLs of a model in your main finetune and reference this finetune in other finetune models to save time)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
### July 2 2025: WanGP v6.5.1, WanGP takes care of you: lots of quality of life features:
 | 
					 | 
				
			||||||
- View directly inside WanGP the properties (seed, resolutions, length, most settings...) of the past generations
 | 
					 | 
				
			||||||
- In one click use the newly generated video as a Control Video or Source Video to be continued 
 | 
					 | 
				
			||||||
- Manage multiple settings for the same model and switch between them using a dropdown box 
 | 
					 | 
				
			||||||
- WanGP will keep the last generated videos in the Gallery and will remember the last model you used if you restart the app but kept the Web page open
 | 
					 | 
				
			||||||
- Custom resolutions : add a file in the WanGP folder with the list of resolutions you want to see in WanGP (look at the instruction readme in this folder)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Taking care of your life is not enough, you want new stuff to play with ?
 | 
					 | 
				
			||||||
- MMAudio directly inside WanGP : add an audio soundtrack that matches the content of your video. By the way it is a low VRAM MMAudio and 6 GB of VRAM should be sufficient. You will need to go in the *Extensions* tab of the WanGP *Configuration* to enable MMAudio
 | 
					 | 
				
			||||||
- Forgot to upsample your video during the generation ? want to try another MMAudio variation ? Fear not you can also apply upsampling or add an MMAudio track once the video generation is done. Even better you can ask WangGP for multiple variations of MMAudio to pick the one you like best
 | 
					 | 
				
			||||||
- MagCache support: a new step skipping approach, supposed to be better than TeaCache. Makes a difference if you usually generate with a high number of steps
 | 
					 | 
				
			||||||
- SageAttention2++ support : not just the compatibility but also a slightly reduced VRAM usage
 | 
					 | 
				
			||||||
- Video2Video in Wan Text2Video : this is the paradox, a text2video can become a video2video if you start the denoising process later on an existing video
 | 
					 | 
				
			||||||
- FusioniX upsampler: this is an illustration of Video2Video in Text2Video. Use the FusioniX text2video model with an output resolution of 1080p and a denoising strength of 0.25 and you will get one of the best upsamplers (in only 2/3 steps, you will need lots of VRAM though). Increase the denoising strength and you will get one of the best Video Restorer
 | 
					 | 
				
			||||||
- Choice of Wan Samplers / Schedulers
 | 
					 | 
				
			||||||
- More Lora formats support
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
**If you had upgraded to v6.5 please upgrade again to 6.5.1 as this will fix a bug that ignored Loras beyond the first one**
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
See full changelog: **[Changelog](docs/CHANGELOG.md)**
 | 
					See full changelog: **[Changelog](docs/CHANGELOG.md)**
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
				
			|||||||
@ -7,6 +7,7 @@
 | 
				
			|||||||
            "https://huggingface.co/DeepBeepMeep/Qwen_image/resolve/main/qwen_image_edit_20B_bf16.safetensors",
 | 
					            "https://huggingface.co/DeepBeepMeep/Qwen_image/resolve/main/qwen_image_edit_20B_bf16.safetensors",
 | 
				
			||||||
            "https://huggingface.co/DeepBeepMeep/Qwen_image/resolve/main/qwen_image_edit_20B_quanto_bf16_int8.safetensors"
 | 
					            "https://huggingface.co/DeepBeepMeep/Qwen_image/resolve/main/qwen_image_edit_20B_quanto_bf16_int8.safetensors"
 | 
				
			||||||
        ],
 | 
					        ],
 | 
				
			||||||
 | 
							"preload_URLs": ["https://huggingface.co/DeepBeepMeep/Qwen_image/resolve/main/qwen_image_edit_inpainting.safetensors"],
 | 
				
			||||||
        "attention": {
 | 
					        "attention": {
 | 
				
			||||||
            "<89": "sdpa"
 | 
					            "<89": "sdpa"
 | 
				
			||||||
        }
 | 
					        }
 | 
				
			||||||
 | 
				
			|||||||
							
								
								
									
										24
									
								
								defaults/vace_fun_14B_2_2.json
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										24
									
								
								defaults/vace_fun_14B_2_2.json
									
									
									
									
									
										Normal file
									
								
							@ -0,0 +1,24 @@
 | 
				
			|||||||
 | 
					{
 | 
				
			||||||
 | 
					    "model": {
 | 
				
			||||||
 | 
					        "name": "Wan2.2 Vace Fun 14B",
 | 
				
			||||||
 | 
					        "architecture": "vace_14B",
 | 
				
			||||||
 | 
					        "description": "This is the Fun Vace 2.2 version, that is not the official Vace 2.2",
 | 
				
			||||||
 | 
					        "URLs": [
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_HIGH_mbf16.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_HIGH_quanto_mbf16_int8.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_HIGH_quanto_mfp16_int8.safetensors"
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					        "URLs2": [
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_LOW_mbf16.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_LOW_quanto_mbf16_int8.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.2/resolve/main/Wan2_2_Fun_VACE_A14B_LOW_quanto_mfp16_int8.safetensors"
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					        "group": "wan2_2"
 | 
				
			||||||
 | 
					    },
 | 
				
			||||||
 | 
					    "guidance_phases": 2,
 | 
				
			||||||
 | 
					    "num_inference_steps": 30,
 | 
				
			||||||
 | 
					    "guidance_scale": 1,
 | 
				
			||||||
 | 
					    "guidance2_scale": 1,
 | 
				
			||||||
 | 
					    "flow_shift": 2,
 | 
				
			||||||
 | 
					    "switch_threshold": 875
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
							
								
								
									
										28
									
								
								defaults/vace_fun_14B_cocktail_2_2.json
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										28
									
								
								defaults/vace_fun_14B_cocktail_2_2.json
									
									
									
									
									
										Normal file
									
								
							@ -0,0 +1,28 @@
 | 
				
			|||||||
 | 
					{
 | 
				
			||||||
 | 
					    "model": {
 | 
				
			||||||
 | 
					        "name": "Wan2.2 Vace Fun Cocktail 14B",
 | 
				
			||||||
 | 
					        "architecture": "vace_14B",
 | 
				
			||||||
 | 
					        "description": "This model has been created on the fly using the Wan text 2.2 video model and the Loras of FusioniX. The weight of the Detail Enhancer Lora has been reduced to improve identity preservation. This is the Fun Vace 2.2, that is not the official Vace 2.2",
 | 
				
			||||||
 | 
					        "URLs": "vace_fun_14B_2_2",
 | 
				
			||||||
 | 
					        "URLs2": "vace_fun_14B_2_2",
 | 
				
			||||||
 | 
					        "loras": [
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/loras_accelerators/Wan21_CausVid_14B_T2V_lora_rank32_v2.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/loras_accelerators/DetailEnhancerV1.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/loras_accelerators/Wan21_AccVid_T2V_14B_lora_rank32_fp16.safetensors",
 | 
				
			||||||
 | 
					            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/loras_accelerators/Wan21_T2V_14B_MoviiGen_lora_rank32_fp16.safetensors"
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					        "loras_multipliers": [
 | 
				
			||||||
 | 
					            1,
 | 
				
			||||||
 | 
					            0.2,
 | 
				
			||||||
 | 
					            0.5,
 | 
				
			||||||
 | 
					            0.5
 | 
				
			||||||
 | 
					        ],
 | 
				
			||||||
 | 
					        "group": "wan2_2"
 | 
				
			||||||
 | 
					    },
 | 
				
			||||||
 | 
					    "guidance_phases": 2,
 | 
				
			||||||
 | 
					    "num_inference_steps": 10,
 | 
				
			||||||
 | 
					    "guidance_scale": 1,
 | 
				
			||||||
 | 
					    "guidance2_scale": 1,
 | 
				
			||||||
 | 
					    "flow_shift": 2,
 | 
				
			||||||
 | 
					    "switch_threshold": 875
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
@ -1,20 +1,154 @@
 | 
				
			|||||||
# Changelog
 | 
					# Changelog
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## 🔥 Latest News
 | 
					## 🔥 Latest News
 | 
				
			||||||
### July 21 2025: WanGP v7.1 
 | 
					### August 29 2025: WanGP v8.21 -  Here Goes Your Weekend
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **InfiniteTalk Video to Video**: this feature can be used for Video Dubbing. Keep in mind that it is a *Sparse Video to Video*, that is internally only image is used by Sliding Window. However thanks to the new *Smooth Transition* mode, each new clip is connected to the previous and all the camera work is done by InfiniteTalk. If you dont get any transition, increase the number of frames of a Sliding Window (81 frames recommended)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **StandIn**: very light model specialized in Identity Transfer. I have provided two versions of Standin: a basic one derived from the text 2 video model and another based on Vace. If used with Vace, the last reference frame given to Vace will be also used for StandIn
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **Flux ESO**: a new Flux dervied *Image Editing tool*, but this one is specialized both in *Identity Transfer* and *Style Transfer*. Style has to be understood in its wide meaning: give a reference picture of a person and another one of Sushis and you will turn this person into Sushis
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 24 2025: WanGP v8.1 -  the RAM Liberator
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **Reserved RAM entirely freed when switching models**, you should get much less out of memory related to RAM. I have also added a button in *Configuration / Performance* that will release most of the RAM used by WanGP if you want to use another application without quitting WanGP 
 | 
				
			||||||
 | 
					- **InfiniteTalk** support: improved version of Multitalk that supposedly supports very long video generations based on an audio track. Exists in two flavors (*Single Speaker* and *Multi Speakers*) but doesnt seem to be compatible with Vace. One key new feature compared to Multitalk is that you can have different visual shots associated to the same audio: each Reference frame you provide you will be associated to a new Sliding Window. If only Reference frame is provided, it will be used for all windows. When Continuing a video, you can either continue the current shot (no Reference Frame) or add new shots (one or more Reference Frames).\
 | 
				
			||||||
 | 
					If you are not into audio, you can use still this model to generate infinite long image2video, just select "no speaker". Last but not least, Infinitetalk works works with all the Loras accelerators.
 | 
				
			||||||
 | 
					- **Flux Chroma 1 HD** support: uncensored flux based model and lighter than Flux (8.9B versus 12B) and can fit entirely in VRAM with only 16 GB of VRAM. Unfortunalely it is not distilled and you will need CFG at minimum 20 steps
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 21 2025: WanGP v8.01 - the killer of seven
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- **Qwen Image Edit** : Flux Kontext challenger (prompt driven image edition). Best results (including Identity preservation) will be obtained at 720p. Beyond you may get image outpainting and / or lose identity preservation. Below 720p prompt adherence will be worse. Qwen Image Edit works with Qwen Lora Lightning 4 steps. I have also unlocked all the resolutions for Qwen models. Bonus Zone: support for multiple image compositions but identity preservation won't be as good.
 | 
				
			||||||
 | 
					- **On demand Prompt Enhancer** (needs to be enabled in Configuration Tab) that you can use to Enhance a Text Prompt before starting a Generation. You can refine the Enhanced Prompt or change the original Prompt.
 | 
				
			||||||
 | 
					- Choice of a **Non censored Prompt Enhancer**. Beware this is one is VRAM hungry and will require 12 GB of VRAM to work
 | 
				
			||||||
 | 
					- **Memory Profile customizable per model** : useful to set for instance Profile 3 (preload the model entirely in VRAM) with only Image Generation models, if you have 24 GB of VRAM. In that case Generation will be much faster because with Image generators (contrary to Video generators) as a lot of time is wasted in offloading 
 | 
				
			||||||
 | 
					- **Expert Guidance Mode**: change the Guidance during the generation up to 2 times. Very useful with Wan 2.2 Ligthning to reduce the slow motion effect. The idea is to insert a CFG phase before the 2 accelerated phases that follow and have no Guidance. I have added the finetune *Wan2.2 Vace Lightning 3 Phases 14B* with a prebuilt configuration. Please note that it is a 8 steps process although the lora lightning is 4 steps. This expert guidance mode is also available with Wan 2.1.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*WanGP 8.01 update, improved Qwen Image Edit Identity Preservation*
 | 
				
			||||||
 | 
					### August 12 2025: WanGP v7.7777 - Lucky Day(s)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This is your lucky day ! thanks to new configuration options that will let you store generated Videos and Images in lossless compressed formats, you will find they in fact they look two times better without doing anything !
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Just kidding, they will be only marginally better, but at least this opens the way to professionnal editing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Support:
 | 
				
			||||||
 | 
					- Video: x264, x264 lossless, x265
 | 
				
			||||||
 | 
					- Images: jpeg, png, webp, wbp lossless
 | 
				
			||||||
 | 
					Generation Settings are stored in each of the above regardless of the format (that was the hard part).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Also you can now choose different output directories for images and videos.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					unexpected luck: fixed lightning 8 steps for Qwen, and lightning 4 steps for Wan 2.2, now you just need 1x multiplier no weird numbers. 
 | 
				
			||||||
 | 
					*update 7.777 : oops got a crash a with FastWan ? Luck comes and goes, try a new update, maybe you will have a better chance this time*
 | 
				
			||||||
 | 
					*update 7.7777 : Sometime good luck seems to last forever. For instance what if Qwen Lightning 4 steps could also work with WanGP ?*
 | 
				
			||||||
 | 
					- https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors (Qwen Lightning 4 steps)
 | 
				
			||||||
 | 
					- https://huggingface.co/lightx2v/Qwen-Image-Lightning/resolve/main/Qwen-Image-Lightning-8steps-V1.1-bf16.safetensors (new improved version of Qwen Lightning 8 steps)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 10 2025: WanGP v7.76 - Faster than the VAE ...
 | 
				
			||||||
 | 
					We have a funny one here today: FastWan 2.2 5B, the Fastest Video Generator, only 20s to generate 121 frames at 720p. The snag is that VAE is twice as slow... 
 | 
				
			||||||
 | 
					Thanks to Kijai for extracting the Lora that is used to build the corresponding finetune.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*WanGP 7.76: fixed the messed up I did to i2v models (loras path was wrong for Wan2.2 and Clip broken)*
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 9 2025: WanGP v7.74 - Qwen Rebirth part 2
 | 
				
			||||||
 | 
					Added support for Qwen Lightning lora for a 8 steps generation (https://huggingface.co/lightx2v/Qwen-Image-Lightning/blob/main/Qwen-Image-Lightning-8steps-V1.0.safetensors). Lora is not normalized and you can use a multiplier around 0.1.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Mag Cache support for all the Wan2.2 models Don't forget to set guidance to 1 and 8 denoising steps , your gen will be 7x faster !
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 8 2025: WanGP v7.73 - Qwen Rebirth
 | 
				
			||||||
 | 
					Ever wondered what impact not using Guidance has on a model that expects it ? Just look at Qween Image in WanGP 7.71 whose outputs were erratic. Somehow I had convinced myself that Qwen was a distilled model. In fact Qwen was dying for a negative prompt. And in WanGP 7.72 there is at last one for him.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					As Qwen is not so picky after all I have added also quantized text encoder which reduces the RAM requirements of Qwen by 10 GB (the text encoder quantized version produced garbage before) 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Unfortunately still the Sage bug for older GPU architectures. Added Sdpa fallback for these architectures.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*7.73 update: still Sage / Sage2 bug for GPUs before RTX40xx. I have added a detection mechanism that forces Sdpa attention if that's the case*
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 6 2025: WanGP v7.71 - Picky, picky
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This release comes with two new models :
 | 
				
			||||||
 | 
					- Qwen Image: a Commercial grade Image generator capable to inject full sentences in the generated Image while still offering incredible visuals
 | 
				
			||||||
 | 
					- Wan 2.2 TextImage to Video 5B: the last Wan 2.2 needed if you want to complete your Wan 2.2 collection (loras for this folder can be stored in "\loras\5B"     )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is catch though, they are very picky if you want to get good generations: first they both need lots of steps (50 ?) to show what they have to offer. Then for Qwen Image I had to hardcode the supported resolutions, because if you try anything else, you will get garbage. Likewise Wan 2.2 5B will remind you of Wan 1.0 if you don't ask for at least  720p. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*7.71 update: Added VAE Tiling for both Qwen Image and Wan 2.2 TextImage to Video 5B, for low VRAM during a whole gen.*
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### August 4 2025: WanGP v7.6 - Remuxed
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					With this new version you won't have any excuse if there is no sound in your video.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Continue Video* now works with any video that has already some sound (hint: Multitalk ).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Also, on top of MMaudio and the various sound driven models I have added the ability to use your own soundtrack.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					As a result you can apply a different sound source on each new video segment when doing a *Continue Video*. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					For instance:
 | 
				
			||||||
 | 
					- first video part: use Multitalk with two people speaking
 | 
				
			||||||
 | 
					- second video part: you apply your own soundtrack which will gently follow the multitalk conversation
 | 
				
			||||||
 | 
					- third video part: you use Vace effect and its corresponding control audio will be concatenated to the rest of the audio
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					To multiply the combinations I have also implemented *Continue Video* with the various image2video models.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Also:
 | 
				
			||||||
 | 
					- End Frame support added for LTX Video models
 | 
				
			||||||
 | 
					- Loras can now be targetted specifically at the High noise or Low noise models with Wan 2.2, check the Loras and Finetune guides
 | 
				
			||||||
 | 
					- Flux Krea Dev support
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### July 30 2025: WanGP v7.5:  Just another release ... Wan 2.2 part 2
 | 
				
			||||||
 | 
					Here is now Wan 2.2 image2video a very good model if you want to set Start and End frames. Two Wan 2.2 models delivered, only one to go ...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Please note that although it is an image2video model it is structurally very close to Wan 2.2 text2video (same layers with only a different initial projection). Given that Wan 2.1 image2video loras don't work too well (half of their tensors are not supported), I have decided that this model will look for its loras in the text2video loras folder instead of the image2video folder.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					I have also optimized RAM management with Wan 2.2 so that loras and modules will be loaded only once in RAM and Reserved RAM, this saves up to 5 GB of RAM which can make a difference...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					And this time I really removed Vace Cocktail Light which gave a blurry vision.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### July 29 2025: WanGP v7.4:  Just another release ... Wan 2.2 Preview
 | 
				
			||||||
 | 
					Wan 2.2 is here.  The good news is that WanGP wont require a single byte of extra VRAM to run it and it will be as fast as Wan 2.1. The bad news is that you will need much more RAM if you want to leverage entirely this new model since it has twice has many parameters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					So here is a preview version of Wan 2.2 that is without the 5B model and Wan 2.2 image to video for the moment.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					However as I felt bad to deliver only half of the wares, I gave you instead .....** Wan 2.2 Vace Experimental Cocktail** !
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Very good surprise indeed, the loras and Vace partially work with Wan 2.2. We will need to wait for the official Vace 2.2 release since some Vace features are broken like identity preservation
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Bonus zone: Flux multi images conditions has been added, or maybe not if I broke everything as I have been distracted by Wan...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					7.4 update: I forgot to update the version number. I also removed Vace Cocktail light which didnt work well.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### July 27 2025: WanGP v7.3 : Interlude
 | 
				
			||||||
 | 
					While waiting for Wan 2.2, you will appreciate the model selection hierarchy which is very useful to collect even more models. You will also appreciate that WanGP remembers which model you used last in each model family.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### July 26 2025: WanGP v7.2 : Ode to Vace
 | 
				
			||||||
 | 
					I am really convinced that Vace can do everything the other models can do and in a better way especially as Vace can be combined with Multitalk.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Here are some new Vace improvements:
 | 
				
			||||||
 | 
					- I have provided a default finetune named *Vace Cocktail*  which is a model created on the fly using the Wan text 2 video model and the Loras used to build FusioniX. The weight of the *Detail Enhancer* Lora has been reduced to improve identity preservation. Copy the model definition in *defaults/vace_14B_cocktail.json* in the *finetunes/* folder to change the Cocktail composition. Cocktail contains already some Loras acccelerators so no need to add on top a Lora Accvid, Causvid or Fusionix, ... . The whole point of Cocktail is to be able  to build you own FusioniX (which originally is a combination of 4 loras) but without the inconvenient of FusioniX.
 | 
				
			||||||
 | 
					- Talking about identity preservation, it tends to go away when one generates a single Frame instead of a Video which is shame for our Vace photoshop. But there is a solution : I have added an Advanced Quality option, that tells WanGP to generate a little more than a frame (it will still keep only the first frame). It will be a little slower but you will be amazed how Vace Cocktail combined with this option will preserve identities (bye bye *Phantom*). 
 | 
				
			||||||
 | 
					- As in practise I have observed one switches frequently between *Vace text2video* and *Vace text2image* I have put them in the same place they are now just one tab away, no need to reload the model. Likewise *Wan text2video* and *Wan tex2image* have been merged.
 | 
				
			||||||
 | 
					- Color fixing when using Sliding Windows. A new postprocessing *Color Correction* applied automatically by default (you can disable it in the *Advanced tab Sliding Window*) will try to match the colors of the new window with that of the previous window. It doesnt fix all the unwanted artifacts of the new window but at least this makes the transition smoother. Thanks to the multitalk team for the original code.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Also you will enjoy our new real time statistics (CPU / GPU usage, RAM / VRAM used, ... ). Many thanks to **Redtash1** for providing the framework for this new feature ! You need to go in the Config tab to enable real time stats.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### July 21 2025: WanGP v7.12
 | 
				
			||||||
- Flux Family Reunion : *Flux Dev* and *Flux Schnell* have been invited aboard WanGP. To celebrate that, Loras support for the Flux *diffusers* format has also been added.
 | 
					- Flux Family Reunion : *Flux Dev* and *Flux Schnell* have been invited aboard WanGP. To celebrate that, Loras support for the Flux *diffusers* format has also been added.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- LTX Video upgraded to version 0.9.8: you can now generate 1800 frames (1 min of video !) in one go without a sliding window. With the distilled model it will take only 5 minutes with a RTX 4090 (you will need 22 GB of VRAM though). I have added options to select higher humber frames if you want to experiment
 | 
					- LTX Video upgraded to version 0.9.8: you can now generate 1800 frames (1 min of video !) in one go without a sliding window. With the distilled model it will take only 5 minutes with a RTX 4090 (you will need 22 GB of VRAM though). I have added options to select higher humber frames if you want to experiment (go to Configuration Tab / General / Increase the Max Number of Frames, change the value and restart the App)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- LTX Video ControlNet : it is a Control Net that allows you for instance to transfer a Human motion or Depth from a control video. It is not as powerful as Vace but can produce interesting things especially as now you can generate quickly a 1 min video. Under the scene IC-Loras (see below) for Pose, Depth and Canny are automatically loaded for you, no need to add them. 
 | 
					- LTX Video ControlNet : it is a Control Net that allows you for instance to transfer a Human motion or Depth from a control video. It is not as powerful as Vace but can produce interesting things especially as now you can generate quickly a 1 min video. Under the scene IC-Loras (see below) for Pose, Depth and Canny are automatically loaded for you, no need to add them. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- LTX IC-Lora support: these are special Loras that consumes a conditional image or video
 | 
					- LTX IC-Lora support: these are special Loras that consumes a conditional image or video
 | 
				
			||||||
Beside the pose, depth and canny IC-Loras transparently loaded there is the *detailer* (https://huggingface.co/Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8) which is basically an upsampler. Add the *detailer* as a Lora and use LTX Raw Format as control net choice to use it.
 | 
					Beside the pose, depth and canny IC-Loras transparently loaded there is the *detailer* (https://huggingface.co/Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8) which is basically an upsampler. Add the *detailer* as a Lora and use LTX Raw Format as control net choice to use it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
And Also:
 | 
					- Matanyone is now also for the GPU Poor as its VRAM requirements have been divided by 2! (7.12 shadow update)
 | 
				
			||||||
- easier way to select video resolution 
 | 
					 | 
				
			||||||
- started to optimize Matanyone to reduce VRAM requirements
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- Easier way to select video resolution 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### July 15 2025: WanGP v7.0 is an AI Powered Photoshop
 | 
					### July 15 2025: WanGP v7.0 is an AI Powered Photoshop
 | 
				
			||||||
This release turns the Wan models into Image Generators. This goes way more than allowing to generate a video made of single frame :
 | 
					This release turns the Wan models into Image Generators. This goes way more than allowing to generate a video made of single frame :
 | 
				
			||||||
 | 
				
			|||||||
@ -107,7 +107,7 @@ class family_handler():
 | 
				
			|||||||
        ]
 | 
					        ]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False):
 | 
					    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False, submodel_no_list = None):
 | 
				
			||||||
        from .flux_main  import model_factory
 | 
					        from .flux_main  import model_factory
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        flux_model = model_factory(
 | 
					        flux_model = model_factory(
 | 
				
			||||||
 | 
				
			|||||||
@ -203,7 +203,7 @@ def prepare_kontext(
 | 
				
			|||||||
        image_mask_latents = convert_image_to_tensor(img_mask.resize((target_width // 16, target_height // 16), resample=Image.Resampling.LANCZOS))
 | 
					        image_mask_latents = convert_image_to_tensor(img_mask.resize((target_width // 16, target_height // 16), resample=Image.Resampling.LANCZOS))
 | 
				
			||||||
        image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
 | 
					        image_mask_latents = torch.where(image_mask_latents>-0.5, 1., 0. )[0:1]
 | 
				
			||||||
        image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
 | 
					        image_mask_rebuilt = image_mask_latents.repeat_interleave(16, dim=-1).repeat_interleave(16, dim=-2).unsqueeze(0)
 | 
				
			||||||
        convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
 | 
					        # convert_tensor_to_image( image_mask_rebuilt.squeeze(0).repeat(3,1,1)).save("mmm.png")
 | 
				
			||||||
        image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)        
 | 
					        image_mask_latents = image_mask_latents.reshape(1, -1, 1).to(device)        
 | 
				
			||||||
        return_dict.update({
 | 
					        return_dict.update({
 | 
				
			||||||
            "img_msk_latents": image_mask_latents,
 | 
					            "img_msk_latents": image_mask_latents,
 | 
				
			||||||
 | 
				
			|||||||
@ -68,7 +68,13 @@ class family_handler():
 | 
				
			|||||||
                "visible": False,
 | 
					                "visible": False,
 | 
				
			||||||
            }
 | 
					            }
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if base_model_type in ["hunyuan_avatar"]: extra_model_def["no_background_removal"] = True
 | 
					        if base_model_type in ["hunyuan_avatar"]: 
 | 
				
			||||||
 | 
					            extra_model_def["image_ref_choices"] = {
 | 
				
			||||||
 | 
					                "choices": [("Start Image", "KI")],
 | 
				
			||||||
 | 
					                "letters_filter":"KI",
 | 
				
			||||||
 | 
					                "visible": False,
 | 
				
			||||||
 | 
					            }
 | 
				
			||||||
 | 
					            extra_model_def["no_background_removal"] = True
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if base_model_type in ["hunyuan_custom", "hunyuan_custom_edit", "hunyuan_custom_audio", "hunyuan_avatar"]:
 | 
					        if base_model_type in ["hunyuan_custom", "hunyuan_custom_edit", "hunyuan_custom_audio", "hunyuan_avatar"]:
 | 
				
			||||||
            extra_model_def["one_image_ref_needed"] = True
 | 
					            extra_model_def["one_image_ref_needed"] = True
 | 
				
			||||||
@ -123,7 +129,7 @@ class family_handler():
 | 
				
			|||||||
        } 
 | 
					        } 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type = None,  base_model_type = None, model_def = None, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False):
 | 
					    def load_model(model_filename, model_type = None,  base_model_type = None, model_def = None, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False, submodel_no_list = None):
 | 
				
			||||||
        from .hunyuan import HunyuanVideoSampler
 | 
					        from .hunyuan import HunyuanVideoSampler
 | 
				
			||||||
        from mmgp import offload
 | 
					        from mmgp import offload
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
				
			|||||||
@ -476,14 +476,14 @@ class LTXV:
 | 
				
			|||||||
        images = images.sub_(0.5).mul_(2).squeeze(0)
 | 
					        images = images.sub_(0.5).mul_(2).squeeze(0)
 | 
				
			||||||
        return images
 | 
					        return images
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def get_loras_transformer(self, get_model_recursive_prop, video_prompt_type, **kwargs):
 | 
					    def get_loras_transformer(self, get_model_recursive_prop, model_type, video_prompt_type, **kwargs):
 | 
				
			||||||
        map = {
 | 
					        map = {
 | 
				
			||||||
            "P" : "pose",
 | 
					            "P" : "pose",
 | 
				
			||||||
            "D" : "depth",
 | 
					            "D" : "depth",
 | 
				
			||||||
            "E" : "canny",
 | 
					            "E" : "canny",
 | 
				
			||||||
        }
 | 
					        }
 | 
				
			||||||
        loras = []
 | 
					        loras = []
 | 
				
			||||||
        preloadURLs = get_model_recursive_prop(self.model_type,  "preload_URLs")
 | 
					        preloadURLs = get_model_recursive_prop(model_type,  "preload_URLs")
 | 
				
			||||||
        lora_file_name = ""
 | 
					        lora_file_name = ""
 | 
				
			||||||
        for letter, signature in map.items():
 | 
					        for letter, signature in map.items():
 | 
				
			||||||
            if letter in video_prompt_type:
 | 
					            if letter in video_prompt_type:
 | 
				
			||||||
 | 
				
			|||||||
@ -74,7 +74,7 @@ class family_handler():
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False):
 | 
					    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False, submodel_no_list = None):
 | 
				
			||||||
        from .ltxv import LTXV
 | 
					        from .ltxv import LTXV
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        ltxv_model = LTXV(
 | 
					        ltxv_model = LTXV(
 | 
				
			||||||
 | 
				
			|||||||
@ -569,6 +569,8 @@ class QwenImagePipeline(): #DiffusionPipeline
 | 
				
			|||||||
        pipeline=None,
 | 
					        pipeline=None,
 | 
				
			||||||
        loras_slists=None,
 | 
					        loras_slists=None,
 | 
				
			||||||
        joint_pass= True,
 | 
					        joint_pass= True,
 | 
				
			||||||
 | 
					        lora_inpaint = False,
 | 
				
			||||||
 | 
					        outpainting_dims = None,
 | 
				
			||||||
    ):
 | 
					    ):
 | 
				
			||||||
        r"""
 | 
					        r"""
 | 
				
			||||||
        Function invoked when calling the pipeline for generation.
 | 
					        Function invoked when calling the pipeline for generation.
 | 
				
			||||||
@ -704,7 +706,7 @@ class QwenImagePipeline(): #DiffusionPipeline
 | 
				
			|||||||
                    image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
 | 
					                    image_height, image_width = calculate_new_dimensions(ref_height, ref_width, image_height, image_width, False, block_size=multiple_of)
 | 
				
			||||||
                if (image_width,image_height) != image.size:
 | 
					                if (image_width,image_height) != image.size:
 | 
				
			||||||
                    image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS) 
 | 
					                    image = image.resize((image_width,image_height), resample=Image.Resampling.LANCZOS) 
 | 
				
			||||||
            else:
 | 
					            elif not lora_inpaint:
 | 
				
			||||||
                # _, image_width, image_height = min(
 | 
					                # _, image_width, image_height = min(
 | 
				
			||||||
                #     (abs(aspect_ratio - w / h), w, h) for w, h in PREFERRED_QWENIMAGE_RESOLUTIONS
 | 
					                #     (abs(aspect_ratio - w / h), w, h) for w, h in PREFERRED_QWENIMAGE_RESOLUTIONS
 | 
				
			||||||
                # )
 | 
					                # )
 | 
				
			||||||
@ -721,8 +723,16 @@ class QwenImagePipeline(): #DiffusionPipeline
 | 
				
			|||||||
            if image.size != (image_width, image_height):
 | 
					            if image.size != (image_width, image_height):
 | 
				
			||||||
                image = image.resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
 | 
					                image = image.resize((image_width, image_height), resample=Image.Resampling.LANCZOS)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            image = convert_image_to_tensor(image)
 | 
				
			||||||
 | 
					            if lora_inpaint:
 | 
				
			||||||
 | 
					                image_mask_rebuilt = torch.where(convert_image_to_tensor(image_mask)>-0.5, 1., 0. )[0:1]
 | 
				
			||||||
 | 
					                image_mask_latents = None
 | 
				
			||||||
 | 
					                green = torch.tensor([-1.0, 1.0, -1.0]).to(image) 
 | 
				
			||||||
 | 
					                green_image = green[:, None, None] .expand_as(image)
 | 
				
			||||||
 | 
					                image = torch.where(image_mask_rebuilt > 0, green_image, image)
 | 
				
			||||||
 | 
					                prompt_image = convert_tensor_to_image(image)
 | 
				
			||||||
 | 
					            image = image.unsqueeze(0).unsqueeze(2)
 | 
				
			||||||
            # image.save("nnn.png")
 | 
					            # image.save("nnn.png")
 | 
				
			||||||
            image = convert_image_to_tensor(image).unsqueeze(0).unsqueeze(2)
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
        has_neg_prompt = negative_prompt is not None or (
 | 
					        has_neg_prompt = negative_prompt is not None or (
 | 
				
			||||||
            negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
 | 
					            negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
 | 
				
			||||||
@ -940,7 +950,7 @@ class QwenImagePipeline(): #DiffusionPipeline
 | 
				
			|||||||
            )
 | 
					            )
 | 
				
			||||||
            latents = latents / latents_std + latents_mean
 | 
					            latents = latents / latents_std + latents_mean
 | 
				
			||||||
            output_image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
 | 
					            output_image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
 | 
				
			||||||
            if image_mask is not None:
 | 
					            if image_mask is not None and not lora_inpaint :  #not (lora_inpaint and outpainting_dims is not None):
 | 
				
			||||||
                output_image = image.squeeze(2) * (1 - image_mask_rebuilt) + output_image.to(image) * image_mask_rebuilt 
 | 
					                output_image = image.squeeze(2) * (1 - image_mask_rebuilt) + output_image.to(image) * image_mask_rebuilt 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
				
			|||||||
@ -1,4 +1,6 @@
 | 
				
			|||||||
import torch
 | 
					import torch
 | 
				
			||||||
 | 
					import gradio as gr
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def get_qwen_text_encoder_filename(text_encoder_quantization):
 | 
					def get_qwen_text_encoder_filename(text_encoder_quantization):
 | 
				
			||||||
    text_encoder_filename = "ckpts/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct_bf16.safetensors"
 | 
					    text_encoder_filename = "ckpts/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct_bf16.safetensors"
 | 
				
			||||||
@ -29,6 +31,16 @@ class family_handler():
 | 
				
			|||||||
            "letters_filter": "KI",
 | 
					            "letters_filter": "KI",
 | 
				
			||||||
            }
 | 
					            }
 | 
				
			||||||
            extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape" 
 | 
					            extra_model_def["background_removal_label"]= "Remove Backgrounds only behind People / Objects except main Subject / Landscape" 
 | 
				
			||||||
 | 
					            extra_model_def["video_guide_outpainting"] = [2]
 | 
				
			||||||
 | 
					            extra_model_def["model_modes"] = {
 | 
				
			||||||
 | 
					                        "choices": [
 | 
				
			||||||
 | 
					                            ("Lora Inpainting: Inpainted area completely unrelated to occulted content", 1),
 | 
				
			||||||
 | 
					                            ("Masked Denoising : Inpainted area may reuse some content that has been occulted", 0),
 | 
				
			||||||
 | 
					                            ],
 | 
				
			||||||
 | 
					                        "default": 1,
 | 
				
			||||||
 | 
					                        "label" : "Inpainting Method",
 | 
				
			||||||
 | 
					                        "image_modes" : [2],
 | 
				
			||||||
 | 
					            }
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        return extra_model_def
 | 
					        return extra_model_def
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -58,7 +70,7 @@ class family_handler():
 | 
				
			|||||||
            }
 | 
					            }
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False):
 | 
					    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized = False, submodel_no_list = None):
 | 
				
			||||||
        from .qwen_main import model_factory
 | 
					        from .qwen_main import model_factory
 | 
				
			||||||
        from mmgp import offload
 | 
					        from mmgp import offload
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -99,5 +111,18 @@ class family_handler():
 | 
				
			|||||||
            ui_defaults.update({
 | 
					            ui_defaults.update({
 | 
				
			||||||
                "video_prompt_type": "KI",
 | 
					                "video_prompt_type": "KI",
 | 
				
			||||||
                "denoising_strength" : 1.,
 | 
					                "denoising_strength" : 1.,
 | 
				
			||||||
 | 
					                "model_mode" : 0,
 | 
				
			||||||
            })
 | 
					            })
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def validate_generative_settings(base_model_type, model_def, inputs):
 | 
				
			||||||
 | 
					        if base_model_type in ["qwen_image_edit_20B"]:
 | 
				
			||||||
 | 
					            model_mode = inputs["model_mode"]
 | 
				
			||||||
 | 
					            denoising_strength= inputs["denoising_strength"]
 | 
				
			||||||
 | 
					            video_guide_outpainting= inputs["video_guide_outpainting"]
 | 
				
			||||||
 | 
					            from wgp import get_outpainting_dims
 | 
				
			||||||
 | 
					            outpainting_dims = get_outpainting_dims(video_guide_outpainting)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            if denoising_strength < 1 and model_mode == 1:
 | 
				
			||||||
 | 
					                gr.Info("Denoising Strength will be ignored while using Lora Inpainting")
 | 
				
			||||||
 | 
					            if outpainting_dims is not None and model_mode == 0 :
 | 
				
			||||||
 | 
					                return "Outpainting is not supported with Masked Denoising  "
 | 
				
			||||||
 | 
				
			|||||||
@ -44,7 +44,7 @@ class model_factory():
 | 
				
			|||||||
        save_quantized = False,
 | 
					        save_quantized = False,
 | 
				
			||||||
        dtype = torch.bfloat16,
 | 
					        dtype = torch.bfloat16,
 | 
				
			||||||
        VAE_dtype = torch.float32,
 | 
					        VAE_dtype = torch.float32,
 | 
				
			||||||
        mixed_precision_transformer = False
 | 
					        mixed_precision_transformer = False,
 | 
				
			||||||
    ):
 | 
					    ):
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -117,6 +117,8 @@ class model_factory():
 | 
				
			|||||||
        joint_pass = True,
 | 
					        joint_pass = True,
 | 
				
			||||||
        sample_solver='default',
 | 
					        sample_solver='default',
 | 
				
			||||||
        denoising_strength = 1.,
 | 
					        denoising_strength = 1.,
 | 
				
			||||||
 | 
					        model_mode = 0,
 | 
				
			||||||
 | 
					        outpainting_dims = None,
 | 
				
			||||||
        **bbargs
 | 
					        **bbargs
 | 
				
			||||||
    ):
 | 
					    ):
 | 
				
			||||||
        # Generate with different aspect ratios
 | 
					        # Generate with different aspect ratios
 | 
				
			||||||
@ -205,8 +207,16 @@ class model_factory():
 | 
				
			|||||||
            loras_slists=loras_slists,
 | 
					            loras_slists=loras_slists,
 | 
				
			||||||
            joint_pass = joint_pass,
 | 
					            joint_pass = joint_pass,
 | 
				
			||||||
            denoising_strength=denoising_strength,
 | 
					            denoising_strength=denoising_strength,
 | 
				
			||||||
            generator=torch.Generator(device="cuda").manual_seed(seed)
 | 
					            generator=torch.Generator(device="cuda").manual_seed(seed),
 | 
				
			||||||
 | 
					            lora_inpaint = image_mask is not None and model_mode == 1,
 | 
				
			||||||
 | 
					            outpainting_dims = outpainting_dims,
 | 
				
			||||||
        )        
 | 
					        )        
 | 
				
			||||||
        if image is None: return None
 | 
					        if image is None: return None
 | 
				
			||||||
        return image.transpose(0, 1)
 | 
					        return image.transpose(0, 1)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    def get_loras_transformer(self, get_model_recursive_prop, model_type, model_mode, **kwargs):
 | 
				
			||||||
 | 
					        if model_mode == 0: return [], []
 | 
				
			||||||
 | 
					        preloadURLs = get_model_recursive_prop(model_type,  "preload_URLs")
 | 
				
			||||||
 | 
					        return [os.path.join("ckpts", os.path.basename(preloadURLs[0]))] , [1]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
				
			|||||||
@ -64,6 +64,7 @@ class WanAny2V:
 | 
				
			|||||||
        config,
 | 
					        config,
 | 
				
			||||||
        checkpoint_dir,
 | 
					        checkpoint_dir,
 | 
				
			||||||
        model_filename = None,
 | 
					        model_filename = None,
 | 
				
			||||||
 | 
					        submodel_no_list = None,
 | 
				
			||||||
        model_type = None, 
 | 
					        model_type = None, 
 | 
				
			||||||
        model_def = None,
 | 
					        model_def = None,
 | 
				
			||||||
        base_model_type = None,
 | 
					        base_model_type = None,
 | 
				
			||||||
@ -126,50 +127,65 @@ class WanAny2V:
 | 
				
			|||||||
        forcedConfigPath = base_config_file if len(model_filename) > 1 else None
 | 
					        forcedConfigPath = base_config_file if len(model_filename) > 1 else None
 | 
				
			||||||
        # forcedConfigPath = base_config_file = f"configs/flf2v_720p.json"
 | 
					        # forcedConfigPath = base_config_file = f"configs/flf2v_720p.json"
 | 
				
			||||||
        # model_filename[1] = xmodel_filename
 | 
					        # model_filename[1] = xmodel_filename
 | 
				
			||||||
 | 
					        self.model = self.model2 = None
 | 
				
			||||||
        source =  model_def.get("source", None)
 | 
					        source =  model_def.get("source", None)
 | 
				
			||||||
 | 
					        source2 = model_def.get("source2", None)
 | 
				
			||||||
        module_source =  model_def.get("module_source", None)
 | 
					        module_source =  model_def.get("module_source", None)
 | 
				
			||||||
 | 
					        module_source2 =  model_def.get("module_source2", None)
 | 
				
			||||||
        if module_source is not None:
 | 
					        if module_source is not None:
 | 
				
			||||||
            model_filename = [] + model_filename
 | 
					            self.model = offload.fast_load_transformers_model(model_filename[:1] + [module_source], modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
            model_filename[1] = module_source
 | 
					        if module_source2 is not None:
 | 
				
			||||||
            self.model = offload.fast_load_transformers_model(model_filename, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
					            self.model2 = offload.fast_load_transformers_model(model_filename[1:2] + [module_source2], modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
        elif source is not None:
 | 
					        if source is not None:
 | 
				
			||||||
            self.model = offload.fast_load_transformers_model(source, modelClass=WanModel, writable_tensors= False, forcedConfigPath= base_config_file)
 | 
					            self.model = offload.fast_load_transformers_model(source, modelClass=WanModel, writable_tensors= False, forcedConfigPath= base_config_file)
 | 
				
			||||||
        elif self.transformer_switch:
 | 
					        if source2 is not None:
 | 
				
			||||||
            shared_modules= {}
 | 
					            self.model2 = offload.fast_load_transformers_model(source2, modelClass=WanModel, writable_tensors= False, forcedConfigPath= base_config_file)
 | 
				
			||||||
            self.model = offload.fast_load_transformers_model(model_filename[:1], modules = model_filename[2:], modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath,  return_shared_modules= shared_modules)
 | 
					 | 
				
			||||||
            self.model2 = offload.fast_load_transformers_model(model_filename[1:2], modules = shared_modules, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
					 | 
				
			||||||
            shared_modules = None
 | 
					 | 
				
			||||||
        else:
 | 
					 | 
				
			||||||
            self.model = offload.fast_load_transformers_model(model_filename, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
					 | 
				
			||||||
        
 | 
					 | 
				
			||||||
        # self.model = offload.load_model_data(self.model, xmodel_filename )
 | 
					 | 
				
			||||||
        # offload.load_model_data(self.model, "c:/temp/Phantom-Wan-1.3B.pth")
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
        self.model.lock_layers_dtypes(torch.float32 if mixed_precision_transformer else dtype)
 | 
					        if self.model is not None or self.model2 is not None:
 | 
				
			||||||
        offload.change_dtype(self.model, dtype, True)
 | 
					            from wgp import save_model
 | 
				
			||||||
 | 
					            from mmgp.safetensors2 import torch_load_file
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            if self.transformer_switch:
 | 
				
			||||||
 | 
					                if 0 in submodel_no_list[2:] and 1 in submodel_no_list:
 | 
				
			||||||
 | 
					                    raise Exception("Shared and non shared modules at the same time across multipe models is not supported")
 | 
				
			||||||
 | 
					                
 | 
				
			||||||
 | 
					                if 0 in submodel_no_list[2:]:
 | 
				
			||||||
 | 
					                    shared_modules= {}
 | 
				
			||||||
 | 
					                    self.model = offload.fast_load_transformers_model(model_filename[:1], modules = model_filename[2:], modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath,  return_shared_modules= shared_modules)
 | 
				
			||||||
 | 
					                    self.model2 = offload.fast_load_transformers_model(model_filename[1:2], modules = shared_modules, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
 | 
					                    shared_modules = None
 | 
				
			||||||
 | 
					                else:
 | 
				
			||||||
 | 
					                    modules_for_1 =[ file_name for file_name, submodel_no in zip(model_filename[2:],submodel_no_list[2:] ) if submodel_no ==1 ]
 | 
				
			||||||
 | 
					                    modules_for_2 =[ file_name for file_name, submodel_no in zip(model_filename[2:],submodel_no_list[2:] ) if submodel_no ==2 ]
 | 
				
			||||||
 | 
					                    self.model = offload.fast_load_transformers_model(model_filename[:1], modules = modules_for_1, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
 | 
					                    self.model2 = offload.fast_load_transformers_model(model_filename[1:2], modules = modules_for_2, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            else:
 | 
				
			||||||
 | 
					                self.model = offload.fast_load_transformers_model(model_filename, modelClass=WanModel,do_quantize= quantizeTransformer and not save_quantized, writable_tensors= False, defaultConfigPath=base_config_file , forcedConfigPath= forcedConfigPath)
 | 
				
			||||||
 | 
					        
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        if self.model is not None:
 | 
				
			||||||
 | 
					            self.model.lock_layers_dtypes(torch.float32 if mixed_precision_transformer else dtype)
 | 
				
			||||||
 | 
					            offload.change_dtype(self.model, dtype, True)
 | 
				
			||||||
 | 
					            self.model.eval().requires_grad_(False)
 | 
				
			||||||
        if self.model2 is not None:
 | 
					        if self.model2 is not None:
 | 
				
			||||||
            self.model2.lock_layers_dtypes(torch.float32 if mixed_precision_transformer else dtype)
 | 
					            self.model2.lock_layers_dtypes(torch.float32 if mixed_precision_transformer else dtype)
 | 
				
			||||||
            offload.change_dtype(self.model2, dtype, True)
 | 
					            offload.change_dtype(self.model2, dtype, True)
 | 
				
			||||||
 | 
					 | 
				
			||||||
        # offload.save_model(self.model, "wan2.1_text2video_1.3B_mbf16.safetensors", do_quantize= False, config_file_path=base_config_file, filter_sd=sd)
 | 
					 | 
				
			||||||
        # offload.save_model(self.model, "wan2.2_image2video_14B_low_mbf16.safetensors",  config_file_path=base_config_file)
 | 
					 | 
				
			||||||
        # offload.save_model(self.model, "wan2.2_image2video_14B_low_quanto_mbf16_int8.safetensors", do_quantize=True, config_file_path=base_config_file)
 | 
					 | 
				
			||||||
        self.model.eval().requires_grad_(False)
 | 
					 | 
				
			||||||
        if self.model2 is not None:
 | 
					 | 
				
			||||||
            self.model2.eval().requires_grad_(False)
 | 
					            self.model2.eval().requires_grad_(False)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if module_source is not None:
 | 
					        if module_source is not None:
 | 
				
			||||||
            from wgp import save_model
 | 
					            save_model(self.model, model_type, dtype, None, is_module=True, filter=list(torch_load_file(module_source)), module_source_no=1)
 | 
				
			||||||
            from mmgp.safetensors2 import torch_load_file
 | 
					        if module_source2 is not None:
 | 
				
			||||||
            filter = list(torch_load_file(module_source))
 | 
					            save_model(self.model2, model_type, dtype, None, is_module=True, filter=list(torch_load_file(module_source2)), module_source_no=2)
 | 
				
			||||||
            save_model(self.model, model_type, dtype, None, is_module=True, filter=filter)
 | 
					        if not source is None:
 | 
				
			||||||
        elif not source is None:
 | 
					            save_model(self.model, model_type, dtype, None, submodel_no= 1)
 | 
				
			||||||
            from wgp import save_model
 | 
					        if not source2 is None:
 | 
				
			||||||
            save_model(self.model, model_type, dtype, None)
 | 
					            save_model(self.model2, model_type, dtype, None, submodel_no= 2)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if save_quantized:
 | 
					        if save_quantized:
 | 
				
			||||||
            from wgp import save_quantized_model
 | 
					            from wgp import save_quantized_model
 | 
				
			||||||
            save_quantized_model(self.model, model_type, model_filename[0], dtype, base_config_file)
 | 
					            if self.model is not None:
 | 
				
			||||||
 | 
					                save_quantized_model(self.model, model_type, model_filename[0], dtype, base_config_file)
 | 
				
			||||||
            if self.model2 is not None:
 | 
					            if self.model2 is not None:
 | 
				
			||||||
                save_quantized_model(self.model2, model_type, model_filename[1], dtype, base_config_file, submodel_no=2)
 | 
					                save_quantized_model(self.model2, model_type, model_filename[1], dtype, base_config_file, submodel_no=2)
 | 
				
			||||||
        self.sample_neg_prompt = config.sample_neg_prompt
 | 
					        self.sample_neg_prompt = config.sample_neg_prompt
 | 
				
			||||||
@ -307,7 +323,7 @@ class WanAny2V:
 | 
				
			|||||||
                canvas = canvas.to(device)
 | 
					                canvas = canvas.to(device)
 | 
				
			||||||
        return ref_img.to(device), canvas
 | 
					        return ref_img.to(device), canvas
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size,  device, keep_video_guide_frames= [], start_frame = 0, pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
 | 
					    def prepare_source(self, src_video, src_mask, src_ref_images, total_frames, image_size,  device, keep_video_guide_frames= [], pre_src_video = None, inject_frames = [], outpainting_dims = None, any_background_ref = False):
 | 
				
			||||||
        image_sizes = []
 | 
					        image_sizes = []
 | 
				
			||||||
        trim_video_guide = len(keep_video_guide_frames)
 | 
					        trim_video_guide = len(keep_video_guide_frames)
 | 
				
			||||||
        def conv_tensor(t, device):
 | 
					        def conv_tensor(t, device):
 | 
				
			||||||
@ -659,13 +675,15 @@ class WanAny2V:
 | 
				
			|||||||
            inject_from_start = False
 | 
					            inject_from_start = False
 | 
				
			||||||
            if input_frames != None and denoising_strength < 1 :
 | 
					            if input_frames != None and denoising_strength < 1 :
 | 
				
			||||||
                color_reference_frame = input_frames[:, -1:].clone()
 | 
					                color_reference_frame = input_frames[:, -1:].clone()
 | 
				
			||||||
                if overlapped_latents != None:
 | 
					                if prefix_frames_count > 0:
 | 
				
			||||||
                    overlapped_latents_frames_num = overlapped_latents.shape[2]
 | 
					                    overlapped_frames_num = prefix_frames_count
 | 
				
			||||||
                    overlapped_frames_num = (overlapped_latents_frames_num-1) * 4 + 1
 | 
					                    overlapped_latents_frames_num = (overlapped_latents_frames_num -1 // 4) + 1 
 | 
				
			||||||
 | 
					                    # overlapped_latents_frames_num = overlapped_latents.shape[2]
 | 
				
			||||||
 | 
					                    # overlapped_frames_num = (overlapped_latents_frames_num-1) * 4 + 1
 | 
				
			||||||
                else: 
 | 
					                else: 
 | 
				
			||||||
                    overlapped_latents_frames_num = overlapped_frames_num  = 0
 | 
					                    overlapped_latents_frames_num = overlapped_frames_num  = 0
 | 
				
			||||||
                if len(keep_frames_parsed) == 0  or image_outputs or  (overlapped_frames_num + len(keep_frames_parsed)) == input_frames.shape[1] and all(keep_frames_parsed) : keep_frames_parsed = [] 
 | 
					                if len(keep_frames_parsed) == 0  or image_outputs or  (overlapped_frames_num + len(keep_frames_parsed)) == input_frames.shape[1] and all(keep_frames_parsed) : keep_frames_parsed = [] 
 | 
				
			||||||
                injection_denoising_step = int(sampling_steps * (1. - denoising_strength) )
 | 
					                injection_denoising_step = int( round(sampling_steps * (1. - denoising_strength),4) )
 | 
				
			||||||
                latent_keep_frames = []
 | 
					                latent_keep_frames = []
 | 
				
			||||||
                if source_latents.shape[2] < lat_frames or len(keep_frames_parsed) > 0:
 | 
					                if source_latents.shape[2] < lat_frames or len(keep_frames_parsed) > 0:
 | 
				
			||||||
                    inject_from_start = True
 | 
					                    inject_from_start = True
 | 
				
			||||||
 | 
				
			|||||||
@ -78,7 +78,7 @@ class family_handler():
 | 
				
			|||||||
        return family_handler.query_model_files(computeList, base_model_type, model_filename, text_encoder_quantization)
 | 
					        return family_handler.query_model_files(computeList, base_model_type, model_filename, text_encoder_quantization)
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized= False):
 | 
					    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized= False, submodel_no_list = None):
 | 
				
			||||||
        from .configs import WAN_CONFIGS
 | 
					        from .configs import WAN_CONFIGS
 | 
				
			||||||
        from .wan_handler import family_handler
 | 
					        from .wan_handler import family_handler
 | 
				
			||||||
        cfg = WAN_CONFIGS['t2v-14B']
 | 
					        cfg = WAN_CONFIGS['t2v-14B']
 | 
				
			||||||
 | 
				
			|||||||
@ -214,18 +214,20 @@ def process_tts_multi(text, save_dir, voice1, voice2):
 | 
				
			|||||||
    return s1, s2, save_path_sum
 | 
					    return s1, s2, save_path_sum
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames =  0, fps = 25, sr = 16000, padded_frames_for_embeddings = 0, min_audio_duration = 0):
 | 
					def get_full_audio_embeddings(audio_guide1 = None, audio_guide2 = None, combination_type ="add", num_frames =  0, fps = 25, sr = 16000, padded_frames_for_embeddings = 0, min_audio_duration = 0, return_sum_only = False):
 | 
				
			||||||
    wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/chinese-wav2vec2-base")
 | 
					    wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/chinese-wav2vec2-base")
 | 
				
			||||||
    # wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/wav2vec")
 | 
					    # wav2vec_feature_extractor, audio_encoder= custom_init('cpu', "ckpts/wav2vec")
 | 
				
			||||||
    pad = int(padded_frames_for_embeddings/ fps * sr)
 | 
					    pad = int(padded_frames_for_embeddings/ fps * sr)
 | 
				
			||||||
    new_human_speech1, new_human_speech2, sum_human_speechs, duration_changed = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps, pad = pad, min_audio_duration = min_audio_duration )
 | 
					    new_human_speech1, new_human_speech2, sum_human_speechs, duration_changed = audio_prepare_multi(audio_guide1, audio_guide2, combination_type, duration= num_frames / fps, pad = pad, min_audio_duration = min_audio_duration )
 | 
				
			||||||
    audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
 | 
					    if return_sum_only:
 | 
				
			||||||
    audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
 | 
					        full_audio_embs = None
 | 
				
			||||||
    full_audio_embs = []
 | 
					    else:
 | 
				
			||||||
    if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
 | 
					        audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
 | 
				
			||||||
    # if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
 | 
					        audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder, sr=sr, fps= fps)
 | 
				
			||||||
    if audio_guide2 != None: full_audio_embs.append(audio_embedding_2)
 | 
					        full_audio_embs = []
 | 
				
			||||||
    if audio_guide2 == None and not duration_changed: sum_human_speechs = None
 | 
					        if audio_guide1 != None: full_audio_embs.append(audio_embedding_1)
 | 
				
			||||||
 | 
					        if audio_guide2 != None: full_audio_embs.append(audio_embedding_2)
 | 
				
			||||||
 | 
					        if audio_guide2 == None and not duration_changed: sum_human_speechs = None
 | 
				
			||||||
    return full_audio_embs, sum_human_speechs
 | 
					    return full_audio_embs, sum_human_speechs
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
				
			|||||||
@ -166,7 +166,8 @@ class family_handler():
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
            extra_model_def["lock_image_refs_ratios"] = True
 | 
					            extra_model_def["lock_image_refs_ratios"] = True
 | 
				
			||||||
            extra_model_def["background_removal_label"]= "Remove Backgrounds behind People / Objects, keep it for Landscape or positioned Frames"
 | 
					            extra_model_def["background_removal_label"]= "Remove Backgrounds behind People / Objects, keep it for Landscape or positioned Frames"
 | 
				
			||||||
 | 
					            extra_model_def["video_guide_outpainting"] = [0,1]
 | 
				
			||||||
 | 
					            
 | 
				
			||||||
        if base_model_type in ["standin"]: 
 | 
					        if base_model_type in ["standin"]: 
 | 
				
			||||||
            extra_model_def["lock_image_refs_ratios"] = True
 | 
					            extra_model_def["lock_image_refs_ratios"] = True
 | 
				
			||||||
            extra_model_def["image_ref_choices"] = {
 | 
					            extra_model_def["image_ref_choices"] = {
 | 
				
			||||||
@ -293,7 +294,7 @@ class family_handler():
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    @staticmethod
 | 
					    @staticmethod
 | 
				
			||||||
    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized= False):
 | 
					    def load_model(model_filename, model_type, base_model_type, model_def, quantizeTransformer = False, text_encoder_quantization = None, dtype = torch.bfloat16, VAE_dtype = torch.float32, mixed_precision_transformer = False, save_quantized= False, submodel_no_list = None):
 | 
				
			||||||
        from .configs import WAN_CONFIGS
 | 
					        from .configs import WAN_CONFIGS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        if test_class_i2v(base_model_type):
 | 
					        if test_class_i2v(base_model_type):
 | 
				
			||||||
@ -306,6 +307,7 @@ class family_handler():
 | 
				
			|||||||
            config=cfg,
 | 
					            config=cfg,
 | 
				
			||||||
            checkpoint_dir="ckpts",
 | 
					            checkpoint_dir="ckpts",
 | 
				
			||||||
            model_filename=model_filename,
 | 
					            model_filename=model_filename,
 | 
				
			||||||
 | 
					            submodel_no_list = submodel_no_list,
 | 
				
			||||||
            model_type = model_type,        
 | 
					            model_type = model_type,        
 | 
				
			||||||
            model_def = model_def,
 | 
					            model_def = model_def,
 | 
				
			||||||
            base_model_type=base_model_type,
 | 
					            base_model_type=base_model_type,
 | 
				
			||||||
@ -381,7 +383,7 @@ class family_handler():
 | 
				
			|||||||
        if base_model_type in ["fantasy"]:
 | 
					        if base_model_type in ["fantasy"]:
 | 
				
			||||||
            ui_defaults.update({
 | 
					            ui_defaults.update({
 | 
				
			||||||
                "audio_guidance_scale": 5.0,
 | 
					                "audio_guidance_scale": 5.0,
 | 
				
			||||||
                "sliding_window_size": 1, 
 | 
					                "sliding_window_overlap" : 1,
 | 
				
			||||||
            })
 | 
					            })
 | 
				
			||||||
 | 
					
 | 
				
			||||||
        elif base_model_type in ["multitalk"]:
 | 
					        elif base_model_type in ["multitalk"]:
 | 
				
			||||||
@ -398,6 +400,7 @@ class family_handler():
 | 
				
			|||||||
                "guidance_scale": 5.0,
 | 
					                "guidance_scale": 5.0,
 | 
				
			||||||
                "flow_shift": 7, # 11 for 720p
 | 
					                "flow_shift": 7, # 11 for 720p
 | 
				
			||||||
                "sliding_window_overlap" : 9,
 | 
					                "sliding_window_overlap" : 9,
 | 
				
			||||||
 | 
					                "sliding_window_size": 81, 
 | 
				
			||||||
                "sample_solver" : "euler",
 | 
					                "sample_solver" : "euler",
 | 
				
			||||||
                "video_prompt_type": "QKI",
 | 
					                "video_prompt_type": "QKI",
 | 
				
			||||||
                "remove_background_images_ref" : 0,
 | 
					                "remove_background_images_ref" : 0,
 | 
				
			||||||
 | 
				
			|||||||
							
								
								
									
										69
									
								
								preprocessing/extract_vocals.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										69
									
								
								preprocessing/extract_vocals.py
									
									
									
									
									
										Normal file
									
								
							@ -0,0 +1,69 @@
 | 
				
			|||||||
 | 
					from pathlib import Path
 | 
				
			||||||
 | 
					import os, tempfile
 | 
				
			||||||
 | 
					import numpy as np
 | 
				
			||||||
 | 
					import soundfile as sf
 | 
				
			||||||
 | 
					import librosa
 | 
				
			||||||
 | 
					import torch
 | 
				
			||||||
 | 
					import gc
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					from audio_separator.separator import Separator
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def get_vocals(src_path: str, dst_path: str, min_seconds: float = 8) -> str:
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					    If the source audio is shorter than `min_seconds`, pad with trailing silence
 | 
				
			||||||
 | 
					    in a temporary file, then run separation and save only the vocals to dst_path.
 | 
				
			||||||
 | 
					    Returns the full path to the vocals file.
 | 
				
			||||||
 | 
					    """
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    default_device = torch.get_default_device()
 | 
				
			||||||
 | 
					    torch.set_default_device('cpu')
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    dst = Path(dst_path)
 | 
				
			||||||
 | 
					    dst.parent.mkdir(parents=True, exist_ok=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    # Quick duration check
 | 
				
			||||||
 | 
					    duration = librosa.get_duration(path=src_path)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    use_path = src_path
 | 
				
			||||||
 | 
					    temp_path = None
 | 
				
			||||||
 | 
					    try:
 | 
				
			||||||
 | 
					        if duration < min_seconds:
 | 
				
			||||||
 | 
					            # Load (resample) and pad in memory
 | 
				
			||||||
 | 
					            y, sr = librosa.load(src_path, sr=None, mono=False)
 | 
				
			||||||
 | 
					            if y.ndim == 1:  # ensure shape (channels, samples)
 | 
				
			||||||
 | 
					                y = y[np.newaxis, :]
 | 
				
			||||||
 | 
					            target_len = int(min_seconds * sr)
 | 
				
			||||||
 | 
					            pad = max(0, target_len - y.shape[1])
 | 
				
			||||||
 | 
					            if pad:
 | 
				
			||||||
 | 
					                y = np.pad(y, ((0, 0), (0, pad)), mode="constant")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					            # Write a temp WAV for the separator
 | 
				
			||||||
 | 
					            fd, temp_path = tempfile.mkstemp(suffix=".wav")
 | 
				
			||||||
 | 
					            os.close(fd)
 | 
				
			||||||
 | 
					            sf.write(temp_path, y.T, sr)  # soundfile expects (frames, channels)
 | 
				
			||||||
 | 
					            use_path = temp_path
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        # Run separation: emit only the vocals, with your exact filename
 | 
				
			||||||
 | 
					        sep = Separator(
 | 
				
			||||||
 | 
					            output_dir=str(dst.parent),
 | 
				
			||||||
 | 
					            output_format=(dst.suffix.lstrip(".") or "wav"),
 | 
				
			||||||
 | 
					            output_single_stem="Vocals",
 | 
				
			||||||
 | 
					            model_file_dir="ckpts/roformer/" #model_bs_roformer_ep_317_sdr_12.9755.ckpt"
 | 
				
			||||||
 | 
					        )
 | 
				
			||||||
 | 
					        sep.load_model()
 | 
				
			||||||
 | 
					        out_files = sep.separate(use_path, {"Vocals": dst.stem})
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        out = Path(out_files[0])
 | 
				
			||||||
 | 
					        return str(out if out.is_absolute() else (dst.parent / out))
 | 
				
			||||||
 | 
					    finally:
 | 
				
			||||||
 | 
					        if temp_path and os.path.exists(temp_path):
 | 
				
			||||||
 | 
					            os.remove(temp_path)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					        torch.cuda.empty_cache()
 | 
				
			||||||
 | 
					        gc.collect()
 | 
				
			||||||
 | 
					        torch.set_default_device(default_device)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Example:
 | 
				
			||||||
 | 
					# final = extract_vocals("in/clip.mp3", "out/vocals.wav")
 | 
				
			||||||
 | 
					# print(final)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -100,7 +100,7 @@ class OptimizedPyannote31SpeakerSeparator:
 | 
				
			|||||||
        self.hf_token = hf_token
 | 
					        self.hf_token = hf_token
 | 
				
			||||||
        self._overlap_pipeline = None
 | 
					        self._overlap_pipeline = None
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    def separate_audio(self, audio_path: str, output1, output2 ) -> Dict[str, str]:
 | 
					    def separate_audio(self, audio_path: str, output1, output2, audio_original_path: str = None  ) -> Dict[str, str]:
 | 
				
			||||||
        """Optimized main separation function with memory management."""
 | 
					        """Optimized main separation function with memory management."""
 | 
				
			||||||
        xprint("Starting optimized audio separation...")
 | 
					        xprint("Starting optimized audio separation...")
 | 
				
			||||||
        self._current_audio_path = os.path.abspath(audio_path)        
 | 
					        self._current_audio_path = os.path.abspath(audio_path)        
 | 
				
			||||||
@ -128,7 +128,11 @@ class OptimizedPyannote31SpeakerSeparator:
 | 
				
			|||||||
        gc.collect()
 | 
					        gc.collect()
 | 
				
			||||||
        
 | 
					        
 | 
				
			||||||
        # Save outputs efficiently
 | 
					        # Save outputs efficiently
 | 
				
			||||||
        output_paths = self._save_outputs_optimized(waveform, final_masks, sample_rate, audio_path, output1, output2)
 | 
					        if audio_original_path is None:
 | 
				
			||||||
 | 
					            waveform_original = waveform
 | 
				
			||||||
 | 
					        else:
 | 
				
			||||||
 | 
					            waveform_original, sample_rate = self.load_audio(audio_original_path)
 | 
				
			||||||
 | 
					        output_paths = self._save_outputs_optimized(waveform_original, final_masks, sample_rate, audio_path, output1, output2)
 | 
				
			||||||
        
 | 
					        
 | 
				
			||||||
        return output_paths
 | 
					        return output_paths
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@ -835,7 +839,7 @@ class OptimizedPyannote31SpeakerSeparator:
 | 
				
			|||||||
        for turn, _, speaker in diarization.itertracks(yield_label=True):
 | 
					        for turn, _, speaker in diarization.itertracks(yield_label=True):
 | 
				
			||||||
            xprint(f"{speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
 | 
					            xprint(f"{speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
def extract_dual_audio(audio, output1, output2, verbose = False):
 | 
					def extract_dual_audio(audio, output1, output2, verbose = False, audio_original = None):
 | 
				
			||||||
    global verbose_output
 | 
					    global verbose_output
 | 
				
			||||||
    verbose_output = verbose
 | 
					    verbose_output = verbose
 | 
				
			||||||
    separator = OptimizedPyannote31SpeakerSeparator(
 | 
					    separator = OptimizedPyannote31SpeakerSeparator(
 | 
				
			||||||
@ -848,7 +852,7 @@ def extract_dual_audio(audio, output1, output2, verbose = False):
 | 
				
			|||||||
    import time
 | 
					    import time
 | 
				
			||||||
    start_time = time.time()
 | 
					    start_time = time.time()
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    outputs = separator.separate_audio(audio, output1, output2)
 | 
					    outputs = separator.separate_audio(audio, output1, output2, audio_original)
 | 
				
			||||||
    
 | 
					    
 | 
				
			||||||
    elapsed_time = time.time() - start_time
 | 
					    elapsed_time = time.time() - start_time
 | 
				
			||||||
    xprint(f"\n=== SUCCESS (completed in {elapsed_time:.2f}s) ===")
 | 
					    xprint(f"\n=== SUCCESS (completed in {elapsed_time:.2f}s) ===")
 | 
				
			||||||
 | 
				
			|||||||
@ -21,14 +21,15 @@ mutagen
 | 
				
			|||||||
pyloudnorm
 | 
					pyloudnorm
 | 
				
			||||||
librosa==0.11.0
 | 
					librosa==0.11.0
 | 
				
			||||||
speechbrain==1.0.3
 | 
					speechbrain==1.0.3
 | 
				
			||||||
 
 | 
					audio-separator==0.36.1
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# UI & interaction
 | 
					# UI & interaction
 | 
				
			||||||
gradio==5.29.0
 | 
					gradio==5.29.0
 | 
				
			||||||
dashscope
 | 
					dashscope
 | 
				
			||||||
loguru
 | 
					loguru
 | 
				
			||||||
 | 
					
 | 
				
			||||||
# Vision & segmentation
 | 
					# Vision & segmentation
 | 
				
			||||||
opencv-python>=4.9.0.80
 | 
					opencv-python>=4.12.0.88
 | 
				
			||||||
segment-anything
 | 
					segment-anything
 | 
				
			||||||
rembg[gpu]==2.0.65
 | 
					rembg[gpu]==2.0.65
 | 
				
			||||||
onnxruntime-gpu
 | 
					onnxruntime-gpu
 | 
				
			||||||
 | 
				
			|||||||
		Loading…
	
		Reference in New Issue
	
	Block a user