Vace Outpainting, new Depth extractor

2025-11-04 06:15:17 +00:00 · 2025-06-19 15:59:47 +02:00 · 2025-06-19 15:59:47 +02:00 · 4d202db319
commit 4d202db319
parent 188225cdd7
27 changed files with 2023 additions and 180 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,13 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 **Follow DeepBeepMeep on Twitter/X to get the Latest News**: https://x.com/deepbeepmeep

 ## 🔥 Latest Updates
+### June 19 2025: WanGP v6.2, Vace even more Powercharged
+Have I told you that I am a big fan of Vace ? Here are more goodies to unleash its power: 
+- If you ever wanted to watch Star Wars in 4:3, just use the new *Outpainting* feature and it will add the missing bits of image at the top and the bottom of the screen. The best thing is *Outpainting* can be combined with all the other Vace modifications, for instance you can change the main character of your favorite movie at the same time  
+- More processing can combined at the same time  (for instance the depth process can be applied outside the mask)
+- Upgraded the depth extractor to Depth Anything 2 which is much more detailed
+
+As a bonus, I have added two finetunes based on the Safe-Forcing technology (which requires only 4 steps to generate a video): Wan 2.1 text2video Safe-Forcing and Vace Safe-Forcing. I know there is Lora around but the quality of the Lora is worse (at least with Vace) compared to the full model. Don't hesiate to share your opinion about this on the discord server. 
 ### June 17 2025: WanGP v6.1, Vace Powercharged
 Lots of improvements for Vace the Mother of all Models:
 - masks can now be combined with on the fly processing of a control video, for instance you can extract the motion of a specific person defined by a mask
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@ -1,6 +1,40 @@
 # Changelog

 ## 🔥 Latest News
+### June 19 2025: WanGP v6.2, Vace even more Powercharged
+Have I told you that I am a big fan of Vace ? Here are more goodies to unleash its power: 
+- If you ever wanted to watch Star Wars in 4:3, just use the new *Outpainting* feature and it will add the missing bits of image at the top and the bottom of the screen. The best thing is *Outpainting* can be combined with all the other Vace modifications, for instance you can change the main character of your favorite movie at the same time  
+- More processing can combined at the same time  (for instance the depth process can be applied outside the mask)
+- Upgraded the depth extractor to Depth Anything 2 which is much more detailed
+
+### June 17 2025: WanGP v6.1, Vace Powercharged
+Lots of improvements for Vace the Mother of all Models:
+- masks can now be combined with on the fly processing of a control video, for instance you can extract the motion of a specific person defined by a mask
+- on the fly modification of masks : reversed masks (with the same mask you can modify the background instead of the people covered by the masks), enlarged masks (you can cover more area if for instance the person you are trying to inject is larger than the one in the mask), ...
+- view these modified masks directly inside WanGP during the video generation to check they are really as expected
+- multiple frames injections: multiples frames can be injected at any location of the video
+- expand past videos in on click: just select one generated video to expand it
+
+Of course all these new stuff work on all Vace finetunes (including Vace Fusionix).
+
+Thanks also to Reevoy24 for adding a Notfication sound at the end of a generation and for fixing the background color of the current generation summary.
+
+### June 12 2025: WanGP v6.0
+👋 *Finetune models*: You find the 20 models supported by WanGP not sufficient ? Too impatient to wait for the next release to get the support for a newly released model ? Your prayers have been answered: if a new model is compatible with a model architecture supported by WanGP, you can add by yourself the support for this model in WanGP by just creating a finetune model definition. You can then store this model in the cloud (for instance in Huggingface) and the very light finetune definition file can be easily shared with other users. WanGP will download automatically the finetuned model for them.
+
+To celebrate the new finetunes support, here are a few finetune gifts (directly accessible from the model selection menu):
+- *Fast Hunyuan Video* : generate model t2v in only 6 steps
+- *Hunyuan Vido AccVideo* : generate model t2v in only 5 steps
+- *Wan FusioniX*: it is a combo of AccVideo / CausVid ans other models and can generate high quality Wan videos in only 8 steps
+
+One more thing...
+
+The new finetune system can be used to combine complementaty models : what happens when you combine  Fusionix Text2Video and Vace Control Net ?
+
+You get **Vace FusioniX**: the Ultimate Vace Model, Fast (10 steps, no need for guidance) and with a much better quality Video than the original slower model (despite being the best Control Net out there). Here goes one more finetune...
+
+Check the *Finetune Guide* to create finetune models definitions and share them on the WanGP discord server.
+
 ### June 12 2025: WanGP v5.6
 👋 *Finetune models*: You find the 20 models supported by WanGP not sufficient ? Too impatient to wait for the next release to get the support for a newly released model ? Your prayers have been answered: if a new model is compatible with a model architecture supported by WanGP, you can add by yourself the support for this model in WanGP by just creating a finetune model definition. You can then store this model in the cloud (for instance in Huggingface) and the very light finetune definition file can be easily shared with other users. WanGP will download automatically the finetuned model for them.

--- a/docs/LORAS.md
+++ b/docs/LORAS.md
@ -88,6 +88,36 @@ python wgp.py --lora-preset mypreset.lset
 - Presets include comments with usage instructions
 - Share `.lset` files with other users

+## Supported Formats
+
+WanGP supports multiple lora formats:
+- **Safetensors** (.safetensors)
+- **Replicate** format
+- **Standard PyTorch** (.pt, .pth)
+
+## Safe-Forcing lightx2v Lora (Video Generation Accelerator)
+
+Safeforcing Lora has been created by Kijai from the Safe-Forcing lightx2v distilled Wan model and can generate videos with only 2 steps and offers also a 2x speed improvement since it doesnt require classifier free guidance. It works on both t2v and i2v models
+
+### Setup Instructions
+1. Download the Lora:
+   ```
+   https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors
+   ```
+2. Place in your `loras/` directory
+
+### Usage
+1. Select a Wan t2v or i2v model (e.g., Wan 2.1 text2video 13B or Vace 13B)
+2. Enable Advanced Mode
+3. In Advanced Generation Tab:
+   - Set Guidance Scale = 1
+   - Set Shift Scale = 5
+4. In Advanced Lora Tab:
+   - Select the Lora above
+   - Set multiplier to 1
+5. Set generation steps to 2-8
+6. Generate!
+
 ## CausVid Lora (Video Generation Accelerator)

 CausVid is a distilled Wan model that generates videos in 4-12 steps with 2x speed improvement.
@ -118,19 +148,14 @@ CausVid is a distilled Wan model that generates videos in 4-12 steps with 2x spe

 *Note: Lower steps = lower quality (especially motion)*

-## Supported Formats

-WanGP supports multiple lora formats:
- **Safetensors** (.safetensors)
- **Replicate** format
- **Standard PyTorch** (.pt, .pth)

 ## AccVid Lora (Video Generation Accelerator)

 AccVid is a distilled Wan model that generates videos with a 2x speed improvement since classifier free guidance is no longer needed (that is cfg = 1).

 ### Setup Instructions
-1. Download the CausVid Lora:
+1. Download the AccVid Lora:

 - for t2v models:
   ```
@ -152,6 +177,9 @@ AccVid is a distilled Wan model that generates videos with a 2x speed improvemen
   - Set Shift Scale = 5
 4. The number steps remain unchanged compared to what you would use with the original model but it will be two times faster since classifier free guidance is not needed

+
+https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors
+
 ## Performance Tips

 ### Fast Loading/Unloading
--- a/docs/MODELS.md
+++ b/docs/MODELS.md
@ -108,9 +108,16 @@ Please note that that the term *Text2Video* refers to the underlying Wan archite
 <BR>

 ## Wan Special Loras
-### Causvid
+### Safe-Forcing lightx2v Lora
 - **Type**: Distilled model (Lora implementation)
- **Speed**: 4-12 steps generation, 2x faster
+- **Speed**: 4-8 steps generation, 2x faster (no classifier free guidance)
+- **Compatible**: Works with t2v and i2v Wan 14B models
+- **Setup**: Requires Safe-Forcing lightx2v Lora (see [LORAS.md](LORAS.md))
+
+
+### Causvid Lora
+- **Type**: Distilled model (Lora implementation)
+- **Speed**: 4-12 steps generation, 2x faster (no classifier free guidance)
 - **Compatible**: Works with Wan 14B models
 - **Setup**: Requires CausVid Lora (see [LORAS.md](LORAS.md))

--- a/docs/VACE.md
+++ b/docs/VACE.md
@ -1,6 +1,6 @@
 # VACE ControlNet Guide

-VACE is a powerful ControlNet that enables Video-to-Video and Reference-to-Video generation. It allows you to inject your own images into output videos, animate characters, perform inpainting/outpainting, and continue videos.
+VACE is a powerful ControlNet that enables Video-to-Video and Reference-to-Video generation. It allows you to inject your own images into output videos, animate characters, perform inpainting/outpainting, and continue existing videos.

 ## Overview

@ -10,7 +10,8 @@ VACE is probably one of the most powerful Wan models available. With it, you can
 - Perform video inpainting and outpainting
 - Continue existing videos
 - Transfer motion from one video to another
- Change the style of scenes while preserving depth
+- Change the style of scenes while preserving the structure of the scenes
+

 ## Getting Started

@ -18,79 +19,93 @@ VACE is probably one of the most powerful Wan models available. With it, you can
 1. Select either "Vace 1.3B" or "Vace 13B" from the dropdown menu
 2. Note: VACE works best with videos up to 7 seconds with the Riflex option enabled

+You can also use any derived Vace models such as Vace Fusionix or combine Vace with Loras accelerator such as Causvid.
+
 ### Input Types

-VACE accepts three types of visual hints (which can be combined):
-
 #### 1. Control Video
- Transfer motion or depth to a new video
- Use only the first n frames and extrapolate the rest
- Perform inpainting with grey color (127) as mask areas
- Grey areas will be filled based on text prompt and reference images
+The Control Video is the source material that contains the instructions about what you want. So Vace expects in the Control Video some visual hints about the type of processing expected: for instance replacing an area by something else, converting an Open Pose wireframe into a human motion, colorizing an Area,  transferring the depth of an image area, ...
+
+For example, anywhere your control video contains the color 127 (grey), it will be considered as an area to be inpainting and replaced by the content of your text prompt and / or a reference image (see below). Likewise if the frames of a Control Video contains an Open Pose wireframe (basically some straight lines tied together that describes the pose of a person), Vace will automatically turn this Open Pose into a real human based on the text prompt and any reference Images (see below).
+
+You can either build yourself the Control Video with the annotators tools provided by the Vace team (see the Vace ressources at the bottom) or you can let WanGP (recommended option) generates on the fly a Vace formatted Control Video based on information you provide.
+
+WanGP wil need the following information to generate a Vace Control Video:
+- A *Control Video* : this video shouldn't have been altered by an annotator tool and can be taken straight from youtube or your camera
+- *Control Video Process* : This is the type of process you want to apply on the control video. For instance *Transfer Human Motion* will generate the Open Pose information from your video so that you can transfer this same motion to a generated character
+- *Area Processed* : you can target the processing to a specific area. For instance even if there are multiple people in the Control Video you may want to replace only one them. If you decide to target an area you will need to provide a *Video Mask* as well. These types of videos can be easily created using the Matanyone tool embedded with WanGP (see the doc of Matanyone below). WanGP can apply different types of process, one the mask and another one on the outside the mask.
+
+Another nice thing is that you can combine all effects above with Outpainting since WanGP will create automatically an outpainting area in the Control Video if you ask for this. 
+
+By default WanGP will ask Vace to generate new frames in the "same spirit" of the control video if the latter is shorter than the number frames that you have requested.
+

 #### 2. Reference Images
- Use as background/setting for the video
- Inject people or objects of your choice
- Select multiple reference images
- **Tip**: Replace complex backgrounds with white for better object integration
- Always describe injected objects/people explicitly in your text prompt
+With Reference Images you can inject people or objects of your choice in the Video.
+You can also force Images to appear at a specific frame nos in the Video.

-#### 3. Video Mask
- Stronger control over which parts to keep (black) or replace (white)
- Perfect for inpainting/outpainting
- Example: White mask except at beginning/end (black) keeps first/last frames while generating middle content
+If the Reference Image is a person or an object, it is recommended to turn on the background remover that will replace the background by the white color.
+This is not needed for a background image or an injected frame at a specific position.

-## Common Use Cases
+It is recommended to describe injected objects/people explicitly in your text prompt so that Vace can connect the Reference Images to the new generated video and this will increase the chance that you will find your injected people or objects.

-### Motion Transfer
-**Goal**: Animate a character of your choice using motion from another video
-**Setup**: 
- Reference Images: Your character
- Control Video: Person performing desired motion
- Text Prompt: Describe your character and the action

-### Object/Person Injection
-**Goal**: Insert people or objects into a scene
-**Setup**:
- Reference Images: The people/objects to inject
- Text Prompt: Describe the scene and explicitly mention the injected elements
+### Understanding Vace Control Video and Mask format
+As stated above WanGP will adapt the Control Video and the Video Mask to meet your instructions. You can preview the first frames of the new Control Video and of the Video Mask in the Generation Preview box (just click a thumbnail) to check that your request has been properly interpreted. You can as well ask WanGP to save in the main folder of WanGP the full generated Control Video and  Video Mask by launching the app with the *--save-masks* command.

-### Character Animation
-**Goal**: Animate a character based on text description
-**Setup**:
- Control Video: Video of person moving
- Text Prompt: Detailed description of your character
+Look at the background colors of both the Control Video and the Video Mask:
+The Mask Video is the most important because depending on the color of its pixels, the Control Video will be interpreted differently. If an area in the Mask is black, the corresponding Control Video area will be kept as is. On the contrary if an area of the Mask is plain white, a Vace process will be applied on this area. If there isn't any Mask Video the Vace process will apply on the whole video frames. The nature of the process itself will depend on what there is in the Control Video for this area. 
+- if the area in grey (127) in the Control Video, this area will be replaced by new content based on the text prompt or image references
+- if an area represents a person in the wireframe Open Pose format, it will be replaced by a person animated with motion described by the Open Pose.The appearance of the person will depend on the text prompt or image references
+- if an area contains multiples shades of grey, these will be assumed to represent different levels of image depth and Vace will try to generate new content located at the same depth

-### Style Transfer with Depth
-**Goal**: Change scene style while preserving spatial relationships
-**Setup**:
- Control Video: Original video (for depth information)
- Text Prompt: New style description
+There are more Vace representations. For all the different mapping please refer the official Vace documentation.

-## Integrated Matanyone Tool
+### Example 1 : Replace a Person in one video by another one by keeping the Background
+1) In Vace, select *Control Video Process*=**Transfer human pose**, *Area processed*=**Masked area** 
+2) In *Matanyone Video Mask Creator*, load your source video and create a mask where you targetted a specific person 
+3) Click *Export to Control Video Input and Video Mask Input* to transfer both the original video that now becomes the *Control Video* and the black & white mask that now defines the *Video Mask Area*
+4) Back in Vace, in *Reference Image* select **Inject Landscapes / People / Objects** and upload one or several pictures of the new person
+5) Generate

-WanGP includes the Matanyone tool, specifically tuned for VACE workflows. This helps create control videos and masks simultaneously.
+This works also with several people at the same time (you just need to mask several people in *Matanyone*), you can also play with the slider *Expand / Shrink Mask* if the new person is larger than the original one and of course, you can also use the text *Prompt* if you dont want to use an image for the swap.

-### Creating Face Replacement Masks
+
+### Example 2 : Change the Background behind some characters
+1) In Vace, select *Control Video Process*=**Inpainting**, *Area processed*=**Non Masked area** 
+2) In *Matanyone Video Mask Creator*, load your source video and create a mask where you targetted the people you want to keep 
+3) Click *Export to Control Video Input and Video Mask Input* to transfer both the original video that now becomes the *Control Video* and the black & white mask that now defines the *Video Mask Area*
+4) Generate
+
+If instead *Control Video Process*=**Depth**, then the background although it will be still different it will have a similar geometry than in the control video
+
+### Example 3 : Outpaint a Video to the Left and Inject a Character in this new area 
+1) In Vace, select *Control Video Process*=**Keep Unchanged** 
+2) *Control Video Outpainting in Percentage* enter the value 40 to the *Left* entry
+3) In *Reference Image* select **Inject Landscapes / People / Objects** and upload one or several pictures of a person
+4) Enter the *Prompt* such as "a person is coming from the left" (you will need of course a more accurate description)
+5) Generate
+
+
+
+### Creating Face / Object Replacement Masks
 1. Load your video in Matanyone
-2. Click on the face in the first frame
-3. Create a mask for the face
-4. Generate both control video and mask video with "Generate Video Matting"
-5. Export to VACE with "Export to current Video Input and Video Mask"
-6. Load replacement face image in Reference Images field
+2. Click on the face or object in the first frame
+3. Validate the mask by clicking **Set Mask**
+4. Generate a copy of the control video (for easy transfers) and a new mask video by clicking "Generate Video Matting"
+5. Export to VACE with *Export to Control Video Input and Video Mask Input*

 ### Advanced Matanyone Tips
- **Negative Point Prompts**: Remove parts from current selection
- **Sub Masks**: Create multiple independent masks, then combine them
- **Background Masks**: Select everything except the character (useful for background replacement)
- Enable/disable sub masks in Matanyone settings
+- **Negative Point Prompts**: Remove parts from current selection if the mask goes beyond the desired area
+- **Sub Masks**: Create multiple independent masks, then combine them. This may be useful if you are struggling to select exactly what you want.    
+

 ## Recommended Settings

 ### Quality Settings
- **Skip Layer Guidance**: Turn ON with default configuration for better results
+- **Skip Layer Guidance**: Turn ON with default configuration for better results (useless with FusioniX of Causvid are there is no cfg)
 - **Long Prompts**: Use detailed descriptions, especially for background elements not in reference images
- **Steps**: Use at least 15 steps for good quality, 30+ for best results
+- **Steps**: Use at least 15 steps for good quality, 30+ for best results if you use the original Vace model. But only 8-10 steps are sufficient with Vace Funsionix or if you use Loras such as Causvid or Self-Forcing.

 ### Sliding Window Settings
 For very long videos, configure sliding windows properly:
@ -98,27 +113,24 @@ For very long videos, configure sliding windows properly:
 - **Window Size**: Set appropriate duration for your content
 - **Overlap Frames**: Long enough for motion continuity, short enough to avoid blur propagation
 - **Discard Last Frames**: Remove at least 4 frames from each window (VACE 1.3B tends to blur final frames)
+- **Add Overlapped Noise**: May or may not reduce quality degradation over time

 ### Background Removal
-VACE includes automatic background removal options:
+WanGP includes automatic background removal options:
 - Use for reference images containing people/objects
- **Don't use** for landscape/setting reference images (first reference image)
- Multiple background removal types available
+- **Don't use** this for landscape/setting reference images (the first reference image)
+- If you are not happy with the automatic background removal tool you can use the Image version of Matanyone for a precise background removal

 ## Window Sliding for Long Videos

 Generate videos up to 1 minute by merging multiple windows:
+The longer the video the greater the quality degradation. However the effect will be less visible if your generated video reuses mostly non altered control video.

 ### How It Works
 - Each window uses corresponding time segment from control video
 - Example: 0-4s control video → first window, 4-8s → second window, etc.
 - Automatic overlap management ensures smooth transitions

-### Settings
- **Window Size**: Duration of each generation window
- **Overlap Frames**: Frames shared between windows for continuity
- **Discard Last Frames**: Remove poor-quality ending frames
- **Add Overlapped Noise**: Reduce quality degradation over time

 ### Formula
 ```
@ -133,10 +145,7 @@ Generated Frames = [Windows - 1] × [Window Size - Overlap - Discard] + Window S
 ## Advanced Features

 ### Extend Video
-Click "Extend the Video Sample, Please!" during generation to add more windows dynamically.
-
-### Noise Addition
-Add noise to overlapped frames to hide accumulated errors and quality degradation.
+Click "Extend the Video Sample, Please!" during generation to add more windows dynamically. This can be useful, if you don't ask for many frames to generated initially and are happy the video parts already generated.

 ### Frame Truncation
 Automatically remove lower-quality final frames from each window (recommended: 4 frames for VACE 1.3B).
@ -181,7 +190,6 @@ Automatically remove lower-quality final frames from each window (recommended: 4
 4. Check control video quality

 ## Tips for Best Results
-
 1. **Detailed Prompts**: Describe everything in the scene, especially elements not in reference images
 2. **Quality Reference Images**: Use high-resolution, well-lit reference images
 3. **Proper Masking**: Take time to create precise masks with Matanyone
--- a/finetunes/t2v_fusionix.json
+++ b/finetunes/t2v_fusionix.json
@ -1,7 +1,7 @@
 {
 	"model":
 	{
-		"name": "Wan text2video FusioniX 14B",
+		"name": "Wan2.1 text2video FusioniX 14B",
 		"architecture" : "t2v",
 		"description": "A powerful merged text-to-video model based on the original WAN 2.1 T2V model, enhanced using multiple open-source components and LoRAs to boost motion realism, temporal consistency, and expressive detail.",
 		"URLs": [
--- a/finetunes/t2v_sf.json
+++ b/finetunes/t2v_sf.json
@ -0,0 +1,38 @@
+{
+    "model": {
+        "name": "Wan2.1 text2video Self-Forcing 14B",
+        "architecture": "t2v",
+        "description": "This model is an advanced text-to-video generation model. This approach allows the model to generate videos with significantly fewer inference steps (4 or 8 steps) and without classifier-free guidance, substantially reducing video generation time while maintaining high quality outputs.",
+        "URLs": [
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_bf16.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_quanto_bf16_int8.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_quanto_fp16_int8.safetensors"
+        ],
+		"author": "https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill",
+        "auto_quantize": true
+    },
+    "negative_prompt": "",
+    "prompt": "",
+    "resolution": "832x480",
+    "video_length": 81,
+    "seed": -1,
+    "num_inference_steps": 4,
+    "guidance_scale": 1,
+    "flow_shift": 3,
+    "embedded_guidance_scale": 6,
+    "repeat_generation": 1,
+    "multi_images_gen_type": 0,
+    "tea_cache_setting": 0,
+    "tea_cache_start_step_perc": 0,
+    "loras_multipliers": "",
+    "temporal_upsampling": "",
+    "spatial_upsampling": "",
+    "RIFLEx_setting": 0,
+    "slg_switch": 0,
+    "slg_start_perc": 10,
+    "slg_end_perc": 90,
+    "cfg_star_switch": 0,
+    "cfg_zero_step": -1,
+    "prompt_enhancer": "",
+    "activated_loras": []
+}
--- a/finetunes/vace_14B_sf.json
+++ b/finetunes/vace_14B_sf.json
@ -0,0 +1,41 @@
+{
+    "model": {
+        "name": "Vace Self-Forcing 14B",
+        "architecture": "vace_14B",
+        "modules": [
+            "vace_14B"
+        ],
+        "description": "This model is a combination of Vace and an advanced text-to-video generation model. This approach allows the model to generate videos with significantly fewer inference steps (4 or 8 steps) and without classifier-free guidance, substantially reducing video generation time while maintaining high quality outputs.",
+        "URLs": [
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_bf16.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_quanto_bf16_int8.safetensors",
+            "https://huggingface.co/DeepBeepMeep/Wan2.1/resolve/main/wan2.1_StepDistill-CfgDistill_14B_quanto_fp16_int8.safetensors"
+        ],
+		"author": "https://huggingface.co/lightx2v/Wan2.1-T2V-14B-StepDistill-CfgDistill",
+        "auto_quantize": true
+    },
+    "negative_prompt": "",
+    "prompt": "",
+    "resolution": "832x480",
+    "video_length": 81,
+    "seed": -1,
+    "num_inference_steps": 4,
+    "guidance_scale": 1,
+    "flow_shift": 3,
+    "embedded_guidance_scale": 6,
+    "repeat_generation": 1,
+    "multi_images_gen_type": 0,
+    "tea_cache_setting": 0,
+    "tea_cache_start_step_perc": 0,
+    "loras_multipliers": "",
+    "temporal_upsampling": "",
+    "spatial_upsampling": "",
+    "RIFLEx_setting": 0,
+    "slg_switch": 0,
+    "slg_start_perc": 10,
+    "slg_end_perc": 90,
+    "cfg_star_switch": 0,
+    "cfg_zero_step": -1,
+    "prompt_enhancer": "",
+    "activated_loras": []
+}
--- a/preprocessing/depth_anything_v2/init.py
+++ b/preprocessing/depth_anything_v2/init.py
--- a/preprocessing/depth_anything_v2/depth.py
+++ b/preprocessing/depth_anything_v2/depth.py
@ -0,0 +1,56 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import numpy as np
+import torch
+from einops import rearrange
+from PIL import Image
+
+
+def convert_to_numpy(image):
+    if isinstance(image, Image.Image):
+        image = np.array(image)
+    elif isinstance(image, torch.Tensor):
+        image = image.detach().cpu().numpy()
+    elif isinstance(image, np.ndarray):
+        image = image.copy()
+    else:
+        raise f'Unsurpport datatype{type(image)}, only surpport np.ndarray, torch.Tensor, Pillow Image.'
+    return image
+
+class DepthV2Annotator:
+    def __init__(self, cfg, device=None):
+        from .dpt import DepthAnythingV2
+        pretrained_model = cfg['PRETRAINED_MODEL']
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") if device is None else device
+        self.model = DepthAnythingV2(encoder='vitl', features=256, out_channels=[256, 512, 1024, 1024]).to(self.device)
+        self.model.load_state_dict(
+            torch.load(
+                pretrained_model,
+                map_location=self.device
+            )
+        )
+        self.model.eval()
+
+    @torch.inference_mode()
+    @torch.autocast('cuda', enabled=False)
+    def forward(self, image):
+        image = convert_to_numpy(image)
+        depth = self.model.infer_image(image)
+
+        depth_pt = depth.copy()
+        depth_pt -= np.min(depth_pt)
+        depth_pt /= np.max(depth_pt)
+        depth_image = (depth_pt * 255.0).clip(0, 255).astype(np.uint8)
+
+        depth_image = depth_image[..., np.newaxis]
+        depth_image = np.repeat(depth_image, 3, axis=2)
+        return depth_image
+
+
+class DepthV2VideoAnnotator(DepthV2Annotator):
+    def forward(self, frames):
+        ret_frames = []
+        for frame in frames:
+            anno_frame = super().forward(np.array(frame))
+            ret_frames.append(anno_frame)
+        return ret_frames
--- a/preprocessing/depth_anything_v2/dinov2.py
+++ b/preprocessing/depth_anything_v2/dinov2.py
@ -0,0 +1,414 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# This source code is licensed under the Apache License, Version 2.0
+# found in the LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/main/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/models/vision_transformer.py
+
+from functools import partial
+import math
+import logging
+from typing import Sequence, Tuple, Union, Callable
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from torch.nn.init import trunc_normal_
+
+from .layers import Mlp, PatchEmbed, SwiGLUFFNFused, MemEffAttention, NestedTensorBlock as Block
+
+logger = logging.getLogger("dinov2")
+
+
+def named_apply(fn: Callable, module: nn.Module, name="", depth_first=True, include_root=False) -> nn.Module:
+    if not depth_first and include_root:
+        fn(module=module, name=name)
+    for child_name, child_module in module.named_children():
+        child_name = ".".join((name, child_name)) if name else child_name
+        named_apply(fn=fn, module=child_module, name=child_name, depth_first=depth_first, include_root=True)
+    if depth_first and include_root:
+        fn(module=module, name=name)
+    return module
+
+
+class BlockChunk(nn.ModuleList):
+    def forward(self, x):
+        for b in self:
+            x = b(x)
+        return x
+
+
+class DinoVisionTransformer(nn.Module):
+    def __init__(
+            self,
+            img_size=224,
+            patch_size=16,
+            in_chans=3,
+            embed_dim=768,
+            depth=12,
+            num_heads=12,
+            mlp_ratio=4.0,
+            qkv_bias=True,
+            ffn_bias=True,
+            proj_bias=True,
+            drop_path_rate=0.0,
+            drop_path_uniform=False,
+            init_values=None,  # for layerscale: None or 0 => no layerscale
+            embed_layer=PatchEmbed,
+            act_layer=nn.GELU,
+            block_fn=Block,
+            ffn_layer="mlp",
+            block_chunks=1,
+            num_register_tokens=0,
+            interpolate_antialias=False,
+            interpolate_offset=0.1,
+    ):
+        """
+        Args:
+            img_size (int, tuple): input image size
+            patch_size (int, tuple): patch size
+            in_chans (int): number of input channels
+            embed_dim (int): embedding dimension
+            depth (int): depth of transformer
+            num_heads (int): number of attention heads
+            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
+            qkv_bias (bool): enable bias for qkv if True
+            proj_bias (bool): enable bias for proj in attn if True
+            ffn_bias (bool): enable bias for ffn if True
+            drop_path_rate (float): stochastic depth rate
+            drop_path_uniform (bool): apply uniform drop rate across blocks
+            weight_init (str): weight init scheme
+            init_values (float): layer-scale init values
+            embed_layer (nn.Module): patch embedding layer
+            act_layer (nn.Module): MLP activation layer
+            block_fn (nn.Module): transformer block class
+            ffn_layer (str): "mlp", "swiglu", "swiglufused" or "identity"
+            block_chunks: (int) split block sequence into block_chunks units for FSDP wrap
+            num_register_tokens: (int) number of extra cls tokens (so-called "registers")
+            interpolate_antialias: (str) flag to apply anti-aliasing when interpolating positional embeddings
+            interpolate_offset: (float) work-around offset to apply when interpolating positional embeddings
+        """
+        super().__init__()
+        norm_layer = partial(nn.LayerNorm, eps=1e-6)
+
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.num_tokens = 1
+        self.n_blocks = depth
+        self.num_heads = num_heads
+        self.patch_size = patch_size
+        self.num_register_tokens = num_register_tokens
+        self.interpolate_antialias = interpolate_antialias
+        self.interpolate_offset = interpolate_offset
+
+        self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
+        assert num_register_tokens >= 0
+        self.register_tokens = (
+            nn.Parameter(torch.zeros(1, num_register_tokens, embed_dim)) if num_register_tokens else None
+        )
+
+        if drop_path_uniform is True:
+            dpr = [drop_path_rate] * depth
+        else:
+            dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+
+        if ffn_layer == "mlp":
+            logger.info("using MLP layer as FFN")
+            ffn_layer = Mlp
+        elif ffn_layer == "swiglufused" or ffn_layer == "swiglu":
+            logger.info("using SwiGLU layer as FFN")
+            ffn_layer = SwiGLUFFNFused
+        elif ffn_layer == "identity":
+            logger.info("using Identity layer as FFN")
+
+            def f(*args, **kwargs):
+                return nn.Identity()
+
+            ffn_layer = f
+        else:
+            raise NotImplementedError
+
+        blocks_list = [
+            block_fn(
+                dim=embed_dim,
+                num_heads=num_heads,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                proj_bias=proj_bias,
+                ffn_bias=ffn_bias,
+                drop_path=dpr[i],
+                norm_layer=norm_layer,
+                act_layer=act_layer,
+                ffn_layer=ffn_layer,
+                init_values=init_values,
+            )
+            for i in range(depth)
+        ]
+        if block_chunks > 0:
+            self.chunked_blocks = True
+            chunked_blocks = []
+            chunksize = depth // block_chunks
+            for i in range(0, depth, chunksize):
+                # this is to keep the block index consistent if we chunk the block list
+                chunked_blocks.append([nn.Identity()] * i + blocks_list[i: i + chunksize])
+            self.blocks = nn.ModuleList([BlockChunk(p) for p in chunked_blocks])
+        else:
+            self.chunked_blocks = False
+            self.blocks = nn.ModuleList(blocks_list)
+
+        self.norm = norm_layer(embed_dim)
+        self.head = nn.Identity()
+
+        self.mask_token = nn.Parameter(torch.zeros(1, embed_dim))
+
+        self.init_weights()
+
+    def init_weights(self):
+        trunc_normal_(self.pos_embed, std=0.02)
+        nn.init.normal_(self.cls_token, std=1e-6)
+        if self.register_tokens is not None:
+            nn.init.normal_(self.register_tokens, std=1e-6)
+        named_apply(init_weights_vit_timm, self)
+
+    def interpolate_pos_encoding(self, x, w, h):
+        previous_dtype = x.dtype
+        npatch = x.shape[1] - 1
+        N = self.pos_embed.shape[1] - 1
+        if npatch == N and w == h:
+            return self.pos_embed
+        pos_embed = self.pos_embed.float()
+        class_pos_embed = pos_embed[:, 0]
+        patch_pos_embed = pos_embed[:, 1:]
+        dim = x.shape[-1]
+        w0 = w // self.patch_size
+        h0 = h // self.patch_size
+        # we add a small number to avoid floating point error in the interpolation
+        # see discussion at https://github.com/facebookresearch/dino/issues/8
+        # DINOv2 with register modify the interpolate_offset from 0.1 to 0.0
+        w0, h0 = w0 + self.interpolate_offset, h0 + self.interpolate_offset
+        # w0, h0 = w0 + 0.1, h0 + 0.1
+
+        sqrt_N = math.sqrt(N)
+        sx, sy = float(w0) / sqrt_N, float(h0) / sqrt_N
+        patch_pos_embed = nn.functional.interpolate(
+            patch_pos_embed.reshape(1, int(sqrt_N), int(sqrt_N), dim).permute(0, 3, 1, 2),
+            scale_factor=(sx, sy),
+            # (int(w0), int(h0)), # to solve the upsampling shape issue
+            mode="bicubic",
+            antialias=self.interpolate_antialias
+        )
+
+        assert int(w0) == patch_pos_embed.shape[-2]
+        assert int(h0) == patch_pos_embed.shape[-1]
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1).to(previous_dtype)
+
+    def prepare_tokens_with_masks(self, x, masks=None):
+        B, nc, w, h = x.shape
+        x = self.patch_embed(x)
+        if masks is not None:
+            x = torch.where(masks.unsqueeze(-1), self.mask_token.to(x.dtype).unsqueeze(0), x)
+
+        x = torch.cat((self.cls_token.expand(x.shape[0], -1, -1), x), dim=1)
+        x = x + self.interpolate_pos_encoding(x, w, h)
+
+        if self.register_tokens is not None:
+            x = torch.cat(
+                (
+                    x[:, :1],
+                    self.register_tokens.expand(x.shape[0], -1, -1),
+                    x[:, 1:],
+                ),
+                dim=1,
+            )
+
+        return x
+
+    def forward_features_list(self, x_list, masks_list):
+        x = [self.prepare_tokens_with_masks(x, masks) for x, masks in zip(x_list, masks_list)]
+        for blk in self.blocks:
+            x = blk(x)
+
+        all_x = x
+        output = []
+        for x, masks in zip(all_x, masks_list):
+            x_norm = self.norm(x)
+            output.append(
+                {
+                    "x_norm_clstoken": x_norm[:, 0],
+                    "x_norm_regtokens": x_norm[:, 1: self.num_register_tokens + 1],
+                    "x_norm_patchtokens": x_norm[:, self.num_register_tokens + 1:],
+                    "x_prenorm": x,
+                    "masks": masks,
+                }
+            )
+        return output
+
+    def forward_features(self, x, masks=None):
+        if isinstance(x, list):
+            return self.forward_features_list(x, masks)
+
+        x = self.prepare_tokens_with_masks(x, masks)
+
+        for blk in self.blocks:
+            x = blk(x)
+
+        x_norm = self.norm(x)
+        return {
+            "x_norm_clstoken": x_norm[:, 0],
+            "x_norm_regtokens": x_norm[:, 1: self.num_register_tokens + 1],
+            "x_norm_patchtokens": x_norm[:, self.num_register_tokens + 1:],
+            "x_prenorm": x,
+            "masks": masks,
+        }
+
+    def _get_intermediate_layers_not_chunked(self, x, n=1):
+        x = self.prepare_tokens_with_masks(x)
+        # If n is an int, take the n last blocks. If it's a list, take them
+        output, total_block_len = [], len(self.blocks)
+        blocks_to_take = range(total_block_len - n, total_block_len) if isinstance(n, int) else n
+        for i, blk in enumerate(self.blocks):
+            x = blk(x)
+            if i in blocks_to_take:
+                output.append(x)
+        assert len(output) == len(blocks_to_take), f"only {len(output)} / {len(blocks_to_take)} blocks found"
+        return output
+
+    def _get_intermediate_layers_chunked(self, x, n=1):
+        x = self.prepare_tokens_with_masks(x)
+        output, i, total_block_len = [], 0, len(self.blocks[-1])
+        # If n is an int, take the n last blocks. If it's a list, take them
+        blocks_to_take = range(total_block_len - n, total_block_len) if isinstance(n, int) else n
+        for block_chunk in self.blocks:
+            for blk in block_chunk[i:]:  # Passing the nn.Identity()
+                x = blk(x)
+                if i in blocks_to_take:
+                    output.append(x)
+                i += 1
+        assert len(output) == len(blocks_to_take), f"only {len(output)} / {len(blocks_to_take)} blocks found"
+        return output
+
+    def get_intermediate_layers(
+            self,
+            x: torch.Tensor,
+            n: Union[int, Sequence] = 1,  # Layers or n last layers to take
+            reshape: bool = False,
+            return_class_token: bool = False,
+            norm=True
+    ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor]]]:
+        if self.chunked_blocks:
+            outputs = self._get_intermediate_layers_chunked(x, n)
+        else:
+            outputs = self._get_intermediate_layers_not_chunked(x, n)
+        if norm:
+            outputs = [self.norm(out) for out in outputs]
+        class_tokens = [out[:, 0] for out in outputs]
+        outputs = [out[:, 1 + self.num_register_tokens:] for out in outputs]
+        if reshape:
+            B, _, w, h = x.shape
+            outputs = [
+                out.reshape(B, w // self.patch_size, h // self.patch_size, -1).permute(0, 3, 1, 2).contiguous()
+                for out in outputs
+            ]
+        if return_class_token:
+            return tuple(zip(outputs, class_tokens))
+        return tuple(outputs)
+
+    def forward(self, *args, is_training=False, **kwargs):
+        ret = self.forward_features(*args, **kwargs)
+        if is_training:
+            return ret
+        else:
+            return self.head(ret["x_norm_clstoken"])
+
+
+def init_weights_vit_timm(module: nn.Module, name: str = ""):
+    """ViT weight initialization, original timm impl (for reproducibility)"""
+    if isinstance(module, nn.Linear):
+        trunc_normal_(module.weight, std=0.02)
+        if module.bias is not None:
+            nn.init.zeros_(module.bias)
+
+
+def vit_small(patch_size=16, num_register_tokens=0, **kwargs):
+    model = DinoVisionTransformer(
+        patch_size=patch_size,
+        embed_dim=384,
+        depth=12,
+        num_heads=6,
+        mlp_ratio=4,
+        block_fn=partial(Block, attn_class=MemEffAttention),
+        num_register_tokens=num_register_tokens,
+        **kwargs,
+    )
+    return model
+
+
+def vit_base(patch_size=16, num_register_tokens=0, **kwargs):
+    model = DinoVisionTransformer(
+        patch_size=patch_size,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        mlp_ratio=4,
+        block_fn=partial(Block, attn_class=MemEffAttention),
+        num_register_tokens=num_register_tokens,
+        **kwargs,
+    )
+    return model
+
+
+def vit_large(patch_size=16, num_register_tokens=0, **kwargs):
+    model = DinoVisionTransformer(
+        patch_size=patch_size,
+        embed_dim=1024,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        block_fn=partial(Block, attn_class=MemEffAttention),
+        num_register_tokens=num_register_tokens,
+        **kwargs,
+    )
+    return model
+
+
+def vit_giant2(patch_size=16, num_register_tokens=0, **kwargs):
+    """
+    Close to ViT-giant, with embed-dim 1536 and 24 heads => embed-dim per head 64
+    """
+    model = DinoVisionTransformer(
+        patch_size=patch_size,
+        embed_dim=1536,
+        depth=40,
+        num_heads=24,
+        mlp_ratio=4,
+        block_fn=partial(Block, attn_class=MemEffAttention),
+        num_register_tokens=num_register_tokens,
+        **kwargs,
+    )
+    return model
+
+
+def DINOv2(model_name):
+    model_zoo = {
+        "vits": vit_small,
+        "vitb": vit_base,
+        "vitl": vit_large,
+        "vitg": vit_giant2
+    }
+
+    return model_zoo[model_name](
+        img_size=518,
+        patch_size=14,
+        init_values=1.0,
+        ffn_layer="mlp" if model_name != "vitg" else "swiglufused",
+        block_chunks=0,
+        num_register_tokens=0,
+        interpolate_antialias=False,
+        interpolate_offset=0.1
+    )
--- a/preprocessing/depth_anything_v2/dpt.py
+++ b/preprocessing/depth_anything_v2/dpt.py
@ -0,0 +1,210 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import cv2
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision.transforms import Compose
+
+from .dinov2 import DINOv2
+from .util.blocks import FeatureFusionBlock, _make_scratch
+from .util.transform import Resize, NormalizeImage, PrepareForNet
+
+
+class DepthAnythingV2(nn.Module):
+    def __init__(
+            self,
+            encoder='vitl',
+            features=256,
+            out_channels=[256, 512, 1024, 1024],
+            use_bn=False,
+            use_clstoken=False
+    ):
+        super(DepthAnythingV2, self).__init__()
+
+        self.intermediate_layer_idx = {
+            'vits': [2, 5, 8, 11],
+            'vitb': [2, 5, 8, 11],
+            'vitl': [4, 11, 17, 23],
+            'vitg': [9, 19, 29, 39]
+        }
+
+        self.encoder = encoder
+        self.pretrained = DINOv2(model_name=encoder)
+
+        self.depth_head = DPTHead(self.pretrained.embed_dim, features, use_bn, out_channels=out_channels,
+                                  use_clstoken=use_clstoken)
+
+    def forward(self, x):
+        patch_h, patch_w = x.shape[-2] // 14, x.shape[-1] // 14
+
+        features = self.pretrained.get_intermediate_layers(x, self.intermediate_layer_idx[self.encoder],
+                                                           return_class_token=True)
+
+        depth = self.depth_head(features, patch_h, patch_w)
+        depth = F.relu(depth)
+
+        return depth.squeeze(1)
+
+    @torch.no_grad()
+    def infer_image(self, raw_image, input_size=518):
+        image, (h, w) = self.image2tensor(raw_image, input_size)
+
+        depth = self.forward(image)
+        depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
+
+        return depth.cpu().numpy()
+
+    def image2tensor(self, raw_image, input_size=518):
+        transform = Compose([
+            Resize(
+                width=input_size,
+                height=input_size,
+                resize_target=False,
+                keep_aspect_ratio=True,
+                ensure_multiple_of=14,
+                resize_method='lower_bound',
+                image_interpolation_method=cv2.INTER_CUBIC,
+            ),
+            NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+            PrepareForNet(),
+        ])
+
+        h, w = raw_image.shape[:2]
+
+        image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
+
+        image = transform({'image': image})['image']
+        image = torch.from_numpy(image).unsqueeze(0)
+
+        DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
+        image = image.to(DEVICE)
+
+        return image, (h, w)
+
+
+class DPTHead(nn.Module):
+    def __init__(
+            self,
+            in_channels,
+            features=256,
+            use_bn=False,
+            out_channels=[256, 512, 1024, 1024],
+            use_clstoken=False
+    ):
+        super(DPTHead, self).__init__()
+
+        self.use_clstoken = use_clstoken
+
+        self.projects = nn.ModuleList([
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channel,
+                kernel_size=1,
+                stride=1,
+                padding=0,
+            ) for out_channel in out_channels
+        ])
+
+        self.resize_layers = nn.ModuleList([
+            nn.ConvTranspose2d(
+                in_channels=out_channels[0],
+                out_channels=out_channels[0],
+                kernel_size=4,
+                stride=4,
+                padding=0),
+            nn.ConvTranspose2d(
+                in_channels=out_channels[1],
+                out_channels=out_channels[1],
+                kernel_size=2,
+                stride=2,
+                padding=0),
+            nn.Identity(),
+            nn.Conv2d(
+                in_channels=out_channels[3],
+                out_channels=out_channels[3],
+                kernel_size=3,
+                stride=2,
+                padding=1)
+        ])
+
+        if use_clstoken:
+            self.readout_projects = nn.ModuleList()
+            for _ in range(len(self.projects)):
+                self.readout_projects.append(
+                    nn.Sequential(
+                        nn.Linear(2 * in_channels, in_channels),
+                        nn.GELU()))
+
+        self.scratch = _make_scratch(
+            out_channels,
+            features,
+            groups=1,
+            expand=False,
+        )
+
+        self.scratch.stem_transpose = None
+
+        self.scratch.refinenet1 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet2 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet3 = _make_fusion_block(features, use_bn)
+        self.scratch.refinenet4 = _make_fusion_block(features, use_bn)
+
+        head_features_1 = features
+        head_features_2 = 32
+
+        self.scratch.output_conv1 = nn.Conv2d(head_features_1, head_features_1 // 2, kernel_size=3, stride=1, padding=1)
+        self.scratch.output_conv2 = nn.Sequential(
+            nn.Conv2d(head_features_1 // 2, head_features_2, kernel_size=3, stride=1, padding=1),
+            nn.ReLU(True),
+            nn.Conv2d(head_features_2, 1, kernel_size=1, stride=1, padding=0),
+            nn.ReLU(True),
+            nn.Identity(),
+        )
+
+    def forward(self, out_features, patch_h, patch_w):
+        out = []
+        for i, x in enumerate(out_features):
+            if self.use_clstoken:
+                x, cls_token = x[0], x[1]
+                readout = cls_token.unsqueeze(1).expand_as(x)
+                x = self.readout_projects[i](torch.cat((x, readout), -1))
+            else:
+                x = x[0]
+
+            x = x.permute(0, 2, 1).reshape((x.shape[0], x.shape[-1], patch_h, patch_w))
+
+            x = self.projects[i](x)
+            x = self.resize_layers[i](x)
+
+            out.append(x)
+
+        layer_1, layer_2, layer_3, layer_4 = out
+
+        layer_1_rn = self.scratch.layer1_rn(layer_1)
+        layer_2_rn = self.scratch.layer2_rn(layer_2)
+        layer_3_rn = self.scratch.layer3_rn(layer_3)
+        layer_4_rn = self.scratch.layer4_rn(layer_4)
+
+        path_4 = self.scratch.refinenet4(layer_4_rn, size=layer_3_rn.shape[2:])
+        path_3 = self.scratch.refinenet3(path_4, layer_3_rn, size=layer_2_rn.shape[2:])
+        path_2 = self.scratch.refinenet2(path_3, layer_2_rn, size=layer_1_rn.shape[2:])
+        path_1 = self.scratch.refinenet1(path_2, layer_1_rn)
+
+        out = self.scratch.output_conv1(path_1)
+        out = F.interpolate(out, (int(patch_h * 14), int(patch_w * 14)), mode="bilinear", align_corners=True)
+        out = self.scratch.output_conv2(out)
+
+        return out
+
+
+def _make_fusion_block(features, use_bn, size=None):
+    return FeatureFusionBlock(
+        features,
+        nn.ReLU(False),
+        deconv=False,
+        bn=use_bn,
+        expand=False,
+        align_corners=True,
+        size=size,
+    )
--- a/preprocessing/depth_anything_v2/layers/init.py
+++ b/preprocessing/depth_anything_v2/layers/init.py
@ -0,0 +1,11 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from .mlp import Mlp
+from .patch_embed import PatchEmbed
+from .swiglu_ffn import SwiGLUFFN, SwiGLUFFNFused
+from .block import NestedTensorBlock
+from .attention import MemEffAttention
--- a/preprocessing/depth_anything_v2/layers/attention.py
+++ b/preprocessing/depth_anything_v2/layers/attention.py
@ -0,0 +1,79 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/models/vision_transformer.py
+
+import logging
+
+from torch import Tensor
+from torch import nn
+
+logger = logging.getLogger("dinov2")
+
+try:
+    from xformers.ops import memory_efficient_attention, unbind, fmha
+
+    XFORMERS_AVAILABLE = True
+except ImportError:
+    logger.warning("xFormers not available")
+    XFORMERS_AVAILABLE = False
+
+
+class Attention(nn.Module):
+    def __init__(
+            self,
+            dim: int,
+            num_heads: int = 8,
+            qkv_bias: bool = False,
+            proj_bias: bool = True,
+            attn_drop: float = 0.0,
+            proj_drop: float = 0.0,
+    ) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim ** -0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim, bias=proj_bias)
+        self.proj_drop = nn.Dropout(proj_drop)
+
+    def forward(self, x: Tensor) -> Tensor:
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+
+        q, k, v = qkv[0] * self.scale, qkv[1], qkv[2]
+        attn = q @ k.transpose(-2, -1)
+
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class MemEffAttention(Attention):
+    def forward(self, x: Tensor, attn_bias=None) -> Tensor:
+        if not XFORMERS_AVAILABLE:
+            assert attn_bias is None, "xFormers is required for nested tensors usage"
+            return super().forward(x)
+
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads)
+
+        q, k, v = unbind(qkv, 2)
+
+        x = memory_efficient_attention(q, k, v, attn_bias=attn_bias)
+        x = x.reshape([B, N, C])
+
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
--- a/preprocessing/depth_anything_v2/layers/block.py
+++ b/preprocessing/depth_anything_v2/layers/block.py
@ -0,0 +1,252 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/patch_embed.py
+
+import logging
+from typing import Callable, List, Any, Tuple, Dict
+
+import torch
+from torch import nn, Tensor
+
+from .attention import Attention, MemEffAttention
+from .drop_path import DropPath
+from .layer_scale import LayerScale
+from .mlp import Mlp
+
+
+logger = logging.getLogger("dinov2")
+
+
+try:
+    from xformers.ops import fmha
+    from xformers.ops import scaled_index_add, index_select_cat
+
+    XFORMERS_AVAILABLE = True
+except ImportError:
+    # logger.warning("xFormers not available")
+    XFORMERS_AVAILABLE = False
+
+
+class Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        mlp_ratio: float = 4.0,
+        qkv_bias: bool = False,
+        proj_bias: bool = True,
+        ffn_bias: bool = True,
+        drop: float = 0.0,
+        attn_drop: float = 0.0,
+        init_values=None,
+        drop_path: float = 0.0,
+        act_layer: Callable[..., nn.Module] = nn.GELU,
+        norm_layer: Callable[..., nn.Module] = nn.LayerNorm,
+        attn_class: Callable[..., nn.Module] = Attention,
+        ffn_layer: Callable[..., nn.Module] = Mlp,
+    ) -> None:
+        super().__init__()
+        # print(f"biases: qkv: {qkv_bias}, proj: {proj_bias}, ffn: {ffn_bias}")
+        self.norm1 = norm_layer(dim)
+        self.attn = attn_class(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            proj_bias=proj_bias,
+            attn_drop=attn_drop,
+            proj_drop=drop,
+        )
+        self.ls1 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
+        self.drop_path1 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = ffn_layer(
+            in_features=dim,
+            hidden_features=mlp_hidden_dim,
+            act_layer=act_layer,
+            drop=drop,
+            bias=ffn_bias,
+        )
+        self.ls2 = LayerScale(dim, init_values=init_values) if init_values else nn.Identity()
+        self.drop_path2 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+
+        self.sample_drop_ratio = drop_path
+
+    def forward(self, x: Tensor) -> Tensor:
+        def attn_residual_func(x: Tensor) -> Tensor:
+            return self.ls1(self.attn(self.norm1(x)))
+
+        def ffn_residual_func(x: Tensor) -> Tensor:
+            return self.ls2(self.mlp(self.norm2(x)))
+
+        if self.training and self.sample_drop_ratio > 0.1:
+            # the overhead is compensated only for a drop path rate larger than 0.1
+            x = drop_add_residual_stochastic_depth(
+                x,
+                residual_func=attn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+            )
+            x = drop_add_residual_stochastic_depth(
+                x,
+                residual_func=ffn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+            )
+        elif self.training and self.sample_drop_ratio > 0.0:
+            x = x + self.drop_path1(attn_residual_func(x))
+            x = x + self.drop_path1(ffn_residual_func(x))  # FIXME: drop_path2
+        else:
+            x = x + attn_residual_func(x)
+            x = x + ffn_residual_func(x)
+        return x
+
+
+def drop_add_residual_stochastic_depth(
+    x: Tensor,
+    residual_func: Callable[[Tensor], Tensor],
+    sample_drop_ratio: float = 0.0,
+) -> Tensor:
+    # 1) extract subset using permutation
+    b, n, d = x.shape
+    sample_subset_size = max(int(b * (1 - sample_drop_ratio)), 1)
+    brange = (torch.randperm(b, device=x.device))[:sample_subset_size]
+    x_subset = x[brange]
+
+    # 2) apply residual_func to get residual
+    residual = residual_func(x_subset)
+
+    x_flat = x.flatten(1)
+    residual = residual.flatten(1)
+
+    residual_scale_factor = b / sample_subset_size
+
+    # 3) add the residual
+    x_plus_residual = torch.index_add(x_flat, 0, brange, residual.to(dtype=x.dtype), alpha=residual_scale_factor)
+    return x_plus_residual.view_as(x)
+
+
+def get_branges_scales(x, sample_drop_ratio=0.0):
+    b, n, d = x.shape
+    sample_subset_size = max(int(b * (1 - sample_drop_ratio)), 1)
+    brange = (torch.randperm(b, device=x.device))[:sample_subset_size]
+    residual_scale_factor = b / sample_subset_size
+    return brange, residual_scale_factor
+
+
+def add_residual(x, brange, residual, residual_scale_factor, scaling_vector=None):
+    if scaling_vector is None:
+        x_flat = x.flatten(1)
+        residual = residual.flatten(1)
+        x_plus_residual = torch.index_add(x_flat, 0, brange, residual.to(dtype=x.dtype), alpha=residual_scale_factor)
+    else:
+        x_plus_residual = scaled_index_add(
+            x, brange, residual.to(dtype=x.dtype), scaling=scaling_vector, alpha=residual_scale_factor
+        )
+    return x_plus_residual
+
+
+attn_bias_cache: Dict[Tuple, Any] = {}
+
+
+def get_attn_bias_and_cat(x_list, branges=None):
+    """
+    this will perform the index select, cat the tensors, and provide the attn_bias from cache
+    """
+    batch_sizes = [b.shape[0] for b in branges] if branges is not None else [x.shape[0] for x in x_list]
+    all_shapes = tuple((b, x.shape[1]) for b, x in zip(batch_sizes, x_list))
+    if all_shapes not in attn_bias_cache.keys():
+        seqlens = []
+        for b, x in zip(batch_sizes, x_list):
+            for _ in range(b):
+                seqlens.append(x.shape[1])
+        attn_bias = fmha.BlockDiagonalMask.from_seqlens(seqlens)
+        attn_bias._batch_sizes = batch_sizes
+        attn_bias_cache[all_shapes] = attn_bias
+
+    if branges is not None:
+        cat_tensors = index_select_cat([x.flatten(1) for x in x_list], branges).view(1, -1, x_list[0].shape[-1])
+    else:
+        tensors_bs1 = tuple(x.reshape([1, -1, *x.shape[2:]]) for x in x_list)
+        cat_tensors = torch.cat(tensors_bs1, dim=1)
+
+    return attn_bias_cache[all_shapes], cat_tensors
+
+
+def drop_add_residual_stochastic_depth_list(
+    x_list: List[Tensor],
+    residual_func: Callable[[Tensor, Any], Tensor],
+    sample_drop_ratio: float = 0.0,
+    scaling_vector=None,
+) -> Tensor:
+    # 1) generate random set of indices for dropping samples in the batch
+    branges_scales = [get_branges_scales(x, sample_drop_ratio=sample_drop_ratio) for x in x_list]
+    branges = [s[0] for s in branges_scales]
+    residual_scale_factors = [s[1] for s in branges_scales]
+
+    # 2) get attention bias and index+concat the tensors
+    attn_bias, x_cat = get_attn_bias_and_cat(x_list, branges)
+
+    # 3) apply residual_func to get residual, and split the result
+    residual_list = attn_bias.split(residual_func(x_cat, attn_bias=attn_bias))  # type: ignore
+
+    outputs = []
+    for x, brange, residual, residual_scale_factor in zip(x_list, branges, residual_list, residual_scale_factors):
+        outputs.append(add_residual(x, brange, residual, residual_scale_factor, scaling_vector).view_as(x))
+    return outputs
+
+
+class NestedTensorBlock(Block):
+    def forward_nested(self, x_list: List[Tensor]) -> List[Tensor]:
+        """
+        x_list contains a list of tensors to nest together and run
+        """
+        assert isinstance(self.attn, MemEffAttention)
+
+        if self.training and self.sample_drop_ratio > 0.0:
+
+            def attn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.attn(self.norm1(x), attn_bias=attn_bias)
+
+            def ffn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.mlp(self.norm2(x))
+
+            x_list = drop_add_residual_stochastic_depth_list(
+                x_list,
+                residual_func=attn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+                scaling_vector=self.ls1.gamma if isinstance(self.ls1, LayerScale) else None,
+            )
+            x_list = drop_add_residual_stochastic_depth_list(
+                x_list,
+                residual_func=ffn_residual_func,
+                sample_drop_ratio=self.sample_drop_ratio,
+                scaling_vector=self.ls2.gamma if isinstance(self.ls1, LayerScale) else None,
+            )
+            return x_list
+        else:
+
+            def attn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.ls1(self.attn(self.norm1(x), attn_bias=attn_bias))
+
+            def ffn_residual_func(x: Tensor, attn_bias=None) -> Tensor:
+                return self.ls2(self.mlp(self.norm2(x)))
+
+            attn_bias, x = get_attn_bias_and_cat(x_list)
+            x = x + attn_residual_func(x, attn_bias=attn_bias)
+            x = x + ffn_residual_func(x)
+            return attn_bias.split(x)
+
+    def forward(self, x_or_x_list):
+        if isinstance(x_or_x_list, Tensor):
+            return super().forward(x_or_x_list)
+        elif isinstance(x_or_x_list, list):
+            assert XFORMERS_AVAILABLE, "Please install xFormers for nested tensors usage"
+            return self.forward_nested(x_or_x_list)
+        else:
+            raise AssertionError
--- a/preprocessing/depth_anything_v2/layers/drop_path.py
+++ b/preprocessing/depth_anything_v2/layers/drop_path.py
@ -0,0 +1,34 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/drop.py
+
+from torch import nn
+
+
+def drop_path(x, drop_prob: float = 0.0, training: bool = False):
+    if drop_prob == 0.0 or not training:
+        return x
+    keep_prob = 1 - drop_prob
+    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = x.new_empty(shape).bernoulli_(keep_prob)
+    if keep_prob > 0.0:
+        random_tensor.div_(keep_prob)
+    output = x * random_tensor
+    return output
+
+
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
--- a/preprocessing/depth_anything_v2/layers/layer_scale.py
+++ b/preprocessing/depth_anything_v2/layers/layer_scale.py
@ -0,0 +1,28 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+# Modified from: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py
+
+
+from typing import Union
+
+import torch
+from torch import Tensor
+from torch import nn
+
+
+class LayerScale(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        init_values: Union[float, Tensor] = 1e-5,
+        inplace: bool = False,
+    ) -> None:
+        super().__init__()
+        self.inplace = inplace
+        self.gamma = nn.Parameter(init_values * torch.ones(dim))
+
+    def forward(self, x: Tensor) -> Tensor:
+        return x.mul_(self.gamma) if self.inplace else x * self.gamma
--- a/preprocessing/depth_anything_v2/layers/mlp.py
+++ b/preprocessing/depth_anything_v2/layers/mlp.py
@ -0,0 +1,39 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/mlp.py
+
+from typing import Callable, Optional
+from torch import Tensor, nn
+
+
+class Mlp(nn.Module):
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        act_layer: Callable[..., nn.Module] = nn.GELU,
+        drop: float = 0.0,
+        bias: bool = True,
+    ) -> None:
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias)
+        self.drop = nn.Dropout(drop)
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
--- a/preprocessing/depth_anything_v2/layers/patch_embed.py
+++ b/preprocessing/depth_anything_v2/layers/patch_embed.py
@ -0,0 +1,90 @@
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+# References:
+#   https://github.com/facebookresearch/dino/blob/master/vision_transformer.py
+#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/layers/patch_embed.py
+
+from typing import Callable, Optional, Tuple, Union
+
+from torch import Tensor
+import torch.nn as nn
+
+
+def make_2tuple(x):
+    if isinstance(x, tuple):
+        assert len(x) == 2
+        return x
+
+    assert isinstance(x, int)
+    return (x, x)
+
+
+class PatchEmbed(nn.Module):
+    """
+    2D image to patch embedding: (B,C,H,W) -> (B,N,D)
+
+    Args:
+        img_size: Image size.
+        patch_size: Patch token size.
+        in_chans: Number of input image channels.
+        embed_dim: Number of linear projection output channels.
+        norm_layer: Normalization layer.
+    """
+
+    def __init__(
+        self,
+        img_size: Union[int, Tuple[int, int]] = 224,
+        patch_size: Union[int, Tuple[int, int]] = 16,
+        in_chans: int = 3,
+        embed_dim: int = 768,
+        norm_layer: Optional[Callable] = None,
+        flatten_embedding: bool = True,
+    ) -> None:
+        super().__init__()
+
+        image_HW = make_2tuple(img_size)
+        patch_HW = make_2tuple(patch_size)
+        patch_grid_size = (
+            image_HW[0] // patch_HW[0],
+            image_HW[1] // patch_HW[1],
+        )
+
+        self.img_size = image_HW
+        self.patch_size = patch_HW
+        self.patches_resolution = patch_grid_size
+        self.num_patches = patch_grid_size[0] * patch_grid_size[1]
+
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+
+        self.flatten_embedding = flatten_embedding
+
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_HW, stride=patch_HW)
+        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
+
+    def forward(self, x: Tensor) -> Tensor:
+        _, _, H, W = x.shape
+        patch_H, patch_W = self.patch_size
+
+        assert H % patch_H == 0, f"Input image height {H} is not a multiple of patch height {patch_H}"
+        assert W % patch_W == 0, f"Input image width {W} is not a multiple of patch width: {patch_W}"
+
+        x = self.proj(x)  # B C H W
+        H, W = x.size(2), x.size(3)
+        x = x.flatten(2).transpose(1, 2)  # B HW C
+        x = self.norm(x)
+        if not self.flatten_embedding:
+            x = x.reshape(-1, H, W, self.embed_dim)  # B H W C
+        return x
+
+    def flops(self) -> float:
+        Ho, Wo = self.patches_resolution
+        flops = Ho * Wo * self.embed_dim * self.in_chans * (self.patch_size[0] * self.patch_size[1])
+        if self.norm is not None:
+            flops += Ho * Wo * self.embed_dim
+        return flops
--- a/preprocessing/depth_anything_v2/layers/swiglu_ffn.py
+++ b/preprocessing/depth_anything_v2/layers/swiglu_ffn.py
@ -0,0 +1,64 @@
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from typing import Callable, Optional
+
+from torch import Tensor, nn
+import torch.nn.functional as F
+
+
+class SwiGLUFFN(nn.Module):
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        act_layer: Callable[..., nn.Module] = None,
+        drop: float = 0.0,
+        bias: bool = True,
+    ) -> None:
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.w12 = nn.Linear(in_features, 2 * hidden_features, bias=bias)
+        self.w3 = nn.Linear(hidden_features, out_features, bias=bias)
+
+    def forward(self, x: Tensor) -> Tensor:
+        x12 = self.w12(x)
+        x1, x2 = x12.chunk(2, dim=-1)
+        hidden = F.silu(x1) * x2
+        return self.w3(hidden)
+
+
+try:
+    from xformers.ops import SwiGLU
+
+    XFORMERS_AVAILABLE = True
+except ImportError:
+    SwiGLU = SwiGLUFFN
+    XFORMERS_AVAILABLE = False
+
+
+class SwiGLUFFNFused(SwiGLU):
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        act_layer: Callable[..., nn.Module] = None,
+        drop: float = 0.0,
+        bias: bool = True,
+    ) -> None:
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        hidden_features = (int(hidden_features * 2 / 3) + 7) // 8 * 8
+        super().__init__(
+            in_features=in_features,
+            hidden_features=hidden_features,
+            out_features=out_features,
+            bias=bias,
+        )
--- a/preprocessing/depth_anything_v2/util/init.py
+++ b/preprocessing/depth_anything_v2/util/init.py
--- a/preprocessing/depth_anything_v2/util/blocks.py
+++ b/preprocessing/depth_anything_v2/util/blocks.py
@ -0,0 +1,151 @@
+import torch.nn as nn
+
+
+def _make_scratch(in_shape, out_shape, groups=1, expand=False):
+    scratch = nn.Module()
+
+    out_shape1 = out_shape
+    out_shape2 = out_shape
+    out_shape3 = out_shape
+    if len(in_shape) >= 4:
+        out_shape4 = out_shape
+
+    if expand:
+        out_shape1 = out_shape
+        out_shape2 = out_shape * 2
+        out_shape3 = out_shape * 4
+        if len(in_shape) >= 4:
+            out_shape4 = out_shape * 8
+
+    scratch.layer1_rn = nn.Conv2d(in_shape[0], out_shape1, kernel_size=3, stride=1, padding=1, bias=False,
+                                  groups=groups)
+    scratch.layer2_rn = nn.Conv2d(in_shape[1], out_shape2, kernel_size=3, stride=1, padding=1, bias=False,
+                                  groups=groups)
+    scratch.layer3_rn = nn.Conv2d(in_shape[2], out_shape3, kernel_size=3, stride=1, padding=1, bias=False,
+                                  groups=groups)
+    if len(in_shape) >= 4:
+        scratch.layer4_rn = nn.Conv2d(in_shape[3], out_shape4, kernel_size=3, stride=1, padding=1, bias=False,
+                                      groups=groups)
+
+    return scratch
+
+
+class ResidualConvUnit(nn.Module):
+    """Residual convolution module.
+    """
+
+    def __init__(self, features, activation, bn):
+        """Init.
+
+        Args:
+            features (int): number of features
+        """
+        super().__init__()
+
+        self.bn = bn
+
+        self.groups = 1
+
+        self.conv1 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups)
+
+        self.conv2 = nn.Conv2d(features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups)
+
+        if self.bn == True:
+            self.bn1 = nn.BatchNorm2d(features)
+            self.bn2 = nn.BatchNorm2d(features)
+
+        self.activation = activation
+
+        self.skip_add = nn.quantized.FloatFunctional()
+
+    def forward(self, x):
+        """Forward pass.
+
+        Args:
+            x (tensor): input
+
+        Returns:
+            tensor: output
+        """
+
+        out = self.activation(x)
+        out = self.conv1(out)
+        if self.bn == True:
+            out = self.bn1(out)
+
+        out = self.activation(out)
+        out = self.conv2(out)
+        if self.bn == True:
+            out = self.bn2(out)
+
+        if self.groups > 1:
+            out = self.conv_merge(out)
+
+        return self.skip_add.add(out, x)
+
+
+class FeatureFusionBlock(nn.Module):
+    """Feature fusion block.
+    """
+
+    def __init__(
+            self,
+            features,
+            activation,
+            deconv=False,
+            bn=False,
+            expand=False,
+            align_corners=True,
+            size=None
+    ):
+        """Init.
+
+        Args:
+            features (int): number of features
+        """
+        super(FeatureFusionBlock, self).__init__()
+
+        self.deconv = deconv
+        self.align_corners = align_corners
+
+        self.groups = 1
+
+        self.expand = expand
+        out_features = features
+        if self.expand == True:
+            out_features = features // 2
+
+        self.out_conv = nn.Conv2d(features, out_features, kernel_size=1, stride=1, padding=0, bias=True, groups=1)
+
+        self.resConfUnit1 = ResidualConvUnit(features, activation, bn)
+        self.resConfUnit2 = ResidualConvUnit(features, activation, bn)
+
+        self.skip_add = nn.quantized.FloatFunctional()
+
+        self.size = size
+
+    def forward(self, *xs, size=None):
+        """Forward pass.
+
+        Returns:
+            tensor: output
+        """
+        output = xs[0]
+
+        if len(xs) == 2:
+            res = self.resConfUnit1(xs[1])
+            output = self.skip_add.add(output, res)
+
+        output = self.resConfUnit2(output)
+
+        if (size is None) and (self.size is None):
+            modifier = {"scale_factor": 2}
+        elif size is None:
+            modifier = {"size": self.size}
+        else:
+            modifier = {"size": size}
+
+        output = nn.functional.interpolate(output, **modifier, mode="bilinear", align_corners=self.align_corners)
+        output = self.out_conv(output)
+
+        return output
--- a/preprocessing/depth_anything_v2/util/transform.py
+++ b/preprocessing/depth_anything_v2/util/transform.py
@ -0,0 +1,159 @@
+import cv2
+import numpy as np
+
+
+class Resize(object):
+    """Resize sample to given size (width, height).
+    """
+
+    def __init__(
+            self,
+            width,
+            height,
+            resize_target=True,
+            keep_aspect_ratio=False,
+            ensure_multiple_of=1,
+            resize_method="lower_bound",
+            image_interpolation_method=cv2.INTER_AREA,
+    ):
+        """Init.
+
+        Args:
+            width (int): desired output width
+            height (int): desired output height
+            resize_target (bool, optional):
+                True: Resize the full sample (image, mask, target).
+                False: Resize image only.
+                Defaults to True.
+            keep_aspect_ratio (bool, optional):
+                True: Keep the aspect ratio of the input sample.
+                Output sample might not have the given width and height, and
+                resize behaviour depends on the parameter 'resize_method'.
+                Defaults to False.
+            ensure_multiple_of (int, optional):
+                Output width and height is constrained to be multiple of this parameter.
+                Defaults to 1.
+            resize_method (str, optional):
+                "lower_bound": Output will be at least as large as the given size.
+                "upper_bound": Output will be at max as large as the given size. (Output size might be smaller than given size.)
+                "minimal": Scale as least as possible.  (Output size might be smaller than given size.)
+                Defaults to "lower_bound".
+        """
+        self.__width = width
+        self.__height = height
+
+        self.__resize_target = resize_target
+        self.__keep_aspect_ratio = keep_aspect_ratio
+        self.__multiple_of = ensure_multiple_of
+        self.__resize_method = resize_method
+        self.__image_interpolation_method = image_interpolation_method
+
+    def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
+        y = (np.round(x / self.__multiple_of) * self.__multiple_of).astype(int)
+
+        if max_val is not None and y > max_val:
+            y = (np.floor(x / self.__multiple_of) * self.__multiple_of).astype(int)
+
+        if y < min_val:
+            y = (np.ceil(x / self.__multiple_of) * self.__multiple_of).astype(int)
+
+        return y
+
+    def get_size(self, width, height):
+        # determine new height and width
+        scale_height = self.__height / height
+        scale_width = self.__width / width
+
+        if self.__keep_aspect_ratio:
+            if self.__resize_method == "lower_bound":
+                # scale such that output size is lower bound
+                if scale_width > scale_height:
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            elif self.__resize_method == "upper_bound":
+                # scale such that output size is upper bound
+                if scale_width < scale_height:
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            elif self.__resize_method == "minimal":
+                # scale as least as possbile
+                if abs(1 - scale_width) < abs(1 - scale_height):
+                    # fit width
+                    scale_height = scale_width
+                else:
+                    # fit height
+                    scale_width = scale_height
+            else:
+                raise ValueError(f"resize_method {self.__resize_method} not implemented")
+
+        if self.__resize_method == "lower_bound":
+            new_height = self.constrain_to_multiple_of(scale_height * height, min_val=self.__height)
+            new_width = self.constrain_to_multiple_of(scale_width * width, min_val=self.__width)
+        elif self.__resize_method == "upper_bound":
+            new_height = self.constrain_to_multiple_of(scale_height * height, max_val=self.__height)
+            new_width = self.constrain_to_multiple_of(scale_width * width, max_val=self.__width)
+        elif self.__resize_method == "minimal":
+            new_height = self.constrain_to_multiple_of(scale_height * height)
+            new_width = self.constrain_to_multiple_of(scale_width * width)
+        else:
+            raise ValueError(f"resize_method {self.__resize_method} not implemented")
+
+        return (new_width, new_height)
+
+    def __call__(self, sample):
+        width, height = self.get_size(sample["image"].shape[1], sample["image"].shape[0])
+
+        # resize sample
+        sample["image"] = cv2.resize(sample["image"], (width, height), interpolation=self.__image_interpolation_method)
+
+        if self.__resize_target:
+            if "depth" in sample:
+                sample["depth"] = cv2.resize(sample["depth"], (width, height), interpolation=cv2.INTER_NEAREST)
+
+            if "mask" in sample:
+                sample["mask"] = cv2.resize(sample["mask"].astype(np.float32), (width, height),
+                                            interpolation=cv2.INTER_NEAREST)
+
+        return sample
+
+
+class NormalizeImage(object):
+    """Normlize image by given mean and std.
+    """
+
+    def __init__(self, mean, std):
+        self.__mean = mean
+        self.__std = std
+
+    def __call__(self, sample):
+        sample["image"] = (sample["image"] - self.__mean) / self.__std
+
+        return sample
+
+
+class PrepareForNet(object):
+    """Prepare sample for usage as network input.
+    """
+
+    def __init__(self):
+        pass
+
+    def __call__(self, sample):
+        image = np.transpose(sample["image"], (2, 0, 1))
+        sample["image"] = np.ascontiguousarray(image).astype(np.float32)
+
+        if "depth" in sample:
+            depth = sample["depth"].astype(np.float32)
+            sample["depth"] = np.ascontiguousarray(depth)
+
+        if "mask" in sample:
+            sample["mask"] = sample["mask"].astype(np.float32)
+            sample["mask"] = np.ascontiguousarray(sample["mask"])
+
+        return sample
--- a/preprocessing/matanyone/app.py
+++ b/preprocessing/matanyone/app.py
@ -675,7 +675,7 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, vace_image_re
                    with gr.Column() as output_row: #equal_height=True
                        with gr.Row():
                            with gr.Column(scale=2):
-                                foreground_video_output = gr.Video(label="Original Video Output", visible=False, elem_classes="video")
+                                foreground_video_output = gr.Video(label="Original Video Input", visible=False, elem_classes="video")
                                foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
                            with gr.Column(scale=2):
                                alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
--- a/preprocessing/midas/vit.py
+++ b/preprocessing/midas/vit.py
@ -498,6 +498,7 @@ def _make_pretrained_vitb_rn50_384(pretrained,
                                   hooks=None,
                                   use_vit_only=False):
    model = timm.create_model('vit_base_resnet50_384', pretrained=pretrained)
+    # model = timm.create_model('vit_base_r50_s16_384.orig_in21k_ft_in1k', pretrained=pretrained)

    hooks = [0, 1, 8, 11] if hooks is None else hooks
    return _make_vit_b_rn50_backbone(
--- a/wan/text2video.py
+++ b/wan/text2video.py
@ -97,6 +97,7 @@ class WanT2V:
        # dtype = torch.bfloat16
        # offload.load_model_data(self.model, "ckpts/Wan14BT2VFusioniX_fp16.safetensors")
        offload.change_dtype(self.model, dtype, True)
+        # offload.save_model(self.model, "wan2.1_selforcing_fp16.safetensors", config_file_path=base_config_file)
        # offload.save_model(self.model, "wan2.1_text2video_14B_mbf16.safetensors", config_file_path=base_config_file)
        # offload.save_model(self.model, "wan2.1_text2video_14B_quanto_mfp16_int8.safetensors", do_quantize=True, config_file_path=base_config_file)
        self.model.eval().requires_grad_(False)
--- a/wgp.py
+++ b/wgp.py
@ -45,7 +45,7 @@ AUTOSAVE_FILENAME = "queue.zip"
 PROMPT_VARS_MAX = 10

 target_mmgp_version = "3.4.9"
-WanGP_version = "6.1"
+WanGP_version = "6.2"
 settings_version = 2
 prompt_enhancer_image_caption_model, prompt_enhancer_image_caption_processor, prompt_enhancer_llm_model, prompt_enhancer_llm_tokenizer = None, None, None, None

@ -252,7 +252,7 @@ def process_prompt_and_add_tasks(state, model_choice):
        if video_guide == None:
            gr.Info("You must provide a Control Video")
            return
-        if "A" in video_prompt_type:
+        if "A" in video_prompt_type and not "U" in video_prompt_type:
            if video_mask == None:
                gr.Info("You must provide a Video Mask")
                return
@ -1705,7 +1705,7 @@ def get_model_name(model_type, description_container = [""]):
 def get_model_filename(model_type, quantization ="int8", dtype_policy = ""):
    finetune_def = finetunes.get(model_type, None)
    if finetune_def != None: 
-        choices = [ "ckpts/" + os.path.basename(path) for path in finetune_def["URLs"] ]
+        choices = [ ("ckpts/" + os.path.basename(path) if path.startswith("http") else path)  for path in finetune_def["URLs"] ]
    else:
        signature = model_signatures[model_type]
        choices = [ name for name in transformer_choices if signature in name]
@ -2055,7 +2055,7 @@ def download_models(model_filename, model_type):
    shared_def = {
        "repoId" : "DeepBeepMeep/Wan2.1",
        "sourceFolderList" : [ "pose", "depth", "mask", "wav2vec", ""  ],
-        "fileList" : [ [],[], ["sam_vit_h_4b8939_fp16.safetensors"], ["config.json", "feature_extractor_config.json", "model.safetensors", "preprocessor_config.json", "special_tokens_map.json", "tokenizer_config.json", "vocab.json"],
+        "fileList" : [ [],["depth_anything_v2_vitl.pth"], ["sam_vit_h_4b8939_fp16.safetensors"], ["config.json", "feature_extractor_config.json", "model.safetensors", "preprocessor_config.json", "special_tokens_map.json", "tokenizer_config.json", "vocab.json"],
                [ "flownet.pkl"  ] ]
    }
    process_files_def(**shared_def)
@ -2776,7 +2776,7 @@ def get_resampled_video(video_in, start_frame, max_frames, target_fps, bridge='t
    # print(f"frame nos: {frame_nos}")
    return frames_list

-def get_preprocessor(process_type):
+def get_preprocessor(process_type, inpaint_color):
    if process_type=="pose":
        from preprocessing.dwpose.pose import PoseBodyFaceVideoAnnotator
        cfg_dict = {
@ -2784,22 +2784,34 @@ def get_preprocessor(process_type):
            "POSE_MODEL": "ckpts/pose/dw-ll_ucoco_384.onnx",
            "RESIZE_SIZE": 1024
        }
-        anno_ins = PoseBodyFaceVideoAnnotator(cfg_dict)
+        anno_ins = lambda img: PoseBodyFaceVideoAnnotator(cfg_dict).forward(img)[0]
    elif process_type=="depth":
-        from preprocessing.midas.depth import DepthVideoAnnotator
+        # from preprocessing.midas.depth import DepthVideoAnnotator
+        # cfg_dict = {
+        #     "PRETRAINED_MODEL": "ckpts/depth/dpt_hybrid-midas-501f0c75.pt"
+        # }
+        # anno_ins = lambda img: DepthVideoAnnotator(cfg_dict).forward(img)[0]
+
+        from preprocessing.depth_anything_v2.depth import DepthV2VideoAnnotator
        cfg_dict = {
-            "PRETRAINED_MODEL": "ckpts/depth/dpt_hybrid-midas-501f0c75.pt"
+            "PRETRAINED_MODEL": "ckpts/depth/depth_anything_v2_vitl.pth"
+            # "PRETRAINED_MODEL": "ckpts/depth/depth_anything_vitb14.pth"
+            
        }
-        anno_ins = DepthVideoAnnotator(cfg_dict)
+        anno_ins = lambda img: DepthV2VideoAnnotator(cfg_dict).forward(img)[0]
    elif process_type=="gray":
        from preprocessing.gray import GrayVideoAnnotator
        cfg_dict = {}
-        anno_ins = GrayVideoAnnotator(cfg_dict)
+        anno_ins = lambda img: GrayVideoAnnotator(cfg_dict).forward(img)[0]
+    elif process_type=="inpaint":
+        anno_ins = lambda img : inpaint_color
+        # anno_ins = lambda img : np.full_like(img, inpaint_color)
    else:
-        anno_ins = None
+        anno_ins = lambda img : img[0]
    return anno_ins

-def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,  max_frames, start_frame=0, fit_canvas = False, target_fps = 16, block_size= 16, expand_scale = 2, process_type = "inpainting", to_bbox = False, RGB_Mask = False, negate_mask = False, outpaint_outside_mask = False, inpaint_color = 127):
+def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,  max_frames, start_frame=0, fit_canvas = False, target_fps = 16, block_size= 16, expand_scale = 2, process_type = "inpaint", to_bbox = False, RGB_Mask = False, negate_mask = False, process_outside_mask = None, inpaint_color = 127, outpainting_dims = None):
+    from wan.utils.utils import calculate_new_dimensions

    def mask_to_xyxy_box(mask):
        rows, cols = np.where(mask == 255)
@ -2819,12 +2831,14 @@ def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,
        return None, None
    any_mask = input_mask_path != None
    pose_special = "pose" in process_type
-    if process_type == "pose_depth":
-        preproc = get_preprocessor("pose")
-        preproc2 = get_preprocessor("depth")
-    else:
-        preproc = get_preprocessor(process_type)
-        preproc2 = None
+    if process_type == "pose_depth": process_type = "pose"
+    any_identity_mask = False
+    if process_type == "identity":
+        any_identity_mask = True
+        negate_mask = False
+        process_outside_mask = None
+    preproc = get_preprocessor(process_type, inpaint_color)
+    preproc2 = get_preprocessor(process_outside_mask, inpaint_color) if process_type != process_outside_mask else preproc
    video = get_resampled_video(input_video_path, start_frame, max_frames, target_fps)
    if any_mask:
        mask_video = get_resampled_video(input_mask_path, start_frame, max_frames, target_fps)
@ -2832,19 +2846,27 @@ def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,
    if len(video) == 0 or any_mask and len(mask_video) == 0:
        return None, None

-    
-    if fit_canvas != None:
-        frame_height, frame_width, _ = video[0].shape
+    frame_height, frame_width, _ = video[0].shape

-        if fit_canvas :
-            scale1  = min(height / frame_height, width /  frame_width)
-            scale2  = min(height / frame_width, width /  frame_height)
-            scale = max(scale1, scale2)
+    if outpainting_dims != None:
+        outpainting_top, outpainting_bottom, outpainting_left, outpainting_right= outpainting_dims
+        if fit_canvas != None:
+            frame_height = int(frame_height * (100 + outpainting_top + outpainting_bottom) / 100)
+            frame_width =  int(frame_width * (100 + outpainting_left + outpainting_right) / 100)
        else:
-            scale =   ((height * width ) /  (frame_height * frame_width))**(1/2)
+            frame_height,frame_width = height, width

-        height = (int(frame_height * scale) // block_size) * block_size
-        width = (int(frame_width * scale) // block_size) * block_size
+    if fit_canvas != None:
+        height, width = calculate_new_dimensions(height, width, frame_height, frame_width, fit_into_canvas = fit_canvas, block_size = block_size)
+
+    if outpainting_dims != None:
+        final_height, final_width = height, width
+        height = int(height / ((100 + outpainting_top + outpainting_bottom) / 100))
+        width =  int(width / ((100 + outpainting_left + outpainting_right) / 100)) 
+        margin_top = int(outpainting_top/(100 + outpainting_top + outpainting_bottom) * final_height)
+        if (margin_top + height) > final_height or outpainting_bottom == 0: margin_top = final_height - height
+        margin_left = int(outpainting_left/(100 + outpainting_left + outpainting_right) * final_width)
+        if (margin_left + width) > final_width or outpainting_right == 0: margin_left = final_width - width

    if any_mask:
        num_frames = min(len(video), len(mask_video))
@ -2852,14 +2874,20 @@ def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,
        num_frames = len(video)
    masked_frames = []
    masks = []
+    if any_identity_mask:
+        any_mask = True
+
    for frame_idx in range(num_frames):
        frame = Image.fromarray(video[frame_idx].cpu().numpy()) #.asnumpy()
        frame = frame.resize((width, height), resample=Image.Resampling.LANCZOS) 
        frame = np.array(frame) 
        if any_mask:
-            mask = Image.fromarray(mask_video[frame_idx].cpu().numpy()) #.asnumpy()
-            mask = mask.resize((width, height), resample=Image.Resampling.LANCZOS) 
-            mask = np.array(mask)
+            if any_identity_mask:
+                mask = np.full( (height, width, 3), 0, dtype= np.uint8)
+            else:
+                mask = Image.fromarray(mask_video[frame_idx].cpu().numpy()) #.asnumpy()
+                mask = mask.resize((width, height), resample=Image.Resampling.LANCZOS) 
+                mask = np.array(mask)

            if len(mask.shape) == 3 and mask.shape[2] == 3:
                mask = cv2.cvtColor(mask, cv2.COLOR_BGR2GRAY)
@ -2880,40 +2908,45 @@ def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,
                if pose_special:
                    original_mask = 255 - original_mask

-        if preproc != None:
-            if pose_special and any_mask:            
-                target_frame = np.where(original_mask[..., None], frame, 0) 
-            else:
-                target_frame = frame 
-            # Image.fromarray(target_frame).save("preprocframe.png")
-            processed_img = preproc.forward([target_frame])[0]
-            if pose_special and outpaint_outside_mask:
-                processed_img = np.where(processed_img ==0, inpaint_color, processed_img) 
-
-            if preproc2 != None:
-                processed_img2 = preproc2.forward([frame])[0]
-                processed_img = (processed_img.astype(np.uint16) + processed_img2.astype(np.uint16))/2
-                processed_img = processed_img.astype(np.uint8)
-            if any_mask:
-                inverse_mask = mask == 0
-                masked_frame = np.where(inverse_mask[..., None], inpaint_color if outpaint_outside_mask else frame, processed_img)
-            else:
-                masked_frame = processed_img
-
+        if pose_special and any_mask:            
+            target_frame = np.where(original_mask[..., None], frame, 0) 
        else:
-            if any_mask and not outpaint_outside_mask:
-                masked_frame = np.where(mask[..., None], inpaint_color, frame)
-            else:
-                masked_frame = np.full_like(frame, inpaint_color)
+            target_frame = frame 
+
+        processed_img = preproc([target_frame])
+
+        if any_mask:
+            if preproc2 != None:
+                frame = preproc2([frame])
+            masked_frame = np.where(mask[..., None], processed_img, frame)
+        else:
+            masked_frame = processed_img
+
            
        if any_mask :
-            if outpaint_outside_mask:
+            if process_outside_mask != None:
                mask = np.full_like(mask, 255)
            mask = torch.from_numpy(mask)
            if RGB_Mask:
                mask =  mask.unsqueeze(-1).repeat(1,1,3)
+            if outpainting_dims != None:
+                full_frame= torch.full( (final_height, final_width, mask.shape[-1]), 255, dtype= torch.uint8, device= mask.device)
+                full_frame[margin_top:margin_top+height, margin_left:margin_left+width] = mask
+                mask = full_frame 
            masks.append(mask)
+
+        if isinstance(masked_frame, int):
+            masked_frame= np.full( (height, width, 3), inpaint_color, dtype= np.uint8)
+
        masked_frame = torch.from_numpy(masked_frame)
+        if masked_frame.shape[-1] == 1:
+            masked_frame =  masked_frame.repeat(1,1,3).to(torch.uint8)
+
+        if outpainting_dims != None:
+            full_frame= torch.full( (final_height, final_width, masked_frame.shape[-1]),  inpaint_color, dtype= torch.uint8, device= masked_frame.device)
+            full_frame[margin_top:margin_top+height, margin_left:margin_left+width] = masked_frame
+            masked_frame = full_frame 
+
        masked_frames.append(masked_frame)
    if args.save_masks:
        from preprocessing.dwpose.pose import save_one_video
@ -2922,10 +2955,14 @@ def preprocess_video_with_mask(input_video_path, input_mask_path, height, width,
        if any_mask:
            saved_masks = [mask.cpu().numpy() for mask in masks ]
            save_one_video("masks.mp4", saved_masks, fps=target_fps, quality=8, macro_block_size=None)
+    preproc = None
+    preproc2 = None
+    gc.collect()
+    torch.cuda.empty_cache()

    return torch.stack(masked_frames), torch.stack(masks) if any_mask else None

-def preprocess_video(process_type, height, width, video_in, max_frames, start_frame=0, fit_canvas = None, target_fps = 16, block_size = 16):
+def preprocess_video(height, width, video_in, max_frames, start_frame=0, fit_canvas = None, target_fps = 16, block_size = 16):

    frames_list = get_resampled_video(video_in, start_frame, max_frames, target_fps)

@ -2953,12 +2990,7 @@ def preprocess_video(process_type, height, width, video_in, max_frames, start_fr
        frame = frame.resize((new_width,new_height), resample=Image.Resampling.LANCZOS) 
        processed_frames_list.append(frame)

-    anno_ins = get_preprocessor(process_type)
-     
-    if anno_ins == None:
-        np_frames = [np.array(frame) for frame in processed_frames_list]
-    else:
-        np_frames = anno_ins.forward(processed_frames_list)
+    np_frames = [np.array(frame) for frame in processed_frames_list]

    # from preprocessing.dwpose.pose import save_one_video
    # save_one_video("test.mp4", np_frames, fps=8, quality=8, macro_block_size=None)
@ -3046,6 +3078,7 @@ def generate_video(
    frames_positions,
    video_guide,
    keep_frames_video_guide,
+    video_guide_outpainting,
    video_mask,
    mask_expand,
    audio_guide,
@ -3255,7 +3288,7 @@ def generate_video(
    source_video = None
    target_camera = None
    if "recam" in model_filename:
-        source_video = preprocess_video("", width=width, height=height,video_in=video_source, max_frames= current_video_length, start_frame = 0, fit_canvas= fit_canvas == 1)
+        source_video = preprocess_video(width=width, height=height,video_in=video_source, max_frames= current_video_length, start_frame = 0, fit_canvas= fit_canvas == 1)
        target_camera = model_mode

    audio_proj_split = None
@ -3409,7 +3442,7 @@ def generate_video(
                    if video_source != None: refresh_preview["video_source"] = get_video_frame(video_source, 0)
                if video_source != None and len(video_source) > 0 and window_no == 1:
                    keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source) 
-                    prefix_video  = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= sample_fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
+                    prefix_video  = preprocess_video(width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= sample_fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
                    prefix_video  = prefix_video .permute(3, 0, 1, 2)
                    prefix_video  = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
                    pre_video_guide =  prefix_video[:, -reuse_frames:]
@ -3423,24 +3456,39 @@ def generate_video(
                video_guide_copy = video_guide
                video_mask_copy = video_mask
                if "V" in video_prompt_type:
+                    extra_label = ""
+                    if "X" in video_prompt_type:
+                        process_outside_mask = "inpaint"
+                    elif "Y" in video_prompt_type:
+                        process_outside_mask = "depth"
+                        extra_label = " and Depth"
+                    else:
+                        process_outside_mask = None
                    preprocess_type = None
-                    if "P" in video_prompt_type and "D" in video_prompt_type :
-                        progress_args = [0, get_latest_status(state,"Extracting Open Pose and Depth Information")]
-                        preprocess_type = "pose_depth"
-                    elif "P" in video_prompt_type :
-                        progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
+                    # if "P" in video_prompt_type and "D" in video_prompt_type :
+                    #     progress_args = [0, get_latest_status(state,"Extracting Open Pose and Depth Information")]
+                    #     preprocess_type = "pose_depth"
+                    if "P" in video_prompt_type :
+                        progress_args = [0, get_latest_status(state,f"Extracting Open Pose{extra_label} Information")]
                        preprocess_type = "pose"
                    elif "D" in video_prompt_type :
                        progress_args = [0, get_latest_status(state,"Extracting Depth Information")]
                        preprocess_type = "depth"
                    elif "C" in video_prompt_type :
-                        progress_args = [0, get_latest_status(state,"Extracting Gray Level Information")]
+                        progress_args = [0, get_latest_status(state,f"Extracting Gray Level{extra_label} Information")]
                        preprocess_type = "gray"
+                    elif "M" in video_prompt_type :
+                        progress_args = [0, get_latest_status(state,f"Creating Inpainting{extra_label} Mask")]
+                        preprocess_type = "inpaint"
+                    elif "U" in video_prompt_type :
+                        progress_args = [0, get_latest_status(state,f"Creating Identity{extra_label} Mask")]
+                        preprocess_type = "identity"
                    else:
-                        progress_args = [0, get_latest_status(state,"Creating Inpainting Mask")]
-                        preprocess_type = "inpainting"
+                        progress_args = [0, get_latest_status(state,f"Creating Vace Generic{extra_label} Mask")]
+                        preprocess_type = "vace"
                    send_cmd("progress", progress_args)
-                    video_guide_copy, video_mask_copy = preprocess_video_with_mask(video_guide, video_mask, height=image_size[0], width = image_size[1], max_frames= current_video_length if guide_start_frame == 0 else current_video_length - reuse_frames, start_frame = guide_start_frame, fit_canvas = sample_fit_canvas, target_fps = fps,  process_type = preprocess_type, expand_scale = mask_expand, RGB_Mask = True, negate_mask = "N" in video_prompt_type, outpaint_outside_mask = "X" in video_prompt_type)
+                    outpainting_dims = None if video_guide_outpainting== None or len(video_guide_outpainting) == 0 or video_guide_outpainting == "0 0 0 0" or video_guide_outpainting.startswith("#") else [int(v) for v in video_guide_outpainting.split(" ")] 
+                    video_guide_copy, video_mask_copy = preprocess_video_with_mask(video_guide, video_mask, height=image_size[0], width = image_size[1], max_frames= current_video_length if guide_start_frame == 0 else current_video_length - reuse_frames, start_frame = guide_start_frame, fit_canvas = sample_fit_canvas, target_fps = fps,  process_type = preprocess_type, expand_scale = mask_expand, RGB_Mask = True, negate_mask = "N" in video_prompt_type, process_outside_mask = process_outside_mask, outpainting_dims = outpainting_dims )
                    if video_guide_copy != None:
                        if sample_fit_canvas != None:
                            image_size = video_guide_copy.shape[-3: -1]
@ -3634,6 +3682,7 @@ def generate_video(
                    else:
                        sample = torch.cat([ prefix_video[:, :-reuse_frames], sample], dim = 1)
                    prefix_video = None
+                    guide_start_frame -= reuse_frames 
                if sliding_window and window_no > 1:
                    if reuse_frames == 0:
                        sample = sample[: , :]
@ -4622,6 +4671,8 @@ def load_settings_from_file(state, file_path):
        model_type = get_model_type(model_filename)
        if model_type == None:
            model_type = current_model_type
+    elif not model_type in model_types:
+        model_type = current_model_type
    defaults = state.get(model_type, None) 
    defaults = get_default_settings(model_type) if defaults == None else defaults
    defaults.update(configs)
@ -4661,6 +4712,7 @@ def save_inputs(
            model_mode,
            video_source,
            keep_frames_video_source,
+            video_guide_outpainting,
            video_prompt_type,
            image_refs,
            frames_positions,
@ -4736,10 +4788,6 @@ def refresh_image_prompt_type(state, image_prompt_type):
    any_video_source = len(filter_letters(image_prompt_type, "VLG"))>0
    return gr.update(visible = "S" in image_prompt_type ), gr.update(visible = "E" in image_prompt_type ), gr.update(visible = "V" in image_prompt_type) , gr.update(visible = any_video_source) 

-def refresh_video_prompt_type(state, video_prompt_type):
-    return gr.Gallery(visible = "I" in video_prompt_type), gr.Video(visible= "V" in video_prompt_type),gr.Video(visible= "M" in video_prompt_type ), gr.Text(visible= "V" in video_prompt_type) , gr.Checkbox(visible= "I" in video_prompt_type)
-
-
 def handle_celll_selection(state, evt: gr.SelectData):
    gen = get_gen_info(state)
    queue = gen.get("queue", [])
@ -4859,25 +4907,29 @@ def refresh_video_prompt_type_image_refs(video_prompt_type, video_prompt_type_im
    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_image_refs)
    visible = "I" in video_prompt_type
    return video_prompt_type, gr.update(visible = visible),gr.update(visible = visible), gr.update(visible = visible and "F" in video_prompt_type_image_refs)
-                
-def refresh_video_prompt_type_video_guide(video_prompt_type, video_prompt_type_video_guide):
-    video_prompt_type = del_in_sequence(video_prompt_type, "DPCMV")
-    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_guide)
-    visible = "V" in video_prompt_type
-    return video_prompt_type, gr.update(visible = visible), gr.update(visible = visible), gr.update(visible= visible), gr.update(visible= visible and "A" in video_prompt_type ), gr.update(visible= visible and "A" in video_prompt_type )

 def refresh_video_prompt_type_video_mask(video_prompt_type, video_prompt_type_video_mask):
-    video_prompt_type = del_in_sequence(video_prompt_type, "XNA")
+    video_prompt_type = del_in_sequence(video_prompt_type, "XYNA")
    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_mask)
    visible= "A" in video_prompt_type     
    return video_prompt_type, gr.update(visible= visible), gr.update(visible= visible )

-def refresh_video_prompt_video_guide_trigger(video_prompt_type, video_prompt_type_video_guide):
-    video_prompt_type_video_guide = video_prompt_type_video_guide.split("#")[0]
-    video_prompt_type = del_in_sequence(video_prompt_type, "DPCMV")
+def refresh_video_prompt_type_video_guide(state, video_prompt_type, video_prompt_type_video_guide):
+    video_prompt_type = del_in_sequence(video_prompt_type, "DPCMUV")
    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_guide)
    visible = "V" in video_prompt_type
-    return video_prompt_type, video_prompt_type_video_guide, gr.update(visible= visible ), gr.update(visible= visible ), gr.update(visible= visible and "A" in video_prompt_type ), gr.update(visible= visible and "A" in video_prompt_type )
+    mask_visible = visible and "A" in video_prompt_type and not "U" in video_prompt_type
+    vace = get_base_model_type(state["model_type"]) in ("vace_1.3B","vace_14B") 
+    return video_prompt_type, gr.update(visible = visible), gr.update(visible = visible),gr.update(visible= visible and vace), gr.update(visible= visible and not "U" in video_prompt_type), gr.update(visible= mask_visible), gr.update(visible= mask_visible)
+
+def refresh_video_prompt_video_guide_trigger(state, video_prompt_type, video_prompt_type_video_guide):
+    video_prompt_type_video_guide = video_prompt_type_video_guide.split("#")[0]
+    video_prompt_type = del_in_sequence(video_prompt_type, "DPCMUV")
+    video_prompt_type = add_to_sequence(video_prompt_type, video_prompt_type_video_guide)
+    visible = "V" in video_prompt_type
+    mask_visible = visible and "A" in video_prompt_type and not "U" in video_prompt_type
+    vace = get_base_model_type(state["model_type"]) in ("vace_1.3B","vace_14B") 
+    return video_prompt_type, video_prompt_type_video_guide, gr.update(visible= visible ),gr.update(visible= visible and vace), gr.update(visible= visible and not "U" in video_prompt_type), gr.update(visible= mask_visible), gr.update(visible= mask_visible)


 def refresh_preview(state):
@ -4925,6 +4977,23 @@ def show_preview_column_modal(state, column_no):
    value = get_modal_image( list_uri[column_no],names[column_no]  )

    return -1, gr.update(value=value), gr.update(visible=True)
+
+def update_video_guide_outpainting(video_guide_outpainting_value, value, pos):
+    if len(video_guide_outpainting_value) <= 1:
+        video_guide_outpainting_list = ["0"] * 4
+    else:
+        video_guide_outpainting_list = video_guide_outpainting_value.split(" ")
+    video_guide_outpainting_list[pos] = str(value)
+    if all(v=="0" for v in video_guide_outpainting_list):
+        return ""
+    return " ".join(video_guide_outpainting_list)
+
+def refresh_video_guide_outpainting_row(video_guide_outpainting_checkbox, video_guide_outpainting):
+    video_guide_outpainting = video_guide_outpainting[1:] if video_guide_outpainting_checkbox else "#" + video_guide_outpainting 
+        
+    return gr.update(visible=video_guide_outpainting_checkbox), video_guide_outpainting
+
+
 def generate_video_tab(update_form = False, state_dict = None, ui_defaults = None, model_choice = None, header = None, main = None):
    global inputs_names #, advanced

@ -5129,15 +5198,16 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                    if vace:
                        video_prompt_type_video_guide = gr.Dropdown(
                            choices=[
-                                ("None", ""),
+                                ("No Control Video", ""),
                                ("Transfer Human Motion", "PV"),
                                ("Transfer Depth", "DV"),
-                                ("Transfer Human Motion & Depth", "DPV"),
+                                # ("Transfer Human Motion & Depth", "DPV"),
                                ("Recolorize Control Video", "CV"),
                                ("Inpainting", "MV"),
-                                ("Vace multi formats", "V"),
+                                ("Vace raw format", "V"),
+                                ("Keep Unchanged", "UV"),
                            ],
-                            value=filter_letters(video_prompt_type_value, "DPCMV"),
+                            value=filter_letters(video_prompt_type_value, "DPCMUV"),
                            label="Control Video Process", scale = 2, visible= True
                        )
                    elif hunyuan_video_custom_edit:
@ -5146,7 +5216,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                ("Inpaint Control Video", "MV"),
                                ("Transfer Human Motion", "PMV"),
                            ],
-                            value=filter_letters(video_prompt_type_value, "DPCMV"),
+                            value=filter_letters(video_prompt_type_value, "DPCMUV"),
                            label="Video to Video", scale = 3, visible= True
                        )
                    else:
@ -5172,9 +5242,11 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                ("Non Masked Area", "NA"),
                                ("Masked Area, rest Outpainted", "XA"),
                                ("Non Masked Area, rest Inpainted", "XNA"),
+                                ("Masked Area, rest Depth", "YA"),
+                                ("Non Masked Area, rest Depth", "YNA"),
                            ],
-                            value= filter_letters(video_prompt_type_value, "XNA"),
-                            visible=  "V" in video_prompt_type_value and not hunyuan_video_custom,
+                            value= filter_letters(video_prompt_type_value, "XYNA"),
+                            visible=  "V" in video_prompt_type_value and not "U" in video_prompt_type_value and not hunyuan_video_custom,
                            label="Area Processed", scale = 2
                        )
                    if vace:
@ -5198,6 +5270,23 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
 
                video_guide = gr.Video(label= "Control Video", visible= "V" in video_prompt_type_value, value= ui_defaults.get("video_guide", None),)
                keep_frames_video_guide = gr.Text(value=ui_defaults.get("keep_frames_video_guide","") , visible= "V" in video_prompt_type_value, scale = 2, label= "Frames to keep in Control Video (empty=All, 1=first, a:b for a range, space to separate values)" ) #, -1=last
+                with gr.Column(visible= "V" in video_prompt_type_value and vace) as video_guide_outpainting_col:
+                    video_guide_outpainting_value = ui_defaults.get("video_guide_outpainting","#")
+                    video_guide_outpainting = gr.Text(value=video_guide_outpainting_value , visible= False)
+                    with gr.Group():
+                        video_guide_outpainting_checkbox = gr.Checkbox(label="Enable Outpainting on Control Video", value=len(video_guide_outpainting_value)>0 and not video_guide_outpainting_value.startswith("#") )
+                        with gr.Row(visible = not video_guide_outpainting_value.startswith("#")) as video_guide_outpainting_row:
+                            video_guide_outpainting_value = video_guide_outpainting_value[1:] if video_guide_outpainting_value.startswith("#") else video_guide_outpainting_value
+                            video_guide_outpainting_list = [0] * 4 if len(video_guide_outpainting_value) == 0 else [int(v) for v in video_guide_outpainting_value.split(" ")]
+                            video_guide_outpainting_top= gr.Slider(0, 100, value= video_guide_outpainting_list[0], step=5, label="Top %", show_reset_button= False)
+                            video_guide_outpainting_bottom = gr.Slider(0, 100, value= video_guide_outpainting_list[1], step=5, label="Bottom %", show_reset_button= False)
+                            video_guide_outpainting_left = gr.Slider(0, 100, value= video_guide_outpainting_list[2], step=5, label="Left %", show_reset_button= False)
+                            video_guide_outpainting_right = gr.Slider(0, 100, value= video_guide_outpainting_list[3], step=5, label="Right %", show_reset_button= False)
+
+                video_mask = gr.Video(label= "Video Mask Area (for Inpainting, white = Control Area, black = Unchanged)", visible= "A" in video_prompt_type_value and not "U" in video_prompt_type_value , value= ui_defaults.get("video_mask", None)) 
+
+                mask_expand = gr.Slider(-10, 50, value=ui_defaults.get("mask_expand", 0), step=1, label="Expand / Shrink Mask Area", visible= "A" in video_prompt_type_value and not "U" in video_prompt_type_value )
+
                image_refs = gr.Gallery( label ="Start Image" if hunyuan_video_avatar else "Reference Images",
                        type ="pil",   show_label= True,
                        columns=[3], rows=[1], object_fit="contain", height="auto", selected_index=0, interactive= True, visible= "I" in video_prompt_type_value, 
@ -5216,10 +5305,6 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                )


-                video_mask = gr.Video(label= "Video Mask Area (for Inpainting or Outpainting, white = Control Area, black = Unchanged)", visible= "A" in video_prompt_type_value, value= ui_defaults.get("video_mask", None)) 
-
-                mask_expand = gr.Slider(-10, 50, value=ui_defaults.get("mask_expand", 0), step=1, label="Expand / Shrink Mask Area", visible= "A" in video_prompt_type_value)
-
            audio_guide = gr.Audio(value= ui_defaults.get("audio_guide", None), type="filepath", label="Voice to follow", show_download_button= True, visible= fantasy or hunyuan_video_avatar or hunyuan_video_custom_audio   )

            advanced_prompt = advanced_ui
@ -5551,7 +5636,9 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
        extra_inputs = prompt_vars + [wizard_prompt, wizard_variables_var, wizard_prompt_activated_var, video_prompt_column, image_prompt_column,
                                      prompt_column_advanced, prompt_column_wizard_vars, prompt_column_wizard, lset_name, advanced_row, speed_tab, quality_tab,
                                      sliding_window_tab, misc_tab, prompt_enhancer_row, inference_steps_row, skip_layer_guidance_row,
-                                      video_prompt_type_video_guide, video_prompt_type_video_mask, video_prompt_type_image_refs] # show_advanced presets_column,
+                                      video_prompt_type_video_guide, video_prompt_type_video_mask, video_prompt_type_image_refs,
+                                      video_guide_outpainting_col,video_guide_outpainting_top, video_guide_outpainting_bottom, video_guide_outpainting_left, video_guide_outpainting_right,
+                                      video_guide_outpainting_checkbox, video_guide_outpainting_row] # show_advanced presets_column,
        if update_form:
            locals_dict = locals()
            gen_inputs = [state_dict if k=="state" else locals_dict[k]  for k in inputs_names] + [state_dict] + extra_inputs
@ -5561,12 +5648,16 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
            target_settings = gr.Text(value = "settings", interactive= False, visible= False)

            image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] ) 
-            video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, keep_frames_video_guide, video_prompt_type_video_mask, video_mask, mask_expand])
+            video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[state, video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, keep_frames_video_guide, video_guide_outpainting_col, video_prompt_type_video_mask, video_mask, mask_expand])
            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref, frames_positions ])
-            video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_prompt_type_video_mask, video_mask, mask_expand])
+            video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [state, video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_guide_outpainting_col, video_prompt_type_video_mask, video_mask, mask_expand])
            video_prompt_type_video_mask.input(fn=refresh_video_prompt_type_video_mask, inputs = [video_prompt_type, video_prompt_type_video_mask], outputs = [video_prompt_type, video_mask, mask_expand])
            multi_prompts_gen_type.select(fn=refresh_prompt_labels, inputs=multi_prompts_gen_type, outputs=[prompt, wizard_prompt])
-
+            video_guide_outpainting_top.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_top, gr.State(0)], outputs = [video_guide_outpainting] )
+            video_guide_outpainting_bottom.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_bottom,gr.State(1)], outputs = [video_guide_outpainting] )
+            video_guide_outpainting_left.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_left,gr.State(2)], outputs = [video_guide_outpainting] )
+            video_guide_outpainting_right.input(fn=update_video_guide_outpainting, inputs=[video_guide_outpainting, video_guide_outpainting_right,gr.State(3)], outputs = [video_guide_outpainting] )
+            video_guide_outpainting_checkbox.input(fn=refresh_video_guide_outpainting_row, inputs=[video_guide_outpainting_checkbox, video_guide_outpainting], outputs= [video_guide_outpainting_row,video_guide_outpainting])
            show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
                fn=switch_prompt_type, inputs = [state, wizard_prompt_activated_var, wizard_variables_var, prompt, wizard_prompt, *prompt_vars], outputs = [wizard_prompt_activated_var, wizard_variables_var, prompt, wizard_prompt, prompt_column_advanced, prompt_column_wizard, prompt_column_wizard_vars, *prompt_vars])
            queue_df.select( fn=handle_celll_selection, inputs=state, outputs=[queue_df, modal_image_display, modal_container])