Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
The WanSoundImageToVideo node generates video content from images with optional audio conditioning. It takes positive and negative conditioning prompts along with a VAE model to create video latents, and can incorporate reference images, audio encoding, control videos, and motion references to guide the video generation process.

Inputs

ParameterData TypeRequiredRangeDescription
positiveCONDITIONINGYes-Positive conditioning prompts that guide what content should appear in the generated video
negativeCONDITIONINGYes-Negative conditioning prompts that specify what content should be avoided in the generated video
vaeVAEYes-VAE model used for encoding and decoding the video latent representations
widthINTYes16 to MAX_RESOLUTIONWidth of the output video in pixels (default: 832, must be divisible by 16)
heightINTYes16 to MAX_RESOLUTIONHeight of the output video in pixels (default: 480, must be divisible by 16)
lengthINTYes1 to MAX_RESOLUTIONNumber of frames in the generated video (default: 77, must be divisible by 4)
batch_sizeINTYes1 to 4096Number of videos to generate simultaneously (default: 1)
audio_encoder_outputAUDIOENCODEROUTPUTNo-Optional audio encoding that can influence the video generation based on sound characteristics
ref_imageIMAGENo-Optional reference image that provides visual guidance for the video content
control_videoIMAGENo-Optional control video that guides the motion and structure of the generated video
ref_motionIMAGENo-Optional motion reference that provides guidance for movement patterns in the video

Outputs

Output NameData TypeDescription
positiveCONDITIONINGProcessed positive conditioning that has been modified for video generation
negativeCONDITIONINGProcessed negative conditioning that has been modified for video generation
latentLATENTGenerated video representation in latent space that can be decoded into final video frames