Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

Inputs

ParameterData TypeRequiredRangeDescription
modeCOMBOYes"single_speaker"
"two_speakers"
The audio input mode. "single_speaker" uses one audio input. "two_speakers" enables inputs for a second speaker and corresponding masks.
modelMODELYes-The base video diffusion model.
model_patchMODELPATCHYes-The model patch containing audio projection layers.
positiveCONDITIONINGYes-The positive conditioning to guide the generation.
negativeCONDITIONINGYes-The negative conditioning to guide the generation.
vaeVAEYes-The VAE used for encoding images to and from the latent space.
widthINTNo16 - MAX_RESOLUTIONThe width of the output video in pixels. Must be divisible by 16. (default: 832)
heightINTNo16 - MAX_RESOLUTIONThe height of the output video in pixels. Must be divisible by 16. (default: 480)
lengthINTNo1 - MAX_RESOLUTIONThe number of frames to generate. (default: 81)
clip_vision_outputCLIPVISIONOUTPUTNo-Optional CLIP vision output for additional conditioning.
start_imageIMAGENo-An optional starting image to initialize the video sequence.
audio_encoder_output_1AUDIOENCODEROUTPUTYes-The primary audio encoder output containing features for the first speaker.
motion_frame_countINTNo1 - 33Number of previous frames to use as motion context when extending a sequence. (default: 9)
audio_scaleFLOATNo-10.0 - 10.0A scaling factor applied to the audio conditioning. (default: 1.0)
previous_framesIMAGENo-Optional previous video frames to extend from.
audio_encoder_output_2AUDIOENCODEROUTPUTNo-The second audio encoder output. Required when mode is set to "two_speakers".
mask_1MASKNo-Mask for the first speaker, required if using two audio inputs.
mask_2MASKNo-Mask for the second speaker, required if using two audio inputs.
Parameter Constraints:
  • When mode is set to "two_speakers", the parameters audio_encoder_output_2, mask_1, and mask_2 become required.
  • If audio_encoder_output_2 is provided, both mask_1 and mask_2 must also be provided.
  • If mask_1 and mask_2 are provided, audio_encoder_output_2 must also be provided.
  • If previous_frames is provided, it must contain at least as many frames as specified by motion_frame_count.

Outputs

Output NameData TypeDescription
modelMODELThe patched model with audio conditioning applied.
positiveCONDITIONINGThe positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).
negativeCONDITIONINGThe negative conditioning, potentially modified with additional context.
latentLATENTThe generated video sequence in latent space.
trim_imageINTThe number of frames from the start of the motion context that should be trimmed when extending a sequence.