This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHubThe WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.
Inputs
| Parameter | Data Type | Required | Range | Description |
|---|---|---|---|---|
mode | COMBO | Yes | "single_speaker""two_speakers" | The audio input mode. "single_speaker" uses one audio input. "two_speakers" enables inputs for a second speaker and corresponding masks. |
model | MODEL | Yes | - | The base video diffusion model. |
model_patch | MODELPATCH | Yes | - | The model patch containing audio projection layers. |
positive | CONDITIONING | Yes | - | The positive conditioning to guide the generation. |
negative | CONDITIONING | Yes | - | The negative conditioning to guide the generation. |
vae | VAE | Yes | - | The VAE used for encoding images to and from the latent space. |
width | INT | No | 16 - MAX_RESOLUTION | The width of the output video in pixels. Must be divisible by 16. (default: 832) |
height | INT | No | 16 - MAX_RESOLUTION | The height of the output video in pixels. Must be divisible by 16. (default: 480) |
length | INT | No | 1 - MAX_RESOLUTION | The number of frames to generate. (default: 81) |
clip_vision_output | CLIPVISIONOUTPUT | No | - | Optional CLIP vision output for additional conditioning. |
start_image | IMAGE | No | - | An optional starting image to initialize the video sequence. |
audio_encoder_output_1 | AUDIOENCODEROUTPUT | Yes | - | The primary audio encoder output containing features for the first speaker. |
motion_frame_count | INT | No | 1 - 33 | Number of previous frames to use as motion context when extending a sequence. (default: 9) |
audio_scale | FLOAT | No | -10.0 - 10.0 | A scaling factor applied to the audio conditioning. (default: 1.0) |
previous_frames | IMAGE | No | - | Optional previous video frames to extend from. |
audio_encoder_output_2 | AUDIOENCODEROUTPUT | No | - | The second audio encoder output. Required when mode is set to "two_speakers". |
mask_1 | MASK | No | - | Mask for the first speaker, required if using two audio inputs. |
mask_2 | MASK | No | - | Mask for the second speaker, required if using two audio inputs. |
- When
modeis set to"two_speakers", the parametersaudio_encoder_output_2,mask_1, andmask_2become required. - If
audio_encoder_output_2is provided, bothmask_1andmask_2must also be provided. - If
mask_1andmask_2are provided,audio_encoder_output_2must also be provided. - If
previous_framesis provided, it must contain at least as many frames as specified bymotion_frame_count.
Outputs
| Output Name | Data Type | Description |
|---|---|---|
model | MODEL | The patched model with audio conditioning applied. |
positive | CONDITIONING | The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision). |
negative | CONDITIONING | The negative conditioning, potentially modified with additional context. |
latent | LATENT | The generated video sequence in latent space. |
trim_image | INT | The number of frames from the start of the motion context that should be trimmed when extending a sequence. |