WanInfiniteTalkToVideo - ComfyUI Built-in Node Documentation

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

The WanInfiniteTalkToVideo node generates video sequences from audio input. It uses a video diffusion model, conditioned on audio features extracted from one or two speakers, to produce a latent representation of a talking head video. The node can generate a new sequence or extend an existing one using previous frames for motion context.

Inputs

Parameter	Data Type	Required	Range	Description
`mode`	COMBO	Yes	`"single_speaker"` `"two_speakers"`	The audio input mode. `"single_speaker"` uses one audio input. `"two_speakers"` enables inputs for a second speaker and corresponding masks.
`model`	MODEL	Yes	-	The base video diffusion model.
`model_patch`	MODELPATCH	Yes	-	The model patch containing audio projection layers.
`positive`	CONDITIONING	Yes	-	The positive conditioning to guide the generation.
`negative`	CONDITIONING	Yes	-	The negative conditioning to guide the generation.
`vae`	VAE	Yes	-	The VAE used for encoding images to and from the latent space.
`width`	INT	No	16 - MAX_RESOLUTION	The width of the output video in pixels. Must be divisible by 16. (default: 832)
`height`	INT	No	16 - MAX_RESOLUTION	The height of the output video in pixels. Must be divisible by 16. (default: 480)
`length`	INT	No	1 - MAX_RESOLUTION	The number of frames to generate. (default: 81)
`clip_vision_output`	CLIPVISIONOUTPUT	No	-	Optional CLIP vision output for additional conditioning.
`start_image`	IMAGE	No	-	An optional starting image to initialize the video sequence.
`audio_encoder_output_1`	AUDIOENCODEROUTPUT	Yes	-	The primary audio encoder output containing features for the first speaker.
`motion_frame_count`	INT	No	1 - 33	Number of previous frames to use as motion context when extending a sequence. (default: 9)
`audio_scale`	FLOAT	No	-10.0 - 10.0	A scaling factor applied to the audio conditioning. (default: 1.0)
`previous_frames`	IMAGE	No	-	Optional previous video frames to extend from.
`audio_encoder_output_2`	AUDIOENCODEROUTPUT	No	-	The second audio encoder output. Required when `mode` is set to `"two_speakers"`.
`mask_1`	MASK	No	-	Mask for the first speaker, required if using two audio inputs.
`mask_2`	MASK	No	-	Mask for the second speaker, required if using two audio inputs.

Parameter Constraints:

When mode is set to "two_speakers", the parameters audio_encoder_output_2, mask_1, and mask_2 become required.
If audio_encoder_output_2 is provided, both mask_1 and mask_2 must also be provided.
If mask_1 and mask_2 are provided, audio_encoder_output_2 must also be provided.
If previous_frames is provided, it must contain at least as many frames as specified by motion_frame_count.

Outputs

Output Name	Data Type	Description
`model`	MODEL	The patched model with audio conditioning applied.
`positive`	CONDITIONING	The positive conditioning, potentially modified with additional context (e.g., start image, CLIP vision).
`negative`	CONDITIONING	The negative conditioning, potentially modified with additional context.
`latent`	LATENT	The generated video sequence in latent space.
`trim_image`	INT	The number of frames from the start of the motion context that should be trimmed when extending a sequence.

Nodes

​Inputs

​Outputs

Inputs

Outputs