Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
The TextEncodeHunyuanVideo_ImageToVideo node creates conditioning data for video generation by combining text prompts with image embeddings. It uses a CLIP model to process both the text input and visual information from a CLIP vision output, then generates tokens that blend these two sources according to the specified image interleave setting.

Inputs

ParameterData TypeRequiredRangeDescription
clipCLIPYes-The CLIP model used for tokenization and encoding
clip_vision_outputCLIP_VISION_OUTPUTYes-The visual embeddings from a CLIP vision model that provide image context
promptSTRINGYes-The text description to guide the video generation, supports multiline input and dynamic prompts
image_interleaveINTYes1-512How much the image influences things vs the text prompt. Higher number means more influence from the text prompt. (default: 2)

Outputs

Output NameData TypeDescription
CONDITIONINGCONDITIONINGThe conditioning data that combines text and image information for video generation