Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
The Kandinsky5ImageToVideo node prepares conditioning and latent space data for video generation using the Kandinsky model. It creates an empty video latent tensor and can optionally encode a starting image to guide the initial frames of the generated video, modifying the positive and negative conditioning accordingly.

Inputs

ParameterData TypeRequiredRangeDescription
positiveCONDITIONINGYesN/AThe positive conditioning prompts to guide the video generation.
negativeCONDITIONINGYesN/AThe negative conditioning prompts to steer the video generation away from certain concepts.
vaeVAEYesN/AThe VAE model used to encode the optional starting image into the latent space.
widthINTNo16 to 8192 (step 16)The width of the output video in pixels (default: 768).
heightINTNo16 to 8192 (step 16)The height of the output video in pixels (default: 512).
lengthINTNo1 to 8192 (step 4)The number of frames in the video (default: 121).
batch_sizeINTNo1 to 4096The number of video sequences to generate simultaneously (default: 1).
start_imageIMAGENoN/AAn optional starting image. If provided, it is encoded and used to replace the noisy start of the model’s output latents.
Note: When a start_image is provided, it is automatically resized to match the specified width and height using bilinear interpolation. The first length frames of the image batch are used for encoding. The encoded latent is then injected into both the positive and negative conditioning to guide the video’s initial appearance.

Outputs

Output NameData TypeDescription
positiveCONDITIONINGThe modified positive conditioning, potentially updated with encoded start image data.
negativeCONDITIONINGThe modified negative conditioning, potentially updated with encoded start image data.
latentLATENTAn empty video latent tensor with zeros, shaped for the specified dimensions.
cond_latentLATENTThe clean, encoded latent representation of the provided start images. This is used internally to replace the noisy beginning of the generated video latents.