Skip to main content
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub
This node prepares data for training by encoding images and text. It takes a list of images and a corresponding list of text captions, then uses a VAE model to convert the images into latent representations and a CLIP model to convert the text into conditioning data. The resulting paired latents and conditioning are output as lists, ready for use in training workflows.

Inputs

ParameterData TypeRequiredRangeDescription
imagesIMAGEYesN/AList of images to encode.
vaeVAEYesN/AVAE model for encoding images to latents.
clipCLIPYesN/ACLIP model for encoding text to conditioning.
textsSTRINGNoN/AList of text captions. Can be length n (matching images), 1 (repeated for all), or omitted (uses empty string).
Parameter Constraints:
  • The number of items in the texts list must be 0, 1, or exactly match the number of items in the images list. If it is 0, an empty string is used for all images. If it is 1, that single text is repeated for all images.

Outputs

Output NameData TypeDescription
latentsLATENTList of latent dicts.
conditioningCONDITIONINGList of conditioning lists.