This workflow generates a new image by fusing visual elements from two input images using vision-language models and a large language model for creative prompt synthesis.
First, an image style is selected. The style can either be chosen randomly from a predefined style list or provided manually as a fixed input.
Next, two input images are processed independently using two QwenVL-4B nodes. Each QwenVL node analyzes its input image and produces a detailed textual description of the visual content.
The resulting image descriptions, together with the selected image style, are then passed into an Ollama node. This node uses a large language model to extract, merge, and creatively recombine key visual elements from both descriptions into a single, cohesive image prompt. The output prompt uses the selected style as the primary and only style, placed at the beginning of the prompt.
The default summarization and fusion model is gpt-oss:120b, but smaller models can be used to reduce VRAM requirements. Good results have also been achieved with Aya, Llama 3, and Qwen 3.
For image generation, the output resolution is determined dynamically. The workflow uses the total_pixels value derived from the first input image and scales the generated image accordingly to preserve relative image size and detail.
Finally, the generated prompt, selected style, and computed image resolution are passed to the image generation node (Z-Image), which produces the final synthesized image.
Description
extended version of my other workflow