Cross-Image Attention

Abstract

Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images ––– one depicting the target structure and the other specifying the desired appearance ––– our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

Results

How does it work?

Given a structure image and an appearance image, we begin by inverting the two images into the latent space of a pretrained image diffusion model.
At each timestep t, we replace the self-attention layer with our cross-image attention layer by mixing the keys and values from \(z_t^{app}\) with the query of \(z_t^{out}\).
To improve the output image quality, we introduce three extensions.
1. We apply a contrast operation encouraging the \(Q_{out}\) to attend to a smaller set of keys in \(K_{app}\).
2. We introduce an appearance guidance mechanism akin to classifier-free guidance used for text-guided image synthesis.
3. We apply an AdaIN operation over \(z_{t-1}^{out}\) to better align with the feature statistics of \(z_{t-1}^{app}\).
This process is repeated across multiple timesteps of the denoising process and multiple layers of the network decoder, resulting in the gradual appearance transfer.

Cross-Image Attention for Implicit Correspondences

Our cross-attention implicitly establishes semantic correspondences across images.
For each query (marked in red, green, and yellow), we compute attention maps between the query and all keys at a specific attention layer.
In the first two rows, we show the self-attention maps, which focus on semantically similar regions in the image. For instance, the yellow query attends to the legs of the giraffe in the structure image and to nearby grass pixels in the background of the appearance image.
In the bottom row, we use our cross-image attention. Now, each query on the giraffe corresponds to semantically similar regions of the zebra. For example, the red query attends to the head of the zebra while the yellow query points to its legs.

Cross Domain Appearance Transfer

Our approach can transfer appearance between cross-domain objects. This transfer is possible even in a zero-shot setting thanks to the strong correspondences already captured by the diffusion model itself.

Addtional Results

Additional appearance transfer results obtained by our method.

BibTeX


      @misc{alaluf2023crossimage,
        title={Cross-Image Attention for Zero-Shot Appearance Transfer}, 
        author={Yuval Alaluf and Daniel Garibi and Or Patashnik and Hadar Averbuch-Elor and Daniel Cohen-Or},
        year={2023},
        eprint={2311.03335},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }