CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

CompVis @ LMU Munich, MCML1
Apple2

Overview

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

How it works

Following the standard LoRA method, we keep the original weight matrix W_0 frozen and add two new trainable weight matrices A and B for each layer (i) that we want to adapt. Usually, we would train A and B on a small dataset to capture a specific style or subject, resulting in an adapter that is fixed at inference time. However, we propose to dynamically apply a transformation φ on the embedding of the first LoRA matrix A. In practice, we implement φ as an affine transformation with scale and shift parameter γ and β, respectively. These are predicted by a mapping network that depend on the conditioning c.

Qualitative Comparison

Style

Samples from our method with style conditioning compared against other methods. We used an empty prompt and only conditioned on the image. We generally perform on par with IP-Adapter and outperform it on some samples. Note that the third image from the left is less degraded, and the third image from the right captures the mane of the horse better.

Structure

Samples from our method with structural conditioning compared against other methods. Note that for our method, especially compared with T2I Adapter, the details of the images are substantially more closely aligned with the depth prompt (see e.g. the lamp in the background of the living room scene and the side table's legs, or the salad on the pizza)

Quantitative Comparison

Style

Best results are in bold. LoRAdapter needs the fewest parameters and is able to achieve state-of-the-art performance while also enabling direct structure control.

Structure

Best results are in bold. We evaluate cycle consistency (MSE-d), FID and LPIPS. The difference between configuration A and B is the number of layers that are adapted resulting in a different number of parameters. LoRAdapter outperforms all other methods in all metrics.

BibTeX


@misc{stracke2024loradapter,
  title={CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models}, 
  author={Nick Stracke and Stefan Andreas Baumann and Joshua Susskind and Miguel Angel Bautista and Björn Ommer},
  year={2024},
  eprint={2405.07913},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}