Subject-Specific Concept Control

Abstract

In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between ``person'' and ``old person''). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model.

The images below contain two subjects each. We vary one attribute for each subject independently to showcase our subject specifc control When looking at one row or one column independently, note how the specific attribute, e.g. age, of one person changes while it stays constant for the other person.

We vary age for the man and the woman independently.

We vary width for the man and the woman independently.

We vary age for the man and the woman independently.

Method

Our method is based on the following findings (see paper for in-depth analysis):

Diffusion models are able to interpret modified prompt embeddings e' that do not directly correspond to a possible text prompt.
Fine-grained changes of a specific token embedding apply localized changes.
The space of token embeddings approximates a Euclidean space across similar anchor points, e.g. young <-> old for people. This enables us composable smooth local edits along fixed category-specific directions ∆e

Compositional Attribute Editing

Our method allows the stacking and continuous modulation of multiple attributes for the same subject

Age and Price

Makeup and Age

Smile and Width

Continuous Attribute Modification

We can scale our attribute deltas for a smooth interpolation between two anchors.

Smoothly applied age slider in both the negative (young) and positive direction (old).

BibTeX

@misc{baumann2024attributecontrol, title={{C}ontinuous, {S}ubject-{S}pecific {A}ttribute {C}ontrol in {T}2{I} {M}odels by {I}dentifying {S}emantic {D}irections}, author={Stefan Andreas Baumann and Felix Krause and Michael Neumayr and Nick Stracke and Vincent Tao Hu and Bj{\"o}rn Ommer}, year={2024}, eprint={2403.17064}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Abstract