Yuval Alaluf*, Elad Richardson*, Gal Metzer, Daniel Cohen-Or
Tel Aviv University
* Denotes equal contributionA key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.
Personalization results of our method under a variety of prompts. Our expressive representation enables one to generate novel compositions of personalized concepts that achieve high visual fidelity and editability without tuning the generative model. The bottom row shows our method's unique ability to control the reconstruction-editability tradeoff at inference time with a single trained model.
Official implementation of our NeTI paper.
Our code relies on the environment in the official Stable Diffusion repository. To set up their environment, please run:
conda env create -f environment/environment.yaml
conda activate neti
On top of these requirements, we added several requirements which can be found in environment/requirements.txt
. These requirements will be installed in the above command.
Hugging Face Diffusers Library
Our code relies on the diffusers library and the official Stable Diffusion v1.4 model.
Sample text-guided personalized generation results obtained with NeTI.
You can try out some of our trained models using our HuggingFace Spaces app here
As part of our code release and to assist with comparisons, we have also provided some of the trained models and datasets used in the paper.
All of our models can be found here. All datasets used from Textual Inversion can be found here.
Note that datasets taken from CustomDiffusion, can be downloaded from their official implementation.
To train your own concept, you can simply run the scripts/train.py
script and pass a config file specifying all training parameters. For example,
python scripts/train.py --config_path input_configs/train.yaml
Notes:
- All training arguments can be found in the
RunConfig
class intraining/config.py
and are set to their defaults according to the official paper. - For parsing the config and its parameters, we use the pyrallis library.
To run inference on a trained model, you can run our scripts/inference.py
script. An example config file is provided in input_configs/inference.yaml
:
python scripts/inference.py --config_path input_configs/inference.yaml
Notes:
- You can either pass an
input_dir
anditeration
, which we will then use to extract the corresponding mapper checkpoint and embeddings file, or you can directly pass specific values formapper_checkpoint_path
andlearned_embeds_path
. - For specifying the prompts, you can either provide a list of prompts, or specify a path to a text file, where each line contains a prompt. For example:
A photo of {} A photo of {} on a beach A colorful grafitti of {}
- Note that the concept placement should be specified using
{}
. - We will replace
{}
with the concept's placeholder token that is saved in the mapper checkpoint.
- Note that the concept placement should be specified using
- Prompts used in the paper's evaluations are provided in
constants.py
underPROMPTS
. - Please refer to the
InferenceConfig
class for more details on all parameters.
All generated images will be saved to the path {cfg.inference_dir}/{prompt}
. We will also save a grid of all images (in the case of multiple seeds) under {cfg.inference_dir}
.
Using our dropout technique, users can control the balance between the generated image's visual and text fidelity at inference time.
To apply inference-time dropout, you can simply specify different values for truncation_idxs
in the InferenceConfig
.
If a list of truncation values are specified, then results for each truncation value will be saved separately.
We use the same evaluation protocol as used in Textual Inversion. The main logic for computing the metrics can be found here.
Our code builds on the diffusers implementation of textual inversion and the unofficial implementation of XTI from cloneofsimo.
If you use this code for your research, please cite the following work:
@misc{alaluf2023neural,
title={A Neural Space-Time Representation for Text-to-Image Personalization},
author={Yuval Alaluf and Elad Richardson and Gal Metzer and Daniel Cohen-Or},
year={2023},
eprint={2305.15391},
archivePrefix={arXiv},
primaryClass={cs.CV}
}