case.mp4
We recommend the requirements as follows.
conda create -n ingredients python=3.11.0
conda activate ingredients
pip install -r requirements.txt
The weights of model are available at 🤗HuggingFace.
We provide the inference scripts inference.py
for simple testing. Run the command as examples:
python infer.py \
--prompt "Two men in half bodies, are seated in a dimly lit room, possibly an office or meeting room, with a formal atmosphere." \
--model_path "\path\to\model" \
--seed 2025 \
--img_file_path 'asserts/0.jpg' 'asserts/1.jpg'
We also include the evaluation metrics code at metric
folder and evaluation data at for results comparison in multi-id customization tasks.
Similar to ConsisID, Ingredients also has high requirements for prompt quality. We suggest referring to formation in the link.
Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Ingredients.
python app.py
Coming soon, including multi-stage training scripts and multi-ID text-video datasets.
You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:
# For stage 1
bash train_face.sh
# For stage 2
bash train_router.sh
If you find this work useful for your research and applications, please cite us using this BibTeX:
@article{fei2025ingredients,
title={Ingredients: Blending Custom Photos with Video Diffusion Transformers},
author={Fei, Zhengcong and Li, Debang and Qiu, Di and Yu, Changqian and Fan, Mingyuan},
journal={arXiv preprint arXiv:2501.01790},
year={2025}
}
For any question, please feel free to open an issue.
This project wouldn't be possible without the following open-sourced repositories: CogVideoX, ConsisID, Uniportrait, and Hunyuan Video.