GitHub - feizc/Ingredients: Blending Custom Photos with Video Diffusion Transformers

Blending Custom Photos with Video Diffusion Transformers

This repository is the official implementation of Ingredients, a powerful way to customize video creations by incorporating multiple specific identity (ID) photos, with advanced video diffusion Transformers. This is a research project, and it is recommended to try advanced products:

📷 1. Gallery

case.mp4

⚙️ 2. Environments

We recommend the requirements as follows.

conda create -n ingredients python=3.11.0
conda activate ingredients
pip install -r requirements.txt

The weights of model are available at 🤗HuggingFace.

🗝️ 3. Inference

We provide the inference scripts inference.py for simple testing. Run the command as examples:

python infer.py \
    --prompt "Two men in half bodies, are seated in a dimly lit room, possibly an office or meeting room, with a formal atmosphere." \
    --model_path "\path\to\model" \
    --seed 2025 \
    --img_file_path 'asserts/0.jpg' 'asserts/1.jpg'

We also include the evaluation metrics code at metric folder and evaluation data at for results comparison in multi-id customization tasks.

Similar to ConsisID, Ingredients also has high requirements for prompt quality. We suggest referring to formation in the link.

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Ingredients.

python app.py

⏰ 4. Training

Coming soon, including multi-stage training scripts and multi-ID text-video datasets.

You can prepare the video-text pair data as formation and our experiments can be repeated by simply run the training scripts as:

# For stage 1
bash train_face.sh
# For stage 2
bash train_router.sh

🚀 5. Cite

If you find this work useful for your research and applications, please cite us using this BibTeX:

@article{fei2025ingredients,
    title={Ingredients: Blending Custom Photos with Video Diffusion Transformers},
    author={Fei, Zhengcong and Li, Debang and Qiu, Di and Yu, Changqian and Fan, Mingyuan},
    journal={arXiv preprint arXiv:2501.01790},
    year={2025}
}

For any question, please feel free to open an issue.

Acknowledgement

This project wouldn't be possible without the following open-sourced repositories: CogVideoX, ConsisID, Uniportrait, and Hunyuan Video.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
asserts		asserts
metric		metric
models		models
LICENSE		LICENSE
README.md		README.md
app.py		app.py
infer.py		infer.py
infer.sh		infer.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Blending Custom Photos with Video Diffusion Transformers

📷 1. Gallery

⚙️ 2. Environments

🗝️ 3. Inference

Gradio Web UI

⏰ 4. Training

🚀 5. Cite

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

feizc/Ingredients

Folders and files

Latest commit

History

Repository files navigation

Blending Custom Photos with Video Diffusion Transformers

📷 1. Gallery

⚙️ 2. Environments

🗝️ 3. Inference

Gradio Web UI

⏰ 4. Training

🚀 5. Cite

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages