Skip to content

apple/pico-banana-400k

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍌 Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Pico-Banana-400K is a large-scale dataset of ~400K text–image–edit triplets designed to advance research in text-guided image editing.
Each example contains:

  • an original image (from Open Images),
  • a human-like edit instruction, and
  • the edited result generated and verified by the Nano-Banana model.

The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations—from low-level color adjustments to high-level object, scene, and stylistic edits.


🧩 Key Features

Feature Description
Total Samples ~257K single-turn text–image–edit triplets for SFT, ~56K single-turn text-image(positive) - image(negative)-edit for preference learning, and ~72K multi-turn texts-images-edits for multi-turn applications
Source Open Images
Edit Operations 35 across 8 semantic categories
Categories Pixel & Photometric, Object-Level, Scene Composition, Stylistic, Text & Symbol, Human-Centric, Scale & Perspective, Spatial/Layout
Image Resolution 512–1024 px
Prompt Generator Gemini-2.5-Flash
Editing Model Nano-Banana
Self-Evaluation Automated judging pipeline using Gemini-2.5-Pro for edit quality

🏗️ Dataset Construction

Pico-Banana-400K is built using a two-stage multimodal generation pipeline:

  1. Instruction Generation
    Each Open Images sample is passed to Gemini-2.5-Flash, which writes concise, natural-language editing instructions grounded in visible content. We also provide short instructions summarized by Qwen-2.5-Instruct-7B. Example:
    {
      "instruction": "Change the red car to blue."
    }
    
    
  2. Editing + Self-Evaluation The Nano-Banana model performs the edit, then automatically evaluates the result using a structured quality prompt that measures: Instruction Compliance (40%) Editing Realism (25%) Preservation Balance (20%) Technical Quality (15%) Only edits scoring above a strict threshold (~0.7) are labeled as successful, forming the main dataset; the remaining ~56K are retained as failure cases for robustness and preference learning.

📊 Dataset Statistics

Nano-Banana-400K contains ~400K image editing data, covering a wide visual and semantic range drawn from real-world imagery.


🧭 Category Distribution

Category Description Percentage
Object-Level Semantic Add, remove, replace, or relocate objects 35%
Scene Composition & Multi-Subject Contextual and environmental transformations 20%
Human-Centric Edits involving clothing, expression, or appearance 18%
Stylistic Domain and artistic style transfer 10%
Text & Symbol Edits involving visible text, signs, or symbols 8%
Pixel & Photometric Brightness, contrast, and tonal adjustments 5%
Scale & Perspective Zoom, viewpoint, or framing changes 2%
Spatial / Layout Outpainting, composition, or canvas extension 2%

📂 Data Composition

  • Single-Turn SFT samples (successful edits): ~257K
  • Single-Turn Preference samples (failure cases): ~56K
  • Multi-Turn SFT samples (failure cases): ~72K
  • Gemini-generated instructions: concise, natural, and image-aware
  • Edit coverage: 35 edit types across 8 semantic categories
  • Image diversity: includes humans, objects, text-rich scenes, etc from Open Images

🖼️ Visualization

Below are representative examples from different categories:

Category Example
Object-Level “Replace the red apple with a green one.”
Scene Composition “Add sunlight streaming through the window.”
Human-Centric “Change the person’s expression to smiling.”
Text & Symbol “Uppercase the text on the billboard.”
Stylistic “Convert the image to a Van Gogh painting style.”

Pico-Banana-400K provides both breadth (diverse edit operations) and depth (quality-controlled multimodal supervision), making it a strong foundation for training and evaluating text-guided image editing models.

🧠 Applications

Pico-Banana-400K serves as a versatile resource for advancing controllable and instruction-aware image editing.
Beyond single-step editing, the dataset enables multi-turn, conversational editing and reward-based training paradigms.

📦 Dataset Download Guide

The Pico-Banana-400K dataset is hosted on Apple’s public CDN.
You can download each component (single-turn, multi-turn, and preference data) using the provided manifest files.


🖼️ 1. Single-Turn Edited Images

Manifest files: sft link and preference link

🖼️ 2. Multi-Turn Edited Images

Manifest file: multi-turn link

🖼️ 3. Source Images

Urls to download source images are provided along with edit instructions in sft link, preference link, and multi-turn link

🧩 License

Pico-Banana-400K is released under the Creative Commons Attribution–NonCommercial–NoDerivatives (CC BY-NC-ND 4.0) license. ✅ Free for research and non-commercial use ❌ Commercial use and derivative redistribution are not permitted 🖼️ Source images follow the Open Images (CC BY 2.0) license By using this dataset, you agree to comply with the terms of both licenses.

📘 Citation

If you use 🍌 Pico-Banana-400K in your research, please cite it as follows:

@misc{qian2025picobanana,
  title        = {Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing},
  author       = {Yusu Qian and Eli Bocek-Rivele and Liangchen Song and Jiasen Lu and Jialing Tong and Yinfei Yang and Wenze Hu and Zhe Gan},
  year         = {2025},
  note         = {Dataset release (preprint / placeholder citation). Paper forthcoming.},
  url          = {https://github.com/apple/ml-pico-banana-400K},
}

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published