diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" index 78814ac2c..d0d4e06f5 100644 --- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" +++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" @@ -109,20 +109,26 @@ query-response格式: ### 多模态 -对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的四条示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。 +对于多模态数据集,和上述任务的格式相同。区别在于增加了`images`, `videos`, `audios`几个key,分别代表多模态资源的url或者path(推荐使用绝对路径),`<image>` `<video>` `<audio>`标签代表了插入图片/视频/音频的位置,ms-swift支持多图片/视频/音频的情况。这些特殊tokens将在预处理的时候进行替换,参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198)。下面给出的示例分别展示了纯文本,以及包含图像、视频和音频数据的数据格式。 + +SWIFT 支持从 LMDB 数据库加载多模态资源,使用格式为 `lmdb://key@path_to_lmdb`。这对于存储和访问大量图像、视频、音频等资源非常有效,特别适合训练和推理时处理大规模多模态数据集。使用前请确保已安装 LMDB:`pip install lmdb`。 预训练: -``` +```jsonl {"messages": [{"role": "assistant", "content": "预训练的文本在这里"}]} {"messages": [{"role": "assistant", "content": "<image>是一只小狗,<image>是一只小猫"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]} +{"messages": [{"role": "assistant", "content": "<image>是一只从LMDB加载的小兔子"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]} {"messages": [{"role": "assistant", "content": "<audio>描述了今天天气真不错"}], "audios": ["/xxx/x.wav"]} {"messages": [{"role": "assistant", "content": "<image>是一个大象,<video>是一只狮子在跑步"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]} +{"messages": [{"role": "assistant", "content": "<video>展示了太空中的星系"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]} ``` 微调: ```jsonl {"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "浙江的省会在杭州。"}]} {"messages": [{"role": "user", "content": "<image><image>两张图片有什么区别"}, {"role": "assistant", "content": "前一张是小猫,后一张是小狗"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]} +{"messages": [{"role": "user", "content": "<image>这个动物是什么?"}, {"role": "assistant", "content": "这是一只棕色的熊猫,很罕见的物种。"}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]} +{"messages": [{"role": "user", "content": "<image>和<image>这两种动物有什么区别?"}, {"role": "assistant", "content": "第一张图是老虎,第二张图是狮子。"}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]} {"messages": [{"role": "user", "content": "<audio>语音说了什么"}, {"role": "assistant", "content": "今天天气真好呀"}], "audios": ["/xxx/x.mp3"]} {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "<image>图片中是什么,<video>视频中是什么"}, {"role": "assistant", "content": "图片中是一个大象,视频中是一只小狗在草地上奔跑"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]} ``` diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md index 471db6e83..deece24b9 100644 --- a/docs/source_en/Customization/Custom-dataset.md +++ b/docs/source_en/Customization/Custom-dataset.md @@ -113,15 +113,18 @@ Please refer to [embedding训练文档](../BestPractices/Embedding.md#dataset-fo ### Multimodal -For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The four examples below respectively demonstrate the data format for plain text, as well as formats containing image, video, and audio data. +For multimodal datasets, the format is the same as the aforementioned tasks. The difference lies in the addition of several keys: `images`, `videos`, and `audios`, which represent the URLs or paths (preferably absolute paths) of multimodal resources. The tags `<image>`, `<video>`, and `<audio>` indicate where to insert images, videos, or audio. MS-Swift supports multiple images, videos, and audio files. These special tokens will be replaced during preprocessing, as referenced [here](https://github.com/modelscope/ms-swift/blob/main/swift/llm/template/template/qwen.py#L198). The examples below demonstrate the data format for plain text, as well as formats containing image, video, and audio data. +SWIFT supports loading multimodal resources from LMDB databases using the format `lmdb://key@path_to_lmdb`. This is highly effective for storing and accessing large collections of images, videos, audio files, and other resources, especially when training and inferencing with large-scale multimodal datasets. Make sure to install LMDB first: `pip install lmdb`. Pre-training: ```jsonl {"messages": [{"role": "assistant", "content": "Pre-trained text goes here"}]} {"messages": [{"role": "assistant", "content": "<image>is a puppy, <image>is a kitten"}], "images": ["/xxx/x.jpg", "/xxx/x.png"]} +{"messages": [{"role": "assistant", "content": "<image>is a rabbit loaded from LMDB"}], "images": ["lmdb://rabbit_img@/path/to/animals_lmdb"]} {"messages": [{"role": "assistant", "content": "<audio>describes how nice the weather is today"}], "audios": ["/xxx/x.wav"]} {"messages": [{"role": "assistant", "content": "<image>is an elephant, <video>is a lion running"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]} +{"messages": [{"role": "assistant", "content": "<video>shows galaxies in space"}], "videos": ["lmdb://space_video@/path/to/videos_lmdb"]} ``` Supervised Fine-tuning: @@ -129,6 +132,8 @@ Supervised Fine-tuning: ```jsonl {"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]} {"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}, {"role": "assistant", "content": "The first one is a kitten, and the second one is a puppy."}], "images": ["/xxx/x.jpg", "/xxx/x.png"]} +{"messages": [{"role": "user", "content": "<image>What is this animal?"}, {"role": "assistant", "content": "This is a brown panda, a very rare species."}], "images": ["lmdb://panda_img@/path/to/wildlife_lmdb"]} +{"messages": [{"role": "user", "content": "<image>and<image>What's the difference between these two animals?"}, {"role": "assistant", "content": "The first image is a tiger, and the second image is a lion."}], "images": ["lmdb://tiger_img@/path/to/animals_lmdb", "lmdb://lion_img@/path/to/animals_lmdb"]} {"messages": [{"role": "user", "content": "<audio>What did the audio say?"}, {"role": "assistant", "content": "The weather is really nice today."}], "audios": ["/xxx/x.mp3"]} {"messages": [{"role": "system", "content": "You are a helpful and harmless assistant."}, {"role": "user", "content": "<image>What is in the image, <video>What is in the video?"}, {"role": "assistant", "content": "The image shows an elephant, and the video shows a puppy running on the grass."}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]} ``` diff --git a/swift/llm/template/vision_utils.py b/swift/llm/template/vision_utils.py index 0fc486c67..ec8daf9af 100644 --- a/swift/llm/template/vision_utils.py +++ b/swift/llm/template/vision_utils.py @@ -4,7 +4,7 @@ import os import re from io import BytesIO -from typing import Any, Callable, List, TypeVar, Union +from typing import Any, Callable, Dict, List, Optional, TypeVar, Union import numpy as np import requests @@ -13,6 +13,13 @@ from swift.utils import get_env_args +# Try to import lmdb, but don't fail if it's not available +try: + import lmdb + LMDB_AVAILABLE = True +except ImportError: + LMDB_AVAILABLE = False + # >>> internvl IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) @@ -99,6 +106,9 @@ def rescale_image(img: Image.Image, max_pixels: int) -> Image.Image: _T = TypeVar('_T') +# Cache for LMDB environments and read transactions to avoid reopening +_LMDB_ENV_CACHE: Dict[str, Any] = {} +_LMDB_TXN_CACHE: Dict[str, Any] = {} def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]: res = path @@ -111,6 +121,38 @@ def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]: request_kwargs['timeout'] = timeout content = requests.get(path, **request_kwargs).content res = BytesIO(content) + elif path.startswith('lmdb://'): + if not LMDB_AVAILABLE: + raise ImportError( + "LMDB support requires the 'lmdb' package to be installed. " + "Please install it with 'pip install lmdb'." + ) + # Parse LMDB path format: lmdb://key@path_to_lmdb + _, _, lmdb_url = path.partition('lmdb://') + key, sep, lmdb_dir = lmdb_url.partition('@') + + # Verify format validity with a single check + if not sep or not key or not lmdb_dir or '@' in lmdb_dir: + raise ValueError("LMDB path must be in format: lmdb://key@path_to_lmdb (with exactly one '@')") + + # Use cached environment or create a new one + env = _LMDB_ENV_CACHE.get(lmdb_dir) + if env is None: + env = lmdb.open(lmdb_dir, readonly=True, lock=False, max_readers=1024, max_spare_txns=2) + _LMDB_ENV_CACHE[lmdb_dir] = env + + # Get or create read transaction + txn = _LMDB_TXN_CACHE.get(lmdb_dir) + if txn is None: + txn = env.begin(write=False) + _LMDB_TXN_CACHE[lmdb_dir] = txn + + # Get data using the cached transaction + encoded_key = key.encode() + data = txn.get(encoded_key) + if data is None: + raise KeyError(f"Key '{key}' not found in LMDB at '{lmdb_dir}'") + res = BytesIO(data) elif os.path.exists(path) or (not path.startswith('data:') and len(path) <= 200): path = os.path.abspath(os.path.expanduser(path)) with open(path, 'rb') as f: