Problems with running self-constructed datasets #646

mssssss123 · 2023-09-16T02:44:06Z

mssssss123
Sep 16, 2023

Hi, thank you for such a great job！

I used the img2dataset library to convert the image-text dataset I built into the webdataset format, including jpg, txt, and json files. But when I use openclip for training, I not only want it to use jpg and txt files for training, I additionally save another type of text in the json file. I hope to be able to divide it into three situations during training, one is to use text in txt, one is to use text in an attribute in json, and one is to use a mixture of the two.

Can you give me some suggestions on how to use it or modify the code?

gabrielilharco · 2023-09-28T13:22:30Z

gabrielilharco
Sep 28, 2023
Maintainer

Hi @mssssss123. You can modify the webdataset pipeline to do that. In particular in this line

open_clip/src/training/data.py

Line 391 in f692ec9

wds.rename(image="jpg;png;jpeg;webp", text="txt"),

we map the file extensions to the image and text tower. So you can use different keys for the text argument there depending on what you want to do.

0 replies

EIFY · 2023-09-30T15:59:17Z

EIFY
Sep 30, 2023

@mssssss123 You may want to take a look at EIFY@66e4603, which enables open_clip to use Redcaps json caption files.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with running self-constructed datasets #646

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Problems with running self-constructed datasets #646

mssssss123 Sep 16, 2023

Replies: 2 comments

gabrielilharco Sep 28, 2023 Maintainer

EIFY Sep 30, 2023

mssssss123
Sep 16, 2023

gabrielilharco
Sep 28, 2023
Maintainer

EIFY
Sep 30, 2023