Regarding the subset of training dataset #202

WesLee88524 · 2024-11-24T20:27:08Z

Hi, it is a good work.
However, during the reproduction process, you only used 6.9M images out of the entire 11M SA-1B dataset. Can you release the exact image list to facilitate our reproduction? Similarly, the conceptual-12m dataset also used part of it.
Thanks!

feifeiobama · 2024-11-24T20:42:47Z

For SA-1B, it is crucial to filter those watermarked images, we didn't have a good detector so we adoptedd a naive way by filtering those prompts contianing human-related words. There may be better ways to filter these image datasets.

WesLee88524 · 2024-11-25T15:41:59Z

For SA-1B, it is crucial to filter those watermarked images, we didn't have a good detector so we adoptedd a naive way by filtering those prompts contianing human-related words. There may be better ways to filter these image datasets.

Thank you for your reply. Could you please release the exact image list if possible? Manually filtering the data would be both time-consuming and inefficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding the subset of training dataset #202

Regarding the subset of training dataset #202

WesLee88524 commented Nov 24, 2024

feifeiobama commented Nov 24, 2024

WesLee88524 commented Nov 25, 2024

Regarding the subset of training dataset #202

Regarding the subset of training dataset #202

Comments

WesLee88524 commented Nov 24, 2024

feifeiobama commented Nov 24, 2024

WesLee88524 commented Nov 25, 2024