Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the subset of training dataset #202

Open
WesLee88524 opened this issue Nov 24, 2024 · 2 comments
Open

Regarding the subset of training dataset #202

WesLee88524 opened this issue Nov 24, 2024 · 2 comments

Comments

@WesLee88524
Copy link

Hi, it is a good work.
However, during the reproduction process, you only used 6.9M images out of the entire 11M SA-1B dataset. Can you release the exact image list to facilitate our reproduction? Similarly, the conceptual-12m dataset also used part of it.
Thanks!

@feifeiobama
Copy link
Collaborator

For SA-1B, it is crucial to filter those watermarked images, we didn't have a good detector so we adoptedd a naive way by filtering those prompts contianing human-related words. There may be better ways to filter these image datasets.

@WesLee88524
Copy link
Author

For SA-1B, it is crucial to filter those watermarked images, we didn't have a good detector so we adoptedd a naive way by filtering those prompts contianing human-related words. There may be better ways to filter these image datasets.

Thank you for your reply. Could you please release the exact image list if possible? Manually filtering the data would be both time-consuming and inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants