Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow newlines in captions #283

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

achalddave
Copy link

Some datasets (e.g., YFCC) have new lines in captions, which causes parquet's csv module to error by default. This PR allows passing --newlines-in-captions True to img2dataset, which will in turn tell parquet to allow newlines in CSV values.

The YFCC-15M descriptions can have new lines in the caption, which
causes parquet's csv module to error by default. This commit allows
passing --newlines-in-captions True to img2dataset, which will tell
parquet to allow newlines in CSV values.
@rom1504
Copy link
Owner

rom1504 commented Apr 23, 2023

could you add an example of dataset for which this is needed please ?

@achalddave
Copy link
Author

I needed this for YFCC 100M - did you want that in the README/in the repo somewhere?

@rom1504
Copy link
Owner

rom1504 commented May 28, 2023

yes if you could add it in https://github.com/rom1504/img2dataset/tree/main/dataset_examples it would be great

@ldfandian
Copy link
Contributor

ldfandian commented Jul 3, 2023

I also need this~ (I have a crawler, which gives me many raw web image-text pairs with newline in the text title).
Looking forward to its being merged~ @achalddave

@rom1504
Copy link
Owner

rom1504 commented Jul 15, 2023

could you please rebase on head / resolve conflicts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants