AdamRain / YFCC15M_downloader Public

Notifications You must be signed in to change notification settings
Fork 1
Star 17

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

17 stars 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
auto_download.bat		auto_download.bat
check_images.py		check_images.py
clean_err_file_from_logs.py		clean_err_file_from_logs.py
split_download_task.py		split_download_task.py

Repository files navigation

YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.

We followed the dataset preparation process of DeCLIP here.

First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.
Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.
Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!
- You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
- The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
Check the downloaded images using check_images.py.

Dataset infos:

The dataset should contains 15,388,848 images.
We managed to crawl 15,061,747 of them.
Total space occupied: 867.73G.

Web Drive links:

If the link fails, please leave a message in the issue.

2024-11-13 update: You may use the bypy tools to download the files from Baidu Yun Web Drive.

About

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

Report repository

Releases

No releases published

Packages

No packages published

Languages