This project is released under a Creative Commons Attribution-NonCommercial 4.0 International Public License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/4.0/legalcode
python<version number>_download_facescrub.py downloads the FaceScrub dataset described in
H.-W. Ng, S. Winkler. A data-driven approach to cleaning large face datasets. Proc. IEEE International Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014.
If you are using Python 2, use the script python2_download_facescrub.py
.
If you are using Python 3, use python3_download_facescrub.py
.
In particular, the Python 3 version has been updated by ottocho to support multi-threading.
This code was tested on Ubuntu 14.04 and Mac OS X El Capitan.
I suggest shuffling the dataset file before running the script to download the images as this will (hopefully) make it less likely that you are spamming a particular website with requests, especially if you use the multi-threaded version and have many threads running simultaneously. You want to do this as some servers may block you if they detect many simultaneous and sustained connections from a particular source. Some servers might also return you images with watermark or corrupted images for the same reason.
- requests
pip install requests
If your Python version is < 2.7.9, install requests security package extras to suppress "InsecurePlatformWarning".
pip install "requests[security]"
If your python installation is Anaconda, you may need to conda install cryptography
before you pip install "requests[security]"
See this Stackoverflow post for details.
If the above requests[security]
installation fails, you might need to install additional packages on your system.
Consult this link for instructions.
More details on this issue can be found in this Stackoverflow post
and urllib3 documentation.
- PIL
# Interchangeable with PIL. Can be ignored if you already have PIL installed
pip install Pillow
- python-magic (optional)
# Optional, but good to have, for robustly detecting file type.
# Might be difficult to get working on Windows. In that case, ignore it.
# If a file's type cannot be determined, it will be removed.
pip install python-magic
- First, obtain the FaceScrub files containing links to the images from http://vintage.winklerbros.net/facescrub.html
- Next, set MY_USER_AGENT_STRING in the script. You can obtain it by visiting a site such as https://www.whatismybrowser.com/detect/what-is-my-user-agent
- Finally, run download_facescrub.py to download the dataset.
Note: actors_users_normal_bbox.txt is obtained from http://vintage.winklerbros.net/facescrub.html.
# In the following code, <version number> is "2" if you use Python 2, and "3" if you use Python 3.
# To download and save full size images only
python python<version number>_download_facescrub.py actors_users_normal_bbox.txt actors/
# To download and save full size images along with cropped faces
python python<version number>_download_facescrub.py actors_users_normal_bbox.txt actors/ --crop_face
# Additional (optional) arguments to set log file name, time out (10 seconds),
# max retries (3), start download at line 10 (note: line 1 is header) and
# end at line 20. Leave out --end_at_line to download till the end of file.
python python<version number>_download_facescrub.py actors_users_normal_bbox.txt actors/ \
--crop_face --logfile=download.log --timeout=10 --max_retries=3 --start_at_line=10 --end_at_line=20
The above code will save full size images to the directory actors/images and faces (if required) to actors/faces.
The naming convention for full size images is <name>_<image_id>.<ext>
and <name>_<image_id>_<face_id>.<ext>
for face images.
Note that <ext>
is the extension of image format for the image. It need not be "jpeg".
All error messages in the log are of the form "Line <number>: <error message>: <url>", in case users are interested in them.