An experiment in browser automation
This library pulls images out of Google Images search results and saves them to disk. The neat trick is not that it saves the images in the search results, instead it saves the original source images (e.g. high-res images) that the search results refers to.
This is made possible by the Chrome Remote Debugging API which also means you've discovered the first gotcha: this only works on the Chrome browser.
It's shortcut for Google Images Downloader.
The usual will work, but with caveats:
$ python -m pip install gidler
The caveats are that you're probably going to need Python >= 3.5 for this. I don't have a lot of free time for hobby projects, and they're how I experiment with new Python features. It is an incredible amount of work to make a Python package that works on everything (I've done it for other projects), and I just don't have the time and/or energy. If you want it to work on 2.7, and you provide a working PR, I will very likely merge that in. I just don't have time to do it myself.
However: you don't actually need to do all that work. Just use Anaconda Python. Using conda, you can create a new environment with the right version of Python, and then pip install into that:
$ conda create -n mygidlerenv python=3.5
$ source activate mygidlerenv
(mygidlerenv) $ python -m pip install gidler
First start up Chrome with remote debugging activated on a specific port:
$ <chrome executable> --remote-debugging-port=9222
Now we can play that instance like a marionette!
Example using Chromium browser (on my Mac):
$ open /Users/calebhattingh/Applications/Chromium.app \ --args -remote-debugging-port=9222
If you get this working on Windows or Linux, let me know and I'll add more examples here.
You can execute the module directly from the command-line:
python -m gidler -p 9222 --max 5 -q "mandala"
This:
- Starts up gidler...
- ...on port 9222 (this must match what we gave chrome)...
- ...returning no more than 5 images...
- with a query string of "mandala"
This query string is the same as what you would type into the Google Images search box, so e.g., this all works: "site:deviantart.com sketch portrait"
You can also python -m gidler -h
to see the help.
It works on my machine™.
The script tells Chrome to do an image search, using the given query string on the CLI. Then, the content of the page is parsed to extract the original image URLs, which are then downloaded separately with urllib inside a thread pool with 8 workers (yet another hard-coded settings that will eventually become a CLI option...)
This means that Google is getting hit only with the initial search query, not the all the subsequent (large) image downloads.
Currently, several things are hard-coded:
- The "large" filter is automatically set. This is quite restrictive, and is probably not what you want all the time. This should be a CLI option``*``. If you peek in the source code, you'll see some documentation about all the possible settings; you can even specify width and height requirements. None of that is configurable yet though"*".
- If no max is given, all the images on the first page of results are fetched. The code even forces scroll actions to the bottom of the page in order to get Chrome to load all 400. This might not be what you want.
- The images are saved into a new subfolder in the local folder. This should be a CLI option*
- The subfolder name is a slugified version of the query string, plus a small uuid (so that you can run the same query multiple times with no collisions)
- The image names are the original image names, prefixed also with a small uuid to avoid collisions in case multple images have the same filename.
- timeouts, and other applied pauses are all hardcoded. The pauses are largely to give Chrome a chance to complete the previous instruction. I tweaked these for my situation, but you may find longer pauses are necessary.
- The work was done on OS X. I have no idea* whether this will work on other platforms.
*PRs welcome.
Yes, yes, I know there are other tools. I wanted a more lightweight option. Currently, this library really only depends on Chrome and Python, although there are several of the usual suspects in the requires list. (At the time of writing, requires lists chromote and python-slugify, but those each bring in a few other things, like requests, ws4py and so on.)
The chromote package provides a Python abstraction for Chrome Remote Debugging API. Currently, chromote uses the websocket-client package which has been terribly unstable for me. Sometimes ws.recv() returns, but with nothing. In my fork I changed to use the high-quality ws4py package and since then the connection to the debugging API has been rock solid.