GIDler

An experiment in browser automation

Introduction

This library pulls images out of Google Images search results and saves them to disk. The neat trick is not that it saves the images in the search results, instead it saves the original source images (e.g. high-res images) that the search results refers to.

This is made possible by the Chrome Remote Debugging API which also means you've discovered the first gotcha: this only works on the Chrome browser.

What does "gidler" mean?

It's shortcut for Google Images Downloader.

Install

The usual will work, but with caveats:

$ python -m pip install gidler

The caveats are that you're probably going to need Python >= 3.5 for this. I don't have a lot of free time for hobby projects, and they're how I experiment with new Python features. It is an incredible amount of work to make a Python package that works on everything (I've done it for other projects), and I just don't have the time and/or energy. If you want it to work on 2.7, and you provide a working PR, I will very likely merge that in. I just don't have time to do it myself.

However: you don't actually need to do all that work. Just use Anaconda Python. Using conda, you can create a new environment with the right version of Python, and then pip install into that:

$ conda create -n mygidlerenv python=3.5
$ source activate mygidlerenv
(mygidlerenv) $ python -m pip install gidler

Using and Abusing

Step 1

First start up Chrome with remote debugging activated on a specific port:

$ <chrome executable> --remote-debugging-port=9222

Now we can play that instance like a marionette!

Example using Chromium browser (on my Mac):

$ open /Users/calebhattingh/Applications/Chromium.app \
    --args -remote-debugging-port=9222

If you get this working on Windows or Linux, let me know and I'll add more examples here.

Step 2

You can execute the module directly from the command-line:

python -m gidler -p 9222 --max 5 -q "mandala"

This:

Starts up gidler...
...on port 9222 (this must match what we gave chrome)...
...returning no more than 5 images...
with a query string of "mandala"

This query string is the same as what you would type into the Google Images search box, so e.g., this all works: "site:deviantart.com sketch portrait"

You can also python -m gidler -h to see the help.

Current status

It works on my machine™.

The script tells Chrome to do an image search, using the given query string on the CLI. Then, the content of the page is parsed to extract the original image URLs, which are then downloaded separately with urllib inside a thread pool with 8 workers (yet another hard-coded settings that will eventually become a CLI option...)

This means that Google is getting hit only with the initial search query, not the all the subsequent (large) image downloads.

Future steps

Currently, several things are hard-coded:

The "large" filter is automatically set. This is quite restrictive, and is probably not what you want all the time. This should be a CLI option``*``. If you peek in the source code, you'll see some documentation about all the possible settings; you can even specify width and height requirements. None of that is configurable yet though"*".
If no max is given, all the images on the first page of results are fetched. The code even forces scroll actions to the bottom of the page in order to get Chrome to load all 400. This might not be what you want.
The images are saved into a new subfolder in the local folder. This should be a CLI option*
The subfolder name is a slugified version of the query string, plus a small uuid (so that you can run the same query multiple times with no collisions)
The image names are the original image names, prefixed also with a small uuid to avoid collisions in case multple images have the same filename.
timeouts, and other applied pauses are all hardcoded. The pauses are largely to give Chrome a chance to complete the previous instruction. I tweaked these for my situation, but you may find longer pauses are necessary.
The work was done on OS X. I have no idea* whether this will work on other platforms.

*PRs welcome.

But Selenium/ABC/XYZ already exists!

Yes, yes, I know there are other tools. I wanted a more lightweight option. Currently, this library really only depends on Chrome and Python, although there are several of the usual suspects in the requires list. (At the time of writing, requires lists chromote and python-slugify, but those each bring in a few other things, like requests, ws4py and so on.)

Why are you require`ing your own fork of the `chromote library?

The chromote package provides a Python abstraction for Chrome Remote Debugging API. Currently, chromote uses the websocket-client package which has been terribly unstable for me. Sometimes ws.recv() returns, but with nothing. In my fork I changed to use the high-quality ws4py package and since then the connection to the debugging API has been rock solid.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
gidler		gidler
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
flit.ini		flit.ini
notes.txt		notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIDler

Introduction

What does "gidler" mean?

Install

Using and Abusing

Step 1

Step 2

Current status

Future steps

But Selenium/ABC/XYZ already exists!

Why are you require`ing your own fork of the `chromote library?

About

Releases

Packages

Contributors 2

Languages

License

cjrh/google-images-downloader

Folders and files

Latest commit

History

Repository files navigation

GIDler

Introduction

What does "gidler" mean?

Install

Using and Abusing

Step 1

Step 2

Current status

Future steps

But Selenium/ABC/XYZ already exists!

Why are you require`ing your own fork of the `chromote library?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages