Add download script #59

vdumoulin · 2015-03-28T23:04:05Z

This PR introduces a fuel-download script to download built-in datasets.

Its implementation is similar to fuel-convert's implementation: the script allows to choose amongst the download functions listed in fuel.downloaders.__all__.

I changed the default location to the current working directory following a discussion in #54.

The workflow for downloading a built-in dataset now looks like this:

cd $FUEL_DATA_PATH
fuel-download mnist
fuel-convert mnist

vdumoulin · 2015-03-28T23:06:55Z

@bartvm The last thing needed for this PR to be ready for review is to change where BinarizedMNIST expects the HDF5 file from $FUEL_DATA_PATH/binarized_mnist/ to $FUEL_DATA_PATH, following our discussion in #54.

In order to do that, the Travis cache needs to be cleared.

bartvm · 2015-03-28T23:14:23Z

Sure, do you want me to clear it now? Or did you not make the changes yet?

bartvm · 2015-03-28T23:15:39Z

fuel/downloaders/binarized_mnist.py

+    """
+    base_url = ('http://www.cs.toronto.edu/~larocheh/public/datasets/' +
+                'binarized_mnist/binarized_mnist_')
+    call(['curl', base_url + 'train.amat', '-o',


I'd prefer using requests for this since curl might not be installed on some systems (they might have wget, or something different entirely), plus it's just nicer if we don't have to use bash commands.

We'll need something like this btw: http://stackoverflow.com/a/16696317

Or actually, maybe this is a better option, because most people already have it installed: http://stackoverflow.com/a/27389016

Are you sure most people have urllib3 installed?

I managed to get it working with six.moves.urllib, although for some reason the object returned by urllib.request.urlopen can't be used as a context manager, which makes my implementation a bit less clean than I would like.

vdumoulin · 2015-03-28T23:21:29Z

I just pushed a commit that makes the changes, could you validate that the changes to .travis.yml are correct?

Now would be a good time to clear the cache, in order to remove the unneeded $TRAVIS_BUILD_DIR/binarized_mnist.

bartvm · 2015-03-28T23:33:18Z

Cache cleared!

vdumoulin · 2015-03-28T23:33:45Z

I just rebased and force-pushed.

bartvm · 2015-03-28T23:36:05Z

Btw, not sure, but I think you should have rights to clear the cache as well. If you've got the Travis CLI installed it should be a matter of travis login followed by travis cache --repo bartvm/fuel --delete.

vdumoulin · 2015-03-28T23:37:00Z

Good to know, thanks!

vdumoulin · 2015-03-28T23:38:20Z

The only thing left to verify is whether Travis will cache the generated binarized_mnist.hdf5.

bartvm · 2015-03-28T23:40:26Z

It will say at the very bottom of the Travis log which files it is caching.

dwf · 2015-03-29T03:46:49Z

Just in reference to urllib3: it's a very stable package and it's pure Python, so it's basically Python packaging on Easy Mode, and I'd be fine depending on that if it helps us at all. (Odd trivia, it is written and maintained by my undergrad homeboy @shazow -- I've never worked on a package that had any reason to use it... OH GOD WE'RE CROSSING THE STREAMS).

dwf · 2015-03-29T04:58:46Z

I'm going to raise something because I want to very soon add ImageNet support to this tool: some datasets that we would like built-in do not have a public download URL. In ImageNet's case, there is a download URL, but not a public one. What would be a good way to handle this? I'm thinking argparse subcommands are the most flexible option.

shazow · 2015-03-29T05:32:48Z

If you don't care about anything, then yoloswag stdlib urllib is fine. If you care about verified HTTPS/TLS certs, then you probably want urllib3+certifi, or if you care about re-using sockets/connection pooling, thread safety, retries, etc. Alternatively you could use requests which bundles urllib3+certifi and a bunch of other things. (Hi @dwf!)

vdumoulin · 2015-03-29T14:15:26Z

@dwf We could turn functions defined in fuel.downloaders into functions that accept a subparser and fill in its arguments.

vdumoulin · 2015-03-29T17:48:58Z

I've refactored things a bit. We're now using urllib3 (thanks @shazow!) to download, and downloading has been factored into the fuel.downloaders.base.download function like @bartvm suggested.

@dwf I managed to get subparsers working for fuel-download. Functions defined in fuel.downloaders now take an argparse.ArgumentParser object as input and configure it.

The logic in fuel.downloaders.binarized_mnist could be further extracted into a generic function when downloading the data amounts to downloading a list of files.

vdumoulin · 2015-03-30T15:02:55Z

@dwf You mentioned testing the conversion script. Do you have a suggestion on how this could be done effectively?

vdumoulin · 2015-04-01T15:57:37Z

@dwf Ping RE.: conversion script testing. Do you have a suggestion on how this could be done effectively?

dwf · 2015-04-01T16:58:00Z

I have been thinking about this especially in light of my in progress script for imagenet. I think one way is to create in-memory mock files with io.BytesIO that is much smaller than the original and verify things that way. What do you think? (Going to have another look and try and offer testability refactoring comments specifically)

vdumoulin · 2015-04-01T18:33:02Z

Sounds good. One of the things that could be done is factor the logic of concatenating data sources and creating attributes into a function that operates on file handles and test that function instead.

vdumoulin · 2015-04-02T22:16:10Z

I made a small step in the right direction by factoring out of converters.binarized_mnist the logic of filling the HDF5 file.

I could go one step further and make conversion methods accept either a path or a file handle for input/output files so we can give our mock files as input.

The only thing that's not clear to me yet is what we'll do about shapes: conversion scripts will likely make assumptions about the size of the data stored in the input file which will break if we make small mock files. Maybe there's a way to store sparse arrays filled with zeros that won't blow up memory usage but will retain the right shapes?

vdumoulin · 2015-04-02T22:19:38Z

Maybe this would be the way to go.

vdumoulin · 2015-04-03T19:09:13Z

@dwf @bartvm I feel like everything that needed to be addressed for this PR has been addressed; in fact, this PR even started spilling onto other issues.

I propose that we leave unit testing conversion modules for another PR and merge this one if everything's okay on your end.

That will allow me to rebase PR #64 and go forward with it.

bartvm · 2015-04-03T20:17:42Z

fuel/downloaders/base.py

+                os.remove(f)
+    else:
+        for url, f in zip(urls, files):
+            with open(f, 'w') as file_handle:


Should be wb I think, else Python 3 complains.

bartvm · 2015-04-03T20:20:42Z

I'm happy to merge this as soon as Python 3 on Travis is pacified. There are some tiny details I'd change, but I'd rather merge and beautify afterwards than stalling PRs for too long.

dwf · 2015-04-03T20:33:59Z

Agreed, let's merge once Travis is happy.

vdumoulin · 2015-04-03T21:02:27Z

What ended up fixing the test failures was changing the mode from 'w' to 'wb' in fuel.downloaders.base.default_downloader and removing the encode part in test_download.

I'm not 100% confident in the fix, could any of you confirm that I'm not doing something stupid that just happens to make these tests pass?

dwf · 2015-04-03T21:05:43Z

wb is good practice anyway for cross platform compatibility buy I'll have a
look
On Apr 3, 2015 5:02 PM, "vdumoulin" notifications@github.com wrote:

What ended up fixing the test failures was changing the mode from 'w' to
'wb' in fuel.downloaders.base.default_downloader and removing the encode
part in test_download.

I'm not 100% confident in the fix, could any of you confirm that I'm not
doing something stupid that just happens to make these tests pass?

—
Reply to this email directly or view it on GitHub
#59 (comment).

bartvm · 2015-04-03T21:09:19Z

tests/test_downloaders.py

+    f = tempfile.SpooledTemporaryFile()
+    download(iris_url, f)
+    f.seek(0)
+    assert hashlib.sha256(f.read()).hexdigest() == iris_hash


Looks good to me, but maybe switch to MD5? On my computer that's almost 3 times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets.

I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.

On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote:

In tests/test_downloaders.py
#59 (comment):

+class DummyArgs:

def init(self, **kwargs):

for key, val in kwargs.items():

setattr(self, key, val)

+def test_filename_from_url():

assert filename_from_url(iris_url) == 'iris.data'

+def test_download():

f = tempfile.SpooledTemporaryFile()

download(iris_url, f)

f.seek(0)

assert hashlib.sha256(f.read()).hexdigest() == iris_hash

Looks good to me, but maybe switch to MD5? On my computer that's almost 3
times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very
large datasets.

—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27756061.

Yeah, it's just for file download integrity, right? Spreading a virus
through fuel-download is probably not the most efficient attack vector
anyway :p
On Apr 3, 2015 5:28 PM, "David Warde-Farley" notifications@github.com
wrote:

In tests/test_downloaders.py
#59 (comment):

+class DummyArgs:

def init(self, **kwargs):

for key, val in kwargs.items():

setattr(self, key, val)

+def test_filename_from_url():

assert filename_from_url(iris_url) == 'iris.data'

+def test_download():

f = tempfile.SpooledTemporaryFile()

download(iris_url, f)

f.seek(0)

assert hashlib.sha256(f.read()).hexdigest() == iris_hash

I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.
… <#14c812f7497e2130_>
On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote: In tests/test_downloaders.py <
https://github.com/bartvm/fuel/pull/59#discussion_r27756061>: > + >
+class DummyArgs: > + def init(self, **kwargs): > + for key, val in
kwargs.items(): > + setattr(self, key, val) > + > + > +def
test_filename_from_url(): > + assert filename_from_url(iris_url) ==
'iris.data' > + > + > +def test_download(): > + f =
tempfile.SpooledTemporaryFile() > + download(iris_url, f) > + f.seek(0) > +
assert hashlib.sha256(f.read()).hexdigest() == iris_hash Looks good to me,
but maybe switch to MD5? On my computer that's almost 3 times faster for a
1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets. —
Reply to this email directly or view it on GitHub <
https://github.com/bartvm/fuel/pull/59/files#r27756061>.

—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27757134.

Sounds good, I changed it. When we do implement checksum verification of downloaded files, I guess that's something that can be left as a user choice, provided that we compute the checksums using whatever options we want to offer.

vdumoulin · 2015-04-04T14:41:50Z

Looks like all tests are passing (except Coveralls, which fails for a meager 0.01% decrease in test coverage).

Add download script

bartvm reviewed Mar 28, 2015
View reviewed changes

vdumoulin added 8 commits March 28, 2015 19:32

Add fuel-download script and BinarizedMNIST download script

a847928

Remove unused imports

f0c6b4c

Change default location for fuel-{download,convert} to cwd

78806b6

Remove unused import

c419e60

Infer file name based on dataset name

0383213

Flatten BinarizedMNIST file location

5bad826

Add fuel-download to installed scripts

78497a5

Improve binarized mnist management in travis

49f4767

vdumoulin force-pushed the download_script branch from 23062e7 to 49f4767 Compare March 28, 2015 23:33

Remove unnecessary command in .travis.yml

39f706b

vdumoulin added 2 commits March 28, 2015 22:46

Replace 'curl' system call, add --clear option

a5e9976

Remove unused import

a7ddfa5

Refactor fuel-download to use subparsers, use urllib3

3b81de5

Refactor URL downloading further into fuel.downloaders.base

e0146b8

Allow for no split in BinarizedMNIST

2c7af57

vdumoulin added 4 commits March 30, 2015 14:31

Add tests for fuel.downloaders.base

85bb0c1

Improve fuel.downloaders.basic unit tests

6348341

Refactor DummyArgs

b3b3aec

Fix Python 3 compatibility issue

4cc4457

vdumoulin added 2 commits April 2, 2015 18:02

Factor HDF5 file filling logic out of converters.binarized_mnist

76264f6

Shorten docstring length

97545fa

vdumoulin added 2 commits April 3, 2015 14:20

Simplify downloading management

d704995

Refactor downloading modules and unit tests

805aaee

bartvm reviewed Apr 3, 2015
View reviewed changes

Fix Python 3-related errors

6b38e7a

bartvm reviewed Apr 3, 2015
View reviewed changes

Use MD5 instead of SHA256 for speed

9e94c0f

bartvm added a commit that referenced this pull request Apr 4, 2015

Merge pull request #59 from vdumoulin/download_script

13362d7

Add download script

bartvm merged commit 13362d7 into mila-iqia:master Apr 4, 2015

vdumoulin deleted the download_script branch April 4, 2015 20:59

vdumoulin mentioned this pull request Apr 14, 2015

Plan for wrapping Fuel datasets into Pylearn2 #77

Open

8 tasks

Add download script #59

Add download script #59

Conversation

vdumoulin commented Mar 28, 2015

vdumoulin commented Mar 28, 2015

bartvm commented Mar 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdumoulin commented Mar 28, 2015

bartvm commented Mar 28, 2015

vdumoulin commented Mar 28, 2015

bartvm commented Mar 28, 2015

vdumoulin commented Mar 28, 2015

vdumoulin commented Mar 28, 2015

bartvm commented Mar 28, 2015

dwf commented Mar 29, 2015

dwf commented Mar 29, 2015

shazow commented Mar 29, 2015

vdumoulin commented Mar 29, 2015

vdumoulin commented Mar 29, 2015

vdumoulin commented Mar 30, 2015

vdumoulin commented Apr 1, 2015

dwf commented Apr 1, 2015 via email

vdumoulin commented Apr 1, 2015

vdumoulin commented Apr 2, 2015

vdumoulin commented Apr 2, 2015

vdumoulin commented Apr 3, 2015

Choose a reason for hiding this comment

bartvm commented Apr 3, 2015

dwf commented Apr 3, 2015 via email

vdumoulin commented Apr 3, 2015

dwf commented Apr 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdumoulin commented Apr 4, 2015