Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download script #59

Merged
merged 29 commits into from
Apr 4, 2015
Merged

Add download script #59

merged 29 commits into from
Apr 4, 2015

Conversation

vdumoulin
Copy link
Contributor

This PR introduces a fuel-download script to download built-in datasets.

Its implementation is similar to fuel-convert's implementation: the script allows to choose amongst the download functions listed in fuel.downloaders.__all__.

I changed the default location to the current working directory following a discussion in #54.

The workflow for downloading a built-in dataset now looks like this:

cd $FUEL_DATA_PATH
fuel-download mnist
fuel-convert mnist

@vdumoulin
Copy link
Contributor Author

@bartvm The last thing needed for this PR to be ready for review is to change where BinarizedMNIST expects the HDF5 file from $FUEL_DATA_PATH/binarized_mnist/ to $FUEL_DATA_PATH, following our discussion in #54.

In order to do that, the Travis cache needs to be cleared.

@bartvm
Copy link
Member

bartvm commented Mar 28, 2015

Sure, do you want me to clear it now? Or did you not make the changes yet?

"""
base_url = ('http://www.cs.toronto.edu/~larocheh/public/datasets/' +
'binarized_mnist/binarized_mnist_')
call(['curl', base_url + 'train.amat', '-o',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using requests for this since curl might not be installed on some systems (they might have wget, or something different entirely), plus it's just nicer if we don't have to use bash commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need something like this btw: http://stackoverflow.com/a/16696317

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, maybe this is a better option, because most people already have it installed: http://stackoverflow.com/a/27389016

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure most people have urllib3 installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to get it working with six.moves.urllib, although for some reason the object returned by urllib.request.urlopen can't be used as a context manager, which makes my implementation a bit less clean than I would like.

@vdumoulin
Copy link
Contributor Author

I just pushed a commit that makes the changes, could you validate that the changes to .travis.yml are correct?

Now would be a good time to clear the cache, in order to remove the unneeded $TRAVIS_BUILD_DIR/binarized_mnist.

@bartvm
Copy link
Member

bartvm commented Mar 28, 2015

Cache cleared!

@vdumoulin
Copy link
Contributor Author

I just rebased and force-pushed.

@bartvm
Copy link
Member

bartvm commented Mar 28, 2015

Btw, not sure, but I think you should have rights to clear the cache as well. If you've got the Travis CLI installed it should be a matter of travis login followed by travis cache --repo bartvm/fuel --delete.

@vdumoulin
Copy link
Contributor Author

Good to know, thanks!

@vdumoulin
Copy link
Contributor Author

The only thing left to verify is whether Travis will cache the generated binarized_mnist.hdf5.

@bartvm
Copy link
Member

bartvm commented Mar 28, 2015

It will say at the very bottom of the Travis log which files it is caching.

@dwf
Copy link
Contributor

dwf commented Mar 29, 2015

Just in reference to urllib3: it's a very stable package and it's pure Python, so it's basically Python packaging on Easy Mode, and I'd be fine depending on that if it helps us at all. (Odd trivia, it is written and maintained by my undergrad homeboy @shazow -- I've never worked on a package that had any reason to use it... OH GOD WE'RE CROSSING THE STREAMS).

@dwf
Copy link
Contributor

dwf commented Mar 29, 2015

I'm going to raise something because I want to very soon add ImageNet support to this tool: some datasets that we would like built-in do not have a public download URL. In ImageNet's case, there is a download URL, but not a public one. What would be a good way to handle this? I'm thinking argparse subcommands are the most flexible option.

@shazow
Copy link

shazow commented Mar 29, 2015

If you don't care about anything, then yoloswag stdlib urllib is fine. If you care about verified HTTPS/TLS certs, then you probably want urllib3+certifi, or if you care about re-using sockets/connection pooling, thread safety, retries, etc. Alternatively you could use requests which bundles urllib3+certifi and a bunch of other things. (Hi @dwf!)

@vdumoulin
Copy link
Contributor Author

@dwf We could turn functions defined in fuel.downloaders into functions that accept a subparser and fill in its arguments.

@vdumoulin
Copy link
Contributor Author

I've refactored things a bit. We're now using urllib3 (thanks @shazow!) to download, and downloading has been factored into the fuel.downloaders.base.download function like @bartvm suggested.

@dwf I managed to get subparsers working for fuel-download. Functions defined in fuel.downloaders now take an argparse.ArgumentParser object as input and configure it.

The logic in fuel.downloaders.binarized_mnist could be further extracted into a generic function when downloading the data amounts to downloading a list of files.

@vdumoulin
Copy link
Contributor Author

@dwf You mentioned testing the conversion script. Do you have a suggestion on how this could be done effectively?

@vdumoulin
Copy link
Contributor Author

@dwf Ping RE.: conversion script testing. Do you have a suggestion on how this could be done effectively?

@dwf
Copy link
Contributor

dwf commented Apr 1, 2015 via email

@vdumoulin
Copy link
Contributor Author

Sounds good. One of the things that could be done is factor the logic of concatenating data sources and creating attributes into a function that operates on file handles and test that function instead.

@vdumoulin
Copy link
Contributor Author

I made a small step in the right direction by factoring out of converters.binarized_mnist the logic of filling the HDF5 file.

I could go one step further and make conversion methods accept either a path or a file handle for input/output files so we can give our mock files as input.

The only thing that's not clear to me yet is what we'll do about shapes: conversion scripts will likely make assumptions about the size of the data stored in the input file which will break if we make small mock files. Maybe there's a way to store sparse arrays filled with zeros that won't blow up memory usage but will retain the right shapes?

@vdumoulin
Copy link
Contributor Author

Maybe this would be the way to go.

@vdumoulin
Copy link
Contributor Author

@dwf @bartvm I feel like everything that needed to be addressed for this PR has been addressed; in fact, this PR even started spilling onto other issues.

I propose that we leave unit testing conversion modules for another PR and merge this one if everything's okay on your end.

That will allow me to rebase PR #64 and go forward with it.

os.remove(f)
else:
for url, f in zip(urls, files):
with open(f, 'w') as file_handle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be wb I think, else Python 3 complains.

@bartvm
Copy link
Member

bartvm commented Apr 3, 2015

I'm happy to merge this as soon as Python 3 on Travis is pacified. There are some tiny details I'd change, but I'd rather merge and beautify afterwards than stalling PRs for too long.

@dwf
Copy link
Contributor

dwf commented Apr 3, 2015 via email

@vdumoulin
Copy link
Contributor Author

What ended up fixing the test failures was changing the mode from 'w' to 'wb' in fuel.downloaders.base.default_downloader and removing the encode part in test_download.

I'm not 100% confident in the fix, could any of you confirm that I'm not doing something stupid that just happens to make these tests pass?

@dwf
Copy link
Contributor

dwf commented Apr 3, 2015

wb is good practice anyway for cross platform compatibility buy I'll have a
look
On Apr 3, 2015 5:02 PM, "vdumoulin" notifications@github.com wrote:

What ended up fixing the test failures was changing the mode from 'w' to
'wb' in fuel.downloaders.base.default_downloader and removing the encode
part in test_download.

I'm not 100% confident in the fix, could any of you confirm that I'm not
doing something stupid that just happens to make these tests pass?


Reply to this email directly or view it on GitHub
#59 (comment).

f = tempfile.SpooledTemporaryFile()
download(iris_url, f)
f.seek(0)
assert hashlib.sha256(f.read()).hexdigest() == iris_hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but maybe switch to MD5? On my computer that's almost 3 times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.

On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote:

In tests/test_downloaders.py
#59 (comment):

+class DummyArgs:

  • def init(self, **kwargs):
  •    for key, val in kwargs.items():
    
  •        setattr(self, key, val)
    
    +def test_filename_from_url():
  • assert filename_from_url(iris_url) == 'iris.data'

+def test_download():

  • f = tempfile.SpooledTemporaryFile()
  • download(iris_url, f)
  • f.seek(0)
  • assert hashlib.sha256(f.read()).hexdigest() == iris_hash

Looks good to me, but maybe switch to MD5? On my computer that's almost 3
times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very
large datasets.


Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27756061.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's just for file download integrity, right? Spreading a virus
through fuel-download is probably not the most efficient attack vector
anyway :p
On Apr 3, 2015 5:28 PM, "David Warde-Farley" notifications@github.com
wrote:

In tests/test_downloaders.py
#59 (comment):

+class DummyArgs:

  • def init(self, **kwargs):
  •    for key, val in kwargs.items():
    
  •        setattr(self, key, val)
    
    +def test_filename_from_url():
  • assert filename_from_url(iris_url) == 'iris.data'

+def test_download():

  • f = tempfile.SpooledTemporaryFile()
  • download(iris_url, f)
  • f.seek(0)
  • assert hashlib.sha256(f.read()).hexdigest() == iris_hash

I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.
… <#14c812f7497e2130_>
On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote: In tests/test_downloaders.py <
https://github.com/bartvm/fuel/pull/59#discussion_r27756061>: > + >
+class DummyArgs: > + def init(self, **kwargs): > + for key, val in
kwargs.items(): > + setattr(self, key, val) > + > + > +def
test_filename_from_url(): > + assert filename_from_url(iris_url) ==
'iris.data' > + > + > +def test_download(): > + f =
tempfile.SpooledTemporaryFile() > + download(iris_url, f) > + f.seek(0) > +
assert hashlib.sha256(f.read()).hexdigest() == iris_hash Looks good to me,
but maybe switch to MD5? On my computer that's almost 3 times faster for a
1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets. —
Reply to this email directly or view it on GitHub <
https://github.com/bartvm/fuel/pull/59/files#r27756061>.


Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27757134.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I changed it. When we do implement checksum verification of downloaded files, I guess that's something that can be left as a user choice, provided that we compute the checksums using whatever options we want to offer.

@vdumoulin
Copy link
Contributor Author

Looks like all tests are passing (except Coveralls, which fails for a meager 0.01% decrease in test coverage).

bartvm added a commit that referenced this pull request Apr 4, 2015
@bartvm bartvm merged commit 13362d7 into mila-iqia:master Apr 4, 2015
@vdumoulin vdumoulin deleted the download_script branch April 4, 2015 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants