-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add download script #59
Conversation
Sure, do you want me to clear it now? Or did you not make the changes yet? |
""" | ||
base_url = ('http://www.cs.toronto.edu/~larocheh/public/datasets/' + | ||
'binarized_mnist/binarized_mnist_') | ||
call(['curl', base_url + 'train.amat', '-o', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer using requests
for this since curl
might not be installed on some systems (they might have wget
, or something different entirely), plus it's just nicer if we don't have to use bash commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need something like this btw: http://stackoverflow.com/a/16696317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually, maybe this is a better option, because most people already have it installed: http://stackoverflow.com/a/27389016
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure most people have urllib3 installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to get it working with six.moves.urllib
, although for some reason the object returned by urllib.request.urlopen
can't be used as a context manager, which makes my implementation a bit less clean than I would like.
I just pushed a commit that makes the changes, could you validate that the changes to Now would be a good time to clear the cache, in order to remove the unneeded |
Cache cleared! |
23062e7
to
49f4767
Compare
I just rebased and force-pushed. |
Btw, not sure, but I think you should have rights to clear the cache as well. If you've got the Travis CLI installed it should be a matter of |
Good to know, thanks! |
The only thing left to verify is whether Travis will cache the generated |
It will say at the very bottom of the Travis log which files it is caching. |
Just in reference to |
I'm going to raise something because I want to very soon add ImageNet support to this tool: some datasets that we would like built-in do not have a public download URL. In ImageNet's case, there is a download URL, but not a public one. What would be a good way to handle this? I'm thinking argparse subcommands are the most flexible option. |
If you don't care about anything, then yoloswag stdlib urllib is fine. If you care about verified HTTPS/TLS certs, then you probably want urllib3+certifi, or if you care about re-using sockets/connection pooling, thread safety, retries, etc. Alternatively you could use requests which bundles urllib3+certifi and a bunch of other things. (Hi @dwf!) |
@dwf We could turn functions defined in |
I've refactored things a bit. We're now using @dwf I managed to get subparsers working for fuel-download. Functions defined in The logic in |
@dwf You mentioned testing the conversion script. Do you have a suggestion on how this could be done effectively? |
@dwf Ping RE.: conversion script testing. Do you have a suggestion on how this could be done effectively? |
I have been thinking about this especially in light of my in progress
script for imagenet. I think one way is to create in-memory mock files with
io.BytesIO that is much smaller than the original and verify things that
way. What do you think? (Going to have another look and try and offer
testability refactoring comments specifically)
|
Sounds good. One of the things that could be done is factor the logic of concatenating data sources and creating attributes into a function that operates on file handles and test that function instead. |
I made a small step in the right direction by factoring out of I could go one step further and make conversion methods accept either a path or a file handle for input/output files so we can give our mock files as input. The only thing that's not clear to me yet is what we'll do about shapes: conversion scripts will likely make assumptions about the size of the data stored in the input file which will break if we make small mock files. Maybe there's a way to store sparse arrays filled with zeros that won't blow up memory usage but will retain the right shapes? |
Maybe this would be the way to go. |
@dwf @bartvm I feel like everything that needed to be addressed for this PR has been addressed; in fact, this PR even started spilling onto other issues. I propose that we leave unit testing conversion modules for another PR and merge this one if everything's okay on your end. That will allow me to rebase PR #64 and go forward with it. |
os.remove(f) | ||
else: | ||
for url, f in zip(urls, files): | ||
with open(f, 'w') as file_handle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be wb
I think, else Python 3 complains.
I'm happy to merge this as soon as Python 3 on Travis is pacified. There are some tiny details I'd change, but I'd rather merge and beautify afterwards than stalling PRs for too long. |
Agreed, let's merge once Travis is happy.
|
What ended up fixing the test failures was changing the mode from I'm not 100% confident in the fix, could any of you confirm that I'm not doing something stupid that just happens to make these tests pass? |
wb is good practice anyway for cross platform compatibility buy I'll have a
|
f = tempfile.SpooledTemporaryFile() | ||
download(iris_url, f) | ||
f.seek(0) | ||
assert hashlib.sha256(f.read()).hexdigest() == iris_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but maybe switch to MD5? On my computer that's almost 3 times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.
On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote:
In tests/test_downloaders.py
#59 (comment):
+class DummyArgs:
- def init(self, **kwargs):
for key, val in kwargs.items():
setattr(self, key, val)
+def test_filename_from_url():
- assert filename_from_url(iris_url) == 'iris.data'
+def test_download():
- f = tempfile.SpooledTemporaryFile()
- download(iris_url, f)
- f.seek(0)
- assert hashlib.sha256(f.read()).hexdigest() == iris_hash
Looks good to me, but maybe switch to MD5? On my computer that's almost 3
times faster for a 1GB file (1.4 s vs 3.6 s), which might pay off for very
large datasets.—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27756061.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's just for file download integrity, right? Spreading a virus
through fuel-download is probably not the most efficient attack vector
anyway :p
On Apr 3, 2015 5:28 PM, "David Warde-Farley" notifications@github.com
wrote:
In tests/test_downloaders.py
#59 (comment):
+class DummyArgs:
- def init(self, **kwargs):
for key, val in kwargs.items():
setattr(self, key, val)
+def test_filename_from_url():
- assert filename_from_url(iris_url) == 'iris.data'
+def test_download():
- f = tempfile.SpooledTemporaryFile()
- download(iris_url, f)
- f.seek(0)
- assert hashlib.sha256(f.read()).hexdigest() == iris_hash
I'm -0.5 on MD5, given that it has devastating security problems, but I
suppose it's not a big risk since we're not using it to verify executables.
… <#14c812f7497e2130_>
On Fri, Apr 3, 2015 at 5:09 PM, Bart van Merriënboer <
notifications@github.com> wrote: In tests/test_downloaders.py <
https://github.com/bartvm/fuel/pull/59#discussion_r27756061>: > + >
+class DummyArgs: > + def init(self, **kwargs): > + for key, val in
kwargs.items(): > + setattr(self, key, val) > + > + > +def
test_filename_from_url(): > + assert filename_from_url(iris_url) ==
'iris.data' > + > + > +def test_download(): > + f =
tempfile.SpooledTemporaryFile() > + download(iris_url, f) > + f.seek(0) > +
assert hashlib.sha256(f.read()).hexdigest() == iris_hash Looks good to me,
but maybe switch to MD5? On my computer that's almost 3 times faster for a
1GB file (1.4 s vs 3.6 s), which might pay off for very large datasets. —
Reply to this email directly or view it on GitHub <
https://github.com/bartvm/fuel/pull/59/files#r27756061>.—
Reply to this email directly or view it on GitHub
https://github.com/bartvm/fuel/pull/59/files#r27757134.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I changed it. When we do implement checksum verification of downloaded files, I guess that's something that can be left as a user choice, provided that we compute the checksums using whatever options we want to offer.
Looks like all tests are passing (except Coveralls, which fails for a meager 0.01% decrease in test coverage). |
This PR introduces a
fuel-download
script to download built-in datasets.Its implementation is similar to
fuel-convert
's implementation: the script allows to choose amongst the download functions listed infuel.downloaders.__all__
.I changed the default location to the current working directory following a discussion in #54.
The workflow for downloading a built-in dataset now looks like this: