Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to fetch receipt images #35

Merged
merged 47 commits into from
Sep 14, 2016
Merged

Script to fetch receipt images #35

merged 47 commits into from
Sep 14, 2016

Conversation

cuducos
Copy link
Collaborator

@cuducos cuducos commented Sep 2, 2016

This is a program to download receipts to a local directory (fix #33).

$ python src/fetch_receipts.py
    This script downloads the receipt images from the Lower House server.

    Be aware that downloading everything might use more than 1 TB of disk
    space.  Because of that you have to specify one `target` directory (where
    to save the files) and optionally you can specify with `--limit` the number
    of images to be downloaded.

    If the `target` directory exists and already has some saved receipts,
    these receipts will not be downloaded again (and they will not count when
    using `--limit` either).

    In other words, if you already have 42 receipts in your target folder,
    running the command with a limit of 8 will end up in a directory with 50
    files: the 42 you already had and 8 freshly downloaded ones.


positional arguments:
  target                Directory where images will be saved.

optional arguments:
  -h, --help            show this help message and exit
  -l LIMIT, --limit LIMIT
                        Limit the number of receipts to be saved

It will download the files to a different target than data/ because it tends to use a huge amount of disk space (maybe more than 1TB).

That said people might want to point it to a external volume (using target positional argument) and/or to limit the amount of receipts to be downloaded (using --limit optional argument).

It uses a new external library (added to conda_requirements.txt) called humanize. You can manually install it with pip install humanize if you want.

Finally I recommend everyone that has a spare TB at home to save them until we can raise money to afford a virtual drive for the project with that amount of free space.

Using it to download everything to /Volumes/Narnia/Serenata:

$ python src/fetch_receipts.py /Volumes/Narnia/Serenata

Using it to download 10,000 receipts to /Volumes/Narnia/Serenata:

$ python src/fetch_receipts.py --limit 10000 /Volumes/Narnia/Serenata

'errors': list(),
'skipped': list()
}
for receipt in self.receipts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the operation of downloading the receipts doesn't fit on the constructor of the class, maybe it's a good idea to separate it into a run method which is going to fetch the receipts and also a print_report which will print the error messages if there's something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea, thanks!

Do you wanna step in and send a PR to this branch? These days my workload is focused on Issue #34. Otherwise I'l catch up here later.

@cuducos cuducos changed the title Script to fetch receipt images (fix #33) Script to fetch receipt images Sep 8, 2016
@cuducos
Copy link
Collaborator Author

cuducos commented Sep 8, 2016

Done, @mtrovo! Way better now. Thanks for the advice.

cuducos and others added 5 commits September 8, 2016 13:52
Signed-off-by: Patrick José Pereira <patrickelectric@gmail.com>
Signed-off-by: Patrick José Pereira <patrickelectric@gmail.com>
Add missing apostrophe
@mtrovo
Copy link
Contributor

mtrovo commented Sep 10, 2016

Hey @cuducos I made this change on a local branch and added some improvements to download images in parallel, can you take a look at #50.

Irio and others added 13 commits September 10, 2016 21:39
Since many files are generated based on previous timestamps, leaving
these lines here generate filenames with multiple timestamps, which is
something not wanted.
Following convention proposed in CONTRIBUTING.
Prevent unnecessary downloads/uploads of datasets and documentation
)

* Clean up

* 💅

* Add Jarbas

* Add documentation related to src/

* Add documentation ref. available datasets

* Typos & minor fixes

* Correct filename
Fix typos on CONTRIBUTING.md
Change setup script to Python 3 shebang
@cuducos cuducos merged commit 22b0482 into master Sep 14, 2016
@cuducos cuducos deleted the ec-fetch-receipts branch October 19, 2016 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backup pictures from all receipts
9 participants