This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Disk cache the manifests, fixes #392 #403

Merged

peterbe merged 7 commits into mozilla-services:master from peterbe:disk-cache-the-manifests-fixes-392

Apr 19, 2018

Contributor

peterbe commented Apr 13, 2018 •

edited

Loading

The gist of this is that when we download the .csv.gz files we break it up.

First clear out any old downloaded .csv.gz files that are older than 48 hours.
For each .csv.gz file in the manifest.json, instead of downloading it and yielding each CSV row, we download it to $CSV_DOWNLOAD_DIRECTORY (*)
After the file has been downloaded, instead of reading it from a network URL, we read it from the local file.

(*) In docker-compose this becomes ./csv-download-directory but outside Docker it becomes $TMPDIR/csv-download-directory. This way it's kept between stopping and starting Docker.

peterbe requested a review from leplatrem

April 13, 2018 19:36

peterbe commented

View reviewed changes

jobs/buildhub/s3_inventory_to_kinto.py Outdated

+                  if not os.path.isdir(download_directory):
+                      os.mkdir(download_directory)
+                  print("LOOKING AT", download_directory)

Contributor Author

peterbe Apr 13, 2018

Oops. I have to delete this.

leplatrem reviewed

View reviewed changes

jobs/buildhub/s3_inventory_to_kinto.py Outdated

+                  # Make sure the directory exists if it wasn't already created.
+                  if not os.path.isdir(download_directory):
+                      os.mkdir(download_directory)

Collaborator

leplatrem Apr 18, 2018

nit: use makedirs(..., exists_ok=True) ?

jobs/buildhub/s3_inventory_to_kinto.py Outdated

+                      os.mkdir(download_directory)
+                  # Look for old download junk in the download directory.
+                  too_old = 60 * 60 * 24 * 2  # two days

Collaborator

leplatrem Apr 18, 2018

move to DOWNLOAD_MAX_AGE constant ?

jobs/buildhub/s3_inventory_to_kinto.py

+                          files['MD5checksum'] + '.csv.gz'
+                      )
+                      # The file neither exists or has data.
+                      if os.path.isfile(file_path) and os.stat(file_path).st_size:

Collaborator

leplatrem Apr 18, 2018

nit: add explicit st_size > 0

Contributor Author

peterbe Apr 18, 2018

Out-of-curiousity; why?

Isn't it explicit already that the test is for the st_size to be anything greater than 0?

Collaborator

leplatrem Apr 19, 2018

Matter of taste maybe :)

jobs/buildhub/s3_inventory_to_kinto.py Outdated

+                                      gzip_chunk = await source.read(chunk_size)
+                                      if not gzip_chunk:
+                                          break  # End of response.
+                                      await destination.write(gzip_chunk)

Collaborator

leplatrem Apr 18, 2018

Here maybe you should try/except and delete the partially downloaded file when an error happens, or maybe you don't want to check the md5sum when resuming

Contributor Author

peterbe Apr 19, 2018

I was totally not sure how to do this in the aiosync world. Especially not how to test it. But I did write this, locally:

import time
import os
import asyncio
import aiohttp
import aiofiles

CHUNK_SIZE = 1024 * 256  # 256 KB

async def download_csv(loop, url):
    file_path = '/tmp/' + os.path.basename(url).split('?')[0]
    os.path.isfile(file_path) and os.remove(file_path)

    async with aiohttp.ClientSession(loop=loop) as session:
        try:
            async with aiofiles.open(file_path, 'wb') as destination:
                async with session.get(url) as resp:
                    print(resp.status)
                    while True:
                        print(resp.status, end='', flush=True)
                        time.sleep(0.05)
                        chunk = await resp.content.read(CHUNK_SIZE)
                        if not chunk:
                            break
                        await destination.write(chunk)
        except aiohttp.client_exceptions.ClientPayloadError:
            print('\n')
            mb = os.stat(file_path).st_size / 1024 / 1024
            print(f'WROTE {mb:.1f}MB')
            os.remove(file_path)
            raise
    print('\nall Done!')
def run():

    loop = asyncio.get_event_loop()
    url ='http://10.0.0.80:8080/file.csv'
    loop.run_until_complete(download_csv(loop, url))

if __name__ == '__main__':
    run()

That's using plain aiohttp.ClientSession(loop=loop) though. Not await s3_client.get_object(...)

When I run that in one terminal, then quickly switch to the other terminal where I run the HTTP server on :8080 and kill it, the except block successfully cleans up the half-downloaded file.

r?

peterbe added 7 commits

April 18, 2018 21:23


          Disk cache the manifests, fixes #392

3e2a802


          hacking

8a2e063


          fix for test_unzip_chunks

94f4484


          avoid print

4df3a66


          ignore more

2bd4f5e


          remove debug printing

a2b9a5c


          review fixes

28d1cdc

leplatrem approved these changes

View reviewed changes

jobs/buildhub/s3_inventory_to_kinto.py

+                          except ClientPayloadError:
+                              if os.path.exists(file_path):
+                                  os.remove(file_path)
+                              raise

Collaborator

leplatrem Apr 19, 2018

r+ :)

peterbe merged commit 178f97a into mozilla-services:master

peterbe deleted the disk-cache-the-manifests-fixes-392 branch

April 19, 2018 13:18

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet