Generating existing from scratch is expensive #446

peterbe · 2018-05-02T20:35:29Z

Suppose you delete the .records-hashes-$SERVER_NAME.json file. That means this code right here:

Lines 158 to 161 in ee484fa

    
           new_records = client.get_records( 
        
               _since=previous_run_etag, 
        
               pages=float('inf') 
        
           )

is going to extract every single record from the Kinto database into a monster list of dicts.

I measured it, with memory_profiler in my docker with a postgres database of about 500,000 records. Generating that new_records dict eats up about 1700MB.
See https://irccloud.mozilla.com/pastebin/ZhAoBDia/

The solution has to be to consume that paginated result as a stream and for each patch of 10,000 records make the hashes and then reuse that allocated memory for the next 10,000 records.

@leplatrem I don't see a whole lot in the documentation about to do the pagination i any other way. Do you have some ideas?

The text was updated successfully, but these errors were encountered:

bqbn · 2018-05-02T21:41:26Z

Did some manual runs on -stage using v1.2.1, and it shows that the program requires at least 5 GB memory in order to finish, or it gets OOM killed.

Without this being fixed, we can expect that the program will need more memory due to the ever increasing database.

leplatrem · 2018-05-03T09:01:00Z

@leplatrem I don't see a whole lot in the documentation about to do the pagination i any other way. Do you have some ideas?

Indeed, the pagination is not exposed in the kinto-http client :(

As a work around, using raw requests is relatively easy (untested):

url = records_url
while True:
   resp = requests.get(url, params={"_since": timestamp, "_limit": 1000})
   page_records = resp.json["data"]
   # ... process(page_records)
   url = resp.headers.get("Next-Page")
   if url is None:
       break

* optimize fetch_existing, fixes #447 #445 #446 * remove fetch_existing migration test * test fixes * refactoring endpoint variable

peterbe · 2018-05-03T16:57:38Z

Fixed by 71f34fd

peterbe pushed a commit that referenced this issue May 3, 2018

optimize fetch_existing, fixes #447 #445 #446 (#448)

71f34fd

* optimize fetch_existing, fixes #447 #445 #446 * remove fetch_existing migration test * test fixes * refactoring endpoint variable

peterbe closed this as completed May 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating existing from scratch is expensive #446

Generating existing from scratch is expensive #446

peterbe commented May 2, 2018

bqbn commented May 2, 2018

leplatrem commented May 3, 2018

peterbe commented May 3, 2018

Generating existing from scratch is expensive #446

Generating existing from scratch is expensive #446

Comments

peterbe commented May 2, 2018

bqbn commented May 2, 2018

leplatrem commented May 3, 2018

peterbe commented May 3, 2018