Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Generating existing from scratch is expensive #446

Closed
peterbe opened this issue May 2, 2018 · 3 comments
Closed

Generating existing from scratch is expensive #446

peterbe opened this issue May 2, 2018 · 3 comments

Comments

@peterbe
Copy link
Contributor

peterbe commented May 2, 2018

Suppose you delete the .records-hashes-$SERVER_NAME.json file. That means this code right here:

new_records = client.get_records(
_since=previous_run_etag,
pages=float('inf')
)
is going to extract every single record from the Kinto database into a monster list of dicts.

I measured it, with memory_profiler in my docker with a postgres database of about 500,000 records. Generating that new_records dict eats up about 1700MB.
See https://irccloud.mozilla.com/pastebin/ZhAoBDia/

The solution has to be to consume that paginated result as a stream and for each patch of 10,000 records make the hashes and then reuse that allocated memory for the next 10,000 records.

@leplatrem I don't see a whole lot in the documentation about to do the pagination i any other way. Do you have some ideas?

@bqbn
Copy link
Collaborator

bqbn commented May 2, 2018

Did some manual runs on -stage using v1.2.1, and it shows that the program requires at least 5 GB memory in order to finish, or it gets OOM killed.

Without this being fixed, we can expect that the program will need more memory due to the ever increasing database.

@leplatrem
Copy link
Collaborator

@leplatrem I don't see a whole lot in the documentation about to do the pagination i any other way. Do you have some ideas?

Indeed, the pagination is not exposed in the kinto-http client :(

As a work around, using raw requests is relatively easy (untested):

url = records_url
while True:
   resp = requests.get(url, params={"_since": timestamp, "_limit": 1000})
   page_records = resp.json["data"]
   # ... process(page_records)
   url = resp.headers.get("Next-Page")
   if url is None:
       break

peterbe pushed a commit that referenced this issue May 3, 2018
* optimize fetch_existing, fixes #447 #445 #446

* remove fetch_existing migration test

* test fixes

* refactoring endpoint variable
@peterbe
Copy link
Contributor Author

peterbe commented May 3, 2018

Fixed by 71f34fd

@peterbe peterbe closed this as completed May 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants