Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

Closed
odscjames opened this issue Feb 28, 2019 · 10 comments

Comments

@odscjames
Copy link
Contributor

odscjames commented Feb 28, 2019

For speed reasons, in cases where called repeatedly in some process ....

eg Kingfisher calling once for each piece of data

.... can we do more memory caching?

eg Cache the complied schema with extensions - have seen a case with complex extensions where the process was working slowly and we wondered if caching compiled schema would help the speed?

From @yolile

@odscjames odscjames changed the title For speed reasons, in cases where called repeatdely in some process, can we do more memory caching? For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? Feb 28, 2019
@odscjames
Copy link
Contributor Author

odscjames commented Mar 1, 2019

Me and @kindly had call and think we found this:

https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/common.py#L230 is the def load_codelist(url):
This calls response = requests.get(url)
ie Getting the codelists is not cached!!!

If we cached the request to get the codelist that would really help - can maybe use https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/tools.py#L8 cached_get_request

BUT

This was deliberately not used before because in a long running process we didn't want to get a cached item that might then become stale. ALso CoVE was meant to be used by people testing extensions, so didn't want to cache.

So having this as an option (off by defaults, on in kingfisher)?

Also check - is request to get extension JSON cached?

@odscjames
Copy link
Contributor Author

Also check - is request to get extension JSON cached?

The call to extension.json IS NOT CACHED
The call to release-schema.json is cached IF cache_schema option is used.

https://github.com/open-contracting/lib-cove-ocds/blob/master/libcoveocds/schema.py#L179 apply_extensions func

@odscjames
Copy link
Contributor Author

The answer should include an option so you can choose

  • only cache Core stuff
  • cache everything

Because of "CoVE was meant to be used by people testing extensions" , CoVE would use first option.

Kingfisher would use second option.

@jpmckinney
Copy link
Member

I agree – if, for example, we start a long-running scrape and check, it's fine to cache the schema, codelists, extensions from when the collection is first created, as we're trying to evaluate the data against the schema that was available at that time. There are edge cases (e.g. publisher changes schema and also re-publishes all their data while the scraper is still running), but I'd consider this rare (and in any case we'd probably need to restart the scrape since the data itself had become too stale).

@odscjames
Copy link
Contributor Author

I agree – ....

I've written this up in the linked ticket .... we have 2 separate problems here. The first problem is just to make sure we cache requests for the hour or so a check process takes to run - currently we are DOS-ing servers, we just realised! Doing this will speed up the checks process and mean we don't DOS servers (always good). The second problem is bigger and I've written that up fully in the linked ticket.

@jpmckinney
Copy link
Member

jpmckinney commented Mar 1, 2019

I was thinking of a case like Colombia which takes an estimated 10 days to check (and however long to scrape) – not an hour. But, yes, sounds good.

@yolile
Copy link
Member

yolile commented Mar 1, 2019

(and however long to scrape)

It only takes 5 hours approx to scrape the full dataset.

The problem with the checking is they have 20 declared extensions. And the estimated 10 days is running the checking over differents collections (each one with differents parts of the data) in parallel . So maybe doing this could be a recommendation for check large datasets.

@odscjames
Copy link
Contributor Author

So maybe doing this could be a recommendation for check large datasets.

Actually, running the checks like that has other implications, and we'd want to sort that out. I'd like to tackle this one step at the time.

Caching requests is definitely the right thing to - for DOS reasons and speed reasons. If we do that and find that checks still take far to long we can look at this again. But I suspect that this will really help with speed.

@odscjames
Copy link
Contributor Author

We don't have time to do this properly now, ............... but WHO NEEDS TO DO ANYTHING PROPERLY ANYWAY! :-P

In the branch "master-cache-all-requests" there are 2 commits from V0.1.0 (the version Kingfisher was using), the first adds a speed test and the second makes all requests cached. I ran the tests on the live box and a page of Colombia data .... checking the same file 100 times went from 223 seconds to ..... 22!!!!!!!!!!!!!!!!!!!!!!!!!!!! Now, the test doesn't do other things (like the database access) so I'm not saying we'll see that % speed gain in Kingfisher but clearly we are going to see something pretty good.

This is NOT a long term solution - this obviously means we can't get any new work on this library into Kingfisher. This is a temporary fix for now.

odscjames added a commit that referenced this issue Apr 11, 2019
#9

Also fix to ocds_json_output() func - pass config to Schema class
odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019
odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019
odscjames added a commit that referenced this issue Apr 11, 2019
#9

Also fix to ocds_json_output() func - pass config to Schema class
odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019
odscjames added a commit that referenced this issue Apr 11, 2019
#9

Also fix to ocds_json_output() func - pass config to Schema class
@odscjames
Copy link
Contributor Author

OpenDataServices/lib-cove#16 needs to be done and released first.

Then #14

odscjames added a commit that referenced this issue Apr 12, 2019
#9

Also fix to ocds_json_output() func - pass config to Schema class
odscjames added a commit that referenced this issue Apr 12, 2019
#9

Also fix to ocds_json_output() func - pass config to Schema class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants