-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9
Comments
Me and @kindly had call and think we found this: https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/common.py#L230 is the def load_codelist(url): If we cached the request to get the codelist that would really help - can maybe use https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/tools.py#L8 cached_get_request BUT This was deliberately not used before because in a long running process we didn't want to get a cached item that might then become stale. ALso CoVE was meant to be used by people testing extensions, so didn't want to cache. So having this as an option (off by defaults, on in kingfisher)? Also check - is request to get extension JSON cached? |
The call to extension.json IS NOT CACHED https://github.com/open-contracting/lib-cove-ocds/blob/master/libcoveocds/schema.py#L179 apply_extensions func |
The answer should include an option so you can choose
Because of "CoVE was meant to be used by people testing extensions" , CoVE would use first option. Kingfisher would use second option. |
I agree – if, for example, we start a long-running scrape and check, it's fine to cache the schema, codelists, extensions from when the collection is first created, as we're trying to evaluate the data against the schema that was available at that time. There are edge cases (e.g. publisher changes schema and also re-publishes all their data while the scraper is still running), but I'd consider this rare (and in any case we'd probably need to restart the scrape since the data itself had become too stale). |
I've written this up in the linked ticket .... we have 2 separate problems here. The first problem is just to make sure we cache requests for the hour or so a check process takes to run - currently we are DOS-ing servers, we just realised! Doing this will speed up the checks process and mean we don't DOS servers (always good). The second problem is bigger and I've written that up fully in the linked ticket. |
I was thinking of a case like Colombia which takes an estimated 10 days to check (and however long to scrape) – not an hour. But, yes, sounds good. |
It only takes 5 hours approx to scrape the full dataset. The problem with the checking is they have 20 declared extensions. And the estimated 10 days is running the checking over differents collections (each one with differents parts of the data) in parallel . So maybe doing this could be a recommendation for check large datasets. |
Actually, running the checks like that has other implications, and we'd want to sort that out. I'd like to tackle this one step at the time. Caching requests is definitely the right thing to - for DOS reasons and speed reasons. If we do that and find that checks still take far to long we can look at this again. But I suspect that this will really help with speed. |
We don't have time to do this properly now, ............... but WHO NEEDS TO DO ANYTHING PROPERLY ANYWAY! :-P In the branch "master-cache-all-requests" there are 2 commits from V0.1.0 (the version Kingfisher was using), the first adds a speed test and the second makes all requests cached. I ran the tests on the live box and a page of Colombia data .... checking the same file 100 times went from 223 seconds to ..... 22!!!!!!!!!!!!!!!!!!!!!!!!!!!! Now, the test doesn't do other things (like the database access) so I'm not saying we'll see that % speed gain in Kingfisher but clearly we are going to see something pretty good. This is NOT a long term solution - this obviously means we can't get any new work on this library into Kingfisher. This is a temporary fix for now. |
#9 Also fix to ocds_json_output() func - pass config to Schema class
#9 Also fix to ocds_json_output() func - pass config to Schema class
#9 Also fix to ocds_json_output() func - pass config to Schema class
OpenDataServices/lib-cove#16 needs to be done and released first. Then #14 |
#9 Also fix to ocds_json_output() func - pass config to Schema class
#9 Also fix to ocds_json_output() func - pass config to Schema class
For speed reasons, in cases where called repeatedly in some process ....
eg Kingfisher calling once for each piece of data
.... can we do more memory caching?
eg Cache the complied schema with extensions - have seen a case with complex extensions where the process was working slowly and we wondered if caching compiled schema would help the speed?
From @yolile
The text was updated successfully, but these errors were encountered: