For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

odscjames · 2019-02-28T09:06:46Z

For speed reasons, in cases where called repeatedly in some process ....

eg Kingfisher calling once for each piece of data

.... can we do more memory caching?

eg Cache the complied schema with extensions - have seen a case with complex extensions where the process was working slowly and we wondered if caching compiled schema would help the speed?

From @yolile

odscjames · 2019-03-01T15:14:16Z

Me and @kindly had call and think we found this:

https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/common.py#L230 is the def load_codelist(url):
This calls response = requests.get(url)
ie Getting the codelists is not cached!!!

If we cached the request to get the codelist that would really help - can maybe use https://github.com/OpenDataServices/lib-cove/blob/master/libcove/lib/tools.py#L8 cached_get_request

BUT

This was deliberately not used before because in a long running process we didn't want to get a cached item that might then become stale. ALso CoVE was meant to be used by people testing extensions, so didn't want to cache.

So having this as an option (off by defaults, on in kingfisher)?

Also check - is request to get extension JSON cached?

odscjames · 2019-03-01T15:20:59Z

Also check - is request to get extension JSON cached?

The call to extension.json IS NOT CACHED
The call to release-schema.json is cached IF cache_schema option is used.

https://github.com/open-contracting/lib-cove-ocds/blob/master/libcoveocds/schema.py#L179 apply_extensions func

odscjames · 2019-03-01T15:25:35Z

The answer should include an option so you can choose

only cache Core stuff
cache everything

Because of "CoVE was meant to be used by people testing extensions" , CoVE would use first option.

Kingfisher would use second option.

jpmckinney · 2019-03-01T15:38:47Z

I agree – if, for example, we start a long-running scrape and check, it's fine to cache the schema, codelists, extensions from when the collection is first created, as we're trying to evaluate the data against the schema that was available at that time. There are edge cases (e.g. publisher changes schema and also re-publishes all their data while the scraper is still running), but I'd consider this rare (and in any case we'd probably need to restart the scrape since the data itself had become too stale).

odscjames · 2019-03-01T16:32:04Z

I agree – ....

I've written this up in the linked ticket .... we have 2 separate problems here. The first problem is just to make sure we cache requests for the hour or so a check process takes to run - currently we are DOS-ing servers, we just realised! Doing this will speed up the checks process and mean we don't DOS servers (always good). The second problem is bigger and I've written that up fully in the linked ticket.

jpmckinney · 2019-03-01T16:40:28Z

I was thinking of a case like Colombia which takes an estimated 10 days to check (and however long to scrape) – not an hour. But, yes, sounds good.

yolile · 2019-03-01T16:59:40Z

(and however long to scrape)

It only takes 5 hours approx to scrape the full dataset.

The problem with the checking is they have 20 declared extensions. And the estimated 10 days is running the checking over differents collections (each one with differents parts of the data) in parallel . So maybe doing this could be a recommendation for check large datasets.

odscjames · 2019-03-01T17:44:50Z

So maybe doing this could be a recommendation for check large datasets.

Actually, running the checks like that has other implications, and we'd want to sort that out. I'd like to tackle this one step at the time.

Caching requests is definitely the right thing to - for DOS reasons and speed reasons. If we do that and find that checks still take far to long we can look at this again. But I suspect that this will really help with speed.

open-contracting/lib-cove-ocds#9

odscjames · 2019-03-06T16:57:35Z

We don't have time to do this properly now, ............... but WHO NEEDS TO DO ANYTHING PROPERLY ANYWAY! :-P

In the branch "master-cache-all-requests" there are 2 commits from V0.1.0 (the version Kingfisher was using), the first adds a speed test and the second makes all requests cached. I ran the tests on the live box and a page of Colombia data .... checking the same file 100 times went from 223 seconds to ..... 22!!!!!!!!!!!!!!!!!!!!!!!!!!!! Now, the test doesn't do other things (like the database access) so I'm not saying we'll see that % speed gain in Kingfisher but clearly we are going to see something pretty good.

This is NOT a long term solution - this obviously means we can't get any new work on this library into Kingfisher. This is a temporary fix for now.

#9 Also fix to ocds_json_output() func - pass config to Schema class

open-contracting/lib-cove-ocds#9

#9 Also fix to ocds_json_output() func - pass config to Schema class

open-contracting/lib-cove-ocds#9

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames · 2019-04-11T16:29:16Z

OpenDataServices/lib-cove#16 needs to be done and released first.

Then #14

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames changed the title ~~For speed reasons, in cases where called repeatdely in some process, can we do more memory caching?~~ For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? Feb 28, 2019

odscjames mentioned this issue Mar 1, 2019

At time of scrape, store a copy of all extensions used open-contracting/kingfisher-process#122

Closed

odscjames mentioned this issue Mar 6, 2019

Decrease time-to-first-result for checks and transforms open-contracting-archive/kingfisher-vagrant#297

Closed

odscjames added a commit to open-contracting/kingfisher-process that referenced this issue Mar 6, 2019

lib-cove-ocds - hacking it to be a specially speeded up version

8bcadbe

open-contracting/lib-cove-ocds#9

odscjames mentioned this issue Mar 6, 2019

lib-cove-ocds - hacking it to be a specially speeded up version open-contracting/kingfisher-process#124

Merged

odscjames added a commit to open-contracting/kingfisher-process that referenced this issue Mar 6, 2019

lib-cove-ocds - hacking it to be a specially speeded up version

18da596

open-contracting/lib-cove-ocds#9

odscjames added a commit that referenced this issue Apr 11, 2019

Cache All Requests Option

8e4fcf9

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019

Cache All Requests Option

cff644b

open-contracting/lib-cove-ocds#9

odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019

Cache All Requests Option

7dcd533

open-contracting/lib-cove-ocds#9

odscjames added a commit that referenced this issue Apr 11, 2019

Cache All Requests Option

cf83577

#9 Also fix to ocds_json_output() func - pass config to Schema class

This was referenced Apr 11, 2019

Cache All Requests Option OpenDataServices/lib-cove#16

Merged

Cache All Requests Option #14

Merged

odscjames added a commit to OpenDataServices/lib-cove that referenced this issue Apr 11, 2019

Cache All Requests Option

fb79f91

open-contracting/lib-cove-ocds#9

odscjames added a commit that referenced this issue Apr 11, 2019

Cache All Requests Option

4237b26

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames added a commit that referenced this issue Apr 12, 2019

Cache All Requests Option

cf1db97

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames added a commit that referenced this issue Apr 12, 2019

Cache All Requests Option

a142f44

#9 Also fix to ocds_json_output() func - pass config to Schema class

odscjames mentioned this issue Apr 17, 2019

Fix tech debt open-contracting/kingfisher-process#139

Merged

jpmckinney mentioned this issue May 14, 2019

Use time_limit contextmanager open-contracting/kingfisher-process#152

Closed

jpmckinney added the performance label Feb 20, 2020

jpmckinney closed this as completed Sep 2, 2020

jpmckinney mentioned this issue Sep 2, 2020

Better API for programmatic checks #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

odscjames commented Feb 28, 2019 •

edited

Loading

odscjames commented Mar 1, 2019 •

edited

Loading

odscjames commented Mar 1, 2019

odscjames commented Mar 1, 2019

jpmckinney commented Mar 1, 2019

odscjames commented Mar 1, 2019

jpmckinney commented Mar 1, 2019 •

edited

Loading

yolile commented Mar 1, 2019

odscjames commented Mar 1, 2019

odscjames commented Mar 6, 2019

odscjames commented Apr 11, 2019

For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

For speed reasons, in cases where called repeatedly in some process, can we do more memory caching? #9

Comments

odscjames commented Feb 28, 2019 • edited Loading

odscjames commented Mar 1, 2019 • edited Loading

odscjames commented Mar 1, 2019

odscjames commented Mar 1, 2019

jpmckinney commented Mar 1, 2019

odscjames commented Mar 1, 2019

jpmckinney commented Mar 1, 2019 • edited Loading

yolile commented Mar 1, 2019

odscjames commented Mar 1, 2019

odscjames commented Mar 6, 2019

odscjames commented Apr 11, 2019

odscjames commented Feb 28, 2019 •

edited

Loading

odscjames commented Mar 1, 2019 •

edited

Loading

jpmckinney commented Mar 1, 2019 •

edited

Loading