Use time_limit contextmanager #152

jpmckinney · 2019-05-08T16:19:21Z

… instead of requiring operations to check run_until_timestamp.

I don't see the advantage of having to check run_until_timestamp in a dozen locations (and counting), compared to just allowing an exception to be raised on a timer.

The only potential downside to this PR is if the operations are not safe to interrupt, but it looks like we have used transactions and have ordered operations in a way that is safe.

This PR also replaces some print with logger calls.

One possible improvement to this PR is to catch the TimeoutException in the CLI command, so that users / cronmail get a nice message without a backtrace.

…ck run_until_timestamp

odscjames · 2019-05-13T12:36:07Z

The only potential downside to this PR is if the operations are not safe to interrupt, but it looks like we have used transactions

Yes, I tried to use Postgres Transactions to always write a complete operation, just because you never know when something might crash and you don't want it left in a state the next process can't recover.

I'm still not sure I like just killing something without warning tho. Let me check in with some others who have more experience with multi-threaded Python than me.

One possible improvement to this PR is to catch the TimeoutException in the CLI command, so that users / cronmail get a nice message without a backtrace.

That would be good - we already get a lot of meaningless KeyboardInterrupt errors in Sentry.

jpmckinney · 2019-05-13T17:10:42Z

Does the current implementation take advantage of multiple threads? It looks like the code only uses one thread. The transform-collections docs page also says that only one such command can be run at once.

We can de-prioritize this PR, as I think we need to reconsider the general architecture of Kingfisher Process, which doesn't scale… And one of the original goals for Kingfisher was to be able to support continuous validation (so, large amounts of data at more frequent intervals than we process now) and to meet our needs over the longer term (where we expect there to be more OCDS data available, such that serial operations won't meet our needs).

A more fit architecture would look roughly like:

Kingfisher Scrape sends a request to a destination that enqueues a message about the new data
any number of "new data" consumers dequeue a message, store the data in the database, and then, based on a routing slip that was passed to Kingfisher Scrape when the collection was started, determine whether to enqueue a message (with the data) to checker or upgrade queues
any number of "checker" consumers read a message, optionally force 1.1 checks based on the routing slip, check the data, and store the results
any number of "upgrade" consumers read a message, upgrade the data, and store the results in a new collection
if the routing slip had indicated that upgraded data should then be compiled, then whatever triggered a scrape (potentially involving another queue and consumer) will have created empty collections for the original data and upgraded data, and enqueued a message to compile the upgraded collection
any number of "compile" consumers implement the Aggregator pattern (or derived patterns like Scatter-Gather) so that they wait for the collection to be complete, before doing any work
to indicate that a collection is done (while avoiding race conditions), Scrape would send a special message to the "new data" queue with the total number of messages it had sent, which would be written to the database, so that the "compile" consumers can determine whether all data has been written to the database before starting work
- this special message would also be passed through to the upgrade queue, so that the upgrade consumer can also note the total number of messages for the upgraded collection

The above is a rough sketch and doesn't describe how errors are handled, etc. (though I've mapped it out along with the above on paper), but this is a fair amount of re-work, which will have to be planned carefully…

Right now there's a mix of signals, time-limited (--runforseconds) batch processes, deployment-wide defaults (standard pipeline, default checks), and a single queue (a Redis key, which lacks useful features from brokers like RabbitMQ) that is polled with BLPOP in a long-running process… which doesn't provide the flexibility (e.g. collection-specific pipelines), extensibility (quickly and easily integrate new components for consuming messages for other purposes, like quality checks beyond CoVE) or scalability that we want.

Update: Noting that checker consumer will need to load the schema/codelists against which to check from a cache, viz open-contracting/lib-cove-ocds#9 and #122

jpmckinney · 2019-05-14T22:05:57Z

Earlier discussion: open-contracting-archive/kingfisher-vagrant#297 (comment)

odscjames · 2019-05-15T11:31:33Z

Does the current implementation take advantage of multiple threads? It looks like the code only uses one thread.

No, multiple threads are not the model we went for in Process for this iteration (Scrapy provides that out of the box, so the Scrape side does). Instead we planned to use multiple processes, with the Postgresql database always ultimately tracking the state of the system (and hence work to do) and with a message system to trigger a worker in a separate process to start work.

It's not totally serial working thought; we already have some parallel working and we have scope to introduce more - we have already discussed an option in #151 (comment) for instance.

This iteration came from a set of user requirements 6 months ago and was planned to last 18 months. While it's not perfect I think it's on a good base and we have an idea of iterative improvements to add more.

A more fit architecture would look roughly like:....

There is a lot in here to unpack, starting with the new set of user requirements - it would be good to dig into those more deeply and see how the new data quality work changes the user requirements we worked against when initially building this. Given the breath of that, that may be better for a sprint planning session or similar than here.

But I would hope that any changes can be made iteratively on top of the existing work; already we have the Scrape/Process split that should help with that, and I think there is a lot of possibilities to build on what is already in the Process side.

jpmckinney · 2019-12-30T03:15:58Z

Optimistically closing ahead of Django branch. Comments are linked from relevant issues.

Use time_limit contextmanager, instead of requiring operations to che…

e3dbc36

…ck run_until_timestamp

jpmckinney requested a review from odscjames May 8, 2019 16:19

jpmckinney added 2 commits May 10, 2019 07:53

Fix time_limit contextmanager if no seconds set

d9021a4

Set message parameter for time_limit contextmanager

ca0e9d0

jpmckinney mentioned this pull request Jul 8, 2019

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

Closed

jpmckinney mentioned this pull request Sep 19, 2019

Add release, record and compiled release counts to collections table #183

Closed

This was referenced Sep 24, 2019

Allow prioritizing collections #190

Closed

Move all Redis code to one module #191

Closed

Redis is required - update docs #192

Closed

Use fanout (pub/sub) pattern to automate processing by Kingfisher Views, Pelican #193

Closed

jpmckinney mentioned this pull request Nov 21, 2019

compile-releases: Store warnings (in collection_note table) #222

Closed

jpmckinney closed this Dec 30, 2019

jpmckinney deleted the timelimit branch December 30, 2019 03:16

jpmckinney added the cleanup label Mar 24, 2020

odscjames mentioned this pull request May 20, 2020

Have an overview of system, across Collect, Process #295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use time_limit contextmanager #152

Use time_limit contextmanager #152

jpmckinney commented May 8, 2019 •

edited

Loading

odscjames commented May 13, 2019

jpmckinney commented May 13, 2019 •

edited

Loading

jpmckinney commented May 14, 2019

odscjames commented May 15, 2019

jpmckinney commented Dec 30, 2019

Use time_limit contextmanager #152

Use time_limit contextmanager #152

Conversation

jpmckinney commented May 8, 2019 • edited Loading

odscjames commented May 13, 2019

jpmckinney commented May 13, 2019 • edited Loading

jpmckinney commented May 14, 2019

odscjames commented May 15, 2019

jpmckinney commented Dec 30, 2019

jpmckinney commented May 8, 2019 •

edited

Loading

jpmckinney commented May 13, 2019 •

edited

Loading