Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use time_limit contextmanager #152

Closed
wants to merge 3 commits into from
Closed

Use time_limit contextmanager #152

wants to merge 3 commits into from

Conversation

jpmckinney
Copy link
Member

@jpmckinney jpmckinney commented May 8, 2019

… instead of requiring operations to check run_until_timestamp.

I don't see the advantage of having to check run_until_timestamp in a dozen locations (and counting), compared to just allowing an exception to be raised on a timer.

The only potential downside to this PR is if the operations are not safe to interrupt, but it looks like we have used transactions and have ordered operations in a way that is safe.

This PR also replaces some print with logger calls.

One possible improvement to this PR is to catch the TimeoutException in the CLI command, so that users / cronmail get a nice message without a backtrace.

@jpmckinney jpmckinney requested a review from odscjames May 8, 2019 16:19
@odscjames
Copy link

The only potential downside to this PR is if the operations are not safe to interrupt, but it looks like we have used transactions

Yes, I tried to use Postgres Transactions to always write a complete operation, just because you never know when something might crash and you don't want it left in a state the next process can't recover.

I'm still not sure I like just killing something without warning tho. Let me check in with some others who have more experience with multi-threaded Python than me.

One possible improvement to this PR is to catch the TimeoutException in the CLI command, so that users / cronmail get a nice message without a backtrace.

That would be good - we already get a lot of meaningless KeyboardInterrupt errors in Sentry.

@jpmckinney
Copy link
Member Author

jpmckinney commented May 13, 2019

Does the current implementation take advantage of multiple threads? It looks like the code only uses one thread. The transform-collections docs page also says that only one such command can be run at once.

We can de-prioritize this PR, as I think we need to reconsider the general architecture of Kingfisher Process, which doesn't scale… And one of the original goals for Kingfisher was to be able to support continuous validation (so, large amounts of data at more frequent intervals than we process now) and to meet our needs over the longer term (where we expect there to be more OCDS data available, such that serial operations won't meet our needs).

A more fit architecture would look roughly like:

  • Kingfisher Scrape sends a request to a destination that enqueues a message about the new data
  • any number of "new data" consumers dequeue a message, store the data in the database, and then, based on a routing slip that was passed to Kingfisher Scrape when the collection was started, determine whether to enqueue a message (with the data) to checker or upgrade queues
  • any number of "checker" consumers read a message, optionally force 1.1 checks based on the routing slip, check the data, and store the results
  • any number of "upgrade" consumers read a message, upgrade the data, and store the results in a new collection
  • if the routing slip had indicated that upgraded data should then be compiled, then whatever triggered a scrape (potentially involving another queue and consumer) will have created empty collections for the original data and upgraded data, and enqueued a message to compile the upgraded collection
  • any number of "compile" consumers implement the Aggregator pattern (or derived patterns like Scatter-Gather) so that they wait for the collection to be complete, before doing any work
  • to indicate that a collection is done (while avoiding race conditions), Scrape would send a special message to the "new data" queue with the total number of messages it had sent, which would be written to the database, so that the "compile" consumers can determine whether all data has been written to the database before starting work
    • this special message would also be passed through to the upgrade queue, so that the upgrade consumer can also note the total number of messages for the upgraded collection

The above is a rough sketch and doesn't describe how errors are handled, etc. (though I've mapped it out along with the above on paper), but this is a fair amount of re-work, which will have to be planned carefully…

Right now there's a mix of signals, time-limited (--runforseconds) batch processes, deployment-wide defaults (standard pipeline, default checks), and a single queue (a Redis key, which lacks useful features from brokers like RabbitMQ) that is polled with BLPOP in a long-running process… which doesn't provide the flexibility (e.g. collection-specific pipelines), extensibility (quickly and easily integrate new components for consuming messages for other purposes, like quality checks beyond CoVE) or scalability that we want.

Update: Noting that checker consumer will need to load the schema/codelists against which to check from a cache, viz open-contracting/lib-cove-ocds#9 and #122

@jpmckinney
Copy link
Member Author

@odscjames
Copy link

Does the current implementation take advantage of multiple threads? It looks like the code only uses one thread.

No, multiple threads are not the model we went for in Process for this iteration (Scrapy provides that out of the box, so the Scrape side does). Instead we planned to use multiple processes, with the Postgresql database always ultimately tracking the state of the system (and hence work to do) and with a message system to trigger a worker in a separate process to start work.

It's not totally serial working thought; we already have some parallel working and we have scope to introduce more - we have already discussed an option in #151 (comment) for instance.

This iteration came from a set of user requirements 6 months ago and was planned to last 18 months. While it's not perfect I think it's on a good base and we have an idea of iterative improvements to add more.

A more fit architecture would look roughly like:....

There is a lot in here to unpack, starting with the new set of user requirements - it would be good to dig into those more deeply and see how the new data quality work changes the user requirements we worked against when initially building this. Given the breath of that, that may be better for a sprint planning session or similar than here.

But I would hope that any changes can be made iteratively on top of the existing work; already we have the Scrape/Process split that should help with that, and I think there is a lot of possibilities to build on what is already in the Process side.

@jpmckinney
Copy link
Member Author

Optimistically closing ahead of Django branch. Comments are linked from relevant issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants