-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use time_limit contextmanager #152
Conversation
…ck run_until_timestamp
Yes, I tried to use Postgres Transactions to always write a complete operation, just because you never know when something might crash and you don't want it left in a state the next process can't recover. I'm still not sure I like just killing something without warning tho. Let me check in with some others who have more experience with multi-threaded Python than me.
That would be good - we already get a lot of meaningless KeyboardInterrupt errors in Sentry. |
Does the current implementation take advantage of multiple threads? It looks like the code only uses one thread. The transform-collections docs page also says that only one such command can be run at once. We can de-prioritize this PR, as I think we need to reconsider the general architecture of Kingfisher Process, which doesn't scale… And one of the original goals for Kingfisher was to be able to support continuous validation (so, large amounts of data at more frequent intervals than we process now) and to meet our needs over the longer term (where we expect there to be more OCDS data available, such that serial operations won't meet our needs). A more fit architecture would look roughly like:
The above is a rough sketch and doesn't describe how errors are handled, etc. (though I've mapped it out along with the above on paper), but this is a fair amount of re-work, which will have to be planned carefully… Right now there's a mix of signals, time-limited ( Update: Noting that checker consumer will need to load the schema/codelists against which to check from a cache, viz open-contracting/lib-cove-ocds#9 and #122 |
Earlier discussion: open-contracting-archive/kingfisher-vagrant#297 (comment) |
No, multiple threads are not the model we went for in Process for this iteration (Scrapy provides that out of the box, so the Scrape side does). Instead we planned to use multiple processes, with the Postgresql database always ultimately tracking the state of the system (and hence work to do) and with a message system to trigger a worker in a separate process to start work. It's not totally serial working thought; we already have some parallel working and we have scope to introduce more - we have already discussed an option in #151 (comment) for instance. This iteration came from a set of user requirements 6 months ago and was planned to last 18 months. While it's not perfect I think it's on a good base and we have an idea of iterative improvements to add more.
There is a lot in here to unpack, starting with the new set of user requirements - it would be good to dig into those more deeply and see how the new data quality work changes the user requirements we worked against when initially building this. Given the breath of that, that may be better for a sprint planning session or similar than here. But I would hope that any changes can be made iteratively on top of the existing work; already we have the Scrape/Process split that should help with that, and I think there is a lot of possibilities to build on what is already in the Process side. |
Optimistically closing ahead of Django branch. Comments are linked from relevant issues. |
… instead of requiring operations to check run_until_timestamp.
I don't see the advantage of having to check
run_until_timestamp
in a dozen locations (and counting), compared to just allowing an exception to be raised on a timer.The only potential downside to this PR is if the operations are not safe to interrupt, but it looks like we have used transactions and have ordered operations in a way that is safe.
This PR also replaces some
print
with logger calls.One possible improvement to this PR is to catch the TimeoutException in the CLI command, so that users / cronmail get a nice message without a backtrace.