-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop Web API processing large files; instead pass throught REDIS and process in worker #171
Comments
(1) A queueing system should be part of kingfisher-process' requirements. This tool is primarily for internal use. For external use, installing Redis is simple. The scenario where a user can run arbitrary Python code but can't install Redis is too narrow to bother supporting. Making it a requirement instead of an optional dependency will allow us to have one code path instead of two. (2) Instead of taking a piecemeal approach of fixing one queue-related bottleneck at a time, we should describe the desired architecture (started in #152 (comment)) and then implement it system-wide. That way, we'll have a consistent, systematic solution. (3) We should definitely have different queues for different types of work. Ideally, as part of this, we'd rename the check queue (currently (4) Based on the last call with Rob, Duncan and David, I think it's fine to use Redis as the queueing system. However, we should implement a thin 'adapter' layer, so that it's easier to switch it out later. In other words, all parts of kingfisher-process that need to interact with the queue would use methods from a queue module that are thin wrappers around I consider (2) a blocker for this issue. |
Since we've discussed #152 a little more, I think we can proceed with the first part of my comment there through this issue, viz.:
In other words, I think this means:
This change should support multiple workers interacting simultaneously with the Redis data. Re: a question on the call about use cases / user needs, there are some in #152 before and after the long bullet list. |
This is essentially the core of the new technical proposal. Closing in favor of more atomic issues. |
From open-contracting/kingfisher-collect#154 , where we were having problems with large files hitting memory limits of the web process.
We should use the Redis que - when a large file comes in, the web process simply stores details and contents (if appropriate - sometimes a file path is passed), returns a response to the caller as quick as possible and puts a message on the que. A Redis worker then processes it.
This smooths demand on the system; if files are processed only when they arrive at the web HTTP API, there might be times of high demand and of low demand on the system, causing crashes and/or wasted idle resources. Workers can take a longer time to process larger files without an HTTP agent waiting for a response, and because you know the number of workers is limited you can freely let them use more memory. (With web workers you may suddenly get many requests, and many threads will start up and exhaust your memory.)
There should be a config variable; with the size limit of the file that it would process immediately, and anything above that would be processed via the que. This should default to -1 for never, as some installations might not have a Redis que.
These messages maybe deemed higher priority than the check messages, and thus they may go through a different que from the current check messages.
The text was updated successfully, but these errors were encountered: