Stop Web API processing large files; instead pass throught REDIS and process in worker #171

odscjames · 2019-07-03T13:42:27Z

From open-contracting/kingfisher-collect#154 , where we were having problems with large files hitting memory limits of the web process.

We should use the Redis que - when a large file comes in, the web process simply stores details and contents (if appropriate - sometimes a file path is passed), returns a response to the caller as quick as possible and puts a message on the que. A Redis worker then processes it.

This smooths demand on the system; if files are processed only when they arrive at the web HTTP API, there might be times of high demand and of low demand on the system, causing crashes and/or wasted idle resources. Workers can take a longer time to process larger files without an HTTP agent waiting for a response, and because you know the number of workers is limited you can freely let them use more memory. (With web workers you may suddenly get many requests, and many threads will start up and exhaust your memory.)

There should be a config variable; with the size limit of the file that it would process immediately, and anything above that would be processed via the que. This should default to -1 for never, as some installations might not have a Redis que.

These messages maybe deemed higher priority than the check messages, and thus they may go through a different que from the current check messages.

odscjames · 2019-07-03T13:44:05Z

Process large incoming files asynchronously in Kingfisher

jpmckinney · 2019-07-08T20:38:59Z

(1) A queueing system should be part of kingfisher-process' requirements. This tool is primarily for internal use. For external use, installing Redis is simple. The scenario where a user can run arbitrary Python code but can't install Redis is too narrow to bother supporting. Making it a requirement instead of an optional dependency will allow us to have one code path instead of two.

(2) Instead of taking a piecemeal approach of fixing one queue-related bottleneck at a time, we should describe the desired architecture (started in #152 (comment)) and then implement it system-wide. That way, we'll have a consistent, systematic solution.

(3) We should definitely have different queues for different types of work. Ideally, as part of this, we'd rename the check queue (currently kingfisher_work) to something less generic. We might even want to have different queues for different jobs (e.g. when checking one collection, that collection gets its own queue); this makes it easier to, for example, prioritize the workers for a particular collection.

(4) Based on the last call with Rob, Duncan and David, I think it's fine to use Redis as the queueing system. However, we should implement a thin 'adapter' layer, so that it's easier to switch it out later. In other words, all parts of kingfisher-process that need to interact with the queue would use methods from a queue module that are thin wrappers around redis library methods.

I consider (2) a blocker for this issue.

jpmckinney · 2019-09-25T17:35:48Z

Since we've discussed #152 a little more, I think we can proceed with the first part of my comment there through this issue, viz.:

Kingfisher Scrape sends a request to a destination that enqueues a message about the new data, and then any number of "new data" consumers dequeue a message, store the data in the database.

In other words, I think this means:

SubmitFileView just enqueues a message in a new queue that is specific to new data (i.e. it no longer does what comes after the current TODO in that method)
- If we want to plan ahead, we can use a Redis sorted set instead of a list, so that (in future) an analyst can start a Scrapy job, setting the priority as an additional parameter, and having that passed through to Process when Scrapy makes its requests.
A new worker (a CLI command in the current architecture) similar to ProcessRedisQueueCLICommand is implemented, to do the work that SubmitFileView used to.
- We don't want to simply re-use the generic process-redis-queue command, because that gives less flexibility to control, e.g., how many workers are specifically working on new data. That said, we can perhaps have an abstract class for Redis workers.
- I don't think we need to continue the pattern of time-limited runs in this new command, as this worker should just be running forever. We can create a follow-up issues to monitor that it is running, and an issue in deploy to stop and restart it whenever Process is re-deployed (when the command receives an interrupt, it should catch the exception and stop safely).
- While doing this, instead of using the redis package directly, we can add methods in our redis.py (which should perhaps be renamed to broker.py) for interacting with Redis. This would make it easier to substitute a different backend if we chose to at a later date.

This change should support multiple workers interacting simultaneously with the Redis data.

Re: a question on the call about use cases / user needs, there are some in #152 before and after the long bullet list.

jpmckinney · 2020-11-20T04:23:32Z

This is essentially the core of the new technical proposal. Closing in favor of more atomic issues.

odscjames mentioned this issue Aug 5, 2019

Buenos Aires: MemoryError open-contracting/kingfisher-collect#154

Closed

jpmckinney added the framework: CLI label Oct 25, 2019

jpmckinney added framework: queue and removed framework: CLI labels Dec 10, 2019

jpmckinney mentioned this issue Mar 25, 2020

Preserve existing behaviors in Django branch #274

Closed

22 tasks

jpmckinney added feature Relating to loading data from the web API or CLI command and removed framework: queue labels Aug 13, 2020

jpmckinney closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

odscjames commented Jul 3, 2019

odscjames commented Jul 3, 2019

jpmckinney commented Jul 8, 2019

jpmckinney commented Sep 25, 2019 •

edited

Loading

jpmckinney commented Nov 20, 2020

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

Comments

odscjames commented Jul 3, 2019

odscjames commented Jul 3, 2019

jpmckinney commented Jul 8, 2019

jpmckinney commented Sep 25, 2019 • edited Loading

jpmckinney commented Nov 20, 2020

jpmckinney commented Sep 25, 2019 •

edited

Loading