Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop Web API processing large files; instead pass throught REDIS and process in worker #171

Closed
odscjames opened this issue Jul 3, 2019 · 4 comments
Labels
feature Relating to loading data from the web API or CLI command

Comments

@odscjames
Copy link

From open-contracting/kingfisher-collect#154 , where we were having problems with large files hitting memory limits of the web process.

We should use the Redis que - when a large file comes in, the web process simply stores details and contents (if appropriate - sometimes a file path is passed), returns a response to the caller as quick as possible and puts a message on the que. A Redis worker then processes it.

This smooths demand on the system; if files are processed only when they arrive at the web HTTP API, there might be times of high demand and of low demand on the system, causing crashes and/or wasted idle resources. Workers can take a longer time to process larger files without an HTTP agent waiting for a response, and because you know the number of workers is limited you can freely let them use more memory. (With web workers you may suddenly get many requests, and many threads will start up and exhaust your memory.)

There should be a config variable; with the size limit of the file that it would process immediately, and anything above that would be processed via the que. This should default to -1 for never, as some installations might not have a Redis que.

These messages maybe deemed higher priority than the check messages, and thus they may go through a different que from the current check messages.

@odscjames
Copy link
Author

@jpmckinney
Copy link
Member

(1) A queueing system should be part of kingfisher-process' requirements. This tool is primarily for internal use. For external use, installing Redis is simple. The scenario where a user can run arbitrary Python code but can't install Redis is too narrow to bother supporting. Making it a requirement instead of an optional dependency will allow us to have one code path instead of two.

(2) Instead of taking a piecemeal approach of fixing one queue-related bottleneck at a time, we should describe the desired architecture (started in #152 (comment)) and then implement it system-wide. That way, we'll have a consistent, systematic solution.

(3) We should definitely have different queues for different types of work. Ideally, as part of this, we'd rename the check queue (currently kingfisher_work) to something less generic. We might even want to have different queues for different jobs (e.g. when checking one collection, that collection gets its own queue); this makes it easier to, for example, prioritize the workers for a particular collection.

(4) Based on the last call with Rob, Duncan and David, I think it's fine to use Redis as the queueing system. However, we should implement a thin 'adapter' layer, so that it's easier to switch it out later. In other words, all parts of kingfisher-process that need to interact with the queue would use methods from a queue module that are thin wrappers around redis library methods.

I consider (2) a blocker for this issue.

@jpmckinney
Copy link
Member

jpmckinney commented Sep 25, 2019

Since we've discussed #152 a little more, I think we can proceed with the first part of my comment there through this issue, viz.:

Kingfisher Scrape sends a request to a destination that enqueues a message about the new data, and then any number of "new data" consumers dequeue a message, store the data in the database.

In other words, I think this means:

  • SubmitFileView just enqueues a message in a new queue that is specific to new data (i.e. it no longer does what comes after the current TODO in that method)
    • If we want to plan ahead, we can use a Redis sorted set instead of a list, so that (in future) an analyst can start a Scrapy job, setting the priority as an additional parameter, and having that passed through to Process when Scrapy makes its requests.
  • A new worker (a CLI command in the current architecture) similar to ProcessRedisQueueCLICommand is implemented, to do the work that SubmitFileView used to.
    • We don't want to simply re-use the generic process-redis-queue command, because that gives less flexibility to control, e.g., how many workers are specifically working on new data. That said, we can perhaps have an abstract class for Redis workers.
    • I don't think we need to continue the pattern of time-limited runs in this new command, as this worker should just be running forever. We can create a follow-up issues to monitor that it is running, and an issue in deploy to stop and restart it whenever Process is re-deployed (when the command receives an interrupt, it should catch the exception and stop safely).
    • While doing this, instead of using the redis package directly, we can add methods in our redis.py (which should perhaps be renamed to broker.py) for interacting with Redis. This would make it easier to substitute a different backend if we chose to at a later date.

This change should support multiple workers interacting simultaneously with the Redis data.

Re: a question on the call about use cases / user needs, there are some in #152 before and after the long bullet list.

@jpmckinney jpmckinney added feature Relating to loading data from the web API or CLI command and removed framework: queue labels Aug 13, 2020
@jpmckinney
Copy link
Member

This is essentially the core of the new technical proposal. Closing in favor of more atomic issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Relating to loading data from the web API or CLI command
Projects
None yet
Development

No branches or pull requests

2 participants