-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rest Services for API requests #24
Comments
👍 I think having some kind of built in Restful API would help uptake a lot. Flask is a good one to use since its really straightforward and has a lot of community support. |
In PR #52 , @yrik brings up a good point of discussion which is how we should actually implement the REST API. My initial comments above dictate that the REST API should pass through the request to Kafka, and let the Kafka Monitor handle the validation and access into the cluster. The problem with that approach is that it does not give any kind of feedback to the user who created the REST call. This can be mitigated in a number of ways:
The problem with both of these approaches is that we will either need to create dynamic rest services for each plugin, or use a unified Because we don't know ahead of time what plugins will be loaded, and the REST service may not be deployed on the same machine as the Kafka Monitor, I recommend going with point 2. Where the REST endpoint is very lightweight, and we update the Kafka Monitor to give better feedback across Kafka back to that endpoint in order to let the caller know what happened. |
Actually we could have not fixed plugins with first approach. Just need duplicate Validation check on API feed level. Iterate on plugins and check which one is matches to the data request. Second option also good as it does not require any code duplication. Would be very thankful if you could give several hint toward realization in such case. Question. How would you suggest to implement API that gets info for crawler? Another question. I would like to make an API that allows to get list of all running crawlers, how it's possible to implement? |
With the first approach we are creating duplicate code to load the plugins from another component, and then iterate over them through a library that may not even live on the same box. Scrapy Cluster is distributed and we dont want to lock the REST services to the Kafka Monitor box. The second option I think is cleaner, provided we use Kafka to spit messages back out to the default In the REST service, you then have a daemon thread that is reading from Kafka's outbound firehose and checking the The API request to list all of the current crawlers is documented here, please look at the |
Just faced delay in several hours... on overloaded machine with info request. |
Ok, will try to do it in following way:
|
@yrik while I am not a fan of using the subprocess module to call the shell (this assumes your kafka monitor and rest service are on the same machine) that is certainly a step in the right direction. We need to analyze the request_json somehow and determine if we need to wait to a response or not. (unless we convert over to generating responses for everything in the kafka monitor). Flow of data for the latter is then
How is that workflow? ^ |
yes, sounds like a good plan. |
This branch is beginning the work on the rest endpoint outlined at #24. This adds a new flask rest endpoint that will be used to work with the scraping cluster, and interact with the kafka api. The work will reside in `/rest` and will eventually have both offline test, online tests, and a new section of documentation. Notable new files here: * rest_service.py - the rest endpoint code * settings.py - rest service settings * requirements.txt - the bare minimum requirements to run the rest service
We need a set of Rest services to be able to pass crawl requests into the Kafka API to be processed by the Kafka Monitor. Ideally this uses something small like Flask and will run on a server that has purely Kafka access only. The rest services should not bypass the whole Kafka/Redis Monitor architecture, but provide a front end rest endpoint into submitting and reading things from Kafka.
This API should allow the passthrough of any JSON that needs to flow into Kafka Monitor Plugin, and in cases where there is an expected response, should return the JSON response from Kafka. At that point it behaves just like a rest service.
Note that the rest endpoint should not try to serve streaming data from the firehose, but rather very specific requests.
The text was updated successfully, but these errors were encountered: