Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rest Services for API requests #24

Closed
madisonb opened this issue Nov 14, 2015 · 8 comments
Closed

Rest Services for API requests #24

madisonb opened this issue Nov 14, 2015 · 8 comments

Comments

@madisonb
Copy link
Collaborator

We need a set of Rest services to be able to pass crawl requests into the Kafka API to be processed by the Kafka Monitor. Ideally this uses something small like Flask and will run on a server that has purely Kafka access only. The rest services should not bypass the whole Kafka/Redis Monitor architecture, but provide a front end rest endpoint into submitting and reading things from Kafka.

This API should allow the passthrough of any JSON that needs to flow into Kafka Monitor Plugin, and in cases where there is an expected response, should return the JSON response from Kafka. At that point it behaves just like a rest service.

Note that the rest endpoint should not try to serve streaming data from the firehose, but rather very specific requests.

@madisonb madisonb added this to the Scrapy Cluster 1.2 milestone Nov 14, 2015
@jasonrhaas
Copy link
Contributor

👍 I think having some kind of built in Restful API would help uptake a lot. Flask is a good one to use since its really straightforward and has a lot of community support.

@madisonb
Copy link
Collaborator Author

In PR #52 , @yrik brings up a good point of discussion which is how we should actually implement the REST API. My initial comments above dictate that the REST API should pass through the request to Kafka, and let the Kafka Monitor handle the validation and access into the cluster.

The problem with that approach is that it does not give any kind of feedback to the user who created the REST call. This can be mitigated in a number of ways:

  1. Use the Kafka Monitor inside of the rest service, which requires extra imports, and also requires us to either load a static amount of plugins or to hard code them
  2. Alter the Kafka Monitor to send an outbound action back to Kafka to indicate whether the request was accepted or rejected, and if accepted the REST call should wait for more data for a certain period of time.

The problem with both of these approaches is that we will either need to create dynamic rest services for each plugin, or use a unified /feed endpoint (just like the Kafka Monitor) and that is it.

Because we don't know ahead of time what plugins will be loaded, and the REST service may not be deployed on the same machine as the Kafka Monitor, I recommend going with point 2. Where the REST endpoint is very lightweight, and we update the Kafka Monitor to give better feedback across Kafka back to that endpoint in order to let the caller know what happened.

@yrik
Copy link

yrik commented Mar 15, 2016

Actually we could have not fixed plugins with first approach. Just need duplicate Validation check on API feed level. Iterate on plugins and check which one is matches to the data request.

Second option also good as it does not require any code duplication. Would be very thankful if you could give several hint toward realization in such case.

Question. How would you suggest to implement API that gets info for crawler?
Send a requst to Kafka and read responses in indefinite loop until find one that matches? I faced an issue that sometimes there is no response with stat data, so could be a big issue with such approach.

Another question. I would like to make an API that allows to get list of all running crawlers, how it's possible to implement?

@madisonb
Copy link
Collaborator Author

With the first approach we are creating duplicate code to load the plugins from another component, and then iterate over them through a library that may not even live on the same box. Scrapy Cluster is distributed and we dont want to lock the REST services to the Kafka Monitor box.

The second option I think is cleaner, provided we use Kafka to spit messages back out to the default demo.outbound_firehose and demo.outbound_<appid> topics. You just need to make a Kafka producer and we need to standardize what the object is that is passed.

In the REST service, you then have a daemon thread that is reading from Kafka's outbound firehose and checking the uuid against all incoming messages. There is always a response from stat api requests and action api requests, and it is a commonly seen mistake to think there is not. Most of the time this is due to not running the kafkadump script the whole time, or your Kafka installation is misconfigured.

The API request to list all of the current crawlers is documented here, please look at the crawler, spider or machine stat flag. This should get you all the info you need about what crawlers are running on what machines.

@yrik
Copy link

yrik commented Mar 16, 2016

There is always a response from stat api requests and action api requests, and it is a commonly seen mistake to think there is not.

Just faced delay in several hours... on overloaded machine with info request.

@yrik
Copy link

yrik commented Mar 16, 2016

In the REST service, you then have a daemon thread that is reading from Kafka's outbound firehose and checking the uuid against all incoming messages.

Ok, will try to do it in following way:


def api_feed(request_json):
    subprocess.call(["python", "kafka_monitor.py", "feed", request_json])
    result = None
    while not result:
        result = find_msg_by_uuid(request_json['uuid'])

    return result

@madisonb
Copy link
Collaborator Author

@yrik while I am not a fan of using the subprocess module to call the shell (this assumes your kafka monitor and rest service are on the same machine) that is certainly a step in the right direction. We need to analyze the request_json somehow and determine if we need to wait to a response or not. (unless we convert over to generating responses for everything in the kafka monitor). Flow of data for the latter is then

  • Rest API gets request
  • REST API writes json to Kafka
  • Kafka Monitor receives request
  • Kafka Monitor writes request to redis (standard)
  • Kafka Monitor writes object to Kafka indicating it recieved a message
    • object dictates whether REST service should expect a response from a different Kafka topic
  • Rest service listens to Kafka, waiting for standard response object
  • Rest service processes response object
    • if object says "no more data", rest service returns status code 200
    • if object says "more data, wait on outbound topic", Rest service now waits on new kafka topic (status code 102)
      • If certain amount of time passes with no data, Rest service says status code 204
      • if rest service gets data, returns data with status code 200

How is that workflow? ^

@yrik
Copy link

yrik commented Mar 17, 2016

yes, sounds like a good plan.

madisonb pushed a commit that referenced this issue Oct 16, 2016
This branch is beginning the work on the rest endpoint outlined at #24. This adds a new flask rest endpoint that will be used to work with the scraping cluster, and interact with the kafka api.

The work will reside in `/rest` and will eventually have both offline test, online tests, and a new section of documentation. Notable new files here:

* rest_service.py - the rest endpoint code
* settings.py - rest service settings
* requirements.txt - the bare minimum requirements to run the rest service
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants