Rest Services for API requests #24

madisonb · 2015-11-14T22:06:16Z

We need a set of Rest services to be able to pass crawl requests into the Kafka API to be processed by the Kafka Monitor. Ideally this uses something small like Flask and will run on a server that has purely Kafka access only. The rest services should not bypass the whole Kafka/Redis Monitor architecture, but provide a front end rest endpoint into submitting and reading things from Kafka.

This API should allow the passthrough of any JSON that needs to flow into Kafka Monitor Plugin, and in cases where there is an expected response, should return the JSON response from Kafka. At that point it behaves just like a rest service.

Note that the rest endpoint should not try to serve streaming data from the firehose, but rather very specific requests.

jasonrhaas · 2015-11-15T20:56:57Z

👍 I think having some kind of built in Restful API would help uptake a lot. Flask is a good one to use since its really straightforward and has a lot of community support.

madisonb · 2016-03-15T13:35:03Z

In PR #52 , @yrik brings up a good point of discussion which is how we should actually implement the REST API. My initial comments above dictate that the REST API should pass through the request to Kafka, and let the Kafka Monitor handle the validation and access into the cluster.

The problem with that approach is that it does not give any kind of feedback to the user who created the REST call. This can be mitigated in a number of ways:

Use the Kafka Monitor inside of the rest service, which requires extra imports, and also requires us to either load a static amount of plugins or to hard code them
Alter the Kafka Monitor to send an outbound action back to Kafka to indicate whether the request was accepted or rejected, and if accepted the REST call should wait for more data for a certain period of time.

The problem with both of these approaches is that we will either need to create dynamic rest services for each plugin, or use a unified /feed endpoint (just like the Kafka Monitor) and that is it.

Because we don't know ahead of time what plugins will be loaded, and the REST service may not be deployed on the same machine as the Kafka Monitor, I recommend going with point 2. Where the REST endpoint is very lightweight, and we update the Kafka Monitor to give better feedback across Kafka back to that endpoint in order to let the caller know what happened.

yrik · 2016-03-15T14:15:58Z

Actually we could have not fixed plugins with first approach. Just need duplicate Validation check on API feed level. Iterate on plugins and check which one is matches to the data request.

Second option also good as it does not require any code duplication. Would be very thankful if you could give several hint toward realization in such case.

Question. How would you suggest to implement API that gets info for crawler?
Send a requst to Kafka and read responses in indefinite loop until find one that matches? I faced an issue that sometimes there is no response with stat data, so could be a big issue with such approach.

Another question. I would like to make an API that allows to get list of all running crawlers, how it's possible to implement?

madisonb · 2016-03-15T16:11:53Z

With the first approach we are creating duplicate code to load the plugins from another component, and then iterate over them through a library that may not even live on the same box. Scrapy Cluster is distributed and we dont want to lock the REST services to the Kafka Monitor box.

The second option I think is cleaner, provided we use Kafka to spit messages back out to the default demo.outbound_firehose and demo.outbound_<appid> topics. You just need to make a Kafka producer and we need to standardize what the object is that is passed.

In the REST service, you then have a daemon thread that is reading from Kafka's outbound firehose and checking the uuid against all incoming messages. There is always a response from stat api requests and action api requests, and it is a commonly seen mistake to think there is not. Most of the time this is due to not running the kafkadump script the whole time, or your Kafka installation is misconfigured.

The API request to list all of the current crawlers is documented here, please look at the crawler, spider or machine stat flag. This should get you all the info you need about what crawlers are running on what machines.

yrik · 2016-03-16T14:35:25Z

There is always a response from stat api requests and action api requests, and it is a commonly seen mistake to think there is not.

Just faced delay in several hours... on overloaded machine with info request.

yrik · 2016-03-16T14:44:08Z

In the REST service, you then have a daemon thread that is reading from Kafka's outbound firehose and checking the uuid against all incoming messages.

Ok, will try to do it in following way:


def api_feed(request_json):
    subprocess.call(["python", "kafka_monitor.py", "feed", request_json])
    result = None
    while not result:
        result = find_msg_by_uuid(request_json['uuid'])

    return result

madisonb · 2016-03-16T15:17:16Z

@yrik while I am not a fan of using the subprocess module to call the shell (this assumes your kafka monitor and rest service are on the same machine) that is certainly a step in the right direction. We need to analyze the request_json somehow and determine if we need to wait to a response or not. (unless we convert over to generating responses for everything in the kafka monitor). Flow of data for the latter is then

Rest API gets request
REST API writes json to Kafka
Kafka Monitor receives request
Kafka Monitor writes request to redis (standard)
Kafka Monitor writes object to Kafka indicating it recieved a message
- object dictates whether REST service should expect a response from a different Kafka topic
Rest service listens to Kafka, waiting for standard response object
Rest service processes response object
- if object says "no more data", rest service returns status code 200
- if object says "more data, wait on outbound topic", Rest service now waits on new kafka topic (status code 102)
  - If certain amount of time passes with no data, Rest service says status code 204
  - if rest service gets data, returns data with status code 200

How is that workflow? ^

yrik · 2016-03-17T07:02:47Z

yes, sounds like a good plan.

This branch is beginning the work on the rest endpoint outlined at #24. This adds a new flask rest endpoint that will be used to work with the scraping cluster, and interact with the kafka api. The work will reside in `/rest` and will eventually have both offline test, online tests, and a new section of documentation. Notable new files here: * rest_service.py - the rest endpoint code * settings.py - rest service settings * requirements.txt - the bare minimum requirements to run the rest service

madisonb added the enhancement label Nov 14, 2015

madisonb added this to the Scrapy Cluster 1.2 milestone Nov 14, 2015

madisonb mentioned this issue Nov 16, 2015

UI for displaying information about Cluster #25

Open

madisonb added the feature request label Feb 21, 2016

madisonb mentioned this issue Mar 15, 2016

API #52

Closed

madisonb modified the milestones: Scrapy Cluster 1.3, Scrapy Cluster 1.2 Oct 27, 2016

madisonb mentioned this issue Nov 28, 2016

New Rest Component #87

Merged

madisonb closed this as completed Nov 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rest Services for API requests #24

Rest Services for API requests #24

madisonb commented Nov 14, 2015

jasonrhaas commented Nov 15, 2015

madisonb commented Mar 15, 2016

yrik commented Mar 15, 2016

madisonb commented Mar 15, 2016

yrik commented Mar 16, 2016

yrik commented Mar 16, 2016

madisonb commented Mar 16, 2016

yrik commented Mar 17, 2016

Rest Services for API requests #24

Rest Services for API requests #24

Comments

madisonb commented Nov 14, 2015

jasonrhaas commented Nov 15, 2015

madisonb commented Mar 15, 2016

yrik commented Mar 15, 2016

madisonb commented Mar 15, 2016

yrik commented Mar 16, 2016

yrik commented Mar 16, 2016

madisonb commented Mar 16, 2016

yrik commented Mar 17, 2016