Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

Closed
wants to merge 3 commits into from

Conversation

axsaucedo
Copy link
Contributor

@axsaucedo axsaucedo commented Mar 5, 2020

Data Ingestor Batch Processing

Following from the learnings obtained whilst building the multi-streaming async implementation via #1447 this PR contains a version that decouples the batch functionality into a new component called the "Data Ingestor".

The Data Ingestor is a generic extra component within a Seldon Deployment which allows for batch capabilities outlined in the requirements for #1413 and #1391, namely:

  • [1.1] Long running batch jobs
  • [1.2] Non-Long running batch jobs
  • [2.1] Jobs that are asynchronous
  • [2.2] Jobs that terminate when finite dataset has been processed

Currently the functionality for [2.2] is still being explored but it's now a matter of what the best practice would be to implement; that is whether we want to handle the functionality from the Operator or the Seldon Deployment itself.

Furthermore, the functionality for [2.3] Scheduled jobs, is not covered in this PR and will be explored in a future PR.

Long Running Capabilities

After some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y

The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as "long-running": "true/false" which coudl be compulsory so that the check is done explicitly, but certainly something to explore further.

MOre specifically, this is the overview that explains in high level the GRPC connection:

image

Functionality Added

This PR encompasses the following functionality:

  • Seldon Core Python library extension to support BatchWorker class wrapper
  • Simple "finite-dataset" example that consumes from ELK stack once
  • Simple "streaming dataset" which continuously consumes and publishes from Kafka cluster

The Kafka example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/kafka_streaming/README.md

image

The ELK example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/elastic_search_simple/README.md

image

Still TODO in this PR

  • Explore how to have re-usable source/process/sink classes (a la prepackaged)
    • Currently data ingestors seem qutie generic so they could be built separately (like TFserving example) to simplify these examples
  • Explore CRD structure for data ingestor, as well as terminology to be represented in CRD
  • For kafka example potentially use defaults for group_id and topics based on predictor identifiers
  • Explore ways to make available multiprocess-safe logger to methods
  • A structured way to handle failures on either job level or data point level
  • Explore whether restricting to just GRPC to ensure long running
  • Explore best way to deal with data logging
  • Explore best way to handle this batch functionality from management tool perspective

@seldondev
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: axsaucedo

If they are not already assigned, you can assign the PR to them by writing /assign @axsaucedo in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@seldondev
Copy link
Collaborator

Thu Mar 5 07:51:57 UTC 2020
The logs for [pr-build] [1] will show after the pipeline context has finished.
https://github.com/SeldonIO/seldon-core/blob/gh-pages/jenkins-x/logs/SeldonIO/seldon-core/PR-1505/1.log

impatient try
jx get build logs SeldonIO/seldon-core/PR-1505 --build=1

@seldondev
Copy link
Collaborator

Thu Mar 5 07:52:00 UTC 2020
The logs for [lint] [2] will show after the pipeline context has finished.
https://github.com/SeldonIO/seldon-core/blob/gh-pages/jenkins-x/logs/SeldonIO/seldon-core/PR-1505/2.log

impatient try
jx get build logs SeldonIO/seldon-core/PR-1505 --build=2

@axsaucedo
Copy link
Contributor Author

axsaucedo commented Mar 5, 2020

Long Running Capabilities

After some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y

The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as "long-running": "true/false" which coudl be compulsory so that the check is done explicitly, but certainly something to explore further.

More specifically, this is the overview that explains in high level the GRPC connection:

image

@seldondev
Copy link
Collaborator

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@seldondev seldondev added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020
@axsaucedo
Copy link
Contributor Author

Closing in favour of #1714

@axsaucedo axsaucedo closed this May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants