-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline Parallel Ingest (CSVFileScanFragment operator) #536
Comments
tagging @senderista |
I thought an alternative we discussed in the past was to have Raco encode a pseudo-operator run at the coordinator which would be initialized only with the S3 URL and would dynamically create and dispatch the local operators (i.e., adding @jingjingwang |
I vaguely remember talking about this. I do remember that we agreed to give Raco only the S3 URL, but I'm not familiar with how it can create the plan from there or whether there is another existing operator that does something like this. |
So would initializing the operator at runtime within MyriaX be the way to go? I ran into a situation where I need to use this feature, so I'm trying to tackle it at the moment. So here was my thinking...
We could make it so the fake operator encoding has information about the worker id (Raco could probably take care of this?). That way, when we initialize the operator, we have the worker id from the encoding. We can then create a new Not sure if this would be the best approach. |
FWIW I'm doing something vaguely similar to this right now in this PR: uwescience/myria#858. One question is whether we can assume that all worker IDs are contiguous. If so, then each worker operator instance could initialize its own byte range (taking the minimum partition size into account) after it determines the object size, but we probably can't assume this if we want to be fault-tolerant. I think we can dispense with the assumption of contiguous worker IDs as long as we ensure that all workers share the same live-workers view. We should be able to do this in (FYI, an |
This sounds good, thanks! I'll try it |
Sorry, I think I'm still missing something. So I'm adding this logic under the |
What I meant was to pass the live-workers view to the |
Currently, parallel ingest is only accessible through a REST call to the coordinator. The coordinator builds a query plan and sends a CSVFileScanFragment with the appropriate byte ranges to each worker.
We somehow need to introduce parallel ingest to MyriaL. The tricky part will be to figure out how to pipeline this operator. Should Raco be the one figuring out how to split the byte ranges? I think we discussed this before and we preferred not to introduce any AWS/S3 API to Raco.
The text was updated successfully, but these errors were encountered: