Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement howPartitioned.isBroadcast attribute in Myria catalog #814

Closed
senderista opened this issue Feb 24, 2016 · 10 comments
Closed

Implement howPartitioned.isBroadcast attribute in Myria catalog #814

senderista opened this issue Feb 24, 2016 · 10 comments
Assignees

Comments

@senderista
Copy link
Contributor

We need to persist whether a relation has been broadcast to all workers in order for Raco to push joins into DbQueryScan when one input is a broadcast relation.

@jingjingwang
Copy link
Contributor

My current thought of this problem is as follows. We are relying on two "functions" when distributing relations: a thing called partitionFunction that maps one tuple into a partition, and a thing called cellPartition that maps a partition to a set of workers (check GenericShuffleProducer). cellPartition is a one-to-one mapping function in all the cases except BroadcastProducer and HyperShuffleProducer.

I'd like to merge these two functions into one, which maps a tuple to a set of "destinations" (I don't want to use the word "workers" here since I'd like to stay in the logical world for now). Let's call this new function as "distributeFunction" for now. The # of destinations is a shared property that all subclasses of distributeFunction should support. Then BroadcastProducer has a distributeFunction that maps a tuple to the set of {1, 2, ... # of destinations}.

After this refactoring, we can simply rely on the JSON encoding/decoding of a distributeFunction to read from/write to catalog, query plan, etc. It also makes the interfaces in GenericShuffleProducer simpler, since the only difference is the distributeFunction. How do you feel about it?

@senderista
Copy link
Contributor Author

@jingjingwang makes sense to me. @bmyerz what do you think?

@bmyerz
Copy link
Member

bmyerz commented Mar 1, 2016

It makes sense to me functionally, but the design implications are not yet clear to me.

  • how do we write the distributeFunction? Is there a grammar for writing it that includes the numdestinations variable and an opaque hash function?
  • in Raco, I think it is most straightforward to have fairly high level encodings (e.g., hash_partitioned on what attributes / broadcasted) for use in the rules. I can vaguely envision a more expressive algebra but I don't want to invent it until needed.

Does this make sense?

@senderista
Copy link
Contributor Author

Given recent user interest in this feature, it sounds like we should revisit this and address @bmyerz's concerns about Raco representation.

@bmyerz
Copy link
Member

bmyerz commented Apr 27, 2016

If the main scenarios are the two: 1) replicated table and 2) partitioned table, then I would vote in favor of high level encodings in the catalog:

in Raco, I think it is most straightforward to have fairly high level encodings (e.g., hash_partitioned on what attributes / broadcasted) for use in the rules. I can vaguely envision a more expressive algebra but I don't want to invent it until needed.

@jingjingwang
Copy link
Contributor

I agree with high level encoding in the catalog. On the MyriaX side, I'll refactor these partition functions into a generic one. It's like we have all kinds of ShuffleProducers in operator JSON encoding but when they are instantiated as Java classes everything is GenericShuffleProducer.

@senderista
Copy link
Contributor Author

I like that approach.

@jingjingwang
Copy link
Contributor

Will do #773 together in this refactoring.

@senderista
Copy link
Contributor Author

@jingjingwang how does the decision to collapse tuple->partition and partition->worker mappings into a single tuple->destination mapping comport with the elasticity design in #851, where we maintain a separate mapping of partitions to workers in the catalog? Where does distributeFunction fit into this mapping? If partitions no longer have physical affinity to nodes (because they can be reshuffled for elastic load balancing), then how do broadcast and HyperShuffle operators ensure physical collocation of data on the same node?

@jingjingwang
Copy link
Contributor

I think this issue has been resolved in #863 so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants