-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign data distribution to support elasticity #851
Comments
Just some clarification for the ingest part. If I understand this correctly, each worker will have some number of partitions (basically I was looking at the difference between Also, will happen to new tables that are created by users as they're building queries? Or are all |
@jortiz16 yes, the design is for each worker to store approximately I skimmed the thread you linked on All |
I see. Yes, describing how we can replace the Although now I'm wondering whether this could cause a shuffle bottleneck for small relations. I guess this would also depend on the size of the cluster. |
As I mentioned, we'll enforce a minimum size threshold for each partition, so a very small relation might be partitioned on only one or a few workers, which should mitigate against shuffle being a bottleneck. I haven't thought about values for the threshold and would welcome any suggestions (the threshold would certainly be configurable at cluster creation). |
This is a great summary of what we discussed. A few thoughts: To move a partition, I would be in favor of issuing a proper MyriaL query to use existing machinery as much as possible. This is instead of using a Postgres COPY statement. Also, we should make sure that the function which assigns partitions to workers is the same for all relations such that tables hash-partitioned on their join attributes can be joined without any data shuffling. How will we implement the following: "Finally, we will physically partition the data on the workers assigned nonempty partitions, using a given PartitionFunction". One approach is for shuffle operators to know about the mapping of partitions to workers. This is something that I've been advocating for a while: Shuffle Operators should have one function that maps tuples to buckets and another that maps buckets to output buffers. For parallel ingest, each worker can be told which partitions to ingest. |
@mbalazin Your point 2) about ensuring that assignment of partitions to workers is consistent across relations is an important one and one that I overlooked. This goal would clearly be compromised by a purely arbitrary explicit mapping of partitions to workers, and also (partially) by the optimization I mentioned where small relations could be collapsed into one or a few partitions, to preserve the minimum size threshold for partitions. The latter would break our in-place optimizations for joins etc. My first thought was to maintain the same mapping for all relations from the hash space to partition IDs, and from partition IDs to workers, but for relations that don't meet the minimum size threshold when partitioned across all workers, to broadcast them instead. That would increase ingest overhead for small relations, and possibly cost significant space in the aggregate, but would avoid any shuffle overhead for subsequent queries. (Another problem with this solution would be that for large clusters, the partition size for a relation partitioned across all workers might be too small to meet the minimum partition size threshold, but the relation itself might be large enough to exceed our maximum partition size threshold when broadcast as a single partition.) Thinking about it more, though, I don't think this can work without something like consistent hashing. If we maintain an ordinary hash-based mapping from partition IDs to worker IDs, then even with the extra level of indirection, we end up moving almost all data when a worker is added or removed, just like we would without the level of indirection. The only difference is we don't have to rescan and rehash the data itself, we just rehash the partition IDs to worker IDs and move the partitions as a unit. So if we want to confine writes to only the new node when a node is added, and still maintain determinism in the partition->worker mapping, I don't think the approach of an explicit mapping will work, and consistent hashing is the only deterministic approach I can think of that will work (deterministic after workers have been assigned their vnodes on the hash ring, of course). Re: point 1), I'm all in favor of reusing existing facilities, but I'm not sure how we'd use MyriaL to move a single partition from one worker to another, when the partition mapping is (intentionally) opaque to MyriaL. Did you have some new MyriaL syntax in mind? Re: point 3), is it OK for shuffle operators to just query the catalog to determine the mapping of partitions to workers, or did you want them to be passed an explicit mapping? This will depend on our solution to 2), of course... |
So I think I'm converging on a two-level hashing design: the first level is ordinary hashing mod The question remains how we should determine the number of vnodes for each worker, given |
Whether we broadcast a relation or partition it is a physical tuning We don't have to move most of the data when workers are added/removed. This One approach to using MyriaL to shuffle data might be to name the tables For shuffle operators, might be better for them to receive the mapping in Magdalena Balazinska On Thu, Sep 1, 2016 at 10:07 AM, Tobin Baker notifications@github.com
|
Let's discuss this in person with a whiteboard at the next Myria meeting. Magdalena Balazinska On Thu, Sep 1, 2016 at 12:52 PM, Tobin Baker notifications@github.com
|
I think that rendezvous hashing would be a lot simpler to implement than consistent hashing and shouldn't suffer from the problems I noted above. Basically, we would just concatenate each partition ID with each worker ID, input the result to a hash function, and assign each partition ID to the worker ID which yields the largest hash value. This has similar properties to consistent hashing: when a worker goes away, you just assign each of its partitions to the worker which yields the next largest hash value, and when a worker is added, only partitions which now maximize their hash value for this worker need to be moved. |
I guess the balls-in-bins argument above will apply to any function that assigns partitions to workers uniformly at random, so the number of partitions will always be |
Interestingly, Oracle's distributed in-memory SQL database uses rendezvous hashing to assign in-memory partitions to nodes: http://www.vldb.org/pvldb/vol8/p1630-mukherjee.pdf |
Random idea: could we use quasirandom sequences to place workers/vnodes on the circle rather than pseudorandom hash functions? This should give us better uniformity and hopefully avoid superlinear growth in partitions WRT workers. Example: https://en.wikipedia.org/wiki/Van_der_Corput_sequence |
So I think a very simple improvement on "discrete consistent hashing" (i.e., hashing to a set of A similar approach might work for rendezvous hashing. One might concatenate the bit representations of worker IDs and partition IDs and bit-reverse them, or pick a different (pseudo)random permutation of worker IDs for each partition. Again, for a fixed partition, there can be no collisions between workers since this is a permutation. (One simple way to compactly specify a pseudorandom permutation on worker IDs, randomized over partition IDs, would be to hash all worker IDs using a hash function keyed on the partition ID and sort the worker IDs by their hashes.) The latter approach would guarantee no ties among workers when we are picking the highest-hashed available worker ID for a given partition, but would still give a pseudorandom distribution of partitions over workers, so would suffer from the collision problems described above. These problems are easily soluble for consistent hashing, but I'm not sure what the solution would be for rendezvous hashing (which I otherwise prefer for its simplicity). |
OK, I think I've devised a quasirandom version of rendezvous hashing. Let Note that because the bit-reversal permutation is its own inverse, we do not actually need to generate the sequence. We can reconstruct the sequence number for any cell in the matrix by mapping it to its position in the unrolled vector (i.e., I think that this should give us excellent uniformity (avoiding the pathological collision behavior we get from pseudorandom functions), while being fully deterministic and cheap to calculate. Like the original rendezvous hashing, it nicely avoids the virtual nodes hack of consistent hashing. I need to do some simulations to verify that partition allocation across workers is as uniform as I expect. |
So there's a fly in the ointment: even if we have perfectly uniform hashing with no collisions, we still need a number of partitions quadratic in the number of workers to guarantee uniform redistribution of data when a worker joins or leaves. Since we want to relocate |
Um, I guess I should mention that all this fancy hashing nonsense is strictly speaking unnecessary. I could just modify my original idea of maintaining an explicit mapping between partitions and workers, to now map partition IDs instead of partitions of relations to workers. This would be in some ways simpler than a deterministic function, but it would require its own logic to redistribute partitions uniformly across workers on cluster membership changes, and it would mean we would have to query the master catalog whenever we needed to know the worker hosting all partitions with a given ID. The deterministic approaches would allow workers to determine partition assignment knowing only cluster membership, and they would be adaptable to decentralized, masterless contexts as well. |
Note on my "quasirandom" version of rendezvous hashing: it doesn't work. I simulated the algorithm for row-major, column-major, and Z-order traversals of the matrix, and the uniformity of the permutation always causes a single worker to be assigned all partitions. Using hash functions instead of a permutation to rank the workers in each column yields exactly the expected results: |
So I did some simulations of "quasirandom consistent hashing" and found some interesting behavior.
Given these drawbacks, we should consider whether we really need a deterministic, stateless mapping function from partitions to workers, or whether we're willing to live with an explicit mapping which can guarantee the most uniform possible distribution of partitions across workers (the logic could be as simple as redistributing one partition to/from each worker in worker ID order on each pass, with as many passes as necessary). If we want "coarse-grained partitions" (i.e., the number of partitions is of order the number of workers), then I don't think we can use a pseudorandom approach like classical consistent hashing (for reasons I described earlier), and "quasirandom consistent hashing" is the best stateless approach I can think of so far that is compatible with coarse-grained partitions. |
I found a couple more variations on consistent hashing but haven't had time to read them yet, so I don't know if they are relevant to our "coarse-grained, sequentially numbered partition" scenario. http://arxiv.org/pdf/1406.2294v1.pdf |
I simulated "jump consistent hashing" (http://arxiv.org/pdf/1406.2294v1.pdf) and the distribution was pretty terrible--not surprising since it's a pseudorandom algorithm and hence works well only for very fine-grained entities, i.e keys, not partitions.
|
We might be able to mitigate partition skew somewhat for multitenant clusters by specializing the hash function per namespace: generate a random bitmask by hashing |
Given the shortcomings of all the stateless approaches to partition assignment we've looked at, we've decided to go with the original approach: an explicit mapping of partition IDs to worker IDs. |
Rereading the Dynamo paper, I noticed that there's extensive discussion of the partitioning optimization for consistent hashing in section 6.2. We should cite this as prior work. (The difference from our approach is that they assume a number of partitions much larger than the number of nodes, which is necessary for their randomized approach of assigning partitions to nodes to work.) |
The ability to add and remove workers dynamically on a running cluster (required, e.g., by @jortiz16's PerfEnforce work) will require some fundamental changes to Myria architecture. Briefly, we need to add a level of indirection among workers and the data they host. This new level of data storage granularity will still be called "partitions", but these partitions will be much smaller than what can fit on a single worker, and will be immutable, though mobile (i.e., they will never be split or merged as workers are added and removed). The new hash space for our hash partitioning will be the range of these partition IDs (which will be fixed for the life of the cluster), rather than the range of worker IDs (which can grow and shrink over the life of the cluster). Instead of rehashing data when the range of worker IDs changes, existing data partitions will be moved around among workers, in a manner similar to but distinct from consistent hashing.
How
Relation
s are represented in the Myria catalogA Myria
Relation
will continue to represent a set of partitions, but those partitions will now be 1) fixed in number for the life of the cluster, 2) much more fine-grained (and hence more numerous), and 3) will no longer map 1:1 to workers.Fixed partitions
The number of partitions can be configured only at the time of cluster creation. If it is ever changed, it will require all data in all relations to be repartitioned (by hashing or round-robin distribution). Conceptually, all relations have
NUM_PARTITIONS
partitions, but some of them may be empty if a minimum size threshold (in bytes or tuples) is enforced for each partition. The set of nonempty partitions is always a prefix of the range of all partitions.Fine-grained partitions
To minimize the overhead of repartitioning, when workers are added to or removed from the cluster, we move immutable partitions around among workers rather than repartitioning the data. Instead of directly hashing tuples to workers, tuples are hashed to partitions, where the partition space is fixed for the life of the cluster. These partitions are fine-grained enough to be evenly redistributed across a large cluster when just one worker is added or removed. A reasonable value for the number of partitions might be
2^6
-2^8
.Explicit mapping from partitions to workers
To achieve similar performance to consistent hashing (only a small amount of data needs to be moved when nodes are added or removed), without the overhead of rescanning all data to repartition it (as we would need to do when cluster membership changes), we exploit the fact that we have a centralized system with a stable leader and an existing metadata store. Rather than using a randomized approach to map partitions to workers, we maintain an explicit mapping at the coordinator from partitions to workers. (The mapping function can be essentially arbitrary, as long as it distributes partitions sufficiently uniformly across workers.) The result is that like consistent hashing, we only need to move a small amount of data when cluster membership changes, but unlike consistent hashing, we only read the data we are going to move. Moreover, like consistent hashing, when a new node is added, only that node receives writes, so nodes currently servicing queries incur no additional write workload.
How
Relation
s are stored and queried in PostgresEach
Relation
in the Myria master catalog is distributed overNUM_PARTITIONS
partitions physically implemented as Postgres tables on Myria workers. The Postgres implementation of a relation is a view which simply unions the tables representing all partitions of the relation. To simplify implementation, the view definition will be identical on all workers, and each partition will be represented by a Postgres table on all workers, including workers which do not host that partition. (Tables representing partitions not hosted on a worker will simply be empty, and as noted above, some partitions may be empty on all workers.) Hence, moving a partition from a worker means issuing aTRUNCATE TABLE
statement for the partition's local table, and moving a partition to a worker could be implemented with a PostgresCOPY
statement targeting the partition's local table. The overhead of scanning empty tables should be minimal relative to the ease of implementation this uniform approach affords (especially if Postgres maintains accurate statistics, which it should, given that the tables are immutable between partition rebalancings).Changes to Myria data operations
Ingesting a new
Relation
When a new
Relation
is ingested (e.g., via MyriaLIMPORT
andSTORE
), we will createNUM_PARTITIONS
new Postgres tables in the Postgres database of each worker in the cluster, each table named for its corresponding partition ID. We will additionally create a new Postgres view in each worker's database consisting of theUNION
of all partition tables for the relation. We will then create an initial mapping of partitions to workers in the catalog, based on the number of bytes/tuples in the relation. Finally, we will physically partition the data on the workers assigned nonempty partitions, using a givenPartitionFunction
.Deleting an existing
Relation
When an existing
Relation
is deleted (e.g., through theDELETE
REST API), we will first tombstone its metadata in the catalog in a single SQLite transaction, then delete the corresponding Postgres views and tables on each worker (in a local transaction, with idempotent retry globally), then delete its metadata from the catalog in a single SQLite transaction.New Myria cluster membership operations
AddWorkers
When a new node is added to the cluster, we will create a fixed number of new workers (e.g., one for each CPU core on the node). Then we will query the catalog and create new tables and views on each new worker for each existing relation, as described above. Initially we could force the user to manually rebalance relations, e.g., by issuing
SCAN
andSTORE
commands from MyriaL. Later we could automate the rebalancing, carefully considering (and testing!) its impact on running queries (but as noted above, all writes will be confined to the new node, which is not yet online, and all reads will be evenly distributed across the cluster).RemoveWorkers
When an existing node is removed from the cluster, we must first move all its existing partitions to workers on other nodes (this of course inverts the I/O workload for adding a new node: reads are confined to the node being removed, but all nodes, in general, must accept writes). In principle, this cannot be made fault-tolerant without replication, but hosting Postgres tablespaces on persistent EBS volumes would probably give us acceptable durability, and backing up data to S3 would yield essentially 100% durability.
The text was updated successfully, but these errors were encountered: