You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by the Snowflake paper, we want to decouple Myria's storage requirements from its compute requirements. To accomplish this we will refactor Myria's storage architecture as follows. All ingested relations will be partitioned into a cluster-independent fixed number depending on the data size (some small power of 2 like 64, 128, or 256 might be appropriate). A Myria cluster will ingest the data into the specified number of partitions (each of which corresponds to a Postgres table), using some user-specified partitioning function, and then dump the partitions as tables in Postgres binary format into S3. A manifest file describing the original data source, the schema of the relation, the number of partitions, the partitioning scheme, and the location of each partition file will also be uploaded to S3 at a well-known location (corresponding to some naming convention). Another Myria cluster can ingest the new relation by downloading the manifest file, converting it into metadata stored in the system catalog, mapping each partition to a worker ID, and ingesting the partition files as Postgres tables.
Actually fetching the partition files to compute a given query could be done lazily, with a caching approach as in the Snowflake design. Each node has a Postgres instance with limited storage capacity, which serves as a "partition cache". When the Postgres instance approaches capacity and a new partition needs to be loaded, some other partition needs to be evicted. The Snowflake paper uses exact LRU as the eviction policy, which is feasible to implement since the number of cached objects is very small. (Since the Postgres instance on each node is shared among all workers on that node, the cache structure must be shared as well.)
When cluster membership changes, the assignment of partitions to workers must change as well (see elasticity design in #851). Using the caching approach, we can reshuffle data among workers lazily: partitions which are present on workers they are no longer assigned to simply persist in the cache until they are evicted in favor of a new partition (they will always be chosen as the first victim). Instead of transferring a partition peer-to-peer from the old owner to the new owner, the old owner simply allows the partition to expire from its cache, while the new owner will load it on-demand from S3 when it is actually required. This greatly simplifies the elasticity design and implementation.
The downside of loading all partitions on-demand, of course, is that queries will have to wait for all their required partitions to be present in cache. If the partitions are small enough and we have enough workers available, we should be able to get good parallel throughput from S3. The effectiveness of the cache depends entirely on usage patterns, of course. We will have to find an appropriate balance between storage costs (for the local or remote storage device that backs the Postgres cache) and query performance (for queries that would be penalized by cache evictions). We could instrument cache evictions in parallel with query logging and apply various learning algorithms offline to the resulting traces, or (much more ambitiously) even dynamically change the size of the EBS volumes backing Postgres, in response to an online learning algorithm or feedback controller.
The result of these design changes for users will be that they can share data with complete performance and fault isolation. Myria clusters will need storage space only for what they use, while they are using it. Even underprovisioned clusters (in terms of storage space) will only excessively swap out cached partitions, rather than running out of local disk space (which can have unpredictable and unpleasant consequences).
There are several implications of this design for the configuration of Myria clusters. One is that since clusters are now decoupled from durable storage of the data they process, there is not much reason for them to be kept running, or even existing, while they are idle. This makes the stop/start feature of myria-cluster (which requires on-demand instances) less compelling, and spot instances more compelling. Also, since Postgres is now an ephemeral local cache, using persistent remote storage like EBS as its backing store, rather than cheaper, lower-latency local storage, is no longer a good tradeoff. (The one exception may be the ability of EBS volumes to be dynamically resized in response to excessive swapping of the Postgres "cache".)
This design should be implemented as part of the elasticity implementation, since it depends on fine-grained partitions and the ability to dynamically reassign partitions to workers.
The text was updated successfully, but these errors were encountered:
Inspired by the Snowflake paper, we want to decouple Myria's storage requirements from its compute requirements. To accomplish this we will refactor Myria's storage architecture as follows. All ingested relations will be partitioned into a cluster-independent fixed number depending on the data size (some small power of 2 like 64, 128, or 256 might be appropriate). A Myria cluster will ingest the data into the specified number of partitions (each of which corresponds to a Postgres table), using some user-specified partitioning function, and then dump the partitions as tables in Postgres binary format into S3. A manifest file describing the original data source, the schema of the relation, the number of partitions, the partitioning scheme, and the location of each partition file will also be uploaded to S3 at a well-known location (corresponding to some naming convention). Another Myria cluster can ingest the new relation by downloading the manifest file, converting it into metadata stored in the system catalog, mapping each partition to a worker ID, and ingesting the partition files as Postgres tables.
Actually fetching the partition files to compute a given query could be done lazily, with a caching approach as in the Snowflake design. Each node has a Postgres instance with limited storage capacity, which serves as a "partition cache". When the Postgres instance approaches capacity and a new partition needs to be loaded, some other partition needs to be evicted. The Snowflake paper uses exact LRU as the eviction policy, which is feasible to implement since the number of cached objects is very small. (Since the Postgres instance on each node is shared among all workers on that node, the cache structure must be shared as well.)
When cluster membership changes, the assignment of partitions to workers must change as well (see elasticity design in #851). Using the caching approach, we can reshuffle data among workers lazily: partitions which are present on workers they are no longer assigned to simply persist in the cache until they are evicted in favor of a new partition (they will always be chosen as the first victim). Instead of transferring a partition peer-to-peer from the old owner to the new owner, the old owner simply allows the partition to expire from its cache, while the new owner will load it on-demand from S3 when it is actually required. This greatly simplifies the elasticity design and implementation.
The downside of loading all partitions on-demand, of course, is that queries will have to wait for all their required partitions to be present in cache. If the partitions are small enough and we have enough workers available, we should be able to get good parallel throughput from S3. The effectiveness of the cache depends entirely on usage patterns, of course. We will have to find an appropriate balance between storage costs (for the local or remote storage device that backs the Postgres cache) and query performance (for queries that would be penalized by cache evictions). We could instrument cache evictions in parallel with query logging and apply various learning algorithms offline to the resulting traces, or (much more ambitiously) even dynamically change the size of the EBS volumes backing Postgres, in response to an online learning algorithm or feedback controller.
The result of these design changes for users will be that they can share data with complete performance and fault isolation. Myria clusters will need storage space only for what they use, while they are using it. Even underprovisioned clusters (in terms of storage space) will only excessively swap out cached partitions, rather than running out of local disk space (which can have unpredictable and unpleasant consequences).
There are several implications of this design for the configuration of Myria clusters. One is that since clusters are now decoupled from durable storage of the data they process, there is not much reason for them to be kept running, or even existing, while they are idle. This makes the stop/start feature of
myria-cluster
(which requires on-demand instances) less compelling, and spot instances more compelling. Also, since Postgres is now an ephemeral local cache, using persistent remote storage like EBS as its backing store, rather than cheaper, lower-latency local storage, is no longer a good tradeoff. (The one exception may be the ability of EBS volumes to be dynamically resized in response to excessive swapping of the Postgres "cache".)This design should be implemented as part of the elasticity implementation, since it depends on fine-grained partitions and the ability to dynamically reassign partitions to workers.
The text was updated successfully, but these errors were encountered: