What is Data Sharding? #218
Replies: 1 comment
-
Data Sharding is an improvement made to the Serverless Architecture where stateful Tasks are replaced with semi-stateful Functions. This improved architecture offers lower provisioning latency, higher reliability, and (potentially) lower cost, at the expense of a slightly more complex codebase. It is scheduled for the Fall 2022 Release. Assumptions
Metrics
CompressionCompressed read throughput of a Lambda from S3 is 5 × 75MB/s = 375MB/s. This factor of 5 is related to the reduction in size when going from CSV to Parquet, but is also similar to the average compression factor that one can expect when properly re-encoding a wide range of datatypes that would be encoded on 64 bits. For example, when encoding timestamps, 96% of them can be encoded using a single bit (Source). Similarly, when encoding 64-bit numerical values in time series, each value can be encoded using as little as 1.37 Bytes (Source). And when encoding a 64-bit Data LifecycleWhenever a table is loaded, data is fetched from S3 by a fleet of AWS Lambda functions, and kept in memory during the entire duration of the user session (one hour or more typically). The user is billed by AWS only when Lambda functions are invoked, at the rate of $0.017 for every TB·second. This approach is explained on the following article. Unfortunately, whenever a Lambda experiences a cold re-start, the data it previously held in memory is lost and needs to be retrieved from S3, which is very slow (75MB/s). Furthermore, a MapReduce job is only as fast as its slowest map. Therefore, the higher the probability that one Map function experiences a cold start, the slower the job will be. Knowing that the probability of a cold start upon invocation is 1%, the probability of experiencing at least one cold start is 1 minus the probability of the complementary event, which is the probability of experiencing no cold start at all. If we assume that we have 100 Map lambdas, this probability is There are two things that we can do to reduce that risk:
Because the use of AWS Lambda functions is billed by GB·second, we want to keep the amount of RAM used as low as possible. Therefore, the dataset cached in memory should be replicated once (two copies). As a result, we need at least 2TB of RAM. Furthermore, we need some headroom, which we will set at 10% (2.2TB total). We will see later on that effective headroom becomes 20% thanks to the use of default and backup partitions. Knowing that each Lambda function gives us 10GB of RAM, we will need 220 of them. This leads to a probability of cold start of Taking this 10% headroom into account, each Lambda function will need to load 9GB of data from S3. With a compressed read throughput of 375MB/s, this will take 24s. This is considered to be acceptable, as it will be done only once, at the very beginning of the user session, or whenever new large datasets must be loaded during a session. If data is stored in EFS rather than S3, uncompressed throughput is 10GB/s. Therefore, compressed throughout is 50GB/s, and downloading 1TB should take about 20s, which is comparable to the performance attained with S3. Unfortunately, this is the maximum aggregated throughout that an arbitrary number of Lambdas could benefit from, unlike S3, which aggregated throughput increases with the number of Lambda functions in a linear fashion. Therefore, EBS is recommended for datasets up to 10TB, while S3 should be used for larger datasets (to be confirmed with actual benchmarks). Data PartitionData will be loaded by the Lambda functions across two partitions:
During the Map phase, each Lambda processes the data found within its default partition, unless data found within its backup partition is required because of the cold start of another Lambda, as will be explained later on. Therefore, the actual data being processed is 9GB/2 = 4.5GB. As a result, the effective headroom is 20% instead of 10%. For the time being, we will assume that it is sufficient, but further analysis of the transforms will be required to establish this fact. With a default partition of 4.5GB and 20 columns of 64 bits, this means about 28.4M rows. Processing them with operations that take 1ns per row, total processing time is 28ms per transform, which interestingly is in the same order of magnitude as the latency of an SNS message passing (30ms), making it possible to envision very interactive processes between the Reduce and Map functions. For reference, sorting these 28.4M rows with DuckDB will take 852ms (Source). Data ShardsData stored within partitions (default or backup) is split into shards. The actual number of shards might depend upon many different parameters, but we will assume that it is in the order of 10. This means that each shard is about 450MB in size, and can be loaded from S3 within 1.2s (assuming a compression factor of 5). Furthermore, the 10 shards found in the default partition of a Lambda function are replicated within the backup partitions of 10 other Lambda functions. This ensures that all default data of a Lambda function experiencing a cold start can be found in the backup partitions of 10 others. The probability of having at least one of these backup Lambda functions experiencing a cold start at the same time is approximately And to further reduce the risk of cache miss, backup data should be distributed in such a way that no pairs of Lambda functions back each other up. In other words, if a default shard of Lambda With such an architecture, a Map Lambda tasked with processing a backup shard will never do so for more than one backup shard. This means that it will experience a compute overhead of 10% if there are 10 shards per partition, or 2.8ms on top of the 28ms per transform (and much less if the transform takes longer to execute). Now, a 22% probability of cache miss if still high, and we can reduce it in two ways:
Reducing the number of shards from 10 down to 5 reduces this probability from 22% down to 11%, or exactly half. But it also doubles the size of a shard from 450MB to 900MB, thereby doubling its loading time from 1.2s to 2.4s. Furthermore, it increases the compute overhead of the Map lambdas used to process the backup shards from 10% to 20%, or 5.6ms on top of the 28ms per transform (33.6ms total). As a result, the tradeoff is between experiencing more frequent minor slowdowns versus less frequent slightly more significant slowdowns. It is unclear which option will lead to a better user experience, and the number of shards should probably be parameterized so that multiple experiments can be conducted, or so that users can select their preferred exception handling model. Increasing the number of replicas from 2 to 3 would increase the cost of transforms by 50%, but it would dramatically reduce the probability of cache miss. According to this article, the probability This is probably the best option for 1TB, as the probability of cache miss drops by over two orders of magnitude, from 22% down to 0.11%. Most interestingly, we can see on this formula that such a probability is proportional to the number of Lambda functions
|
Beta Was this translation helpful? Give feedback.
-
Definition
Data Sharding
STOIC Cloud
Release
Beta Was this translation helpful? Give feedback.
All reactions