What is Data Sharding? #218

ghalimi · 2022-03-23T11:27:38Z

ghalimi
Mar 23, 2022
Maintainer

Definition Data Sharding STOIC Cloud Release

ghalimi · 2022-03-23T11:29:43Z

ghalimi
Mar 23, 2022
Maintainer Author

Data Sharding is an improvement made to the Serverless Architecture where stateful Tasks are replaced with semi-stateful Functions. This improved architecture offers lower provisioning latency, higher reliability, and (potentially) lower cost, at the expense of a slightly more complex codebase. It is scheduled for the Fall 2022 Release.

Assumptions

We consider a 1TB dataset made of 20 columns encoded on 64 bits (6.25B rows).
We want most interactions to complete within 1s or less.
Most interactions include an average of 10 conventional transforms.
A user session can last for an hour or longer.

Metrics

Amazon SNS latency is 30ms (Source).
AWS Lambda functions have up to 10GB of RAM and 6 vCPU cores (Source).
AWS Lambda functions experience cold starts for 1% of their invocations (Source).
AWS Lambda hot start is 30ms (SNS latency), while cold start is 100ms to 1s (Source).
AWS Lambda pricing is $0.0000166667 for every GB·second (Source).
Read latency of a Lambda from S3 is 50ms (Source: internal benchmark).
Compressed read throughput of a Lambda from S3 is 75MB/s (Source).
Compressed read throughput of a Lambda from EFS is 2GB/s to 10GB/s (Source).
Compression with Parquet reduces a CSV dataset’s size by a factor of 5 (Source).
Write operations with EFS from Lambdas are 20 times slower than read operations (Source).
A typical transform takes 1ns per row or less (Source: internal benchmark).

Compression

Compressed read throughput of a Lambda from S3 is 5 × 75MB/s = 375MB/s. This factor of 5 is related to the reduction in size when going from CSV to Parquet, but is also similar to the average compression factor that one can expect when properly re-encoding a wide range of datatypes that would be encoded on 64 bits. For example, when encoding timestamps, 96% of them can be encoded using a single bit (Source). Similarly, when encoding 64-bit numerical values in time series, each value can be encoded using as little as 1.37 Bytes (Source). And when encoding a 64-bit rowId for a table of 6.25B rows deployed across 1,100 shards, one can encode this data across 44 bits and expect further compression, thereby approaching a compression factor of 2. Therefore, we will assume that we can re-encode all columns of our dataset to achieve an average size reduction of 5.

Data Lifecycle

Whenever a table is loaded, data is fetched from S3 by a fleet of AWS Lambda functions, and kept in memory during the entire duration of the user session (one hour or more typically). The user is billed by AWS only when Lambda functions are invoked, at the rate of $0.017 for every TB·second. This approach is explained on the following article.

Unfortunately, whenever a Lambda experiences a cold re-start, the data it previously held in memory is lost and needs to be retrieved from S3, which is very slow (75MB/s). Furthermore, a MapReduce job is only as fast as its slowest map. Therefore, the higher the probability that one Map function experiences a cold start, the slower the job will be.

Knowing that the probability of a cold start upon invocation is 1%, the probability of experiencing at least one cold start is 1 minus the probability of the complementary event, which is the probability of experiencing no cold start at all. If we assume that we have 100 Map lambdas, this probability is (1 - 0.01)^100. Therefore, the probability of experiencing at least one cold start is 1 - (1 - 0.01)^100, which is about 63.39%. This means that almost two thirds of the jobs will be slow.

There are two things that we can do to reduce that risk:

Keep the number of Map lambdas as low as possible.
Replicate data cached in memory across multiple Map lambdas.

Because the use of AWS Lambda functions is billed by GB·second, we want to keep the amount of RAM used as low as possible. Therefore, the dataset cached in memory should be replicated once (two copies). As a result, we need at least 2TB of RAM. Furthermore, we need some headroom, which we will set at 10% (2.2TB total). We will see later on that effective headroom becomes 20% thanks to the use of default and backup partitions.

Knowing that each Lambda function gives us 10GB of RAM, we will need 220 of them. This leads to a probability of cold start of 1 - (1 - 0.01)^220, which is about 89.04% (very high). With that architecture as it stands, nine jobs out of ten will be slow. This is clearly not acceptable, and a better architecture must be designed.

Taking this 10% headroom into account, each Lambda function will need to load 9GB of data from S3. With a compressed read throughput of 375MB/s, this will take 24s. This is considered to be acceptable, as it will be done only once, at the very beginning of the user session, or whenever new large datasets must be loaded during a session.

If data is stored in EFS rather than S3, uncompressed throughput is 10GB/s. Therefore, compressed throughout is 50GB/s, and downloading 1TB should take about 20s, which is comparable to the performance attained with S3. Unfortunately, this is the maximum aggregated throughout that an arbitrary number of Lambdas could benefit from, unlike S3, which aggregated throughput increases with the number of Lambda functions in a linear fashion. Therefore, EBS is recommended for datasets up to 10TB, while S3 should be used for larger datasets (to be confirmed with actual benchmarks).

Data Partition

Data will be loaded by the Lambda functions across two partitions:

Default partition
Backup partition

During the Map phase, each Lambda processes the data found within its default partition, unless data found within its backup partition is required because of the cold start of another Lambda, as will be explained later on. Therefore, the actual data being processed is 9GB/2 = 4.5GB. As a result, the effective headroom is 20% instead of 10%. For the time being, we will assume that it is sufficient, but further analysis of the transforms will be required to establish this fact.

With a default partition of 4.5GB and 20 columns of 64 bits, this means about 28.4M rows. Processing them with operations that take 1ns per row, total processing time is 28ms per transform, which interestingly is in the same order of magnitude as the latency of an SNS message passing (30ms), making it possible to envision very interactive processes between the Reduce and Map functions.

For reference, sorting these 28.4M rows with DuckDB will take 852ms (Source).

Data Shards

Data stored within partitions (default or backup) is split into shards. The actual number of shards might depend upon many different parameters, but we will assume that it is in the order of 10. This means that each shard is about 450MB in size, and can be loaded from S3 within 1.2s (assuming a compression factor of 5).

Furthermore, the 10 shards found in the default partition of a Lambda function are replicated within the backup partitions of 10 other Lambda functions. This ensures that all default data of a Lambda function experiencing a cold start can be found in the backup partitions of 10 others. The probability of having at least one of these backup Lambda functions experiencing a cold start at the same time is approximately 10 × 220 × 0.01^2, which is about 22% (Source). This means that about 4 times out of 5, it will be possible to find backup data within hot-started Lambda functions, thereby avoiding the latency of having to fetch data from S3 or EFS.

And to further reduce the risk of cache miss, backup data should be distributed in such a way that no pairs of Lambda functions back each other up. In other words, if a default shard of Lambda A is backed by Lambda B, no default shard of B is backed by A.

With such an architecture, a Map Lambda tasked with processing a backup shard will never do so for more than one backup shard. This means that it will experience a compute overhead of 10% if there are 10 shards per partition, or 2.8ms on top of the 28ms per transform (and much less if the transform takes longer to execute).

Now, a 22% probability of cache miss if still high, and we can reduce it in two ways:

Reducing the number of shards per default partition (10 so far).
Increasing the number of replicas (2 so far).

Reducing the number of shards from 10 down to 5 reduces this probability from 22% down to 11%, or exactly half. But it also doubles the size of a shard from 450MB to 900MB, thereby doubling its loading time from 1.2s to 2.4s. Furthermore, it increases the compute overhead of the Map lambdas used to process the backup shards from 10% to 20%, or 5.6ms on top of the 28ms per transform (33.6ms total). As a result, the tradeoff is between experiencing more frequent minor slowdowns versus less frequent slightly more significant slowdowns. It is unclear which option will lead to a better user experience, and the number of shards should probably be parameterized so that multiple experiments can be conducted, or so that users can select their preferred exception handling model.

Increasing the number of replicas from 2 to 3 would increase the cost of transforms by 50%, but it would dramatically reduce the probability of cache miss. According to this article, the probability P of cache miss with r replicas across λ Lambda functions with a probability of cold start p, each with s shards, can be approximated as s × λ × p^r, or 0.11% with 3 replicas and 5 default shards per Lambda function.

This is probably the best option for 1TB, as the probability of cache miss drops by over two orders of magnitude, from 22% down to 0.11%. Most interestingly, we can see on this formula that such a probability is proportional to the number of Lambda functions λ, and exponential with respect to the number of replicas r, with the latter countering the former. Furthermore, we should remember that the number of Lambda functions is also proportional to the size of the dataset. Therefore, with a probability of cold start of 1%, this formula suggests that an extra replica should be added whenever the dataset grows by two orders of magnitude (100×):

A dataset smaller than 10GB does not require any replication (but 2 replicas would be better)
A dataset between 10GB and 1TB requires 2 replicas
A dataset between 1TB and 100TB requires 3 replicas
A dataset between 100TB and 10PB requires 4 replicas

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is Data Sharding? #218

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is Data Sharding? #218

ghalimi Mar 23, 2022 Maintainer

Replies: 1 comment

ghalimi Mar 23, 2022 Maintainer Author

Assumptions

Metrics

Compression

Data Lifecycle

Data Partition

Data Shards

ghalimi
Mar 23, 2022
Maintainer

ghalimi
Mar 23, 2022
Maintainer Author