Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCP 3619: costing for serialiseData #4480

Merged
merged 8 commits into from
Mar 22, 2022
Merged

Conversation

kwxm
Copy link
Contributor

@kwxm kwxm commented Mar 19, 2022

Costing serialiseData and equalsData is a little tricky because we measure the size of Data objects using only a single number and execution times can be very different for objects of the same size. For example, here's a plot of serialisation times:
SerialiseData

The red line is a regression line obtained by standard linear regression, and it clearly underestimates the serialisation times for many inputs. This PR attempts to fit a more conservative model. We do this by discarding everything below the line and fitting another linear model to the remaining data, repeating until we get a line which lies above at least 90% of the original data (or until we've performed twenty iterations, but with the data here we only require two iterations). We also go to some trouble to force the fitted model to have a sensible intercept, partly because our benchmark results are biased towards small values.

Here are the results of applying this method to the benchmark figures for serialiseData and equalsData.

SerialiseData

SerialiseData-fitted

The bound for serialiseData underestimates 7.8% of the datapoints; for these points (the ones above the red line), the observed value exceeds the predicted value by a factor of up to 2.9x, with a mean of 2.08x (most of this happens for small sizes: see below). The prediction exceeds the observed values in the remaining 92.2% of the data, by a factor of up to 20.3x (ie, the ratio (predicted time)/(observed time) is 20.3); the mean overestimate is 4.68x.

The graph above is for Data objects of size up to about 880,000, which is quite large (and look at the times!). If we zoom in on things of size up to 5000 we get the following graph:
SerialiseData-fitted-small

Here we see a series of observations for small objects heading upwards at a very steep angle (you can just about see these in the previous graph if you look closely at the bottom left corner), and these account for most of the large underpredictions . I'm not sure if these points represent a real trend (ie, whether we could construct larger objects which fall on the same steep line) or if they're just some peculiarity of small data. If we increase the gradient of the red line so that it lies above most of these points then we end up with a costing function which overestimates costs for larger data by a factor of 200 or more, so we probably don't want to do that unless we really can have larger data objects which behave badly; if that is the case then a better generator would give us data which would lead to a better model without having to change any R code.

EqualsData

The same method produces good results for equalsData as well. Here's what it does for the full dataset:
EqualsData-fitted
Only 1.5% of the observations lie above the line; for these points, the observed value exceeds the predicted value by a factor of up to 1.27x, with a mean of 1.11x The prediction exceeds the observed values in the remaining 98.5% of the data, by a factor of up to 12.9x; the mean overestimate is 2.11x.

If we zoom in on the bottom left we see that we don't get the apparently atypical observations that we got for serialiseData, even though the benchmarks for the two functions use exactly the same inputs
EqualsData-fitted-small

Conclusion

This method appears to give us quite accurate upper bounds on execution times for functions which have to traverse entire Data objects. Because of the non-homogeneous nature of Data these bounds are quite conservative. Note that the inferred costs are quite expensive: for example the costing function for serialiseData would charge 20.7µs for serialising an object of size 50, 40.1µs for size 100, and 309.3µs for size 1000. For equalsData the costs would be 2.1µs, 3.02µs, and 19.7µs. We could decrease costs by reverting to a standard linear model, but then we'd end up undercharging for some inputs. It would be useful to know what sort of Data objects people will be serialising in practice, and how large they are likely to be. We could also do with a better generator for Data: see SCP-3653.

@kwxm kwxm requested review from bezirg and michaelpj March 19, 2022 12:56
@kwxm kwxm added Benchmarks Costing Anything relating to costs, fees, gas, etc. labels Mar 19, 2022
@michaelpj
Copy link
Contributor

It would be useful to know what sort of Data objects people will be serialising in practice, and how large they are likely to be.

I've asked the Hydra people to send you some.

Copy link
Contributor

@michaelpj michaelpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a brief look at the R code and it looks sensible!

@@ -13,14 +13,14 @@ library(broom, quietly=TRUE, warn.conflicts=FALSE)


## At present, times in the becnhmarking data are typically of the order of
## 10^(-6) seconds. WE SCALE THESE UP TO MILLISECONDS because the resulting
## 10^(-6) seconds. WE SCALE THESE UP TO MICROSECONDS because the resulting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@michaelpj
Copy link
Contributor

Would be good for Nikos to look too.

@thealmarty
Copy link
Contributor

Nice! Looks from the graphs that adding another regressor with an exponent may fit better. Not sure if we want that complexity though.

@michaelpj
Copy link
Contributor

Okay, this looks good for now, we can refine it later.

@michaelpj michaelpj merged commit bd97960 into master Mar 22, 2022
@kwxm kwxm deleted the kwxm/costing/serialiseData branch March 22, 2022 14:11
@kwxm
Copy link
Contributor Author

kwxm commented Mar 25, 2022

I had a closer look at the times for serialisation and equality checking. Recall that Data is defined as

data Data =
      Constr Integer [Data]
    | Map [(Data, Data)]
    | List [Data]
    | I Integer
    | B BS.ByteString

I generated four different types of samples:

  • List (I): a Data object with a single node consisting of a list of I objects containing Integers. I generated lists of length 10, 50, and 1000, each containing Integers of size (ie, memoryUsage) 10, 100, and 1000.
  • List(B): like List(I), but with Bytestrings.
  • Tree(List): a tree consisting of Lists of Data objects with I 0 at the leaves. The lists were of varying sizes up to length 350.
  • Tree(Map): a tree consisting of Lists of pairs of Data objects with I 0 at the leaves. The lists were of varying sizes up to length 170.

SerialiseData

The raw data for SerialiseData looks like this (click to enlarge the PNG):

SerialiseData1

Clearly serialising lists of Integers is much more expensive than serialising anything else. This is presumably because the CBOR encoding is quite complicated, especially for large integers.

We can get a more detailed view of what happpens for the other types by increasing the vertical scale and cutting off the larger figures for List(I):
SerialiseData2
Serialising the two different types of tree costs approximately the same and is a bit more expensive than serialising bytestrings, at least for small sizes.

The objects in these graphs are of very uniform types. Presumably if we were to generate objects with a greater mixture of the different types of nodes and integers and bytestrings of a wide range of sizes then we'd get a much more solid fan shape.

EqualsData

I also benchmarked EqualsData with the same data (comparing each sample with a fresh copy of the same data to get worst case times). The full set of results looks like this:
EqualsData1
Now the cost of traversing the two tree types dominates the execution time, being significantly more expensive than comparing lists of integers and strings. This contrasts with SerialiseData, where processing large integers seems to be the most expensive aspect.

If we zoom in on the bottom of the graph we get this:
EqualsData2
We see several rays for the List(I) and List(B) data. This is caused by the differing sizes of the contents of the lists (recall that we have integers of size 10, 100, and 1000 for example). Again, if the data was more mixed we'd get a more uniform fan shape.

Conclusion

The fact that we have non-uniform data but only a single measure of size is a nuisance, especially because the time depends on the size in different ways for SerialiseData and EqualsData. The cost model derived in this PR probably overcharges for serialisation of Data objects, but this is necessary because we have to defend against the genuinely expensive cost of serialising large integers, even though these may not appear in good-faith validators.

To mitigate this we'd have to at least rebalance the costs of processing the different node types (losing the close connection to memory usage in the process), or allow different builtins to use different size functions for objects of the same size (and conceivably even allow a single builtin to use different costing functions for different arguments of the same type). This wouldn't be a great technical challenge, but it would require significant changes to the costing infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmarks Costing Anything relating to costs, fees, gas, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants