SCP 3619: costing for serialiseData #4480

kwxm · 2022-03-19T12:55:52Z

Costing serialiseData and equalsData is a little tricky because we measure the size of Data objects using only a single number and execution times can be very different for objects of the same size. For example, here's a plot of serialisation times:

The red line is a regression line obtained by standard linear regression, and it clearly underestimates the serialisation times for many inputs. This PR attempts to fit a more conservative model. We do this by discarding everything below the line and fitting another linear model to the remaining data, repeating until we get a line which lies above at least 90% of the original data (or until we've performed twenty iterations, but with the data here we only require two iterations). We also go to some trouble to force the fitted model to have a sensible intercept, partly because our benchmark results are biased towards small values.

Here are the results of applying this method to the benchmark figures for serialiseData and equalsData.

SerialiseData

The bound for serialiseData underestimates 7.8% of the datapoints; for these points (the ones above the red line), the observed value exceeds the predicted value by a factor of up to 2.9x, with a mean of 2.08x (most of this happens for small sizes: see below). The prediction exceeds the observed values in the remaining 92.2% of the data, by a factor of up to 20.3x (ie, the ratio (predicted time)/(observed time) is 20.3); the mean overestimate is 4.68x.

The graph above is for Data objects of size up to about 880,000, which is quite large (and look at the times!). If we zoom in on things of size up to 5000 we get the following graph:

Here we see a series of observations for small objects heading upwards at a very steep angle (you can just about see these in the previous graph if you look closely at the bottom left corner), and these account for most of the large underpredictions . I'm not sure if these points represent a real trend (ie, whether we could construct larger objects which fall on the same steep line) or if they're just some peculiarity of small data. If we increase the gradient of the red line so that it lies above most of these points then we end up with a costing function which overestimates costs for larger data by a factor of 200 or more, so we probably don't want to do that unless we really can have larger data objects which behave badly; if that is the case then a better generator would give us data which would lead to a better model without having to change any R code.

EqualsData

The same method produces good results for equalsData as well. Here's what it does for the full dataset:

Only 1.5% of the observations lie above the line; for these points, the observed value exceeds the predicted value by a factor of up to 1.27x, with a mean of 1.11x The prediction exceeds the observed values in the remaining 98.5% of the data, by a factor of up to 12.9x; the mean overestimate is 2.11x.

If we zoom in on the bottom left we see that we don't get the apparently atypical observations that we got for serialiseData, even though the benchmarks for the two functions use exactly the same inputs

Conclusion

This method appears to give us quite accurate upper bounds on execution times for functions which have to traverse entire Data objects. Because of the non-homogeneous nature of Data these bounds are quite conservative. Note that the inferred costs are quite expensive: for example the costing function for serialiseData would charge 20.7µs for serialising an object of size 50, 40.1µs for size 100, and 309.3µs for size 1000. For equalsData the costs would be 2.1µs, 3.02µs, and 19.7µs. We could decrease costs by reverting to a standard linear model, but then we'd end up undercharging for some inputs. It would be useful to know what sort of Data objects people will be serialising in practice, and how large they are likely to be. We could also do with a better generator for Data: see SCP-3653.

…aped data

michaelpj · 2022-03-21T10:31:47Z

It would be useful to know what sort of Data objects people will be serialising in practice, and how large they are likely to be.

I've asked the Hydra people to send you some.

michaelpj

I had a brief look at the R code and it looks sensible!

michaelpj · 2022-03-21T10:33:42Z

plutus-core/cost-model/data/models.R

@@ -13,14 +13,14 @@ library(broom,   quietly=TRUE, warn.conflicts=FALSE)


 ## At present, times in the becnhmarking data are typically of the order of
-## 10^(-6) seconds. WE SCALE THESE UP TO MILLISECONDS because the resulting
+## 10^(-6) seconds. WE SCALE THESE UP TO MICROSECONDS because the resulting


michaelpj · 2022-03-21T10:46:02Z

Would be good for Nikos to look too.

thealmarty · 2022-03-21T16:09:51Z

Nice! Looks from the graphs that adding another regressor with an exponent may fit better. Not sure if we want that complexity though.

michaelpj · 2022-03-22T12:01:46Z

Okay, this looks good for now, we can refine it later.

kwxm · 2022-03-25T12:04:14Z

I had a closer look at the times for serialisation and equality checking. Recall that Data is defined as

data Data =
      Constr Integer [Data]
    | Map [(Data, Data)]
    | List [Data]
    | I Integer
    | B BS.ByteString

I generated four different types of samples:

List (I): a Data object with a single node consisting of a list of I objects containing Integers. I generated lists of length 10, 50, and 1000, each containing Integers of size (ie, memoryUsage) 10, 100, and 1000.
List(B): like List(I), but with Bytestrings.
Tree(List): a tree consisting of Lists of Data objects with I 0 at the leaves. The lists were of varying sizes up to length 350.
Tree(Map): a tree consisting of Lists of pairs of Data objects with I 0 at the leaves. The lists were of varying sizes up to length 170.

SerialiseData

The raw data for SerialiseData looks like this (click to enlarge the PNG):

Clearly serialising lists of Integers is much more expensive than serialising anything else. This is presumably because the CBOR encoding is quite complicated, especially for large integers.

We can get a more detailed view of what happpens for the other types by increasing the vertical scale and cutting off the larger figures for List(I):

Serialising the two different types of tree costs approximately the same and is a bit more expensive than serialising bytestrings, at least for small sizes.

The objects in these graphs are of very uniform types. Presumably if we were to generate objects with a greater mixture of the different types of nodes and integers and bytestrings of a wide range of sizes then we'd get a much more solid fan shape.

EqualsData

I also benchmarked EqualsData with the same data (comparing each sample with a fresh copy of the same data to get worst case times). The full set of results looks like this:

Now the cost of traversing the two tree types dominates the execution time, being significantly more expensive than comparing lists of integers and strings. This contrasts with SerialiseData, where processing large integers seems to be the most expensive aspect.

If we zoom in on the bottom of the graph we get this:

We see several rays for the List(I) and List(B) data. This is caused by the differing sizes of the contents of the lists (recall that we have integers of size 10, 100, and 1000 for example). Again, if the data was more mixed we'd get a more uniform fan shape.

Conclusion

The fact that we have non-uniform data but only a single measure of size is a nuisance, especially because the time depends on the size in different ways for SerialiseData and EqualsData. The cost model derived in this PR probably overcharges for serialisation of Data objects, but this is necessary because we have to defend against the genuinely expensive cost of serialising large integers, even though these may not appear in good-faith validators.

To mitigate this we'd have to at least rebalance the costs of processing the different node types (losing the close connection to memory usage in the process), or allow different builtins to use different size functions for objects of the same size (and conceivably even allow a single builtin to use different costing functions for different arguments of the same type). This wouldn't be a great technical challenge, but it would require significant changes to the costing infrastructure.

kwxm added 4 commits March 18, 2022 21:50

Cost function for serialiseData, improved modelling method for fan-sh…

4fc2671

…aped data

Minor corrections

8c45a2a

Tidying up

27714a1

Make info messages clearer

00c2622

kwxm requested review from bezirg and michaelpj March 19, 2022 12:56

kwxm added 2 commits March 19, 2022 13:05

Tidying up

124e0ee

Tidying up

1ba9bf0

kwxm added Benchmarks Costing Anything relating to costs, fees, gas, etc. labels Mar 19, 2022

kwxm added 2 commits March 20, 2022 13:40

Expand comment

a31cf69

Typo

f05bc83

michaelpj approved these changes Mar 21, 2022

View reviewed changes

bezirg approved these changes Mar 21, 2022

View reviewed changes

michaelpj merged commit bd97960 into master Mar 22, 2022

kwxm mentioned this pull request Mar 22, 2022

Forgot the memory usage cost for serialiseData #4486

Merged

kwxm deleted the kwxm/costing/serialiseData branch March 22, 2022 14:11

kwxm mentioned this pull request Mar 26, 2022

SCP-2417: Add builtin function: serialiseData #4447

Merged

9 tasks

kwxm mentioned this pull request Apr 28, 2022

Test that builtin functions don't throw #4555

Merged

kwxm mentioned this pull request Oct 25, 2022

PLT-936: Add Arbitrary Data instance #4922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCP 3619: costing for serialiseData #4480

SCP 3619: costing for serialiseData #4480

kwxm commented Mar 19, 2022 •

edited

Loading

michaelpj commented Mar 21, 2022

michaelpj left a comment

michaelpj Mar 21, 2022

michaelpj commented Mar 21, 2022

thealmarty commented Mar 21, 2022

michaelpj commented Mar 22, 2022

kwxm commented Mar 25, 2022 •

edited

Loading

SCP 3619: costing for serialiseData #4480

SCP 3619: costing for serialiseData #4480

Conversation

kwxm commented Mar 19, 2022 • edited Loading

SerialiseData

EqualsData

Conclusion

michaelpj commented Mar 21, 2022

michaelpj left a comment

Choose a reason for hiding this comment

michaelpj Mar 21, 2022

Choose a reason for hiding this comment

michaelpj commented Mar 21, 2022

thealmarty commented Mar 21, 2022

michaelpj commented Mar 22, 2022

kwxm commented Mar 25, 2022 • edited Loading

SerialiseData

EqualsData

Conclusion

kwxm commented Mar 19, 2022 •

edited

Loading

kwxm commented Mar 25, 2022 •

edited

Loading