CIP-0042 | New plutus builtin: serialiseBuiltinData #218

ch1bo · 2022-02-10T13:00:12Z

This CIP proposes to add a new builtin for serialising BuiltinData to BuiltinByteString

michaelpj

Thanks for doing this! I think this is a good idea and would accept this change. Given that there is little work to do apart from costing (which the Plutus team probably needs to do anyway), the Plutus team will handle the implementation.

michaelpj · 2022-02-10T13:38:51Z

CIP-0036/README.md

+
+### Binary data format
+
+Behind the scene, we expect this function to use a well-known encoding format to ease construction of such serialisation off-chain (in particular, for non-Haskell off-chain contract codes). A natural choice of binary data format in this case is [CBOR][] which is:


This is the specification, let's specify exactly what it does! I agree, it should use exactly the off-chain CBOR encoding of Data.

That said, this will change something. Right now, I think we could change the CBOR encoding of Data to some other valid CBOR way of representing the same thing. But if we wire it in as a builtin then we must never change it, or at least be extremely careful about it.

michaelpj · 2022-02-10T13:41:26Z

CIP-0036/README.md

+
+### Cost Model
+
+The `Data` type is a recursive data-type, so costing it properly is a little tricky. The Plutus source code defines an instance of `ExMemoryUsage` for `Data` with [the following interesting note](https://github.com/input-output-hk/plutus/blob/37b28ae0dc702e3a66883bb33eaa5e1156ba4922/plutus-core/plutus-core/src/PlutusCore/Evaluation/Machine/ExMemory.hs#L205-L225):


In the terminology of CIP-35, this is the size metric for Data. Confusingly named yes...

michaelpj · 2022-02-10T13:43:06Z

CIP-0036/README.md

+
+## Alternatives
+
+* We have identified that the cost mainly stems from concatenating bytestrings; so possibly, an alternative to this proposal could be a better way to concatenate (or to cost) bytestrings (Builders in Plutus?)


While this is possible, it would probably require a much larger expansion of the number of builtins (at least one new type, probably quite a number of operations on it), which is undesirable.

We also would not be happy with these "alternatives" :)

michaelpj · 2022-02-10T13:43:53Z

CIP-0036/README.md

+
+* We have identified that the cost mainly stems from concatenating bytestrings; so possibly, an alternative to this proposal could be a better way to concatenate (or to cost) bytestrings (Builders in Plutus?)
+
+* If costing for `BuiltinData` is unsatisfactory, maybe we want have only well-known input types, e.g. `TxIn`, `TxOut`, `Value` and so on.. `WellKnown t => t -> BuiltinByteString`


This isn't really feasible: we'd have to make each of those builtin types, which we really don't want to do, especially since e.g. TxOut may not be the same in different versions of Plutus (e.g. once we add inline datums the fields will change).

michaelpj · 2022-02-10T13:46:51Z

CIP-0036/README.md

+
+In this particular context, those elements are transaction outputs (a.k.a. `TxOut`). While Plutus already provides built-in for hashing data-structure (e.g. `sha2_256 :: BuiltinByteString -> BuiltinByteString`), it does not provide generic ways of serialising some data type to `BuiltinByteString`.
+
+In an attempt to pursue our work, we have implemented [an on-chain library (plutus-cbor)][plutus-cbor] for encoding data-types as structured [CBOR / RFC 8949][CBOR] in a _relatively efficient_ way (although still quadratic, it is as efficient as it can be with Plutus' available built-ins) and measured the memory and CPU cost of encoding `TxOut` **in a script validator on-chain**. 


I'm confused: here you say that your implementation is quadratic, but below you say it's linear (and the graph agrees with that).

We would expect a quadratic growth, but only see a linear. Maybe due to this: https://input-output-rnd.slack.com/archives/C21UF2WVC/p1644899404438899?thread_ts=1644875367.010429&cid=C21UF2WVC?

^ I think that's unlikely to account for this. Costs of bytestring operations are based on the size of the bytestring, and that was a change that changed the size of the empty bytestring from 1 to 0 and hence caused slight cost differences between the cost of comparing bytestrings in two different Plutus Core versions. It shouldn't affect the basic shape of the cost function for appending bytestrings, which is linear in the sum of the sizes of the arguments (so the total cost of repeated applications with increasingly large inputs should indeed be quadratic).

I think these graphs are in fact parabolas, but you haven't gone far enough out to see that. I increased the maximum list size in Main.hs to 500 and changed pparams in Vaildator.hs to def{_maxTxExUnits = ExUnits 9999999999999999 9999999999999999} and I got the following graph (blue=memory, green=cpu):

This does look a bit more quadratic. I think the explanation for the original graphs is that bytestring concatenation is pretty cheap. The current cpu cost of calling our appendBytestring function with arguments of size a and b is given by 396231 + 621*(a+b) (see here), which increases very slowly with a and b (sizes here are unfortunately measured in 64-bit words, not bytes, and the literal numbers are notionally picoseconds). If you add up a lot of these, a and b have to get pretty big before the a+b term becomes significant relative to the constant term, so this makes the total cost of adding up lots of bytestrings look linear for small inputs. I couldn't see how to work out the sizes of the bytestrings involved in your examples, but if we knew that then we could check that they're not large enough to cause quadratic behaviour to become apparent.

The memory cost is just given by a+b (ie, the size of the result) and again is measured in 64-bit words, so I think you'd have to be adding up quite a large number of large bytestrings to see quadratic memory usage; again, knowing the sizes of the things being concatenated would be useful.

We got the numbers for the cpu cost function by using Criterion to run the apppendBytestring function with inputs of sizes up to 5000, and then fitting a linear function to the execution times; the model fits the data pretty closely, so I think that concatenation times do increase pretty slowly, presumably because the underlying function in Data.ByteString is calling C's memcpy to do the hard work.

michaelpj · 2022-02-10T13:49:40Z

CIP-0036/README.md

+* Favoring manipulation of structured `Data` is an appealing alternative to many `ByteString` manipulation use-cases;
+* CBOR as encoding is a well-known and widely used standard in Cardano, existing tools can be used;
+* The hypothesis on the cost model here is that serialisation cost would be proportional to the `ExMemoryUsage` for `Data`; which means, given the current implementation, proportional to the number and total memory usage of nodes in the `Data` tree-like structure.
+* Benchmarking the costs of serialising `TxOut` values between [plutus-cbor][] and [cborg][] confirms [cborg][] and the existing [encodeData][]'s implementation in Plutus as a great candidate for implementing the built-in: 


This is a comparison in Haskell, right? i.e. of the Haskell version of plutus-cbor versus the "main" Haskell CBOR implementation.

Worth calling out, because it's therefore not necessarily indicative of the difference between the plutus-cbor version compiled to PLC and the builtin version. I expect the improvement will be even greater, but I'm not 100% sure. At any rate it will probably not look quite like this.

Yes, it's in Haskell.

michaelpj · 2022-02-10T13:52:30Z

CIP-0036/README.md

+We define a new Plutus built-in function with the following type signature:
+
+```hs
+serialiseBuiltinData :: BuiltinData -> BuiltinByteString


Suggested change

serialiseBuiltinData :: BuiltinData -> BuiltinByteString

serialiseData :: data -> bytestring

To fit PLC rather than Haskell source. The types are just data and bytestring in PLC. Applies throughout the doc.

michaelpj · 2022-02-10T13:54:06Z

CIP-0036/README.md

+
+- [ ] Using the existing _sizing metric_ for `Data`, we need to determine a costing function (using existing tooling / benchmarks? TBD)
+- [ ] The Hydra Team creates a PR which adds the built-in to PlutusV1 and PlutusV2 and uses a suitable cost function
+- [ ] Release it as a backward-compatible change within the next hard-fork


Releasing the change is out of scope for the CIP process, I believe.

michaelpj · 2022-02-11T09:55:49Z

CIP-0036/README.md

+* Such built-in is generic enough to also cover a wider set of use-cases, while nicely fitting ours;
+* Favoring manipulation of structured `Data` is an appealing alternative to many `ByteString` manipulation use-cases;
+* CBOR as encoding is a well-known and widely used standard in Cardano, existing tools can be used;
+* The hypothesis on the cost model here is that serialisation cost would be proportional to the `ExMemoryUsage` for `Data`; which means, given the current implementation, proportional to the number and total memory usage of nodes in the `Data` tree-like structure.


This should actually be fairly easy to validate: we have random generators for Data, so you could just run those through criterion and plot the serialization time against the size metric.

L-as · 2022-02-11T10:14:17Z

Would you ever serialise a data without hashing it immediately? Wouldn't dataHash be more convenient, in that you're guaranteed to use the same hashing algorithm as what's used for txInfoData?

michaelpj · 2022-02-11T12:14:15Z

Would you ever serialise a data without hashing it immediately? Wouldn't dataHash be more convenient, in that you're guaranteed to use the same hashing algorithm as what's used for txInfoData?

We don't want to specify the hashing algorithm, users might want to use different ones. Matching the hashing of datums in txInfoData isn't currently a goal of this CIP (maybe it should be?). That is conceivably something that you could want, if you wanted to check the datum for an output that only specifies a hash, given that you can compute the datum in the validator. Of course, having serialiseData also lets you do that just fine.

L-as · 2022-02-11T14:21:00Z

Why would you want to use a different hashing algorithm? I honestly can't imagine a single use case.

michaelpj · 2022-02-11T16:40:40Z

Why would you want to use a different hashing algorithm? I honestly can't imagine a single use case.

I think we should generally let our users decide that. This way is more compositional, so we don't have to put ourselves in the position of judging that there are "no use cases" and then being wrong.

Also it lets you do other things, like serialize it and then sign the serialized form.

L-as · 2022-02-15T09:39:24Z

Well that makes no sense. You'd sign the hash anyway.

L-as · 2022-02-15T09:42:09Z

Do we want deserialiseData :: ByteString -> Data too? I get the argument you make about this being more compositional, but dataHash would likely be substantially more efficient due to not having to pass in the entire serialised bytestring into the interpreter. I'm not sure how you implement off-chain dataHash now, but it theoretically does not need to keep the entire serialised bytestring in memory at any one time.

michaelpj · 2022-02-15T10:53:42Z

Do we want deserialiseData :: ByteString -> Data too?

No, deserialization is far more complex than serialization. Serialization is one pass over the input data, deserialization may involve backtracking, error handling, who knows what. Much harder to cost.

I get the argument you make about this being more compositional, but dataHash would likely be substantially more efficient due to not having to pass in the entire serialised bytestring into the interpreter.

I doubt it would be a very large bytestring. We're not talking about working with gigabytes of data here.

L-as · 2022-02-15T15:27:09Z

I made #222. I still feel like dataHash is the approach to go for.

michaelpj · 2022-02-15T16:33:25Z

(I'd like to see this CIP proposal include a discussion of dataHash as an alternative, regardless!)

ch1bo · 2022-02-15T18:13:19Z

It's true that the Hydra use case would also hash the resulting ByteString (for verifying Merkle Tree proofs), but we felt that sha3_256 . serialiseBuiltinData (or a different hash algorithm) is straight-forward to use, does not add any additional cost, is more flexible and does not increase the scope of this CIP unnecessarily.

Probably worth to list in the alternatives section.

L-as · 2022-02-15T19:18:03Z

If you look at my CIP you'll notice that it's conceptually a whole lot simpler.

L-as · 2022-02-15T19:22:11Z

I think it's also clear that `dataHash` would be more efficient given how hashing works, as I've explained in the CIP I made.

KtorZ · 2022-04-12T17:25:30Z

@ch1bo @KtorZ (can I ping myself? 🤔)

Proposal approved to be merged as proposed under the condition that the strategy for the cost model that's been decided with the Plutus is outlined in the proposal.

The plutus team chose this terminolgy and it's more consistent with other bultins like `equalsData`.

… forward

ch1bo

@KtorZ Updated the proposal with latest developments and updated its CIP code to 42.

I also added a note on the binary data format of Data 👇

ch1bo · 2022-04-13T15:30:40Z

CIP-0042/README.md

+
+- [x] Using the existing _sizing metric_ for `Data`, we need to determine a costing function (using existing tooling / benchmarks? TBD)
+- [x] The Plutus team updates plutus to add the built-in to PlutusV1 and PlutusV2 and uses a suitable cost function
+- [ ] The binary format of `Data` is documented and embraced as an interface within `plutus`.


@michaelpj I imagine the binary CBOR format which you chose is the same as was existing before (should be the one also sketched above) and we want to stick with it? If yes, we ought to document it and make sure it's not changing accidentally without proper versioning. What do you think?

@ch1bo not sure what you mean by 'document it' 🤔 ? What more than what we did already (except maybe moving it to a separate file)?

That. Document it for users and maintainers. It's not easily discoverable here and might be better located in plutus. Also it should be checked that the actual binary format stays that way etc.

ch1bo · 2022-05-03T09:36:27Z

I think this should not be labeled as Waiting for Author anymore!?

ch1bo · 2022-05-06T09:48:14Z

Likely useful for any future readers: @kwxm has done a great job on a deeper analysis on how serialiseData will improve on script costs in the hands-on case study of Hydra over here: https://github.com/input-output-hk/hydra-poc/blob/7006b630de27dde2a2c93e238f96121b42b29ff6/SERIALISATION.md

TL;DR: 50-60% improvement over the on-chain CBOR encoding we had been using on the plutus version we had been using

KtorZ

cc @SebastienGllmt @rphair @crptmppt

I'd like to merge that one as discussed in previous meetings, though we haven't approved it formally.

L-as · 2022-06-16T15:33:49Z

Why don't we have deserialiseData?

michaelpj · 2022-06-16T16:45:37Z

You asked that before and I replied here: #218 (comment)

Draft a new plutus builtin: serialiseBuiltinData

cfa21e7

ch1bo force-pushed the ch1bo/new-plutus-builtin-serialiseBuiltinData branch from 370a7cc to cfa21e7 Compare February 10, 2022 13:02

Fix cip header

2d1b6a8

ch1bo force-pushed the ch1bo/new-plutus-builtin-serialiseBuiltinData branch from 3195f65 to 2d1b6a8 Compare February 10, 2022 13:06

JaredCorduan mentioned this pull request Feb 10, 2022

Question about txInfoData IntersectMBO/cardano-ledger#2649

Closed

michaelpj approved these changes Feb 10, 2022

View reviewed changes

kwxm mentioned this pull request Feb 10, 2022

Fix typo in comment IntersectMBO/plutus#4390

Merged

9 tasks

michaelpj reviewed Feb 11, 2022

View reviewed changes

L-as mentioned this pull request Feb 15, 2022

CIP-0043 | New Plutus Core built-in dataHash #222

Closed

bezirg mentioned this pull request Mar 4, 2022

SCP-2417: Add builtin function: serialiseData IntersectMBO/plutus#4447

Merged

9 tasks

KtorZ changed the title ~~New plutus builtin: serialiseBuiltinData~~ CIP-42? | New plutus builtin: serialiseBuiltinData Mar 17, 2022

ch1bo mentioned this pull request Mar 22, 2022

Spike: Switch to babbage and use serializeBuiltinData cardano-scaling/hydra#280

Closed

2 tasks

mangelsjover added the State: Waiting for Author Proposal showing lack of documented progress by authors. label Apr 12, 2022

ch1bo added 4 commits April 13, 2022 17:25

Update CIP code to 42

f7e1e48

Update section about the cost model

b36a38e

Rename serialiseBuiltinData -> serialiseData

4af308a

The plutus team chose this terminolgy and it's more consistent with other bultins like `equalsData`.

Add a step on the path to active to ensure binary compatibility going…

d1c06ba

… forward

ch1bo commented Apr 13, 2022

View reviewed changes

KtorZ removed the State: Waiting for Author Proposal showing lack of documented progress by authors. label May 11, 2022

KtorZ approved these changes May 11, 2022

View reviewed changes

KtorZ changed the title ~~CIP-42? | New plutus builtin: serialiseBuiltinData~~ CIP-0042? | New plutus builtin: serialiseBuiltinData May 11, 2022

KtorZ changed the title ~~CIP-0042? | New plutus builtin: serialiseBuiltinData~~ CIP-0042 | New plutus builtin: serialiseBuiltinData May 11, 2022

KtorZ added Candidate CIP labels May 11, 2022

rphair approved these changes May 11, 2022

View reviewed changes

rphair merged commit bd40b8c into cardano-foundation:master May 11, 2022

perturbing mentioned this pull request Jul 14, 2022

CIP-0068 | Datum Metadata Standard #299

Merged

effectfully mentioned this pull request Jun 20, 2023

On-chain function to convert from Integer to ByteString, or to parse ByteString into Integer IntersectMBO/plutus#3657

Closed

4 tasks

rphair mentioned this pull request Dec 22, 2023

CIP-0042 | Adjust preamble and structure w.r.t CIP-0001 #698

Merged


		### Binary data format

		Behind the scene, we expect this function to use a well-known encoding format to ease construction of such serialisation off-chain (in particular, for non-Haskell off-chain contract codes). A natural choice of binary data format in this case is [CBOR][] which is:


		### Cost Model

		The `Data` type is a recursive data-type, so costing it properly is a little tricky. The Plutus source code defines an instance of `ExMemoryUsage` for `Data` with [the following interesting note](https://github.com/input-output-hk/plutus/blob/37b28ae0dc702e3a66883bb33eaa5e1156ba4922/plutus-core/plutus-core/src/PlutusCore/Evaluation/Machine/ExMemory.hs#L205-L225):


		## Alternatives

		* We have identified that the cost mainly stems from concatenating bytestrings; so possibly, an alternative to this proposal could be a better way to concatenate (or to cost) bytestrings (Builders in Plutus?)


		* We have identified that the cost mainly stems from concatenating bytestrings; so possibly, an alternative to this proposal could be a better way to concatenate (or to cost) bytestrings (Builders in Plutus?)

		* If costing for `BuiltinData` is unsatisfactory, maybe we want have only well-known input types, e.g. `TxIn`, `TxOut`, `Value` and so on.. `WellKnown t => t -> BuiltinByteString`


		In this particular context, those elements are transaction outputs (a.k.a. `TxOut`). While Plutus already provides built-in for hashing data-structure (e.g. `sha2_256 :: BuiltinByteString -> BuiltinByteString`), it does not provide generic ways of serialising some data type to `BuiltinByteString`.

		In an attempt to pursue our work, we have implemented [an on-chain library (plutus-cbor)][plutus-cbor] for encoding data-types as structured [CBOR / RFC 8949][CBOR] in a _relatively efficient_ way (although still quadratic, it is as efficient as it can be with Plutus' available built-ins) and measured the memory and CPU cost of encoding `TxOut` in a script validator on-chain.

	serialiseBuiltinData :: BuiltinData -> BuiltinByteString
	serialiseData :: data -> bytestring

CIP-0042 | New plutus builtin: serialiseBuiltinData #218

CIP-0042 | New plutus builtin: serialiseBuiltinData #218

Conversation

ch1bo commented Feb 10, 2022 • edited Loading

michaelpj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwxm Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

kwxm Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

L-as commented Feb 11, 2022

michaelpj commented Feb 11, 2022

L-as commented Feb 11, 2022

michaelpj commented Feb 11, 2022

L-as commented Feb 15, 2022

L-as commented Feb 15, 2022

michaelpj commented Feb 15, 2022

L-as commented Feb 15, 2022

michaelpj commented Feb 15, 2022

ch1bo commented Feb 15, 2022

L-as commented Feb 15, 2022 via email

L-as commented Feb 15, 2022 via email

KtorZ commented Apr 12, 2022

ch1bo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ch1bo commented May 3, 2022

ch1bo commented May 6, 2022

KtorZ left a comment

Choose a reason for hiding this comment

L-as commented Jun 16, 2022

michaelpj commented Jun 16, 2022

ch1bo commented Feb 10, 2022 •

edited

Loading

kwxm Feb 22, 2022 •

edited

Loading

kwxm Feb 22, 2022 •

edited

Loading