-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom serialization #41
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
- Feature Name: custom_serialization | ||
- Start Date: 2016-10-07 | ||
- RFC PR: | ||
- Pony Issue: | ||
|
||
# Summary | ||
|
||
This feature would allow the programmer to specify custom serialization/deserialization that would be run as part of Pony's built-in serialization/deserializaton process. The custom methods would run after the built-in serialization and deserialization, and would allow the programmer to: | ||
* specify the number of bytes to use for serialization | ||
* write bytes to the serialization buffer | ||
* read bytes from the serialization buffer | ||
|
||
# Motivation | ||
|
||
This is primarily intended as a way to allow programmers to provide systems for serializing and deserializing objects that contain `Pointer` fields. Currently the runtime will raise an error if an object that contains a `Pointer` field is serialized. For example, consider a situation where a Pony program has logic that is implemented in C and has objects that store pointers to data that is used by C code. It is currently impossible to call `Serialise.apply(...)` on the objects to create serialized representations of them. This becomes an especially pressing issue when attempting to write API code for handling user-created objects that may or may not contain `Pointer` fields, depending on the implementation choices made by the user. | ||
|
||
# Detailed design | ||
|
||
A class providing its own serialization system would need to implement methods for serializing and deserializing data, as well as a method for conveying the number of additional bytes needed for custom serialization of the object. The runtime would call these methods at the appropriate time to generate serialized data and deserialize that data. Pony's built-in serialization would still be performed on the objects, this system is intended to allow *additional* data to be stored in the serialized representation of the object and recovered by deserialization. | ||
|
||
## What Gets Serialized And Deserialized | ||
|
||
The intent of this system is to allow the programmer to specify a way to use Pony's existing serialization system to work with objects that contain `Pointer` fields to C data structures. Consequently the expectation is that the system would only be used to serialize and deserialize `Pointer` fields, since the other fields are already serialized by Pony's built-in system. However, there is nothing preventing the user from including information from other fields in the serialized representation, nor from using the serialized data to modify non-`Pointer` fields during deserialization. | ||
|
||
## Methods | ||
|
||
All of the following methods must be implemented for custom serialization: | ||
* `fun _serialise_space(): USize` -- returns the number of bytes to reserve for custom serialization | ||
* `fun _serialise(bytes: Pointer[U8] tag)` -- takes in a pointer to the location in the serialization buffer that has been reserved for this object's extra data, writes a serialized representation of its data to the buffer | ||
* `fun ref _deserialise(bytes: Pointer[U8] tag)` -- takes in a pointer to the location in the deserialization buffer that represents the object's extra data, reads the data out, and modifies the object using that data | ||
|
||
## Behavior Changes | ||
|
||
Currently the runtime raises an error if the program attempts to serialize an object that has a `Pointer` field. This would need to be changed to allow these objects to be serialized. | ||
|
||
## Serialization Format | ||
|
||
Currently the serialization format represents an object in a byte array like this: | ||
|
||
``` | ||
address : [ word 1] [ word 2] [ word 3] [ word 4] ... [ word X] [ X+1 ] [ X+2 ] | ||
value : [type id] [field 1] [field 2] [field 3] ... [type id] [field 1] [field 2] | ||
[------------------- object 1 --------] ... [-------- object 2 ---------] | ||
``` | ||
|
||
The `type id` is the index of the object's type in the class descriptor table. Each field is either | ||
1. A raw value that represents a type such as an integer or floating point number | ||
2. A number that represents the index of in the serialized representation in the byte array of the object in this field | ||
|
||
(Strings and arrays are handled in a similar but distinct manner, but we can avoid that discussion for now.) | ||
|
||
Assume we have code like this: | ||
|
||
``` | ||
class Foo: | ||
let _a: Bar | ||
let _b: U64 = 14 | ||
let _c: U64 = 19 | ||
|
||
class Bar | ||
let _f1: U64 = 24 | ||
let _f2: U64 = 27 | ||
|
||
// ... | ||
let x = Foo | ||
let sx = Serialise(x, auth) | ||
``` | ||
|
||
then the serialized form stored by `sx` might look like this (assuming 8-byte words): | ||
|
||
``` | ||
address: [ 0x00 ] [ 0x08 ] [ 0x10 ] [ 0x18 ] [ 0x20 ] [ 0x28 ] [ 0x30 ] | ||
value: [ 135 ] [ 0x20 ] [ 14 ] [ 19 ] [ 95 ] [ 24 ] [ 27 ] | ||
[----------- Foo instance ------------] [-------- Bar instance -----] | ||
``` | ||
|
||
In this case, the `Foo` class has a `type id` of `135` (address `0x00`), the first field of that object (address `0x08`) points to the object that will be deserialized from position `0x20`, the `Bar` class has a `type id` of `95` (address `0x20`), and the rest of the fields are filled with representations of their numeric values. | ||
|
||
### Change To The Serialization Format | ||
|
||
A class that provides custom serialization will provide a method called `_serialise_space()` that returns the number of bytes that must be added to the end of the object's representation for additional serialization data. The `_serialise_space()` function could always return the same value if all objects of the class are serialized to the same number of bytes, or it could calculate a value based on the serialization format and the size of the object being serialized. The details are entirely up to the implementer. The extra serialization data will appear after the object's fields and before the next object in the byte array. | ||
|
||
Changing the last example slightly, assume we have code like this: | ||
|
||
``` | ||
class Foo: | ||
let _a: Bar | ||
let _b: U64 = 14 | ||
var _c: Pointer[U8] = Pointer[U8] | ||
fun _serialise_space(): USize => 8 | ||
fun _serialise(bytes: Pointer[U8]) => // write 0xBEEF | ||
fun _deserialise(bytes: Pointer[U8]) => // ... | ||
|
||
class Bar | ||
let _f1: U64 = 24 | ||
let _f2: U64 = 27 | ||
|
||
// ... | ||
let x = Foo | ||
let sx = Serialise(x, auth) | ||
``` | ||
|
||
then the serialized form stored by `sx` might look like this (assuming 8-byte words): | ||
|
||
``` | ||
address: [ 0x00 ] [ 0x08 ] [ 0x10 ] [ 0x18 ] [ 0x20 ] [ 0x28 ] [ 0x30 ] [ 0x38 ] | ||
value: [ 135 ] [ 0x28 ] [ 14 ] [ 19 ] [ 0xBEEF] [ 95 ] [ 24 ] [ 27 ] | ||
[----------- Foo instance ----------------------] [-------- Bar instance -----] | ||
``` | ||
|
||
Addresses `0x20` through `0x27` contain the extra data generated by the custom serializer. The deserializer is responsible for converting the serialized representation into a deserialized object and assigning that object to the correct field. | ||
|
||
# How We Teach This | ||
|
||
This should be taught as part of the C FFI documentation in the tutorial because it is intended to be used by objects that already interact with the C FFI in some way. There should also be a Pony pattern that addresses how to serialize and deserialize objects that have `Pointer` fields. | ||
|
||
This is an advanced feature, so it would not change the way that Pony is taught to new users. | ||
|
||
An email to the user mailing list and inclusion in the tutorial should be sufficient for letting existing users know about the feature. | ||
|
||
# How We Test This | ||
|
||
This should be tested in a way that is similar to how the existing C FFI unit tests work. The appropriate Pony class functions should be provided, which in turn call C functions that do the necessary work. An object will be created, serialized, and deserialized, and then the two objects will be compared for equality. This will require another C function to compare the structures that are pointed to. | ||
|
||
# Drawbacks | ||
|
||
This implementation involves working with pointers and assumes the use of the C FFI, so it could encourage the use of unsafe code, which would undermine one of the main features of Pony. It places most of the burden of doing the right thing on the programmer. | ||
|
||
The plan allows the runtime to serialize obects with pointer fields. If the user does not provide the appropriate serialization methods then the deserialized object will contain a null pointer which will most likely cause the program to crash if the program attempts to access that field. | ||
|
||
There is an added runtime cost associated with checking for the existence of serialization functions. | ||
|
||
# Alternatives | ||
|
||
A programmer can create a serialization system of their own if they wish to do so. It will not be usable with by code that relies on the built-in serialization system, so the program would need to differentiate between classes that use the built-in mechanism and ones that provide their own serilization. | ||
|
||
# Unresolved questions | ||
|
||
There is still a question of which calls should be compiled in to the program vs which should be done at run time. Because of the way that serialization is implemented, it may be easier to determine whether or not a type provides deserialization functions at run time than to conditionally include them at compile time. From my initial investigation, it appears that the call to `_serialise` can added at compile time, but the calls to `_serialize_space(...)` and `_deserialise(...)` would be easier to make a run time. There are probably ways to do everything at compile time, but I believe this would make the code more difficult to reason about. Having said that, the serialization system is already complex in several ways, so perhaps adding a marginal amount of extra complexity would be worth the marginal performance improvement. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
fun tag
to make sure the result is a "constant". That is, afun tag
with no arguments should always produce the same result every time, barring any use of capability-insecure ambient authority.I'm mainly concerned about the user accessing fields to produce the
USize
result for this method - it seems like this needs to be a constant value so it will be the same on theserialise
end and thedeserialise
end.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind this point - it seems like the ability to read from fields is part of your intention. I suppose as long as the
_serialise_space
is written as part of the serialised representation it can be read on the other side without calling the method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is that the API user is responsible for storing the size of the representation if that is required for deserialization. The size would be stored as part of the representation. I believe this is the ideal solution because it gives the developer more control over how much space is used by custom serialization, and also because storing the size and then passing it to the
_deserialise
function doesn't provide a real benefit in terms of safety, nor in terms of convenience in many cases.Assume that we have an object with a field that is always serialized into a representation of the same size. In that case, storing the size and passing it as part of the call to
_deserialise
doesn't provide any useful information, because the API user already knows how many bytes to read and can therefore ignore the passed size. Storing the size in this case is wasted space in the representation.Assume that we have an object with a field whose representation can vary in size (like an array of integers). In this case, the size could be used by the API user to avoid reading beyond the end of the byte array of the representation. The argument here is that there is a degree of safety given by providing the size and using it to stay within bounds. However, the API user is free to ignore that size parameter, or may make a mistake in implementing the logic that does bounds checking, at which point we are no safer that we were if we made the user responsible for the decisions of whether and how to pass size information.
Assume that we have an object with two pointer fields. Since there is only one serialization area for both objects, the overall size could be passed to
_deserialise
, but then the function would still have to deal with determining the size of each of the individual representations. As above, the only use I could see for having this size information is to use it to avoid reading off the end of the buffer, but the user could ignore this value and use it incorrectly, in which case it becomes useless.Having Pony store the size and then pass it to the
_deserialise
function provides a small amound of convenience (but not safety) in some (but not all) cases; consequently I think that storing information about the size of a representation should be left to the API user.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the serialised representation does have a way of indicating how many bytes of "user space" was granted (requested by the
_serialise_space
method), how is the Pony implementation able to deserialise the buffer? You point to the overhead of storing an "extra" size-indicating word, but surely the deserialisation mechanism will have to have some way of knowing where the user space ends and the next serialised object begins.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jemc it depends on what you mean by "next serialized object begins". In the serialized representation, each field contains either a value (if the field stores a numeric value) or a "pointer" to the place in the buffer that represents object that goes in the field. So there isn't really a concept of "next" in the representation, only fields that point to other objects in the buffer. For all objects except
String
s andArray
s, the size of the object's representation is dictated by the object itself, so there is no need to store the size. In the case ofString
s andArray
s, the size is stored because these data structures are inherently variable in length.Just to be clear here, deserialization (as currently implemented) doesn't work by linearly running through the buffer. Rather, the root object is the first object that appears in the buffer. From then on, the location in the buffer of other objects is determined by the location in the buffer indicated by the field pointer. You can actually stick big chunks of meaningless bytes in the representation and it will be fine as long as there are no fields that point to those bytes. There is no "next serialized object", only objects that were indicated by fields in another object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for explaining.