Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

namedtuple or dataclasses to implement a general format for data 'packets' #66

Open
sneakers-the-rat opened this issue Feb 4, 2021 · 3 comments

Comments

@sneakers-the-rat
Copy link
Contributor

Stubbing out an idea:

In many cases data needs to move around the system in 'packets' -- eg. a frame from a camera needs both the image frame as well as the timestamp, etc., a task stage will return a prespecified set of data, and so on.

We want to have some predictable way of

a) specifying what fields we expect to have in a given 'packet'
b) provide a uniform way of accessing fields of data -- eg. for transformations, we don't want to have to write a thousand subclasses to handle whether the data is a pandas dataframe or numpy array...
c) handle common transformations like unit conversions, serialization and compression operations

it seems like data classes are the natural builtin way of doing this, seems like we may also want to provide some interface to declare them on the fly more in the syntax of namedtuple.

@jerlich
Copy link

jerlich commented Feb 6, 2021

my current thought on this is to use msgpack and then always have a "name", "type", "data" you can essentially create a minimal type system that can be cross-platform (since basically every language supports message pack). So the ZMQ message would just have two parts: one for routing and a msgpack byte array. This does add some overhead, but msgpack is pretty efficient and it would mean that if someone wanted to write a message sink in a language other than python, it would be fine as long as there was zmq and msgpack available. (which is basically all mainstream languages). For data frames, I think arrow or feather is the best supported format for interoperability. so you could go pandas->arrow->serialize->arrow->DataFrames.jl

@sneakers-the-rat
Copy link
Contributor Author

yes yes, currently at the level of serialization we're using something similar, though with json rather than msgpack, though checking msgpack out now it does look very attractive. Need to add this to the docs because some of this isn't explicitly described there, but messages are currently structured like this:

top level zmq "frames":
[to_0, to_i, ... to_n, to_final, serialized_msg]

where a message can be routed through multiple recipients (multple to entries), though typically there is only one. There is one final to_final field that indicates the final intended recipient (a quirk of the 'dual-layer' messaging system where Station objects route message between Nodes that really needs to have #48 happen to it) that should deserialize the message. When being routed over multiple hops, the message isn't deserialized but just forwarded.

The serialized message (which is abstracted by the Message class: https://docs.auto-pi-lot.com/en/latest/autopilot.core.networking.html#autopilot.core.networking.Message ) is a json dictionary with at minimum

  • id (unique message identifier),
  • to (recipient),
  • sender (networking id of sending node),
  • key (type of message, mapped to some callback "listen" method),
  • timestamp, and
  • ttl (time to live, resend a message n number of times until the recipient has issued a receipt).

as well as an optional value field which has any payload of the message (eg. a 'DATA' message contains some data...). Since arrays aren't serializable by json, they're base64 encoded and then compressed with blosc (the Message._serialize_numpy and ._deserialize_numpy methods).

What I'm thinking is some recursive abstraction for "packets" of data (that would then provide some means of serialization/deserialization for transport):

Problem:

  • A lot of data has multiple fields to it, one example would be video frames, which at minimum consist of some timestamp and an array (most if not all data will have timestamps), and currently they're shuttled around as tuples and accessed by just knowing a priori that img[0] will be the timestamp and img[1] will be the array.
  • Additionally, multiple individual "data" elements are routinely transported together, like multiple values from a stage of a task, or if multiple frames of a video are sent as a batch, etc.
  • In addition to "carrying" data, we also need to specify what data is to be expected, as in TrialData, PLOT, and PARAMS descriptors in the Task class, by elements of the GUI, prefs, and so on.
  • Data also often needs a reliable notion of "unit," and currently the system relies on using one scale of units (eg. times always represented in ms) but that is implicit and unreliable and causes problems like the one fixed by @cxrodgers in fix: set_duration and Solenoid.duration are both in ms, no conversion… #63 and requires a bunch of inelegant overhead (eg. when switching between ms and s to use time.sleep()), and causes bad ambiguities (like when setting a color in an LED, 1 is ambiguous because intensity can be set as a float from 0-1 or as an 8-bit value from 0-255).

Requirements

  • The class should be able to represent values recursively so that arbitrary "packetizations" of data can be encapsulated and manipulated with a single API, but also make each "level" of data behave appropriately. For example, a stage of a task could return several computed and measured values in stage_data, which would have its own stage_data.timestamp, but could also contain stage_data.button_pressed.timestamp and stage_data.button_pressed.value which could be used with the appropriate unit and conversion methods like stage_data.lever_position.m or stage_data.lever_position.mm.
  • The class should be able to support serialization and deserialization for use with Messages, as well as dump its fields to other python builtins/pandas representations for interface with other libraries/etc.
  • The class should be able to specify "expected" data and type to unify the disparate methods used across gui, prefs, Task, plots, and others. Expected data type should be able to provide numpy and tables data formats and column descriptors.
  • The class should work such that it respected the operations of the base type, so for example Volt(10) + Volt(20) = Volt(30) but also could handle conversions seamlessly like when combining Time(.001, unit="s") + Time(1000, unit="ms") = Time(1001, unit="ms") or requesting a specific unit Time(1, unit="s").ms
  • The class should be compatible with type hints, so rather than a single class should be able to be specified like def method(self, argument: unit.Milliseconds)

I know other libraries must have something like this (pretty sure Brian2 does) so i'll look around for inspiration rather than trying to think from scratch

@sneakers-the-rat
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants