Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional value types? #8

Open
Simran-B opened this issue Feb 17, 2016 · 3 comments
Open

Additional value types? #8

Simran-B opened this issue Feb 17, 2016 · 3 comments

Comments

@Simran-B
Copy link
Contributor

A rather random list of things that came to my mind:

  • sets (ordered/unordered), unique values only
  • bags (unordered), as opposed to arrays, which are ordered bags (non-unique and ordered)
  • sorted variants of sets and bags (?)
  • enums (symbols, bitfields, ...)
  • union structs (?)
  • NaN (invalid number) / N/A (known to be missing its value) / undefined
  • UTF-16 encoded text, which may use up less memory/space/bandwidth for certain content (Chinese text for instance), but add a cost for (de-)serialization, because JSON requires text encoding to be UTF-8
@neunhoef
Copy link
Member

First comment: There are not many type bytes left for extensions. Furthermore, the specification is already complex enough for my taste. Therefore I would only be in favour of additions if they bring a lot of additional value and cannot easily be emulated with the available types (or custom types).
I think that arrays can be used for sets (ordered/unordered) and bags, so IMHO this does not give enough reasons to add further complications.
Enums would have to be declared somewhere, VelocyPack is intentionally schema-free, so there is no good place to declare them. One can use integers in most applications.
union structs : I do not understand, since an object is something like a struct and unions do not seem to make sense here.
NaN: we do have IEEE double, which has cases for NaN and infinity and the like. I do not think we should have a separate type for this.
UTF-16 does same some space for certain textual content. I would think that this is an edge case (I know, there are many chinese people!) and UTF-8 isn't too bad for this. In case of need one can always stick the text into a binary blob.

Sorry for being against these suggestions, but I think we always have to keep in mind that every single additional type makes the implementation for another language more complicated.

@Simran-B
Copy link
Contributor Author

Thanks for the detailed reasoning! More types would mean more work indeed and I think pretty much everything can be modeled with already specified types (fuzzy dates being my personal favorite).

Sorted sets seemed interesting to me, in particular in imports, because the type could signal the DBMS that the data doesn't need to sorted after import (because it already is) and that no duplicate values are to be expected. But that's not very useful I guess, and it even moves structural concerns to the data type level, which belong on the document level to stay schema-free at the DB level (like with enums).

I did not know IEEE double could handle NaN and Infinity. Doesn't it also support +0 and -0?

About unions: it's more about how you access the data, not how it's stored actually... I realize that now. I think you would use a binary block of data on the DB level and interpret it in different ways on the application level if necessary.

Regarding UTF-16: non-ansi people probably form the majority, but yeah, blob is always an option and the space savings might not be that large after all - because characters from the 1-byte range (whitespace, inter punctuation, digits, ...) are frequently used in texts with mostly 2-4 byte characters and they may weigh out the differences.

That said, there's probably no type really missing in the VPack specs!

@Simran-B
Copy link
Contributor Author

Simran-B commented Mar 11, 2016

Some additional data types available in CBOR (all UTF-8 strings):

  • URI
  • base64url
  • base64
  • RegExp
  • MIME message

Three more random thoughts:

  • VPack data split across multiple files - convention to name continuous files? (similar to split rar archives, .rar, .r00, .r01, ...)
  • Embedding of meta-data (dropped on conversion to JSON?), such as VPack version, date and time of creation, license / restrictions, integrity hash
  • Subtree hashes for fast deep-equality checks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants