Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Stop using JSON as the standard response format #8330

Closed
jsternberg opened this issue Apr 27, 2017 · 8 comments
Closed

Proposal: Stop using JSON as the standard response format #8330

jsternberg opened this issue Apr 27, 2017 · 8 comments

Comments

@jsternberg
Copy link
Contributor

Feature Request

I think we should stop using JSON as the standard recommended response format. JSON is just not a suitable technology for our use case and provides some rather negative limitations that prevent the system from operating in all of the cases we claim to support.

Proposal: Support an alternative format and recommend all clients use that format instead of JSON. JSON will still and always be supported, but that doesn't mean we have to actively use it.

Current behavior: JSON is the only practically usable format. While CSV exists, it is there for exporting and not there for generic usage in a marshaling/unmarshaling capability.

Desired behavior: Support an alternative format that acts as the standard marshaling format used by client libraries.

JSON will still be the default, but not the recommended. So if a client doesn't ask for a specific format, it gets JSON and all of its flaws. But client libraries will be modified and encouraged to use the alternative format instead.

Use case: JSON provides a number of limitations for clients that unmarshal the response (which would be all of them).

  1. JSON does not support integers above 2^53 because it represents everything as a float and floats cannot reliably represent numbers above 2^53. That means that storing values as a large integer will commonly lead to the database returning the wrong value. This isn't the fault of the storage engine or the query engine, but the marshaling format that loses the precision.
  2. JSON cannot tell the difference between a float and an integer because it represents them the same way. This means that you can do an insert like cpu value=2i and the client will have no idea if it's dealing with integers or floats so it will return [float64(2)]. If you try to use the decimal point to determine if something is an integer or a float, you can end up with cpu value=2 (where it should be a float) and it will return that as an integer. That means the client could return something like this [int64(2), float64(2.5)]. For a user-friendly client, we need to tell the client which type it is instead of forcing it to guess.
  3. JSON has limited performance characteristics. Marshaling and unmarshaling commonly use a lot of memory allocations and there's a lot of extra bytes for just the structure. While not as bad as XML, it's not nearly as good as an interchange format should be. The JSON serializers that allow you to compile the code used to marshal aren't nearly as fast as some of the other languages.

The last one isn't that big of a concern. While performance is always a concern, we have not found marshaling to be a limiting factor. Developer usability comes first. Some possible formats that need to be investigated:

  • MessagePack
  • Protocol Buffers
  • Thrift
  • Flat Buffers

All of these should meet the minimum requirements. Since JSON is still supported and still acts as the global interchange format, we can use more complex unmarshaling since this aspect is focused on supported clients.

Additionally, we should also consider if the current schema for returned values is still appropriate. While the JSON format should never change, the current schema format doesn't make that much sense. One of the big issues is that the columns for the statement are returned with each series. So if you have something like GROUP BY *, you will get the columns repeated for every series, but the columns should be included with the result and not the series. We should also evaluate if there's a better way to represent the response output other than chunking since chunking is fairly confusing and also hard to implement from the client side.

@jsternberg
Copy link
Contributor Author

/cc @stuartcarnie @rbetts

@nathanielc
Copy link
Contributor

nathanielc commented Apr 27, 2017

@jsternberg Kapacitor suffers from this because it cannot tell whether queried data is an int64 or a float64.
See influxdata/kapacitor#1338

I do not have a strong opinion which serialization format to use just that it can correctly communicate type information. My preference would be a self describing format since that makes writing clients easier, no need for an IDL or generated code. Of the list you provide I believe only MessagePack is self describing.

As for changing the schema I don't think we should. Lets tackle one problem at a time, and the current problem is type information loss in our serialization format.

NOTE: It is possible for the columns per group to be different if the fields are scoped to a tag set. I actually added a test case for this to Kapacitor this morning see influxdata/kapacitor#1320

@jwilder
Copy link
Contributor

jwilder commented Apr 27, 2017

I'm in favor of this. We moved away from JSON in write path for similar reasons. I like that we can add additional formats while maintaining backwards compatibility.

@jsternberg
Copy link
Contributor Author

If we want a quicker fix for this, and since we can support multiple formats anyway, we can always go towards merging #7154 (after a rebase) and have it in by 1.3. I think flatbuffers is likely the way to go in the future, but message pack would definitely be an improvement and I don't think supporting both is a negative.

@nathanielc
Copy link
Contributor

+1 for adding msgpack sooner rather than later. It is just an serialization protocol change so it doesn't effect existing clients or behavior. Clients (like Kapacitor) that want to take advantage of it can.

@rbetts rbetts self-assigned this Apr 27, 2017
@rbetts
Copy link
Contributor

rbetts commented Apr 27, 2017

There are set of related issues that we have gathered for shortly after the 1.3.0 milestone - this is in that list.

@mcbortolozzo
Copy link

The project I'm currently working on involves an in-memory IoT analysis tool tightly connected to a single instance disk database. InfluxDB provided good features to fit most of the requirements I had, but the improved performance of one of the proposed formats, concerning both memory usage and time would be very useful. I hadn't seen the msgpack implementation, so I also ran some tests with a protobuf encoder, just to have an overall idea of the possible improvements, and the results were similar to the ones obtained. I'd be interested in having a similar option too.

@rbetts
Copy link
Contributor

rbetts commented Oct 26, 2017

@jsternberg added a msgpack response format: #8897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants