Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New response structure #8450

Closed
jsternberg opened this issue Jun 3, 2017 · 3 comments
Closed

Proposal: New response structure #8450

jsternberg opened this issue Jun 3, 2017 · 3 comments

Comments

@jsternberg
Copy link
Contributor

This issue is directly tied to #8330 and really can't be separated from it. If accepted, both of these would be implemented at the same time.

Feature Request

I think we should reconsider the response structure that we want to be used for the client. I do not think the current one is user-friendly towards either users or API writers. To be clear, when I say response structure, I mean the way the data is structured as it comes back. I do not mean the format like JSON, MessagePack, Protocol Buffers, etc. I think the format itself is broken and based on concepts that haven't existed in the query engine since before 1.0 (or at least not how I think of them while working on the query engine). Changing the format will not fix underlying issues that get run into when programming clients.

I've seen issues that indicate people have difficulty with a large amount of data and I have experienced the problem myself while trying to work on the experimental client.

Large amounts of data means I want to be able to process the stream iteratively, but the current format makes this difficult because we return "results" that contain "series" (which we call "rows" in the code and "series" everywhere that is user facing) and inside of those series are the "values" (which should be called rows).

When we need to break down the data so it gets returned in chunks, we use the concept of a partial result or a partial series. The partial result says that the next result is actually part of the current result and a partial series means the next series is part of the current series. But, when you get into nesting and actually reading from the cursor, you end up with this absurdly complicated code to correctly find which result the next series is in. As an example, here is the code that I created for the experimental client that fully handles partials in a way so the user does not have to worry about partials: https://github.com/influxdata/influxdb-client/blob/2d000abc81dd6588d94ee5578810e9fc64bec5ab/json_cursor.go

I consider this code to be unreadable and I believe it would hamper adoption since you cannot realistically use chunking. I say this as the person who added that feature as an answer to large amounts of data.

But, the idea behind chunking was to make it a default so data would be streamed rather than forcing a potentially large amount of memory to be allocated on the server. It was supposed to solve problems. While it has not solved those problems, that doesn't mean the original problems went away. We still need a way to stream points back to the user instead of returning it as a JSON blob.

With that, I think we should discuss ways that we can potentially change the format. Internal to the query engine, we already return the data in a way that would work. But then we aggregate it and return it in our monster format that's difficult to impossible to read correctly. We will obviously have to maintain the same format with JSON and CSV, but I think we should consider a structure that works better with streams for MessagePack before we merge that and then start using that for any future clients.

@Sineos
Copy link

Sineos commented Jun 3, 2017

Maybe according to the HTTP standard in RFC2616. It leves this up to the client. See e.g. https://benramsey.com/blog/2008/05/206-partial-content-and-range-requests/

@jsternberg
Copy link
Contributor Author

Thank you @Sineos, but I don't think the problem isn't as simple as something that can be solved as part of the HTTP standard. Partial content and range requests is a good idea and I've considered it before, but it doesn't necessarily match with the query workflow.

To expand, the issue is mostly with how things are encoded. For each chunk that is sent back to the client, it has to be parseable. So you can't just split a JSON response between two chunks as you won't be able to read it iteratively. Because of that, we need to send JSON chunks instead. But, we send back the chunks with nested structures like {"results":[{"statement_id":0,"series":[...]}]}. If a series inside of the result is split, we don't have the ability to split an array within a map in JSON so we had to have "partial": true to signal to the client that we were doing that. But while that makes it possible to reconstruct the full set of results, it doesn't make actually iterating through those series or rows easier.

So the idea is to instead find some way to split the array within results and the array within series so that they can be split between different chunks within the HTTP response without causing any issues. The working example I have in my head at the moment is something like this (and mostly just stolen from how Transfer-Encoding: chunked is implemented). Each of the following would be sent as a separate JSON or MessagePack blob (format doesn't matter).

response header: {} or {"error": "error message"}
|- number of results in chunk (0 terminates this block)
|- result header: {"statement_id":0, "columns":["time", ...], "messages": [...]}
|--- number of series in this chunk
...

While the idea isn't fully fleshed out, the idea is that the server doesn't have to close the nested structure for the result because a nested structure doesn't exist. Instead, there's a format of blobs in a specified structure that is meant to be read iteratively. In that way, it makes it easier to read each entry one by one.

I hope that helps explain the idea behind the issues well enough. I haven't fleshed out the format fully, but the current way I envision it would involve it working well with MessagePack.

@Sineos
Copy link

Sineos commented Jun 6, 2017

I was more thinking about the client side to handle the chunks. Like the HTTP Range Request but built into Influx:

  1. Client: Do query. Request ranged
  2. Server: For that query I will return 10,000 results with this {...} structure
  3. Client: Send the first 1,000 results
  4. Server: Sends the first 1,000 as fully contained JSON or blobs
  5. Client: Processes first 1,000 then requests next 1,000

This means the client can decide (depending its processing power and type of of application) at which speed it will request the data.
The fact if this is encapuseled in a chunked HTTP stream or not is up to the HTTP protocol layer and mostly transparent to the client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants