-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance on JSON serialization when using HTTP query #7250
Comments
Possibly related to #7154. |
So I spent a bunch of time on this and came out with what amounted to nothing. I was performing performance comparison tests between the tinylib/msgp library and the JSON one we've got right now and I, weirdly, came out with having no difference between them. When I performed the benchmarks, msgpack was definitely faster, but when I attempted to profile it with a query in the actual server, my response time was mostly the same. I attempted to also do this with just discarding the output on the wire completely so I didn't have to compare different serialization methods. The results were mostly the same. I'll try to find some time this week to reproduce my results (since I'm relaying my results from memory right now), but I think the current blocker is within the query engine. While making the serialization faster is a noble goal, I think it won't produce any sizeable gains because a lot of the hotspots I found in the heat map were the garbage collector and channel operations. The garbage collector is because we don't handle memory as efficiently as we probably should and the channels because selecting raw fields produces channels at the moment. More testing is needed to find what hotspots are there and how to optimize them, but I don't think the JSON encoder is the current hotspot. |
After reading your reply I think there may be another cause if the serialization is not the hotspot: create a large number of query result objects one by one, somehow like
I'm not familiar with golang or its compiler. A naive implemented compiler and garbage collector may lead such operation to a huge number of memory allocating and freeing system calls thus making the process run into userspace and kernel space back and forth, which may cost cpu a lot. As far as I know, such code in Python will make a poor performance. I'm not sure golang as a compiled language will do so or not. Maybe you should use techniques like memory pool to improve performance. The word "poor", I mean, is compared with popular RDBMS like MySQL. I've created a similar table with a "time" column B-Tree indexed, it seemed that if the result set was large, MySQL did much better. As you said, more performance testing and profiling is needed to check if optimization may gain a significant performance enhancement, or we just reach the limitation of golang. Anyway, thanks for considering performance improving. My use case is listing API statistics in our systems for a period of time so a result with thousands rows are quite common. If you need any detailed information to help you , I'm glad to share. |
I used easyjson for serialization / deserialization of billions of JSON messages over web sockets. Their benchmarks are pretty comprehensive too. |
In addition, is it worth adding support for RFC 7464: JavaScript Object Notation (JSON) Text Sequences. If the client includes an { "name": "cluster", "columns": ["time", "clusterID", "copyShardReq", "createIteratorReq", "expandSourcesReq", "fieldDimensionsReq", "hostname", "nodeID", "removeShardReq", "writeShardFail", "writeShardPointsReq"] }
["2017-04-24T23:59:10Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0]
["2017-04-24T23:59:20Z","535175895417456351",0,0,0,0,"stuart-influx.local","data-0:8088",0,0,0,0] |
I don't actually think there is a marshaling problem. I tried to make marshaling faster back when I looked at this awhile ago, but it didn't speed up real performance because the limiting factors weren't based on marshaling. Further, we should consider moving away from JSON and just supporting JSON as a debugging mechanism. JSON causes a bunch of other unrelated issues since it can't differentiate between floats and ints and it cannot accurately represent integers above 2^53. Since we support 2^63 with signed 64-bit integers, this is a bit inconvenient. |
Agreed – we'll start with code-generated serialization |
I have added results of some detailed analysis here. I used easyjson to generate serialization methods for the core types involved in |
Once long ago, I was annoyed by things to do with JSON encoding and decoding when talking to influxdata, and I was thinking about that and thought "this has to be expensive, right?", and thus I somehow ended up here, because I stumbled across the csv/msgpack support on the server side, which don't have corresponding functionality on the client side, which is why I never noticed that they existed. I was a little surprised at the relatively small magnitude of the improvements from improving the marshalling code, but after looking at it more closely, I think this is probably a data structure issue. 10k rows with three values in each row is 10k slices of three interface{}, each of which in turn has to have a pointer to an underlying object. This is a fairly large volume of pointers for the GC to track, and it probably means they're all different allocations; I'm pretty sure that they will be after unmarshalling, in any event. Performance might be improved by coalescing the allocations for the backing store, but my intuition is that if you really want to reduce the memory/CPU overhead much, you'd need to switch to slices of underlying types. So, instead of each row having a slice of interface{}, each series would have a slice of interface{} -- each member of which would be a slice of a concrete type, holding the concrete values for one column. So, if you have a column of timestamps, that would be a single slice of 10,000 time.Time, instead of 10,000 individual interface{} each wrapping a time.Time. |
FWIW, on my system, dumping the same exact largeish (~1.5M row) query ( (Also, I was about to say "oh, and nevermind, ints and pointers can be inlined in interfaces", but they actually can't since apparently around 1.5, because that caused problems for gc.) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions. |
I'm running an influxdb v0.13 instance on Ubuntu 14.04LTS. It seems that influxdb shows a poor performance when the size of result JSON is large for HTTP query.
For example, when I simply run a
count
query on a field namedcount
, the result JSON is quite small.Time stats:
However, when I want to list the field(725675 row in total as shown above), the query becomes slow. The size of result JSON is 19MB.
Time stats
And it becomes slower if I select more fields(even if duplicate the same field
count
). The size of result JSON is 22MB.Time stats
I guess the poor performance is due to the JSON serialization process on large dataset for influxdb. The HTTP query is through interface
lo
so that network speed is not the cause. Neither is the hard drive's io as I have tried to redirect thestdout
to/dev/null
while the time cost remains the same.Would you mind do a profiling on JSON serialization?
The text was updated successfully, but these errors were encountered: