Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: sort object to accelerate FindMember method #1978

Open
lymslive opened this issue Jan 4, 2022 · 4 comments
Open

Suggestion: sort object to accelerate FindMember method #1978

lymslive opened this issue Jan 4, 2022 · 4 comments

Comments

@lymslive
Copy link

lymslive commented Jan 4, 2022

I also concern about the linear complexity O(n) of FindMember() methods as some other users.
And I also noticed there is a macro to enable std::multimap to represent an object, but I think that solution may not be perfect, as it is not memory friendly and the layout is not compatible with before.
So I bring up another approach as following:

  1. add a SortObject() method for GenericValue type, to sort object by the key name, which is typically in complexity of O(NlogN) ;
  2. add a FindMemberBinary() method to find member in an already sorted object, which is in O(logN).
  3. Extend the GenericValue::Flag which is in the last two bytes, there is still unused bit, we can use one bit to mark the object is sorted, then the FindMember() method knows when a binary search is suitable and when should use the old linear search method.

I also think I can understand why rapidjson use linear search by default, beside for simple implementation. When the key number of object is small, there is no significant difference between linear search and binary search(or map find). I noticed in many practical projects (for example HTTP API body), there is only several keys in a json object, and so the default behavior of rapidjson works well in such cases.
The idea is give the choice to programmer, who can decide whether there is need to sort the object before (many repeated) FindMembers.
Sort object requires no extra memory, but after any AddMember(), the object may not keep full sorted. So this basic approach is especially suitable for read-only json document/tree. We could further extend the GenericValue::Flag to mark some object need to reserve sorted in AddMember(), while in trade of O(N) complexity.
We can call the read-only sorted flag as "Week Sorted", the writeable reserved sorted flag as "Strong Sorted", however I think the later flag may be less useful than the previous one.

@miloyip
Copy link
Collaborator

miloyip commented Jan 4, 2022

There is an analysis long time ago #102 .
Maybe a post-process which sorts on all objects in the tree, and then using another API (such as the one you propose), can be an option.

@dinomight
Copy link

This kind of sounds like a post-process routine that indexes the data. That doesn't seem like it's something GenericValue itself should do, but instead some utility class wrapping a GenericValue could do. Mainly because moving all that content around including any underlying objects/strings/values just to order things sounds super expensive when you could instead build up an index of all the keys and map them directly to pointers held within the wrapped GenericValue. This starts to get even more tempting if you have C++17 std::string_view in your tool kit to potentially avoid copying the key values. The sorting would generally be cheap for small containers (where O(N) isn't bad, but indexing is "expensive"), but really expensive on larger containers (where the overhead of moving stuff around may exceed the indexing overhead). It'd be interesting to see where the break even point is on size of the JSON object vs number of lookups and whether just doing the linear search is better than sorting/binary search or better than indexing.

In my past experience doing optimizations, string comparisons are kind of your enemy. I would fear even with O(log N) lookup, you'd probably still find performance lacking if you are doing enough key lookups. That's why even if you had std::multimap, you'd probably still be unhappy. If you start moving into the unordered containers, like std::unordered_multimap, you start to get better performance and may start to find the string hash function as the primary hotspot. That all leads to my recommendation of some sort of indexing wrapper around a GenericValue so that the user can better control the performance characteristics. It could even be able to decide if it should index or linear search based on the size of the JSON object if there is performance data to support one way over the other.

@lymslive
Copy link
Author

lymslive commented Jan 6, 2022

Yes, sorting object can be considered as a kind of post-process, and easily put in other util lib, and at first I just try to write a free function something like:

namespace jsonutil
{
void SortObject(rapidjson::Value& json);
const raidjson::Value* FindMemberBinary(const rapidjson::Value& json, const std::string& name);
}

But with this, it cannot be used in oop manner. I think it would be better if we have official support for this, then can write as json.SortObject() and json.FindMemberBinary(name).
I Agree it is index wrapper system if resort to std::multimap or std::unordered_multimap, but not necessary when sort object in place, as it requires no extra memory and fully back compatible in layout.

When compare key name string, I also have another opinion. Since we always have GetStringLength() available, can we compare string length first, then string content, saying smaller string is LESS THAN larger string, it would be faster than typical strcmp. The idea is that we only need a determined order among object keys, to make binary search run correctly and find some specified key.

I also noticed there is reserved space for hash code in string type rapidjson::Value, which seams not used up by now.

@dinomight
Copy link

My primary concern with a sort and binary search of the JSON object is more the difficulty in knowing when the linear search will have better performance characteristics. Having to move objects around in the layout isn't going to be cheap in some scenarios and may require some space to hold elements while performing swaps. Additionally, since there's reliance on a precondition, it would be very easy for a person to misuse the API and forget the sort call (and that's ignoring a modifiable object where additional items being added won't guarantee the sort order is maintained).

std::multimap wouldn't be an index, but is instead equivalent to the sort and binary search we're talking about here, just with the extra memory overhead. std::unordered_multimap is closer to an indexed approach as it will hash the keys for near constant time lookup. std::unordered_multimap is a C++11 feature and is probably why RapidJSON is targeting the ordered version when you enable members map.

My curiosity had me looking at another library's implementation of key lookups (simdjson, ArduinoJSON). They have a linear search as well through the elements and have their own string compare methods (which are quite close to the rapidjson approach, believe it or not). Makes me wonder if optimizing key lookups in a JSON object are not actually a high priority for most users of JSON libraries as the key lookup itself is "fast enough" to get the values to the places where they are accessed a lot (meaning the key lookup isn't in the performance critical path) or the key lookup isn't used at all (instead, iterate through members on their own so string compare happens once per element and there is no lookup).

GenericValue has a method for string comparison that's called StringEqual that's used for DoFindMember, which is the underlying method used to do member lookup within an object. It does verify equivalent string length before using memcmp to compare the memory contents. strcmp is not invoked.

It might be better to provide users the option of a more optimized key lookup at the cost of insertion overhead (meaning it would impact parse times). In that case, the values could enforce sorted order always or provide some other mechanism to enable optimized key lookups. I don't know how simple that would be to provide in the current architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants