Member query performance #102

miloyip · 2014-08-11T16:07:00Z

From the beginning of RapidJSON, performance of FindMember() is O(n). I personally also jot this down a long time ago.

https://code.google.com/p/rapidjson/issues/detail?id=5&can=1&q=performance

However, today I got e-mail complaining on this again.

I cannot think of a good solution for this. But I would like to write down the thoughts here for public discussion.

String equality test

When querying member by a key, it must involve string equality test operation.

Currently the string equality test is implemented as:

bool GenericValue::StringEqual(const GenericValue& rhs) const {
    return data_.s.length == rhs.data_.s.length &&
        (data_.s.str == rhs.data_.s.str // fast path for constant string
        || memcmp(data_.s.str, rhs.data_.s.str, sizeof(Ch) * data_.s.length) == 0);
}

There are three condition checks:

Strings are unequal if their lengths are unequal. O(1) because length is specified or pre-calculated.
Constant string shortcut for strings pointed to same address (and have same length). O(1)
Compare memory for O(m), where m is the length of both strings. The worst case is when two strings are equal.

An idea for improvement is to hash the string into a hash code. I noticed that it is possible to add 32-bit hash code without additional memory overhead, and have considered this in the design of the DOM:

class GenericValue {
// ...
    struct String {
        const Ch* str;
        SizeType length;
        unsigned hashcode;  //!< reserved
    };  // 12 bytes in 32-bit mode, 16 bytes in 64-bit mode
};

The hashcode can be evaluated during parsing with an new option. And then a new variation of SetString() or constructor can evaluate the hash code of a string. Finally, use that string to query member. We may initialize hashcode = 0 to represent an invalid hash code. Then the code may become:

bool GenericValue::StringEqual(const GenericValue& rhs) const {
    if (data_.s.length != rhs.data_.s.length) return false;
    if (data_.s.str == rhs.data_.s.str) return true;  // fast path for constant string
    if (data_.s.hashcode == 0 || rhs.data_.s.hashcode == 0 || data_.s.hashcode == rhs.data_.s.hashcode)
        return memcmp(data_.s.str, rhs.data_.s.str, sizeof(Ch) * data_.s.length) == 0;
    return false;
}

Although the worst case of the test is still O(m), the test can be finished in O(1) when two strings are unequal (thus their hash codes are unequal).

For member query, most string equal tests should be false, so this may improve performance. However, there will be additional O(m) costs for evaluating hash code of each string initially.

Associative Array

A JSON object is basically an associative array, which maps string keys to JSON values.
Currently members are stored in a std::vector like manner. Order of members are controlled by user, which depends on the sequence of AddMember() or the JSON being parsed. This is actually an important feature when user want to maintain the order of members.

However, of course, the query time is O(n) without additional data structure.

There are other possibilities for representing associative array:

Sorted array.

After parsing an object, the members are sorted by their keys by O(n log n). Binary search can be used in query, improving query time to O(log n). Adding new member needs sorting again later (amortized O(log n)). No additional space overhead.

Besides, note that this requires comparison of string lexicographically (not equality), which is an O(m) operation for m equals to minimal length of two strings.

Hash table.

The simple way is using open addressing, so that it will still be a single buffer. Hash code can be computed as in the last section. Insert, query and remove is O(1). When adding new member and the load ratio (member count divided by capacity) is over a constant, the hash table need to be rebuild by O(n). In addition, iterating members will be linear to the capacity, not the count.

	Current (Vector)	Sorted Array	Hash Table
Custom Order	Yes	No	No
Sorted by Key	May be	Yes	No
Initialization	O(n)	O(n log n)	O(n)
AddMember	Amortized O(1)	Amortized O(log n)	Amortized O(1)
RemoveMember	O(n)	O(n)	O(1)
FindMember	O(n)	O(log n)	O(1)
IterateMember	O(n)	O(n)	O(capacity)
Resize	O(n)	O(n)	O(capacity)
ObjectEquality	O(n^2)	O(n)	O(n)
Space	O(capacity)	O(capacity)	O(capacity)

n = number of member
The hidden m (string length of key) is not shown.
In hash table, n <= capacity * load_ratio. In others, n <= capacity.

If we want to support either or all of these data structures, we have two options: (1) use template parameter to specify which to use in compile-time; or (2) use flags to specify in run-time with overheads in all related operations.

The text was updated successfully, but these errors were encountered:

resty-daze · 2014-08-29T03:27:24Z

I think that you can use hash table with a linked list to achieve most functions here.

Just make HashTable Node something like:

struct Node {
     T *value;
     Node* prev;
     Node* next;
};

You can choose to use LinkList or HashTable when implement different functions. Only cost is the memory space to save the pointers.

geniushuai · 2016-03-24T16:35:55Z

why not just only hash for GenericMember.name. Not necessary for each string.

gvollant · 2022-08-25T13:38:26Z

the commit 71f0fa7 added a map (so a sorted array)

miloyip added the performance label Aug 12, 2014

miloyip mentioned this issue Aug 28, 2014

Three new APIs are added for JSON object type. #119

Merged

miloyip mentioned this issue Apr 11, 2015

JSON Pointer #297

Closed

8 tasks

miloyip mentioned this issue Apr 19, 2015

能发一个release版本不 henshao/jsoncpp#2

Closed

miloyip added this to the v1.1 Beta milestone Apr 24, 2015

miloyip mentioned this issue Jun 3, 2015

Why member iterator is random access one? #352

Closed

miloyip mentioned this issue Jan 31, 2016

Performance in miloyip/nativejson-benchmark nlohmann/json#202

Closed

miloyip mentioned this issue Mar 4, 2016

Slow member find #570

Closed

miloyip removed this from the v1.1 Beta milestone Apr 15, 2016

miloyip mentioned this issue Apr 24, 2018

Performance between rapidjson.FindMember and unordered_map find #1228

Open

miloyip mentioned this issue Jan 4, 2022

Suggestion: sort object to accelerate FindMember method #1978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Member query performance #102

Member query performance #102

miloyip commented Aug 11, 2014

resty-daze commented Aug 29, 2014

geniushuai commented Mar 24, 2016

gvollant commented Aug 25, 2022

Member query performance #102

Member query performance #102

Comments

miloyip commented Aug 11, 2014

String equality test

Associative Array

resty-daze commented Aug 29, 2014

geniushuai commented Mar 24, 2016

gvollant commented Aug 25, 2022