Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Member query performance #102

Open
miloyip opened this issue Aug 11, 2014 · 3 comments
Open

Member query performance #102

miloyip opened this issue Aug 11, 2014 · 3 comments

Comments

@miloyip
Copy link
Collaborator

miloyip commented Aug 11, 2014

From the beginning of RapidJSON, performance of FindMember() is O(n). I personally also jot this down a long time ago.

https://code.google.com/p/rapidjson/issues/detail?id=5&can=1&q=performance

However, today I got e-mail complaining on this again.

I cannot think of a good solution for this. But I would like to write down the thoughts here for public discussion.

String equality test

When querying member by a key, it must involve string equality test operation.

Currently the string equality test is implemented as:

bool GenericValue::StringEqual(const GenericValue& rhs) const {
    return data_.s.length == rhs.data_.s.length &&
        (data_.s.str == rhs.data_.s.str // fast path for constant string
        || memcmp(data_.s.str, rhs.data_.s.str, sizeof(Ch) * data_.s.length) == 0);
}

There are three condition checks:

  1. Strings are unequal if their lengths are unequal. O(1) because length is specified or pre-calculated.
  2. Constant string shortcut for strings pointed to same address (and have same length). O(1)
  3. Compare memory for O(m), where m is the length of both strings. The worst case is when two strings are equal.

An idea for improvement is to hash the string into a hash code. I noticed that it is possible to add 32-bit hash code without additional memory overhead, and have considered this in the design of the DOM:

class GenericValue {
// ...
    struct String {
        const Ch* str;
        SizeType length;
        unsigned hashcode;  //!< reserved
    };  // 12 bytes in 32-bit mode, 16 bytes in 64-bit mode
};

The hashcode can be evaluated during parsing with an new option. And then a new variation of SetString() or constructor can evaluate the hash code of a string. Finally, use that string to query member. We may initialize hashcode = 0 to represent an invalid hash code. Then the code may become:

bool GenericValue::StringEqual(const GenericValue& rhs) const {
    if (data_.s.length != rhs.data_.s.length) return false;
    if (data_.s.str == rhs.data_.s.str) return true;  // fast path for constant string
    if (data_.s.hashcode == 0 || rhs.data_.s.hashcode == 0 || data_.s.hashcode == rhs.data_.s.hashcode)
        return memcmp(data_.s.str, rhs.data_.s.str, sizeof(Ch) * data_.s.length) == 0;
    return false;
}

Although the worst case of the test is still O(m), the test can be finished in O(1) when two strings are unequal (thus their hash codes are unequal).

For member query, most string equal tests should be false, so this may improve performance. However, there will be additional O(m) costs for evaluating hash code of each string initially.

Associative Array

A JSON object is basically an associative array, which maps string keys to JSON values.
Currently members are stored in a std::vector like manner. Order of members are controlled by user, which depends on the sequence of AddMember() or the JSON being parsed. This is actually an important feature when user want to maintain the order of members.

However, of course, the query time is O(n) without additional data structure.

There are other possibilities for representing associative array:

  1. Sorted array.

    After parsing an object, the members are sorted by their keys by O(n log n). Binary search can be used in query, improving query time to O(log n). Adding new member needs sorting again later (amortized O(log n)). No additional space overhead.

    Besides, note that this requires comparison of string lexicographically (not equality), which is an O(m) operation for m equals to minimal length of two strings.

  2. Hash table.

    The simple way is using open addressing, so that it will still be a single buffer. Hash code can be computed as in the last section. Insert, query and remove is O(1). When adding new member and the load ratio (member count divided by capacity) is over a constant, the hash table need to be rebuild by O(n). In addition, iterating members will be linear to the capacity, not the count.

    Current (Vector) Sorted Array Hash Table
    Custom Order Yes No No
    Sorted by Key May be Yes No
    Initialization O(n) O(n log n) O(n)
    AddMember Amortized O(1) Amortized O(log n) Amortized O(1)
    RemoveMember O(n) O(n) O(1)
    FindMember O(n) O(log n) O(1)
    IterateMember O(n) O(n) O(capacity)
    Resize O(n) O(n) O(capacity)
    ObjectEquality O(n^2) O(n) O(n)
    Space O(capacity) O(capacity) O(capacity)
  • n = number of member
  • The hidden m (string length of key) is not shown.
  • In hash table, n <= capacity * load_ratio. In others, n <= capacity.

If we want to support either or all of these data structures, we have two options: (1) use template parameter to specify which to use in compile-time; or (2) use flags to specify in run-time with overheads in all related operations.

@resty-daze
Copy link

I think that you can use hash table with a linked list to achieve most functions here.

Just make HashTable Node something like:

struct Node {
     T *value;
     Node* prev;
     Node* next;
};

You can choose to use LinkList or HashTable when implement different functions. Only cost is the memory space to save the pointers.

@geniushuai
Copy link

why not just only hash for GenericMember.name. Not necessary for each string.

@gvollant
Copy link
Contributor

the commit 71f0fa7 added a map (so a sorted array)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants