Parse json containing an object with unknown keys as a sorted vector of tables #291

trws · 2024-08-26T15:50:48Z

trws
Aug 26, 2024

I've been very impressed with flatcc so far, but seem to have run into a challenge I can't seem to work my way out of. I have an application that makes heavy use of json, where we would really like to switch to a more structured schema-oriented format incrementally, and have the option to go binary later. It's mainly C with some C++, so flatcc looks like a fantastic fit. The trick is, we have a lot of existing code and files with pre-generated json we need to keep supporting. I've been doing some tests with one of the more horrendous examples, a 400+MB json document, a very small chunk of which looks like this:

{
    "R_lite": [
      {
        "rank": "0",
        "children": {
          "core": "0-23"
        }
      },
      {
        "rank": "1-60",
        "children": {
          "core": "0-15"
        }
      }
    ]
}

The "children" key of each entry under "R_lite" is a generic map of string to string where the key is a resource type and the value is an IDset (range-encoded set of ints separated by commas). There are a few things like this in our json documents, and I can't seem to work out how to make a schema that can parse these without hitting "expected array" or similar. IIRC they could be handled as "unknown fields" but that wouldn't get me the data I need. Here's what I've been trying:

table pair
{
key:string (required, key);
val:string;
}
table Rlite
{
rank:  string;
children:  [pair] (sorted);
}

Hence the "expected array" error. Is there a way to mark in the schema that a vector of tables should be parsed/generated as an object rather than a json array, or some other way to parse this json with jsoncc?

mikkelfj · 2024-08-26T18:12:01Z

mikkelfj
Aug 26, 2024
Maintainer

Thanks for the positive feedback.
There isn't really a way to do what you want with the JSON syntax you present, except by using something like jquery to translate the data to a new syntax before piping the data into the FlatCC JSON parser, but that might also work rather well in praxis.

From memory without testing, you can create a new syntax using your suggested schema:

{
    "R_lite": [
      {
        "rank": "0",
        "children": [
          { "key": "core", "val": "0-23" }
        ]
      },
      {
        "rank": "1-60",
        "children": [
          { "key": "core", "val": "0-15" }
        ]
      }
    ]
}

You can add a "key" attribute to the keys field in the schema as you have already done, which would let flatbuffers in general allow for sorting and binary searching. FlatCC specifically also supports the sorted attribute (as you also use), which will cause it to generate a sort function that will traverse the entire buffer and sort all fields marked as sorted. You can also call the sort operation on specific fields individually without the sorted attribute. That has to do with allowing a table with a key attribute to be used both in sorted and unsorted tables. Note that FlatCC does not automatically sort during construction for technical and performance reasons, but does it inline on a newly created buffer if asked to do so. FlatCC also supports unsorted search through linear scan methods.

A more FlatBuffer native version, but less JSON native structure would separate key and value arrays. Here you would search the key, and use the resulting index to find the value. This approach is much more efficient since in does not need to construct an object for each item, but either way should work:


table Rlite
{
rank:  string;
childkeys: [string];
childvals: [string];
}

I have not shown the corresponding JSON, but I guess you have the idea by now, or please ask again.

Note that in this case there are neither sorted or key attributes, but you should still be able to sort. I guess you could add sorted attribtute to the childkeys field, but it has been a while since I worked with this.

Incidentally, your approach, or my suggested alternative, is what I recommend when you dealing arbitrary key value data. This is one reason FlatCC does not support FlexBuffers that are available in Googles C++ project. I feel that JSON works just as well as any other format when you have odd data, but then FlatCC has a comparatively faster JSON to FlatBuffers parser. That said, you might also want to look into FlexBuffers, but it probably won't help you with parsing JSON directly, and it will be slower.

Side note:
I have been working on and off on a separate JSON superset configuration language that can understand common JSON patterns, and potentially parse them into FlatBuffers, but it is sort of low priority slow work, and it would not be portable to FlatBuffers tooling, except as data in form of flatbuffers. A use case could be parsing Open Street Map data.

11 replies

mikkelfj Aug 28, 2024
Maintainer

A quick search suggests that jq might not be fast enough for you:
https://stackoverflow.com/questions/62825963/improving-performance-when-using-jq-to-process-large-files

If you only have a few different patterns, a custom C JSON preparser might work better, as I also hinted at earlier on.

trws Aug 28, 2024
Author

Oh! Of course, that makes sense. We actually use jq all over the project for testing and other things (all json formatting basically) but yeah it wouldn't be fast enough in our messaging setup. If you're curious, everything is open source over at github.com/flux-framework, mainly in the flux-core repository. We use jansson mostly, which is ergonomic and nice but not all that fast. It would however work just fine for parsing substrings where we hit a generic json value while parsing the rest of a message with flatcc. Is there an example or similar for a custom encoding or an element, or for getting the skipped spans while parsing a json document?

mikkelfj Aug 28, 2024
Maintainer

Generic json parser.

flatcc/src/runtime/json_parser.c

Line 639 in b09f8f4

    
           const char *flatcc_json_parser_generic_json(flatcc_json_parser_t *ctx, const char *buf, const char *end)

It might not be super optimised as it is used for edge cases, but it should still be decent.
There are plenty of ways to hack a parser to much simpler and faster if you can assume the input source is either valid or will be rejected later. Note that the generated flatcc JSON parser uses a trick where it reads 8 bytes at a time, then treats it as a big endian number, in order to get early characters most significant (don't recall all details, but little endian was a bad fit). You can then shift and mask trailing bytes and compare to a known constant. In this way you can very quickly search for keywords up to 8 bytes in length. I beleive the parser also uses a trick for quickly testing if a character is present within a given 64-bit word, which is useful for looking for curly brackets while skipping space. I certainly have such code somewhere. This is not exactly SIMD level but fairly fast and portable.

Altogether, you can just create a statemachine scanning for keywords, then look for where to modify the stream, maybe create a logging array for changes, then scan over that array while copying the input buffer to output. That would be more than fast enough and easy of you don't have to do it too many times in a different manner.

mikkelfj Aug 28, 2024
Maintainer

BTW: I did read about flux in detail but I have set up a slurm cluster recently and managed cluster messaging with Flatbuffers over MQTT, so does look like it is up that ballgame.

mikkelfj Aug 29, 2024
Maintainer

quickly testing if a character is present within a given 64-bit word

https://graphics.stanford.edu/~seander/bithacks.html
section: Determine if a word has a byte equal to n

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse json containing an object with unknown keys as a sorted vector of tables #291

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Parse json containing an object with unknown keys as a sorted vector of tables #291

trws Aug 26, 2024

Replies: 1 comment · 11 replies

mikkelfj Aug 26, 2024 Maintainer

mikkelfj Aug 28, 2024 Maintainer

trws Aug 28, 2024 Author

mikkelfj Aug 28, 2024 Maintainer

mikkelfj Aug 28, 2024 Maintainer

mikkelfj Aug 29, 2024 Maintainer

trws
Aug 26, 2024

Replies: 1 comment 11 replies

mikkelfj
Aug 26, 2024
Maintainer

mikkelfj Aug 28, 2024
Maintainer

trws Aug 28, 2024
Author

mikkelfj Aug 28, 2024
Maintainer

mikkelfj Aug 28, 2024
Maintainer

mikkelfj Aug 29, 2024
Maintainer