Expose the underlying tapes as ArrayBuffer #34

nojvek · 2020-04-16T12:09:41Z

Reading the code I see the fast part of simdjson is parsing the bytes of json and creating two buffers/tapes. One is the json tape that marks starting, ending and types for various elements. The other is a string tape that contains the parsed strings in utf-8 format.

JavaScript offers a nice way of fast buffer indexing and getting our values via TypedArrays and ArrayBuffers. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays

This would mean that the iteration part of getting values out could be done in pure js. I.e it would be technically possible to stream the buffers as binary data to the browser and have the iteration of json part work there too.

Or one could dump the tapes as files and get zero cost parsing by simply mmaping a file and iterating over gigabyes of json tape like Flat buffers.

https://google.github.io/flatbuffers/

I also don’t think lazyParse as the only function is a great interface. Underlying simdjson has a concept of elements and iterators. JavaScript has similar concept of iterators too. One would need to resort to proxy hacks which are a bit too magical and sometimes. I think we can expose a much nicer object/array iterator based interface for underlying tape.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Iterators_and_Generators

https://codeburst.io/a-simple-guide-to-es6-iterators-in-javascript-with-examples-189d052c3d8e

This would mean there’s be two sub modules. One that is a fast jsonStr -> {jsonTape, strsTape}

The other that takes {jsonTape, strsTape} -> elemIterator.

Hopefully I’m making sense.

I’m happy to write the js part of the code. Just need to figure out how to export the buffers using napi api.

luizperes · 2020-04-16T18:43:23Z

Thanks @nojvek, that does make sense! I've been meaning to add the rest of the API (but failed to document that in an issue).

The iterator part, in my opinion, should be kept in the C++ side since simdjson already has an API for "JSON Pointer" and support for iterators (https://github.com/simdjson/simdjson/blob/master/doc/basics.md#json-pointer). If we did the same in the library, we would have to update (depending on the change it would need a re-write) our "internal" API every time the upstream updated its own API so to keep it up-to-date.

nojvek · 2020-04-16T20:39:55Z

Not opposed to the idea of using existing c++ interface.

I wrote a pure js tape dumper because I wanted to understand the underlying mechanics of simdjson. Some neat ideas.

const fs = require(`fs`);
const carsTapeBuffer = fs.readFileSync(`${__dirname}/tape.buffer`);
const carsStrBuffer = fs.readFileSync(`${__dirname}/str.buffer`);

const TapeType = {
  ROOT: 'r'.charCodeAt(0),
  START_ARRAY: '['.charCodeAt(0),
  START_OBJECT: '{'.charCodeAt(0),
  END_ARRAY: ']'.charCodeAt(0),
  END_OBJECT: '}'.charCodeAt(0),
  STRING: '"'.charCodeAt(0),
  INT64: 'l'.charCodeAt(0),
  UINT64: 'u'.charCodeAt(0),
  DOUBLE: 'd'.charCodeAt(0),
  TRUE_VALUE: 't'.charCodeAt(0),
  FALSE_VALUE: 'f'.charCodeAt(0),
  NULL_VALUE: 'n'.charCodeAt(0),
};

/**
 * @param {DataView} tapeBufView
 * @param {DataView} strBufView
 */
function dumpTape(tapeBufView, strBufView) {
  console.log(tapeBufView);
  console.log(strBufView);
  const size64 = 8 ; // sizeof(uint64_t)
  const size32 = 4;
  const textDecoder = new TextDecoder();

  for(let tapeIdx = 0, len = tapeBufView.byteLength; tapeIdx < len; tapeIdx += size64) {
    const elemType = tapeBufView.getUint8(tapeIdx + 7);
    switch (elemType) {
      case TapeType.ROOT:
      case TapeType.START_ARRAY:
      case TapeType.START_OBJECT:
      case TapeType.END_ARRAY:
      case TapeType.END_OBJECT: {
        const offset = tapeBufView.getUint32(tapeIdx, true)
        console.log(String.fromCharCode(elemType), offset);
        break;
      }
      case TapeType.TRUE_VALUE:
      case TapeType.FALSE_VALUE:
      case TapeType.NULL_VALUE: {
        console.log(String.fromCharCode(elemType));
        break;
      }
      case TapeType.STRING: {
        const strIdx = tapeBufView.getUint32(tapeIdx, true)
        const strLen = strBufView.getUint32(strIdx, true)
        const str = textDecoder.decode(new DataView(strBufView.buffer, strBufView.byteOffset + strIdx + size32, strLen));
        console.log(String.fromCharCode(elemType), str);
        break;
      }
      case TapeType.INT64: {
        tapeIdx += size64;
        const val = tapeBufView.getBigInt64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), val);
        break;
      }
      case TapeType.UINT64: {
        tapeIdx += size64;
        const elemVal = tapeBufView.getBigUint64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), elemVal);
        break;
      }
      case TapeType.DOUBLE: {
        tapeIdx += size64;
        const elemVal = tapeBufView.getFloat64(tapeIdx, true)
        console.log(String.fromCharCode(elemType), elemVal);
        break;
      }
      default: {
        throw new Error(`unknown type ${elemType}, this should never happen`);
        break;
      }
    }
  }
}

dumpTape(
  new DataView(carsTapeBuffer.buffer, carsTapeBuffer.byteOffset, carsTapeBuffer.length),
  new DataView(carsStrBuffer.buffer, carsStrBuffer.byteOffset, carsStrBuffer.length)
);

luizperes · 2020-04-16T22:55:28Z

I see, with your example, I re-read what you wrote and that makes a lot of sense. So would it suffice if I exposed the two available tapes? (I only need to check if it is possible to access them within C++ without modifying the headers, preferably)

nojvek · 2020-04-17T00:23:44Z

That would be great if it's easily exposable without having a big perf impact.

luizperes · 2020-04-17T07:56:11Z

Hi @nojvek,

I implemented what you asked on the branch buffers. Creating ArrayBuffers seems to have some overhead on NApi and it actually doesn't improve, as we initially thought. Funny thing is: if I switch it with External (keeping a C++ pointer to it), we then see the good results (> 1 GB/s).

Do you know if there is a way of converting an external object into an array buffer?

nojvek changed the title ~~Expose the underlying json tape as ArrayBuffer~~ Expose the underlying tapes as ArrayBuffer Apr 16, 2020

luizperes added a commit that referenced this issue Apr 17, 2020

Trying different approaches, such as exposing tapes (#34)

0980ec6

luizperes added a commit that referenced this issue Apr 17, 2020

Trying different approaches, such as exposing tapes (#34)

f9f29f2

luizperes added a commit that referenced this issue Apr 17, 2020

Trying different approaches, such as exposing tapes (#34)

20b2ece

luizperes added a commit that referenced this issue Apr 17, 2020

Trying different approaches, such as exposing tapes (#34)

43edd6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose the underlying tapes as ArrayBuffer #34

Expose the underlying tapes as ArrayBuffer #34

nojvek commented Apr 16, 2020 •

edited

Loading

luizperes commented Apr 16, 2020

nojvek commented Apr 16, 2020

luizperes commented Apr 16, 2020

nojvek commented Apr 17, 2020 •

edited

Loading

luizperes commented Apr 17, 2020

Expose the underlying tapes as ArrayBuffer #34

Expose the underlying tapes as ArrayBuffer #34

Comments

nojvek commented Apr 16, 2020 • edited Loading

luizperes commented Apr 16, 2020

nojvek commented Apr 16, 2020

luizperes commented Apr 16, 2020

nojvek commented Apr 17, 2020 • edited Loading

luizperes commented Apr 17, 2020

nojvek commented Apr 16, 2020 •

edited

Loading

nojvek commented Apr 17, 2020 •

edited

Loading