-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the underlying tapes as ArrayBuffer #34
Comments
Thanks @nojvek, that does make sense! I've been meaning to add the rest of the API (but failed to document that in an issue). The iterator part, in my opinion, should be kept in the C++ side since |
Not opposed to the idea of using existing c++ interface. I wrote a pure js tape dumper because I wanted to understand the underlying mechanics of simdjson. Some neat ideas. const fs = require(`fs`);
const carsTapeBuffer = fs.readFileSync(`${__dirname}/tape.buffer`);
const carsStrBuffer = fs.readFileSync(`${__dirname}/str.buffer`);
const TapeType = {
ROOT: 'r'.charCodeAt(0),
START_ARRAY: '['.charCodeAt(0),
START_OBJECT: '{'.charCodeAt(0),
END_ARRAY: ']'.charCodeAt(0),
END_OBJECT: '}'.charCodeAt(0),
STRING: '"'.charCodeAt(0),
INT64: 'l'.charCodeAt(0),
UINT64: 'u'.charCodeAt(0),
DOUBLE: 'd'.charCodeAt(0),
TRUE_VALUE: 't'.charCodeAt(0),
FALSE_VALUE: 'f'.charCodeAt(0),
NULL_VALUE: 'n'.charCodeAt(0),
};
/**
* @param {DataView} tapeBufView
* @param {DataView} strBufView
*/
function dumpTape(tapeBufView, strBufView) {
console.log(tapeBufView);
console.log(strBufView);
const size64 = 8 ; // sizeof(uint64_t)
const size32 = 4;
const textDecoder = new TextDecoder();
for(let tapeIdx = 0, len = tapeBufView.byteLength; tapeIdx < len; tapeIdx += size64) {
const elemType = tapeBufView.getUint8(tapeIdx + 7);
switch (elemType) {
case TapeType.ROOT:
case TapeType.START_ARRAY:
case TapeType.START_OBJECT:
case TapeType.END_ARRAY:
case TapeType.END_OBJECT: {
const offset = tapeBufView.getUint32(tapeIdx, true)
console.log(String.fromCharCode(elemType), offset);
break;
}
case TapeType.TRUE_VALUE:
case TapeType.FALSE_VALUE:
case TapeType.NULL_VALUE: {
console.log(String.fromCharCode(elemType));
break;
}
case TapeType.STRING: {
const strIdx = tapeBufView.getUint32(tapeIdx, true)
const strLen = strBufView.getUint32(strIdx, true)
const str = textDecoder.decode(new DataView(strBufView.buffer, strBufView.byteOffset + strIdx + size32, strLen));
console.log(String.fromCharCode(elemType), str);
break;
}
case TapeType.INT64: {
tapeIdx += size64;
const val = tapeBufView.getBigInt64(tapeIdx, true)
console.log(String.fromCharCode(elemType), val);
break;
}
case TapeType.UINT64: {
tapeIdx += size64;
const elemVal = tapeBufView.getBigUint64(tapeIdx, true)
console.log(String.fromCharCode(elemType), elemVal);
break;
}
case TapeType.DOUBLE: {
tapeIdx += size64;
const elemVal = tapeBufView.getFloat64(tapeIdx, true)
console.log(String.fromCharCode(elemType), elemVal);
break;
}
default: {
throw new Error(`unknown type ${elemType}, this should never happen`);
break;
}
}
}
}
dumpTape(
new DataView(carsTapeBuffer.buffer, carsTapeBuffer.byteOffset, carsTapeBuffer.length),
new DataView(carsStrBuffer.buffer, carsStrBuffer.byteOffset, carsStrBuffer.length)
); |
I see, with your example, I re-read what you wrote and that makes a lot of sense. So would it suffice if I exposed the two available tapes? (I only need to check if it is possible to access them within C++ without modifying the headers, preferably) |
That would be great if it's easily exposable without having a big perf impact. |
Hi @nojvek, I implemented what you asked on the branch buffers. Creating Do you know if there is a way of converting an external object into an array buffer? |
Reading the code I see the fast part of simdjson is parsing the bytes of json and creating two buffers/tapes. One is the json tape that marks starting, ending and types for various elements. The other is a string tape that contains the parsed strings in utf-8 format.
JavaScript offers a nice way of fast buffer indexing and getting our values via TypedArrays and ArrayBuffers. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Typed_arrays
This would mean that the iteration part of getting values out could be done in pure js. I.e it would be technically possible to stream the buffers as binary data to the browser and have the iteration of json part work there too.
Or one could dump the tapes as files and get zero cost parsing by simply mmaping a file and iterating over gigabyes of json tape like Flat buffers.
https://google.github.io/flatbuffers/
I also don’t think lazyParse as the only function is a great interface. Underlying simdjson has a concept of elements and iterators. JavaScript has similar concept of iterators too. One would need to resort to proxy hacks which are a bit too magical and sometimes. I think we can expose a much nicer object/array iterator based interface for underlying tape.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Iterators_and_Generators
https://codeburst.io/a-simple-guide-to-es6-iterators-in-javascript-with-examples-189d052c3d8e
This would mean there’s be two sub modules. One that is a fast jsonStr -> {jsonTape, strsTape}
The other that takes {jsonTape, strsTape} -> elemIterator.
Hopefully I’m making sense.
I’m happy to write the js part of the code. Just need to figure out how to export the buffers using napi api.
The text was updated successfully, but these errors were encountered: