-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialized invertedIndex : more space efficient format possible? #268
Comments
I generally agree on size reduction, however I think object keys are used for fast retrieval and intrinsic uniqueness, whereas your example would require iteration over each entry. |
@Andargor sure, in the in-memory representation the property maps are much more suitable. But lunr.js has explicit deserialization / serialization code that maps the in-memory datastructures to JSON. So moving to a more efficient serialization does not imply changing the runtime structure - property hash lookup in the native JS engine is unbeatably fast so that needs to stay of course. |
In 2.1.0-alpha.1 the structure of the index didn't need to change, only the vector references were slightly altered from "docRef" to "fieldName/docRef". That said, its still worth discussing what the format of a serialised index should be. There are a couple of things that need to be balanced when designing the serialisation format; obviously the size (after minifying and compressing) is important, all other things being equal, a serialisation format that leads to a smaller amount of data to transfer is a win. How long a serialised index takes to deserialise is also important, and again the quicker it is the better. Balancing these two isn't always straightforward. Take the two examples from this issue, the existing version is basically just a I think the approach I want to take is to have something that is simple/fast to deserialise and reasonably small on the wire. I want to publish the schema for this format (it'll still be JSON so probably JSONSchema), which would then allow more specific minification, such as the structure suggested at the top of the issue. I'm currently trying this out with a binary format for the index, which will be generated from the JSON serialised index. Deserialisation can either re-create lunr objects directly, or deserialise to the standard JSON format and rely on the built in This way, for small to medium sized indexes, the built in serialisation will work well, and if people want smaller index sizes, or faster deserialisation, they can use a plugin that caters to their needs. What do you think? |
Just for another perspective, my current application has a serialized index that is 20MB in size (including the wrapping search function HTML, which is 1-2 KB). With node/express having compression enabled, on the wire 4.3 MB are actually transferred. I'm just wondering now what real gains having a marginally smaller index would bring to the table. I would think the smaller cleartext index would result in slightly worse compression, therefore offsetting any advantage. I'm not sure what processing advantages there would be, if any, since for my current example the transfer impacts user experience far more than the local decompression. Just thinking out loud. |
You're all right - it's a question that's only possible to answer by actually testing with real-world data (compressed and uncompressed). I tried stuff like this a lot in the past, all across Java, Javascript (server and browser etc) and found just a few commonalities:
My actual background here is that the site is a statically generated (Jekyll based) - which is the reason why lunr.js is a good fit at all - and "hosted" by just dumping it on S3. So no compression without further devOps effort, scripts etc. So the uncompressed size does get interesting. |
@nkuehn regarding compression, if you can't compress some of your assets before deploying to S3 you can look at setting up CoudFront which can do the compression on the fly for you, you might also be able to do the same with other CDN providers. As for binary formats, serialisation performance isn't really an issue, as it is does offline at build time. Obviously it shouldn't take hours, but its unlikely to be an priority. For deserialisation, its unlikely to be as quick as I'd be surprised if a custom binary format doesn't provide a significant decrease in the size of an index, but I need to test this. Getting back to the original issue, if there was a published schema for the plain JSON serialisation format for indexes, you could build on this to create a minified version as in your example. |
PS on the topic: I ran across a bigger analysis the Uber engineering team did a while ago: https://eng.uber.com/trip-data-squeeze/ Probably a comparable data set (trip logs, i.e. a relatively repeating structure). TL;DR: you can get 20% down even after compression with a schema-free but optimized format like CBOR or messagepack. You can "emulate" such a structure in a custom but then totally unreadable JSON that consists just of nested arrays etc - not a real improvement imho. |
Since the index format will be incompatible with v2.1 anyways (discussed e.g. in #263 ) it could be an opportunity to optimize it a bit concerning size (if possible!).
One thing that makes me wonder if the empty objects in the serialized index will ever be used for something (I guess they could be the place where "metadataWhitelist" fields go, but it's a bit opaque to me in the code, can't really see it being used.
Here's an example invertedIndex entry in my serialized index (stemmed token "custom"). It's 584 characters minified. The index covers just ca 80 documents, so it's not even an extreme case.
Does anything speak against serializing the document references as an array? I'm not sure whether the IDs are integers or strings, assuming integers here, but strings are trivial, too (though less efficient). It's just 262 characters minified now, which is 55% savings on that part of the index.
Even if it's just 30% on less pronouced cases (keywords that appear less frequently) and just 15% on the overall index size then I think it's worth considering since the key advantage of lunr.js over other searchers is that you can run it in-browser and the index size is a key bottleneck.
The text was updated successfully, but these errors were encountered: