-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this valid for index size reduction? #98
Comments
Thanks for looking into this. The delta encoding for locations makes sense, but I would want to review the changes to better understand the impact of the change. Regarding the field number, the reason it has to be stored for every location is to support searching composite fields, and being able to remember which original field it came from. It's a useful feature, but I'm certainly open to ideas to save space wasted in this area. I would be open to reviewing the changes for the delta encoding, changing the field number may require further discussion. |
Thanks for the reply - will try and put something together for you to look at soon - and also see if I can study composite fields |
Running the tests helped me see the composite problem. Should have done that earlier. Since then I have coped with composite fields and thought of more improvements. For reference the current state of my optimizations is this commit waddyano/ice@a5ffbee |
@waddyano I took a quick look. I didn't review closely, but the approach looks good from reading the description. One thing I see is that you have sections of code guarded with a condition like |
Thanks for the comments - I am used to the extra complexity with less code duplication but will adjust |
This might really just apply to ice but thought it might be better to ask here
I have been experimenting with using bluge for indexing text files. Main fields are indexed and not stored. To me the index size is rather large so I have been looking at ways to reduce the size.
After cobbling together some code to print what data consumed space I found the bulk of it by far is Location lists in the posting data.
Looking at what the processing generated I tried just storing all location information as deltas. End as the offset from start and the next start as the offset from the previous end except for the first. Everything seems to stay in increasing sequence and is always processed in sequence so this seems to work fine. This is per location list. Since everything is varint's the smaller the integers the smaller the space needed for a little bit of arithmetic during read.
The second observation was that it didn't seem necessary to store the field number if every location - so I removed it and just picked it up from the dictionary the list belonged to.
I did create a version 2 format and locally have code which can read/write both formats in one module.
So far this seems to work and reduced my index size by 38% - is there some case where this will go wrong? And is their interest in me trying to put together an real change for this or I just create my own segment plugin.
The text was updated successfully, but these errors were encountered: