Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress indexes #43

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

dvryaboy
Copy link
Contributor

I added an interface to index reading/writing and provided an alternate representation of the index, which should drop the size of our index files about 4x. Haven't tested on real data, but unit tests pass. Please comment.

TODO: make the order of index serdes tried configurable via properties a-la hadoop's compression, and make the writer configurable as well (right now I just hardcode the writer implementation).

* Will only be called after prepareToRead().
* @return number of block offsets that will be read back.
*/
public int numBlocks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit problematic one to implement and forces us to process the whole file at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@dvryaboy
Copy link
Contributor Author

Rewrote LzoTinyOffsets to use VarInt implementation from Mahout, and got rid of numBlocks() method in the interface.
Tests pass, still haven't tested on real data.

os.writeInt(firstBlockSize);
wroteFirstBlock = true;
} else {
int delta = ((int) (offset - currOffset)) - firstBlockSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about writing delta from previous block size?
this will also adapt well to compressibility changes changes over a large file (extremely rare).

then we don't need wroteFirstBlock, since prevBlockSize would be zero.

@dvryaboy
Copy link
Contributor Author

@sjlee check out this ancient pull request. The goal here is to make lzo indexes significantly smaller, making split calculation, etc, much faster. It's meant to be backwards-compatible (new hadoop-lzo can read both new and old indexes; old hadoop-lzo can't read new indexes of course). Also introduces versioning, in case we want to mess with this further.

If this is interesting, I can take a pass at making this mergeable with current master.

@sjlee
Copy link
Collaborator

sjlee commented Sep 2, 2014

It does sound interesting. Could you give it a shot and let me know? Thanks.

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2019

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants