Compress indexes #43

dvryaboy · 2012-02-12T00:27:01Z

I added an interface to index reading/writing and provided an alternate representation of the index, which should drop the size of our index files about 4x. Haven't tested on real data, but unit tests pass. Please comment.

TODO: make the order of index serdes tried configurable via properties a-la hadoop's compression, and make the writer configurable as well (right now I just hardcode the writer implementation).

rangadi · 2012-02-13T18:34:41Z

src/java/com/hadoop/compression/lzo/LzoIndexSerde.java

+   * Will only be called after prepareToRead().
+   * @return number of block offsets that will be read back.
+   */
+  public int numBlocks();


This is a bit problematic one to implement and forces us to process the whole file at once.

dvryaboy · 2012-02-15T06:46:02Z

Rewrote LzoTinyOffsets to use VarInt implementation from Mahout, and got rid of numBlocks() method in the interface.
Tests pass, still haven't tested on real data.

rangadi · 2012-02-15T19:23:19Z

src/java/com/hadoop/compression/lzo/LzoTinyOffsetsSerde.java

+      os.writeInt(firstBlockSize);
+      wroteFirstBlock = true;
+    } else {
+      int delta = ((int) (offset - currOffset)) - firstBlockSize;


how about writing delta from previous block size?
this will also adapt well to compressibility changes changes over a large file (extremely rare).

then we don't need wroteFirstBlock, since prevBlockSize would be zero.

dvryaboy · 2014-08-30T18:32:10Z

@sjlee check out this ancient pull request. The goal here is to make lzo indexes significantly smaller, making split calculation, etc, much faster. It's meant to be backwards-compatible (new hadoop-lzo can read both new and old indexes; old hadoop-lzo can't read new indexes of course). Also introduces versioning, in case we want to mess with this further.

If this is interesting, I can take a pass at making this mergeable with current master.

sjlee · 2014-09-02T20:19:17Z

It does sound interesting. Could you give it a shot and let me know? Thanks.

CLAassistant · 2019-07-18T15:08:15Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dvryaboy added 5 commits February 11, 2012 12:14

convert lzo index writing/reading to use an interface.

28919df

Add test for LzoIndexSerdes

ebc50ea

take care of instantiation exceptions for serdes

fdbc1e7

Add a compressed index representation.

2016eea

Added GPL licenses where required.

8ee9cea

rangadi reviewed Feb 13, 2012
View reviewed changes

rewrite LzoTinyOffsets to use VarInt from Mahout. Add NOTICE file.

76c4aa0

rangadi reviewed Feb 15, 2012
View reviewed changes

dvryaboy pushed a commit that referenced this pull request Sep 5, 2014

Smaller lzo indexes: make #43 apply to current codebase

c620b24

dvryaboy mentioned this pull request Sep 5, 2014

Smaller lzo indexes: make #43 apply to current codebase #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress indexes #43

Compress indexes #43

dvryaboy commented Feb 12, 2012

rangadi Feb 13, 2012

dvryaboy Feb 15, 2012

dvryaboy commented Feb 15, 2012

rangadi Feb 15, 2012

dvryaboy commented Aug 30, 2014

sjlee commented Sep 2, 2014

CLAassistant commented Jul 18, 2019 •

edited

Loading

Compress indexes #43

Are you sure you want to change the base?

Compress indexes #43

Conversation

dvryaboy commented Feb 12, 2012

rangadi Feb 13, 2012

Choose a reason for hiding this comment

dvryaboy Feb 15, 2012

Choose a reason for hiding this comment

dvryaboy commented Feb 15, 2012

rangadi Feb 15, 2012

Choose a reason for hiding this comment

dvryaboy commented Aug 30, 2014

sjlee commented Sep 2, 2014

CLAassistant commented Jul 18, 2019 • edited Loading

CLAassistant commented Jul 18, 2019 •

edited

Loading