Use LZMA - better compression, faster decompression #1

itkach · 2010-10-27T00:14:51Z

(Originally reported by itkach on Feb 11, 2009 at BitBucket)

LZMA promises better compression ratio then gzip and bzip2 and faster than bzip2 decompression. Using LZMA in aard format to compress articles may result in smaller .aar files and better word lookup performance.

itkach · 2010-10-27T00:14:51Z

(Commented by itkach on Feb 11, 2009 at BitBucket)

http://tukaani.org/xz has been suggested as a possible implementation,
although it doesn't seem to have Python bindings. http://www.joachim-
bauch.de/projects/python/pylzma/ looks more promising. In any case,
benefits of using LZMA need to be explored further.

itkach · 2010-10-27T00:14:52Z

(Commented by itkach on Feb 17, 2009 at BitBucket)

Initial evaluation didn't indicate any substantial improvements from using
LZMA compression. Compiled with LZMA , Simple English wiki 20081126 dump is
55Mb instead of 56 Mb, first volume of English Wikipedia 2337 is Mb instead of
2384 Mb - in both cases size is reduced only by ~ 2%. This is with pylzma
0.3. Decompression is also only marginally faster then bz2 - ~ 5% on
medium size articles (~15 Kb).

itkach · 2010-10-27T00:14:52Z

(Commented by anonymous on Mar 12, 2009 at BitBucket)

aha, I'm dissapointed, are you using the default compression or you use -9
i.e. maximum compression? And do you also compress using pyhton or python is
used just to decompress in the reader?

also i have found another python implementation, which seems to support also
the new format xz called pyliblzma

https://launchpad.net/pyliblzma

itkach · 2010-10-27T00:14:52Z

(Commented by itkach on Mar 16, 2009 at BitBucket)

pylzma was used both for compression and decompression, with default
compression parameters. I tried some variations, but defaults seemed to yield
best results.

I'll see if pyliblzma can do better. I wouldn't hold my breath though: each
article is compressed individually, so neither bzip2 nor lzma demonstrate the
same data compression ratios as with gigantic files. In fact, a significant
number of articles is just too short to benefit from any compression:
compressed text plus compression format headers is bigger than original
uncompressed text. LZMA compression not being part of Python standard library
is also a significant obstacle: adopting it would mean compiling and packaging
it for Windows and Maemo and possibly other platforms where it's not easy for
users to get or build binaries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LZMA - better compression, faster decompression #1

Use LZMA - better compression, faster decompression #1

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

Use LZMA - better compression, faster decompression #1

Use LZMA - better compression, faster decompression #1

Comments

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010

itkach commented Oct 27, 2010