Skip to content
This repository has been archived by the owner on Nov 25, 2019. It is now read-only.

Use LZMA - better compression, faster decompression #1

Open
itkach opened this issue Oct 27, 2010 · 4 comments
Open

Use LZMA - better compression, faster decompression #1

itkach opened this issue Oct 27, 2010 · 4 comments

Comments

@itkach
Copy link
Member

itkach commented Oct 27, 2010

(Originally reported by itkach on Feb 11, 2009 at BitBucket)

LZMA promises better compression ratio then gzip and bzip2 and faster than bzip2 decompression. Using LZMA in aard format to compress articles may result in smaller .aar files and better word lookup performance.

@itkach
Copy link
Member Author

itkach commented Oct 27, 2010

(Commented by itkach on Feb 11, 2009 at BitBucket)

http://tukaani.org/xz has been suggested as a possible implementation,
although it doesn't seem to have Python bindings. http://www.joachim-
bauch.de/projects/python/pylzma/
looks more promising. In any case,
benefits of using LZMA need to be explored further.

@itkach
Copy link
Member Author

itkach commented Oct 27, 2010

(Commented by itkach on Feb 17, 2009 at BitBucket)

Initial evaluation didn't indicate any substantial improvements from using
LZMA compression. Compiled with LZMA , Simple English wiki 20081126 dump is
55Mb instead of 56 Mb, first volume of English Wikipedia 2337 is Mb instead of
2384 Mb - in both cases size is reduced only by ~ 2%. This is with pylzma
0.3. Decompression is also only marginally faster then bz2 - ~ 5% on
medium size articles (~15 Kb).

@itkach
Copy link
Member Author

itkach commented Oct 27, 2010

(Commented by anonymous on Mar 12, 2009 at BitBucket)

aha, I'm dissapointed, are you using the default compression or you use -9
i.e. maximum compression? And do you also compress using pyhton or python is
used just to decompress in the reader?

also i have found another python implementation, which seems to support also
the new format xz called pyliblzma

https://launchpad.net/pyliblzma

@itkach
Copy link
Member Author

itkach commented Oct 27, 2010

(Commented by itkach on Mar 16, 2009 at BitBucket)

pylzma was used both for compression and decompression, with default
compression parameters. I tried some variations, but defaults seemed to yield
best results.

I'll see if pyliblzma can do better. I wouldn't hold my breath though: each
article is compressed individually, so neither bzip2 nor lzma demonstrate the
same data compression ratios as with gigantic files. In fact, a significant
number of articles is just too short to benefit from any compression:
compressed text plus compression format headers is bigger than original
uncompressed text. LZMA compression not being part of Python standard library
is also a significant obstacle: adopting it would mean compiling and packaging
it for Windows and Maemo and possibly other platforms where it's not easy for
users to get or build binaries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant