Swap gzip with bzip2 #15

eric-mh · 2020-11-18T22:57:25Z

Gzip occasionally has issues with the underlying zlib library. It's a bit difficult to reproduce, but errors look like:

...
  File "wordninja.py", line 80, in <module>
    DEFAULT_LANGUAGE_MODEL = LanguageModel(os.path.join(os.path.dirname(os.path.abspath(__file__)),'wordninja','wordninja_words.txt.gz'))
  File "wordninja.py", line 31, in __init__
    with gzip.open(word_file) as f:
  File "/usr/local/lib/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/local/lib/python3.7/gzip.py", line 175, in __init__
    raw = _GzipReader(fileobj)
  File "/usr/local/lib/python3.7/gzip.py", line 380, in __init__
    wbits=-zlib.MAX_WBITS)
  File "/usr/local/lib/python3.7/_compression.py", line 53, in __init__
    self._decompressor = self._decomp_factory(**self._decomp_args)
ValueError: Invalid initialization option

Swapping out the gzip file with bzip should be a harmless option to prevent those

bzip generation steps:

gunzip wordninja_words.txt.gz
bzip2 wordninja_words.txt

gunzip test_lang.txt.gz
bzip2 test_lang.txt

tests:

python3 test.py   
......
----------------------------------------------------------------------
Ran 6 tests in 0.003s

OK

keredson · 2020-11-18T23:37:59Z

That's concerning. Did this happen to you? Is there an example somewhere?

This is complicated a bit because people can supply their own model files. We'd need to not break compatibility.

eric-mh · 2020-11-18T23:44:38Z

Yep, I can't figure out what's behind it though. It occasionally happens on some CentOS 8 hosts where everything else looks fine.

It boils down to this call sometimes throwing a value error and sometimes not:

zlib.decompressobj(-15)

That's a good call with compatibility, this change should really come with gzip support too.

keredson · 2020-11-25T17:57:56Z

wordninja.py

 class LanguageModel(object):
  def __init__(self, word_file):
    # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
-    with gzip.open(word_file) as f:
-      words = f.read().decode().split()
+    if check_magic(word_file, FileTypeMagicBytesRe.BZIP_FILE):


since we have the filename, why not just check the extension?

Just in case there's any discrepancy between the file extension and its type. Although that really depends on what's the expected behavior if it's fed bad inputs.

Should it:

Accept the extension and just throw out errors if it fails to process?

Ignore the extension and just process it best it can?

Ignore the extension, but pop out a warning if the extension doesn't match?

Eric Hui added 3 commits November 18, 2020 14:40

use bzip instead of gzip

2023d86

update bz2 in readme

f3881b4

update bz2 in manifest

4033e41

Eric Hui added 3 commits November 18, 2020 16:02

update with gzip support for compatibility

00ac408

check filetype for decompression

f7fb37f

update ver

ed6b29c

keredson reviewed Nov 25, 2020

View reviewed changes

more efficient read on file check

0e358b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap gzip with bzip2 #15

Swap gzip with bzip2 #15

eric-mh commented Nov 18, 2020

keredson commented Nov 18, 2020

eric-mh commented Nov 18, 2020

keredson Nov 25, 2020

eric-mh Dec 4, 2020 •

edited

Loading

Swap gzip with bzip2 #15

Are you sure you want to change the base?

Swap gzip with bzip2 #15

Conversation

eric-mh commented Nov 18, 2020

keredson commented Nov 18, 2020

eric-mh commented Nov 18, 2020

keredson Nov 25, 2020

Choose a reason for hiding this comment

eric-mh Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

eric-mh Dec 4, 2020 •

edited

Loading