Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap gzip with bzip2 #15

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Swap gzip with bzip2 #15

wants to merge 7 commits into from

Conversation

eric-mh
Copy link

@eric-mh eric-mh commented Nov 18, 2020

Gzip occasionally has issues with the underlying zlib library. It's a bit difficult to reproduce, but errors look like:

...
  File "wordninja.py", line 80, in <module>
    DEFAULT_LANGUAGE_MODEL = LanguageModel(os.path.join(os.path.dirname(os.path.abspath(__file__)),'wordninja','wordninja_words.txt.gz'))
  File "wordninja.py", line 31, in __init__
    with gzip.open(word_file) as f:
  File "/usr/local/lib/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)
  File "/usr/local/lib/python3.7/gzip.py", line 175, in __init__
    raw = _GzipReader(fileobj)
  File "/usr/local/lib/python3.7/gzip.py", line 380, in __init__
    wbits=-zlib.MAX_WBITS)
  File "/usr/local/lib/python3.7/_compression.py", line 53, in __init__
    self._decompressor = self._decomp_factory(**self._decomp_args)
ValueError: Invalid initialization option

Swapping out the gzip file with bzip should be a harmless option to prevent those

bzip generation steps:

gunzip wordninja_words.txt.gz
bzip2 wordninja_words.txt

gunzip test_lang.txt.gz
bzip2 test_lang.txt

tests:

python3 test.py   
......
----------------------------------------------------------------------
Ran 6 tests in 0.003s

OK

@keredson
Copy link
Owner

That's concerning. Did this happen to you? Is there an example somewhere?

This is complicated a bit because people can supply their own model files. We'd need to not break compatibility.

@eric-mh
Copy link
Author

eric-mh commented Nov 18, 2020

Yep, I can't figure out what's behind it though. It occasionally happens on some CentOS 8 hosts where everything else looks fine.

It boils down to this call sometimes throwing a value error and sometimes not:

zlib.decompressobj(-15)

That's a good call with compatibility, this change should really come with gzip support too.

class LanguageModel(object):
def __init__(self, word_file):
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with gzip.open(word_file) as f:
words = f.read().decode().split()
if check_magic(word_file, FileTypeMagicBytesRe.BZIP_FILE):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have the filename, why not just check the extension?

Copy link
Author

@eric-mh eric-mh Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case there's any discrepancy between the file extension and its type. Although that really depends on what's the expected behavior if it's fed bad inputs.

Should it:

  1. Accept the extension and just throw out errors if it fails to process?
  2. Ignore the extension and just process it best it can?
  3. Ignore the extension, but pop out a warning if the extension doesn't match?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants