-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap gzip with bzip2 #15
base: master
Are you sure you want to change the base?
Conversation
That's concerning. Did this happen to you? Is there an example somewhere? This is complicated a bit because people can supply their own model files. We'd need to not break compatibility. |
Yep, I can't figure out what's behind it though. It occasionally happens on some CentOS 8 hosts where everything else looks fine. It boils down to this call sometimes throwing a value error and sometimes not:
That's a good call with compatibility, this change should really come with gzip support too. |
class LanguageModel(object): | ||
def __init__(self, word_file): | ||
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability). | ||
with gzip.open(word_file) as f: | ||
words = f.read().decode().split() | ||
if check_magic(word_file, FileTypeMagicBytesRe.BZIP_FILE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have the filename, why not just check the extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case there's any discrepancy between the file extension and its type. Although that really depends on what's the expected behavior if it's fed bad inputs.
Should it:
- Accept the extension and just throw out errors if it fails to process?
- Ignore the extension and just process it best it can?
- Ignore the extension, but pop out a warning if the extension doesn't match?
Gzip occasionally has issues with the underlying zlib library. It's a bit difficult to reproduce, but errors look like:
Swapping out the gzip file with bzip should be a harmless option to prevent those
bzip generation steps:
tests: