Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Functionality to accept compressed files as input to predict when using a Predictor #5299

Closed
wants to merge 9 commits into from

Conversation

Dbhasin1
Copy link
Contributor

@Dbhasin1 Dbhasin1 commented Jul 2, 2021

Fixes #5237

Changes proposed in this pull request:

  • Uses the open_compressed helper function in allennlp/allennlp/common/file_utils.py to handle compressed files in allennlp/allennlp/commands/predict.py and if our method to detect the compression format fails, asks the user to manually specify the format type as an argument to the predict class (if they haven't already stated so)
  • Apart from gz and bz2 formats, added provision to deal with the lzma file format

Before submitting

  • [ ✅] I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.
  • codecov/patch reports high test coverage (at least 90%).
    You can find this under the "Actions" tab of the pull request once the other checks have finished.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but needs a few tweaks.

I would be OK with not having the manual override functionality at all. If you name your gzip file "something.bz2", you deserve what you get. But if you have a scenario in mind where we need this, let's keep it. It's good practice.

Comment on lines 224 to 238
try:
with open_compressed(input_file) as file_input:
for line in file_input:
if not line.isspace():
yield self._predictor.load_line(line)
except OSError:
if self.compression_type:
with open_compressed(input_file, self.compression_type) as file_input:
for line in file_input:
if not line.isspace():
yield self._predictor.load_line(line)
else:
print(
"Automatic detection of compression type failed, please specify the compression type argument"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic needs to be the other way around. If the compression type is specified, we have to always respect it. If it's not specified, we autodetect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, will incorporate it

@@ -1085,20 +1085,27 @@ def get_file_extension(path: str, dot=True, lower: bool = True):


def open_compressed(
filename: Union[str, PathLike], mode: str = "rt", encoding: Optional[str] = "UTF-8", **kwargs
filename: Union[str, PathLike],
compression_type: str = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the last parameter, so we don't break existing usage of positional arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the type of this should be Optional[None].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the make typecheck commands gave me incompatibility error on changing it to Optional[None] so I kept it as Optional[str], will that be fine?

elif filename.endswith(".bz2"):
import bz2
compression_modules = {"gz": "gzip", "bz2": "bz2", "lzma": "lzma"}
if not compression_type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pass in an empty string for compression_type, we will go down this path. I don't think that's what we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I simply consider an empty string for compression_type to be equivalent to it being 'None' ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be surprising to a user of the library. You can just compare to None:

Suggested change
if not compression_type:
if compression_type is None:

open_fn = module.open
break
else:
module = __import__(compression_modules[extension])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think extension is undefined here? Or am I blind and can't see it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was an error at my end, will fix it

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close to done!

@@ -152,6 +160,7 @@ def __init__(
batch_size: int,
print_to_console: bool,
has_dataset_reader: bool,
compression_type: str = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should be Optional[str].

Comment on lines +241 to +242
except OSError:
print("please specify the correct compression type argument.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why OSError? I don't think you have to catch this exception at all. If it fails, it fails. The only thing we might want to do is make sure that open_compressed() throws exceptions that are understandable.

open_fn = module.open
else:
for extension in compression_modules:
if filename.endswith(extension):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use os.path.splitext() here to make that detection? I don't want a file named info.fogbugz to show up as a gzip file.

@dirkgr
Copy link
Member

dirkgr commented Feb 23, 2022

Since I can't push to your branch and you are MIA, I'm continuing this in #5578.

@dirkgr dirkgr closed this Feb 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept compressed files as input to predict when using a Predictor
2 participants