Functionality to accept compressed files as input to predict when using a Predictor #5299

Dbhasin1 · 2021-07-02T14:35:07Z

Fixes #5237

Changes proposed in this pull request:

Uses the open_compressed helper function in allennlp/allennlp/common/file_utils.py to handle compressed files in allennlp/allennlp/commands/predict.py and if our method to detect the compression format fails, asks the user to manually specify the format type as an argument to the predict class (if they haven't already stated so)
Apart from gz and bz2 formats, added provision to deal with the lzma file format

Before submitting

[ ✅] I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.
codecov/patch reports high test coverage (at least 90%).
You can find this under the "Actions" tab of the pull request once the other checks have finished.

dirkgr

Looks good, but needs a few tweaks.

I would be OK with not having the manual override functionality at all. If you name your gzip file "something.bz2", you deserve what you get. But if you have a scenario in mind where we need this, let's keep it. It's good practice.

dirkgr · 2021-07-06T23:15:05Z

allennlp/commands/predict.py

+            try:
+                with open_compressed(input_file) as file_input:
+                    for line in file_input:
+                        if not line.isspace():
+                            yield self._predictor.load_line(line)
+            except OSError:
+                if self.compression_type:
+                    with open_compressed(input_file, self.compression_type) as file_input:
+                        for line in file_input:
+                            if not line.isspace():
+                                yield self._predictor.load_line(line)
+                else:
+                    print(
+                        "Automatic detection of compression type failed, please specify the compression type argument"
+                    )


This logic needs to be the other way around. If the compression type is specified, we have to always respect it. If it's not specified, we autodetect.

That makes sense, will incorporate it

dirkgr · 2021-07-06T23:15:46Z

allennlp/common/file_utils.py

@@ -1085,20 +1085,27 @@ def get_file_extension(path: str, dot=True, lower: bool = True):


 def open_compressed(
-    filename: Union[str, PathLike], mode: str = "rt", encoding: Optional[str] = "UTF-8", **kwargs
+    filename: Union[str, PathLike],
+    compression_type: str = None,


This should be the last parameter, so we don't break existing usage of positional arguments.

Also, the type of this should be Optional[None].

the make typecheck commands gave me incompatibility error on changing it to Optional[None] so I kept it as Optional[str], will that be fine?

dirkgr · 2021-07-06T23:17:35Z

allennlp/common/file_utils.py

-    elif filename.endswith(".bz2"):
-        import bz2
+    compression_modules = {"gz": "gzip", "bz2": "bz2", "lzma": "lzma"}
+    if not compression_type:


If I pass in an empty string for compression_type, we will go down this path. I don't think that's what we want.

Can I simply consider an empty string for compression_type to be equivalent to it being 'None' ?

I think that would be surprising to a user of the library. You can just compare to None:

Suggested change

if not compression_type:

if compression_type is None:

dirkgr · 2021-07-06T23:19:58Z

allennlp/common/file_utils.py

+                open_fn = module.open
+                break 
+    else:
+        module = __import__(compression_modules[extension])


I think extension is undefined here? Or am I blind and can't see it?

Yep, that was an error at my end, will fix it

dirkgr

Close to done!

dirkgr · 2021-07-23T00:17:26Z

allennlp/commands/predict.py

@@ -152,6 +160,7 @@ def __init__(
        batch_size: int,
        print_to_console: bool,
        has_dataset_reader: bool,
+        compression_type: str = None,


The type should be Optional[str].

dirkgr · 2021-07-23T00:20:21Z

allennlp/commands/predict.py

+                except OSError:
+                    print("please specify the correct compression type argument.")


Why OSError? I don't think you have to catch this exception at all. If it fails, it fails. The only thing we might want to do is make sure that open_compressed() throws exceptions that are understandable.

dirkgr · 2021-07-23T00:24:54Z

allennlp/common/file_utils.py

+        open_fn = module.open
+    else:
+        for extension in compression_modules:
+            if filename.endswith(extension):


Can you use os.path.splitext() here to make that detection? I don't want a file named info.fogbugz to show up as a gzip file.

dirkgr · 2022-02-23T18:58:31Z

Since I can't push to your branch and you are MIA, I'm continuing this in #5578.

Dbhasin1 and others added 4 commits July 2, 2021 18:56

functionality to accept compressed files as input to predict

0a38254

test for lzma format included

316b15e

minor logical error

47349d4

Merge branch 'main' into zipped-file-handler

79d7e96

dirkgr suggested changes Jul 6, 2021

View reviewed changes

suggested changes incorporated

0ac8f67

dirkgr suggested changes Jul 23, 2021

View reviewed changes

dirkgr added 4 commits July 22, 2021 17:25

Merge branch 'main' into zipped-file-handler

85bddc7

Merge branch 'main' into zipped-file-handler

ef821b9

Merge branch 'main' into zipped-file-handler

026e42a

Merge branch 'main' into zipped-file-handler

23c50c7

dirkgr closed this Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality to accept compressed files as input to predict when using a Predictor #5299

Functionality to accept compressed files as input to predict when using a Predictor #5299

Dbhasin1 commented Jul 2, 2021

dirkgr left a comment

dirkgr Jul 6, 2021

Dbhasin1 Jul 7, 2021

dirkgr Jul 6, 2021

dirkgr Jul 6, 2021

Dbhasin1 Jul 8, 2021

dirkgr Jul 6, 2021

Dbhasin1 Jul 7, 2021

dirkgr Jul 7, 2021

dirkgr Jul 6, 2021

Dbhasin1 Jul 7, 2021

dirkgr left a comment

dirkgr Jul 23, 2021

dirkgr Jul 23, 2021

dirkgr Jul 23, 2021

dirkgr commented Feb 23, 2022

		except OSError:
		print("please specify the correct compression type argument.")

Functionality to accept compressed files as input to predict when using a Predictor #5299

Functionality to accept compressed files as input to predict when using a Predictor #5299

Conversation

Dbhasin1 commented Jul 2, 2021

Before submitting

After submitting

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Feb 23, 2022