Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a feature to override file extension defining compression #607

Closed
3 tasks done
dmcguire81 opened this issue Apr 19, 2021 · 5 comments · Fixed by #609
Closed
3 tasks done

Add a feature to override file extension defining compression #607

dmcguire81 opened this issue Apr 19, 2021 · 5 comments · Fixed by #609

Comments

@dmcguire81
Copy link
Contributor

Problem description

Transparent compression/decompression is a killer feature that could be made more broadly applicable, if decoupled from the extension of the underlying "file". Tying application-handling behavior to "file" extension is not a universally portable idea (for example, on classic UNIX, a "shebang" line encodes the same information), and some decoupling already exists in the current ignore_ext flag to smart_open.open. This feature would complete the decoupling by allowing the extension to be effectively overridden to force a compression-handling behavior, by supplying a override_ext string to smart_open.open. At that point, users would be free to name "files" without extensions, and still have the ability to take advantage of the great feature.

Steps/code to reproduce the problem

N/A (not a defect), but any file path without an extension is not eligible for transparent compression/decompression, currently.

Versions

>>> import platform, sys, smart_open
>>> print(platform.platform())
Darwin-19.6.0-x86_64-i386-64bit
>>> print("Python", sys.version)
Python 3.7.8 (default, Jul 27 2020, 17:21:35) 
[Clang 11.0.3 (clang-1103.0.32.59)]
>>> print("smart_open", smart_open.__version__)
smart_open 5.0.0.dev0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 20, 2021

I think the current way we're handling compression is not ideal. It isn't obvious what the ignore_ext flag is doing, and it has no effect on files without an extension, as you've mentioned in the ticket and the associated PR.

This has been bothering me for a while, and I've come up with two solutions:

  1. An additional parameter. As I've mentioned in the PR, it's not a particularly good idea. We already have tons of parameters. Plus, explaining how this new parameter will work in tandem with the existing ignore_ext flag will also confuse people.
  2. Replace ignore_ext parameter with something that's more flexible and easier to explain to people. Of course, this would be a backwards-incompatible change, so we'd have to make a new major release for it.

Unless we find other alternatives, 2 is the way to go, but it will take some time to get it right. OTOH, we'd replace ignore_ext with a parameter like "compression", which will be an enum-like variable with the following values:

  • none: no compression at all, equivalent to the current ignore_ext=True
  • extension: default behavior, infer compression from extension, equievalent to current ignore_ext=False
  • sniff: sniff compression from the first bytes of the file (during reading only)
  • gz: gzip
  • zst: zstandard
  • ... and others

@aperiodic
Copy link
Contributor

  1. Replace ignore_ext parameter with something that's more flexible and easier to explain to people. Of course, this would be a backwards-incompatible change, so we'd have to make a new major release for it.

I agree that this is the best option, but why does it have to be backwards-incompatible? After the compression parameter is introduced, couldn't the implementation translate ignore_ext = False to compression = from_extension and ignore_ext = True to compression = none while emitting a deprecation warning? This would allow the new parameter to be introduced and ignore_ext to be deprecated while maintaining backwards-compatibility, and without the explainability and interaction downsides of the first approach.

(If both ignore_ext and compression are passed, then I think raising an exception is the best behavior, but this doesn't break backwards-compatibility.)

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 20, 2021

Good point @aperiodic. We could introduce the new parameter without immediately removing the old one.

@dmcguire81
Copy link
Contributor Author

Thanks @mpenkov. Any feedback on when a release might get cut?

@mpenkov
Copy link
Collaborator

mpenkov commented May 7, 2021

Probably within the next month or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants