Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read from io.StringIO #570

Closed
mpenkov opened this issue Dec 22, 2020 · 14 comments · Fixed by #660
Closed

Unable to read from io.StringIO #570

mpenkov opened this issue Dec 22, 2020 · 14 comments · Fixed by #660
Labels

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 22, 2020

$ python
Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> buf = io.StringIO('hello world')
>>> buf.seek(0)
0
>>> import smart_open
>>> smart_open.open(buf).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/codecs.py", line 500, in read
    data = self.bytebuffer + newdata
TypeError: can't concat str to bytes
@mpenkov mpenkov added the bug label Dec 22, 2020
@mpenkov
Copy link
Collaborator Author

mpenkov commented Dec 22, 2020

version 4.0.1

@markopy
Copy link
Contributor

markopy commented Jan 18, 2021

I briefly looked into this because one of the tests dealing with it failed.

If you want to support this I think the only way is to explicitly check for StringIO and return it unmodified from open(). Also raise an error if mode='b' or if newline is specified because those can't be supported. encoding and errors can be ignored because no encoding actually takes place with StringIO.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 19, 2021

I don't like the idea of an explicit check. It's too narrow.

How about a fin.peek(0) to see what's in the actual stream? If it's text, we can't do anything other than return it as is.

@markopy
Copy link
Contributor

markopy commented Jan 20, 2021

I don't think StringIO has a peek method. Even if it did, other's would rarely implement it. You would have to do a read followed by rewinding with seek which limits you to random access files.

In general I'm not sure accepting file objects even makes sense. Why would you call smart_open or even the built in open if you already have a file object? I can see the point of having StringIO/BytesIO work for being able to mock something in memory but beyond that I'm not convinced.

Supporting StringIO also breaks the concept of smart_open always treating the underlying file as a byte stream which complicates things and may set you up for quite a bit of maintenance pain in the future.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 21, 2021

@markopy You make valid points.

In general I'm not sure accepting file objects even makes sense. Why would you call smart_open or even the built in open if you already have a file object?

@piskvorky WDYT? What was the motivation of allowing smart_open to accept a file object as input? It breaks symmetry with the built-in function.

>>> open(open('setup.py'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
>>> 

Perhaps we should do exactly as the built-in option, and accept the above three classes as inputs exclusively?

@piskvorky
Copy link
Owner

piskvorky commented Jan 21, 2021

I'm not sure, would have to dig through the project history. I like @markopy 's argument, makes sense to keep smart_open's mission clean and contained.

@markopy
Copy link
Contributor

markopy commented Jan 21, 2021

Just one thing to add: For complete consistency with built in open you also need to accept integer file descriptors as pointed out in #528.

@piskvorky
Copy link
Owner

I found this: #32. @gojomo what was your motivation there, do you remember?

@gojomo
Copy link
Contributor

gojomo commented Jan 21, 2021

I can't recall the motivation. I vaguely think some bit of example code or downstream code in Gensim may have needed this behavior, and at the time I may have thought that even Python's open() worked this same way. There might have been a request via another issue or forum.

Given my minimal comment & your rapid review & integration, we'd probably talked about it somewhere else, but I'm not sure where.

If we try removing the passthrough case & running all smart_open & gensim tests, maybe we'll find something that breaks?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 22, 2021

Are you interested in investigating this further @markopy ?

@markopy
Copy link
Contributor

markopy commented Jan 22, 2021

The change from #32 now resides in _open_binary_stream so it implicitly assumes the file is opened in binary mode. In that context failing for StringIO or anything else that returns something other than bytes on read makes sense.

There are a whole bunch of tests explicitly for SmartOpenFileObjTest as well as some using BytesIO to test other things. There is also a test for this use case:

        Does our detection of the `name` attribute work with wrapped open()-ed streams?

        We `open()` a file with ".bz2" extension, pass the file object to `smart_open()` and check that
        we read decompressed data. This behavior is driven by detecting the `name` attribute in
        `_open_binary_stream()`.

I find support for this somewhat questionable but I don't have time to investigate why this might exist and whether to remove it again makes sense.

The easiest thing to do is to keep the current support for handling binary file objects but not extend it to text streams like StringIO.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 23, 2021

@piskvorky Removing support for open-ing a buffer (as opposed to a string) seems to have no effect on gensim tests.

What do you think about removing this functionality?

@piskvorky
Copy link
Owner

piskvorky commented Jan 23, 2021

I have no chips on the table – if gensim (and i guess gensim-data) continue to work, let's axe it!

It's a top-level API change though… does it warrant a major version bump?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 24, 2021

Yes, I think it does, even though it doesn't seem like people use this particular feature much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants