Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError in stdout/err thread #556

Closed
r4lv opened this issue Feb 11, 2021 · 8 comments
Closed

UnicodeDecodeError in stdout/err thread #556

r4lv opened this issue Feb 11, 2021 · 8 comments

Comments

@r4lv
Copy link

r4lv commented Feb 11, 2021

Dear amoffat,

First, thank you for such an amazing project!

Lately, I've been using sh to generate LaTeX documents with latexmk, and sh gets stuck with an UnicodeDecodeError — I think this is related to _tee, and some strange stdout output. I'm using sh like this:

cmd = sh.latexmk("-pdf", blabla, _out=sys.stdout, _err_to_out=True, _tee="out", _bg=True, _bg_exc=True)
try:
    cmd.wait()
except KeyboardInterrupt:
    cmd.kill()
except sh.ErrorReturnCode as e:
    # process e.stdout, ...

and it never finishes the wait(). This is the error I obtain:

Underfull \hbox (badness 10000) in paragraph at lines 2--2
Overfull \hbox (0.9766pt too wide) in paragraph at lines 9--10
Exception in thread STDOUT/ERR thread for pid 5456:
Traceback (most recent call last):
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 1637, in wrap
    fn(*rgs, **kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2585, in output_thread
    done = f.read()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3049, in read
    self.write_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3026, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2928, in process
    handler.write(encode(chunk))
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2918, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 22: invalid start byte

(I'm getting the same error on python 3.7)

I "solved" that problem by adding an errors="ignore" to the chunk.decode in line 2918, but I'm not sure if there are any side-effects:

sh/sh.py

Lines 2916 to 2920 in 12420ed

def get_file_chunk_consumer(handler):
if getattr(handler, "encoding", None):
def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731
else:
def encode(chunk): return chunk # noqa: E731

0xf6 seems to be the german ö / o with diaeresis, I don't know why utf-8 couldn't decode that — would you have an idea?

@amoffat
Copy link
Owner

amoffat commented Feb 12, 2021

Hi @r4lv. If you run chardet.detect on your input, what encoding does it say it is?

@r4lv
Copy link
Author

r4lv commented Feb 25, 2021

Thank you for the idea!
The input file is {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}, and the log file LaTeX produces (which should match the stdout) is {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''} — something's fishy there. I tried adding _encoding="latin1" and/or _decode_errors="ignore" to my call to sh, but without success. Even when passing _encoding, the handler.encoding from line 2917 still defaults to utf-8. Is this due to the _tee, _out, etc which I pass too?

@amoffat
Copy link
Owner

amoffat commented Mar 4, 2021

Hmm, sh should respect your _encoding value. I'm having a hard time seeing how it could default to utf-8 when you pass it in explicitly.

This is where it gets its default value, which would be your system encoding (but you are overriding it)

sh/sh.py

Line 1220 in 12420ed

"encoding": DEFAULT_ENCODING,

The encoding value is passed into two StreamReader objects (these read from stdout and stderr)
here

sh/sh.py

Lines 2206 to 2212 in 12420ed

self._stdout_stream = StreamReader(
self.log.get_child("streamreader", "stdout"),
self._stdout_parent_fd, stdout, self._stdout,
ca["out_bufsize"], ca["encoding"],
ca["decode_errors"], stdout_pipe,
save_data=save_stdout
)

and here

sh/sh.py

Lines 2233 to 2238 in 12420ed

self._stderr_stream = StreamReader(
Logger("streamreader"),
self._stderr_parent_fd, stderr, self._stderr,
ca["err_bufsize"], ca["encoding"], ca["decode_errors"],
stderr_pipe, save_data=save_stderr
)

The StreamReaders create a chunk processor, using that passed-in encoding directly, here:

sh/sh.py

Lines 2999 to 3000 in 12420ed

self.process_chunk, self.finish_chunk_processor = \
determine_how_to_feed_output(handler, encoding, decode_errors)

A "file chunk consumer" is created because your output (stdout) has a write method on it, here:

sh/sh.py

Lines 2896 to 2897 in 12420ed

elif hasattr(handler, "write"):
process, finish = get_file_chunk_consumer(handler)

In the file chunk consumer, the encoding is picked up off of the StreamReader handler here:

sh/sh.py

Line 2918 in 12420ed

def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731

So if you are specifying "latin1", I am not seeing how "utf8" makes it to line 2918. I don't see anywhere where the "encoding" parameter is overwritten by anything. 🤔

@ecederstrand
Copy link
Collaborator

Can you capture your stdout and attach it here? That would make it easier to reproduce the error and narrow down the problem.

@ecederstrand
Copy link
Collaborator

We'll need a reproducible example to be able to debug this. I'm closing the issue, but feel free to reopen if you get back to this.

@felixonmars
Copy link
Contributor

I hit this recently too. It seems that when the descriptor passed to _out is something like StringIO instead of BytesIO, sh will attempt to decode the output:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True)
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1573984:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True)
Exception in thread STDOUT/ERR thread for pid 1576798:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2977, in process
    handler.write(chunk.decode(encoding, decode_errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=BytesIO(), _tee=True)
b'pub   rsa4096 2018-05-15 [SC]\n      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174\nuid           [ unknown] Marcus M\xfcller (GNU Radio Maintainer) <mmueller@gnuradio.org>\nuid           [ unknown] Marcus M\xfcller <marcus@hostalia.de>\n\n'

With _decode_errors="ignore" set, it starts to work for StringIO, but still not for sys.stdout:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Mller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Mller <marcus@hostalia.de>

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1579387:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

@ecederstrand
Copy link
Collaborator

There must be something more affecting this. I cannot reproduce the above on OS X and Python 3.10.1:

>>> import sh, sys
>>> from io import StringIO
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

@felixonmars
Copy link
Contributor

The output of gpg in your example are all valid ASCII characters, which won't trigger this problem.

The example I provided actually contains a special character:

$ gpg --fingerprint D74F9F146E7F755783583158B343B2BA293E5174
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Müller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Müller <marcus@hostalia.de>

felixonmars added a commit to felixonmars/sh that referenced this issue Oct 19, 2022
This fixes the example I provided at
amoffat#556 (comment)
by making fd/file_chunk_consumer respect decode_errors just like the
other consumers.
amoffatgmi pushed a commit that referenced this issue Feb 8, 2023
This fixes the example I provided at
#556 (comment)
by making fd/file_chunk_consumer respect decode_errors just like the
other consumers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants