UnicodeDecodeError in stdout/err thread #556

r4lv · 2021-02-11T12:23:25Z

Dear amoffat,

First, thank you for such an amazing project!

Lately, I've been using sh to generate LaTeX documents with latexmk, and sh gets stuck with an UnicodeDecodeError — I think this is related to _tee, and some strange stdout output. I'm using sh like this:

cmd = sh.latexmk("-pdf", blabla, _out=sys.stdout, _err_to_out=True, _tee="out", _bg=True, _bg_exc=True)
try:
    cmd.wait()
except KeyboardInterrupt:
    cmd.kill()
except sh.ErrorReturnCode as e:
    # process e.stdout, ...

and it never finishes the wait(). This is the error I obtain:

Underfull \hbox (badness 10000) in paragraph at lines 2--2
Overfull \hbox (0.9766pt too wide) in paragraph at lines 9--10
Exception in thread STDOUT/ERR thread for pid 5456:
Traceback (most recent call last):
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 1637, in wrap
    fn(*rgs, **kwargs)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2585, in output_thread
    done = f.read()
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3049, in read
    self.write_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 3026, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2928, in process
    handler.write(encode(chunk))
  File "/Users/r4lv/.pyenv/versions/3.9.1/Python.framework/Versions/3.9/lib/python3.9/site-packages/sh.py", line 2918, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 22: invalid start byte

(I'm getting the same error on python 3.7)

I "solved" that problem by adding an errors="ignore" to the chunk.decode in line 2918, but I'm not sure if there are any side-effects:

sh/sh.py

Lines 2916 to 2920 in 12420ed

    
           def get_file_chunk_consumer(handler): 
        
               if getattr(handler, "encoding", None): 
        
                   def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731 
        
               else: 
        
                   def encode(chunk): return chunk  # noqa: E731

0xf6 seems to be the german ö / o with diaeresis, I don't know why utf-8 couldn't decode that — would you have an idea?

The text was updated successfully, but these errors were encountered:

amoffat · 2021-02-12T02:14:35Z

Hi @r4lv. If you run chardet.detect on your input, what encoding does it say it is?

r4lv · 2021-02-25T22:31:09Z

Thank you for the idea!
The input file is {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}, and the log file LaTeX produces (which should match the stdout) is {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''} — something's fishy there. I tried adding _encoding="latin1" and/or _decode_errors="ignore" to my call to sh, but without success. Even when passing _encoding, the handler.encoding from line 2917 still defaults to utf-8. Is this due to the _tee, _out, etc which I pass too?

amoffat · 2021-03-04T00:19:44Z

Hmm, sh should respect your _encoding value. I'm having a hard time seeing how it could default to utf-8 when you pass it in explicitly.

This is where it gets its default value, which would be your system encoding (but you are overriding it)

sh/sh.py

Line 1220 in 12420ed

"encoding": DEFAULT_ENCODING,

The encoding value is passed into two StreamReader objects (these read from stdout and stderr)
here

sh/sh.py

Lines 2206 to 2212 in 12420ed

    
           self._stdout_stream = StreamReader( 
        
               self.log.get_child("streamreader", "stdout"), 
        
               self._stdout_parent_fd, stdout, self._stdout, 
        
               ca["out_bufsize"], ca["encoding"], 
        
               ca["decode_errors"], stdout_pipe, 
        
               save_data=save_stdout 
        
           )

and here

sh/sh.py

Lines 2233 to 2238 in 12420ed

    
           self._stderr_stream = StreamReader( 
        
               Logger("streamreader"), 
        
               self._stderr_parent_fd, stderr, self._stderr, 
        
               ca["err_bufsize"], ca["encoding"], ca["decode_errors"], 
        
               stderr_pipe, save_data=save_stderr 
        
           )

The StreamReaders create a chunk processor, using that passed-in encoding directly, here:

sh/sh.py

Lines 2999 to 3000 in 12420ed

    
           self.process_chunk, self.finish_chunk_processor = \ 
        
               determine_how_to_feed_output(handler, encoding, decode_errors)

A "file chunk consumer" is created because your output (stdout) has a write method on it, here:

sh/sh.py

Lines 2896 to 2897 in 12420ed

    
           elif hasattr(handler, "write"): 
        
               process, finish = get_file_chunk_consumer(handler)

In the file chunk consumer, the encoding is picked up off of the StreamReader handler here:

sh/sh.py

Line 2918 in 12420ed

def encode(chunk): return chunk.decode(handler.encoding) # noqa: E731

So if you are specifying "latin1", I am not seeing how "utf8" makes it to line 2918. I don't see anywhere where the "encoding" parameter is overwritten by anything. 🤔

ecederstrand · 2021-05-24T15:37:30Z

Can you capture your stdout and attach it here? That would make it easier to reproduce the error and narrow down the problem.

ecederstrand · 2021-06-07T09:50:03Z

We'll need a reproducible example to be able to debug this. I'm closing the issue, but feel free to reopen if you get back to this.

felixonmars · 2022-10-17T16:13:05Z

I hit this recently too. It seems that when the descriptor passed to _out is something like StringIO instead of BytesIO, sh will attempt to decode the output:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True)
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1573984:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True)
Exception in thread STDOUT/ERR thread for pid 1576798:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2977, in process
    handler.write(chunk.decode(encoding, decode_errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=BytesIO(), _tee=True)
b'pub   rsa4096 2018-05-15 [SC]\n      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174\nuid           [ unknown] Marcus M\xfcller (GNU Radio Maintainer) <mmueller@gnuradio.org>\nuid           [ unknown] Marcus M\xfcller <marcus@hostalia.de>\n\n'

With _decode_errors="ignore" set, it starts to work for StringIO, but still not for sys.stdout:

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Mller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Mller <marcus@hostalia.de>

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "D74F9F146E7F755783583158B343B2BA293E5174", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
Exception in thread STDOUT/ERR thread for pid 1579387:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 1641, in wrap
    fn(*rgs, **kwargs)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2593, in output_thread
    done = f.read()
  File "/usr/lib/python3.10/site-packages/sh.py", line 3057, in read
    self.write_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 3034, in write_chunk
    self.should_quit = self.process_chunk(chunk)
  File "/usr/lib/python3.10/site-packages/sh.py", line 2936, in process
    handler.write(encode(chunk))
  File "/usr/lib/python3.10/site-packages/sh.py", line 2926, in encode
    def encode(chunk): return chunk.decode(handler.encoding)  # noqa: E731
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 33: invalid start byte
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174

ecederstrand · 2022-10-19T09:53:04Z

There must be something more affecting this. I cannot reproduce the above on OS X and Python 3.10.1:

>>> import sh, sys
>>> from io import StringIO
>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=StringIO(), _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

>>> sh.env("LC_CTYPE=POSIX", "gpg", "--fingerprint", "E6B0152CE5614F6680EEFDC25D11C3B7A2E59699", _out=sys.stdout, _tee=True, _decode_errors="ignore")
pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

pub   ed25519 2022-10-19 [SC] [expires: 2024-10-18]
      E6B0 152C E561 4F66 80EE  FDC2 5D11 C3B7 A2E5 9699
uid           [ ultim. ] John Doe <john@example.com>
sub   cv25519 2022-10-19 [E] [expires: 2024-10-18]

felixonmars · 2022-10-19T10:38:05Z

The output of gpg in your example are all valid ASCII characters, which won't trigger this problem.

The example I provided actually contains a special character:

$ gpg --fingerprint D74F9F146E7F755783583158B343B2BA293E5174
pub   rsa4096 2018-05-15 [SC]
      D74F 9F14 6E7F 7557 8358  3158 B343 B2BA 293E 5174
uid           [ unknown] Marcus Müller (GNU Radio Maintainer) <mmueller@gnuradio.org>
uid           [ unknown] Marcus Müller <marcus@hostalia.de>

This fixes the example I provided at amoffat#556 (comment) by making fd/file_chunk_consumer respect decode_errors just like the other consumers.

This fixes the example I provided at #556 (comment) by making fd/file_chunk_consumer respect decode_errors just like the other consumers.

ecederstrand closed this as completed Jun 7, 2021

felixonmars added a commit to felixonmars/sh that referenced this issue Oct 19, 2022

Make fd/file_chunk_consumer respect decode_errors

5c7a65e

This fixes the example I provided at amoffat#556 (comment) by making fd/file_chunk_consumer respect decode_errors just like the other consumers.

felixonmars mentioned this issue Oct 19, 2022

Make fd/file_chunk_consumer respect decode_errors #628

Merged

amoffatgmi pushed a commit that referenced this issue Feb 8, 2023

Make fd/file_chunk_consumer respect decode_errors

95f60ea

This fixes the example I provided at #556 (comment) by making fd/file_chunk_consumer respect decode_errors just like the other consumers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError in stdout/err thread #556

UnicodeDecodeError in stdout/err thread #556

r4lv commented Feb 11, 2021

amoffat commented Feb 12, 2021

r4lv commented Feb 25, 2021

amoffat commented Mar 4, 2021 •

edited

Loading

ecederstrand commented May 24, 2021

ecederstrand commented Jun 7, 2021

felixonmars commented Oct 17, 2022

ecederstrand commented Oct 19, 2022

felixonmars commented Oct 19, 2022

UnicodeDecodeError in stdout/err thread #556

UnicodeDecodeError in stdout/err thread #556

Comments

r4lv commented Feb 11, 2021

amoffat commented Feb 12, 2021

r4lv commented Feb 25, 2021

amoffat commented Mar 4, 2021 • edited Loading

ecederstrand commented May 24, 2021

ecederstrand commented Jun 7, 2021

felixonmars commented Oct 17, 2022

ecederstrand commented Oct 19, 2022

felixonmars commented Oct 19, 2022

amoffat commented Mar 4, 2021 •

edited

Loading