[BUG] Script referenced by `shellpipe` containing UTF-8 multibyte characters is not applied correctly #25

knutwannheden · 2022-01-28T08:12:41Z

Describe the bug
Since I have a somewhat complex setup I will have to try to distill it to a simple reproducer first, but it looks like there is a problem with encoding of input and or output when using the shellpipe filter. It appears like the platform default encoding is applied (CP1252 on Windows) and I am not aware of any way to

To Reproduce
I will try to work something out. But a simple script like this already causes failures on Windows:

#!/bin/bash

tmpfile=$(mktemp /tmp/names.XXXXXX)
tee $tmpfile | sed $'s/a/👻/g'
rm $tmpfile

Example error:

  File "C:\Users\knutw\AppData\Roaming\Python\Python310\site-packages\webchanges\filters.py", line 1430, in pipe_filter
    return subprocess.run(
  File "C:\Python310\lib\subprocess.py", line 503, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "C:\Python310\lib\subprocess.py", line 1149, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\Python310\lib\subprocess.py", line 1517, in _communicate
    self._stdin_write(input)
  File "C:\Python310\lib\subprocess.py", line 1083, in _stdin_write
    self.stdin.write(input)
  File "C:\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u21b5' in position 1912: character maps to <undefined>

Expected behavior
I don't know the internals of webchanges but I think it may make sense to pass input and output streams in binary mode with shellpipe and then where required encode and decode strings using a default of UTF-8, which can possibly be configured somehow.

Screen scrape/screenshots
If applicable, add screen scrape or screenshots to help explain your problem.

Version info

2022-01-28 09:11:57,495 cli[46860] DEBUG: webchanges: 3.9 Copyright 2020- Mike Borsetti
2022-01-28 09:11:57,495 cli[46860] DEBUG: CPython: 3.10.1 ('tags/v3.10.1:2cd268a', 'Dec  6 2021 19:10:37') MSC v.1929 64 bit (AMD64)
2022-01-28 09:11:57,550 cli[46860] DEBUG: System: Windows-10-10.0.22000-SP0

Additional context
See also relevant code fromurlwatch: https://github.com/thp/urlwatch/blob/b1d4bc8526dc8ec680c50d3e3476797d18915635/lib/urlwatch/filters.py#L925-L926

webchanges appears to use text=True while not specifying any encoding. I have in my environment set the variable PYTHONIOENCODING=UTF-8, but I am relatively new to Python so I don't exactly know when and where this has any effect.

The text was updated successfully, but these errors were encountered:

knutwannheden · 2022-01-28T08:13:31Z

Please let me know if I can provide more information.

mborsetti · 2022-01-28T08:41:10Z

Hi Knut,

I love your bug reports for their detail; thanks.

Are you running Python in UTF-8 mode as per here?

knutwannheden · 2022-01-28T09:34:35Z

Are you running Python in UTF-8 mode as per here?

Thanks for the pointer. I wasn't aware of the PYTHONUTF8=1 environment variable. That indeed appears to solve my problems. I assume PYTHONIOENCODING=UTF-8 is a "thing from the past" then.

I will do some more testing, but otherwise I think this may already be a good enough solution. It would probably make sense to point this out in the docs (for all us poor Windows users), especially since this project is perceived as a drop-in replacement for urlwatch, which doesn't appear to suffer from this problem.

But maybe you come to a different conclusion and decide to also change how your application communicates with sub-processes. It would certainly make sense with some tests, to make sure that both the input and output of UTF-8 characters works properly (also on Windows...).

I will report back my test results.

mborsetti · 2022-01-28T10:11:11Z

It would probably make sense to point this out in the docs (for all us poor Windows users), especially since this project is perceived as a drop-in replacement for urlwatch, which doesn't appear to suffer from this problem.

Yep, already written up in the filters doc for the next release. Nice catch about adding it to the migration documentation too.

But maybe you come to a different conclusion and decide to also change how your application communicates with sub-processes. It would certainly make sense with some tests, to make sure that both the input and output of UTF-8 characters works properly (also on Windows...).

Philosophically, I prefer to adhere to Python standards as opposed to make the app behave in "unexpected" ways. So I am leaning towards using the standard Python UTF-8 mode and standard subprocess module, and just documenting it.

The tests do have UTF-8 characters in them on purpose, but they do run in Python UTF-8 mode on my Windows machine (and succeed).

I will report back my test results.

Thanks, appreciate it!

knutwannheden · 2022-01-28T10:50:36Z

Philosophically, I prefer to adhere to Python standards as opposed to make the app behave in "unexpected" ways. So I am leaning towards using the standard Python UTF-8 mode and standard subprocess module, and just documenting it.

Makes sense. I was just thinking that it may make sense to not set text=True, since the tool could also be used to watch binary data like PDF files. For text the tool would then just encode the text using the standard encoding (which I assume will respect PYTHONUTF8).

I still have quite a few discrepancies compared to urlwatch, but I think that is mostly due to the html2text filter which produces a different output. I will have to go through them all, which could take a while, since I have watches on a few hundred pages...

I will close this issue for now.

knutwannheden · 2022-01-28T10:58:27Z

I still have quite a few discrepancies compared to urlwatch, but I think that is mostly due to the html2text filter which produces a different output. I will have to go through them all, which could take a while, since I have watches on a few hundred pages...

Regarding this comment: I now notice that it is just that the default for the method is different (html2text instead of strip_tags aka re). I now remember reading that in the docs...

knutwannheden closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Script referenced by `shellpipe` containing UTF-8 multibyte characters is not applied correctly #25

[BUG] Script referenced by `shellpipe` containing UTF-8 multibyte characters is not applied correctly #25

knutwannheden commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

mborsetti commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

mborsetti commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

[BUG] Script referenced by shellpipe containing UTF-8 multibyte characters is not applied correctly #25

[BUG] Script referenced by shellpipe containing UTF-8 multibyte characters is not applied correctly #25

Comments

knutwannheden commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

mborsetti commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

mborsetti commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

knutwannheden commented Jan 28, 2022

[BUG] Script referenced by `shellpipe` containing UTF-8 multibyte characters is not applied correctly #25

[BUG] Script referenced by `shellpipe` containing UTF-8 multibyte characters is not applied correctly #25