Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Script referenced by shellpipe containing UTF-8 multibyte characters is not applied correctly #25

Closed
knutwannheden opened this issue Jan 28, 2022 · 6 comments

Comments

@knutwannheden
Copy link

Describe the bug
Since I have a somewhat complex setup I will have to try to distill it to a simple reproducer first, but it looks like there is a problem with encoding of input and or output when using the shellpipe filter. It appears like the platform default encoding is applied (CP1252 on Windows) and I am not aware of any way to

To Reproduce
I will try to work something out. But a simple script like this already causes failures on Windows:

#!/bin/bash

tmpfile=$(mktemp /tmp/names.XXXXXX)
tee $tmpfile | sed $'s/a/👻/g'
rm $tmpfile

Example error:

  File "C:\Users\knutw\AppData\Roaming\Python\Python310\site-packages\webchanges\filters.py", line 1430, in pipe_filter
    return subprocess.run(
  File "C:\Python310\lib\subprocess.py", line 503, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "C:\Python310\lib\subprocess.py", line 1149, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "C:\Python310\lib\subprocess.py", line 1517, in _communicate
    self._stdin_write(input)
  File "C:\Python310\lib\subprocess.py", line 1083, in _stdin_write
    self.stdin.write(input)
  File "C:\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u21b5' in position 1912: character maps to <undefined>

Expected behavior
I don't know the internals of webchanges but I think it may make sense to pass input and output streams in binary mode with shellpipe and then where required encode and decode strings using a default of UTF-8, which can possibly be configured somehow.

Screen scrape/screenshots
If applicable, add screen scrape or screenshots to help explain your problem.

Version info

2022-01-28 09:11:57,495 cli[46860] DEBUG: webchanges: 3.9 Copyright 2020- Mike Borsetti
2022-01-28 09:11:57,495 cli[46860] DEBUG: CPython: 3.10.1 ('tags/v3.10.1:2cd268a', 'Dec  6 2021 19:10:37') MSC v.1929 64 bit (AMD64)
2022-01-28 09:11:57,550 cli[46860] DEBUG: System: Windows-10-10.0.22000-SP0

Additional context
See also relevant code fromurlwatch: https://github.com/thp/urlwatch/blob/b1d4bc8526dc8ec680c50d3e3476797d18915635/lib/urlwatch/filters.py#L925-L926

webchanges appears to use text=True while not specifying any encoding. I have in my environment set the variable PYTHONIOENCODING=UTF-8, but I am relatively new to Python so I don't exactly know when and where this has any effect.

@knutwannheden
Copy link
Author

Please let me know if I can provide more information.

@mborsetti
Copy link
Owner

Hi Knut,

I love your bug reports for their detail; thanks.

Are you running Python in UTF-8 mode as per here?

@knutwannheden
Copy link
Author

Are you running Python in UTF-8 mode as per here?

Thanks for the pointer. I wasn't aware of the PYTHONUTF8=1 environment variable. That indeed appears to solve my problems. I assume PYTHONIOENCODING=UTF-8 is a "thing from the past" then.

I will do some more testing, but otherwise I think this may already be a good enough solution. It would probably make sense to point this out in the docs (for all us poor Windows users), especially since this project is perceived as a drop-in replacement for urlwatch, which doesn't appear to suffer from this problem.

But maybe you come to a different conclusion and decide to also change how your application communicates with sub-processes. It would certainly make sense with some tests, to make sure that both the input and output of UTF-8 characters works properly (also on Windows...).

I will report back my test results.

@mborsetti
Copy link
Owner

It would probably make sense to point this out in the docs (for all us poor Windows users), especially since this project is perceived as a drop-in replacement for urlwatch, which doesn't appear to suffer from this problem.

Yep, already written up in the filters doc for the next release. Nice catch about adding it to the migration documentation too.

But maybe you come to a different conclusion and decide to also change how your application communicates with sub-processes. It would certainly make sense with some tests, to make sure that both the input and output of UTF-8 characters works properly (also on Windows...).

Philosophically, I prefer to adhere to Python standards as opposed to make the app behave in "unexpected" ways. So I am leaning towards using the standard Python UTF-8 mode and standard subprocess module, and just documenting it.

The tests do have UTF-8 characters in them on purpose, but they do run in Python UTF-8 mode on my Windows machine (and succeed).

I will report back my test results.

Thanks, appreciate it!

@knutwannheden
Copy link
Author

Philosophically, I prefer to adhere to Python standards as opposed to make the app behave in "unexpected" ways. So I am leaning towards using the standard Python UTF-8 mode and standard subprocess module, and just documenting it.

Makes sense. I was just thinking that it may make sense to not set text=True, since the tool could also be used to watch binary data like PDF files. For text the tool would then just encode the text using the standard encoding (which I assume will respect PYTHONUTF8).

I still have quite a few discrepancies compared to urlwatch, but I think that is mostly due to the html2text filter which produces a different output. I will have to go through them all, which could take a while, since I have watches on a few hundred pages...

I will close this issue for now.

@knutwannheden
Copy link
Author

I still have quite a few discrepancies compared to urlwatch, but I think that is mostly due to the html2text filter which produces a different output. I will have to go through them all, which could take a while, since I have watches on a few hundred pages...

Regarding this comment: I now notice that it is just that the default for the method is different (html2text instead of strip_tags aka re). I now remember reading that in the docs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants