Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON API: Encoding? #2273

Closed
enkore opened this issue Mar 8, 2017 · 13 comments
Closed

JSON API: Encoding? #2273

enkore opened this issue Mar 8, 2017 · 13 comments
Assignees
Milestone

Comments

@enkore
Copy link
Contributor

enkore commented Mar 8, 2017

From #2249

Or we fix IO encoding to UTF-8 (irrespective of locale) in JSON mode, which probably makes more sense and is less error prone for downstream developers anyway.

I.e., --log-json → stderr text is always UTF-8, stdin text (so not "borg create -") is always UTF-8 (prompts, passwords), env-vars are read as UTF-8 (dito)

--json → stdout text is always UTF-8

@enkore enkore added this to the 1.1 - near future goals milestone Aug 6, 2017
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0rc2, 1.1 - near future goals Aug 15, 2017
@ThomasWaldmann
Copy link
Member

As this is a quite fundamental change, guess it should get some testing, in rc2.

Guess we won't have problems if the json is written to a file, not so sure about when it gets output on the console. Or piped and processed badly by other side.

@enkore
Copy link
Contributor Author

enkore commented Aug 16, 2017

Guess we won't have problems if the json is written to a file

When you write text (strings) to stdout/stderr, then they are encoded to bytes using an encoding guessed by Python. That's independent of whether stdout is connected to a TTY/terminal or redirected to a file.

@ThomasWaldmann
Copy link
Member

Yes, but you suggested to always use utf-8.

So how does e.g. a cygwin console or latin1/ascii console react when you output utf-8 on it?
Guess we could live with funny characters, but it shouldn't crash or hang.

Or when doing borg --json | othertool (and othertool guesses encoding), it might guess wrong when utf-8 is not the native system / fs encoding? If othertool is specialized on borg, it would use the right encoding, but if not, could it be told to use utf-8?

@enkore enkore self-assigned this Aug 18, 2017
@enkore
Copy link
Contributor Author

enkore commented Aug 18, 2017

Depends on #2925;

Note: borg list already uses UTF-8 regardless of system preference (via safe_encode), but only for listing archive contents.

Yes, but you suggested to always use utf-8.

The alternative is to make step one of using --json: "Replicate the way Python guesses encodings [which changes over Python releases]." i.e. "Use Python.". That's not acceptable.

@RonnyPfannschmidt
Copy link
Contributor

well - a completely ascii-save way could be to do unicode-escape then all unicode is escaped as \u...

@knutov
Copy link

knutov commented Aug 25, 2017

just as an idea - why not to skip support for latin1 and other non-unicode terminals now?

latin1 symbols inside utf8 will look the same on latin1 terminal I suppose. Other symbols will not be readable anyway, there is no good solution for this. And there is iconv for those who need 8bit encodings and knows what he is doing.

@enkore
Copy link
Contributor Author

enkore commented Aug 27, 2017

I prepared an initial, functionally incomplete patch I was completely dissatisfied with. I've been working to fix this for good by replacing most of these interactions (os.environ, input/yes, get_passphrase, ...) to use a iosys-class that determines encoding (from Python) and decodes stuff. But this is still incomplete and touches many of the more annoying parts of the code, so it may be reasonable to just go forward with rc2 and perhaps even 1.1.0 without having this resolved yet — on most (Linux/BSD) systems it will "mostly just work", because UTF-8 is a very widespread locale codeset and typically assumed. (OpenBSD has an especially good grip on things here for a Unix, because they only support UTF-8 and 7-bit ASCII). In this case it may be best to add a short note in the docs to say that encoding will be finalized to UTF-8 later.

This will fall apart on Linux when no locale is configured (because Python will fallback to 7-bit ASCII), or glibc things no locale is configured, or considers the configuration invalid (e.g. partial or missing locale files). And of course every locale that is not UTF-8.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Aug 27, 2017

OK, so let's have some docs now and the fix later.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1 - near future goals, 1.1.0rc2 Aug 27, 2017
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 7, 2017
@enkore enkore added the bug label Sep 7, 2017
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 8, 2017
ThomasWaldmann added a commit that referenced this issue Sep 9, 2017
document utf-8 locale requirement for json mode, #2273 (#3009)
@enkore enkore modified the milestones: 1.1.x, 1.1.0 release Sep 9, 2017
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 18, 2017

From https://docs.python.org/3.4/library/sys.html#sys.stdin / sys.stdout / sys.stderr:

The character encoding is platform-dependent.
Under Windows, if the stream is interactive (that is, if its isatty() method returns
True), the console codepage is used, otherwise the ANSI code page.
Under other platforms, the locale encoding is used (see locale.getpreferredencoding()).

Under all platforms though, you can override this value by setting the
PYTHONIOENCODING environment variable before starting Python.

More recent docs:

@enkore enkore removed their assignment Oct 14, 2017
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.1rc1, 1.1.2rc1 Oct 14, 2017
@ThomasWaldmann ThomasWaldmann removed this from the 1.1.2rc1 milestone Nov 4, 2017
@ThomasWaldmann ThomasWaldmann added this to the 1.1.x milestone Nov 4, 2017
@ghost ghost mentioned this issue Aug 26, 2021
@hexagonrecursion
Copy link
Contributor

hexagonrecursion commented Dec 7, 2021

OK, so let's have some docs now and the fix for 1.1.1 or so.

The docs were added Sep 9, 2017 #3019 (document utf-8 locale requirement for json mode). It looks like you forgot to remove the "documentation" label

@ThomasWaldmann
Copy link
Member

Thanks for the hint, I removed the documentation label.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.x, 1.2.x Feb 20, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 1.2.x, 2.x Jun 26, 2022
@ThomasWaldmann ThomasWaldmann modified the milestones: 2.x, 2.0.0b5 Jan 29, 2023
@borgbackup borgbackup deleted a comment from enkore Jan 29, 2023
@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 29, 2023

For stdin/stdout/stderr and JSON emitted on stdout (see frontends.rst), guess we could extend #3019 and just point there from the docs, so users invoking borg can adjust their environment variables if they do not use a locale with utf-8 encoding already:

https://docs.python.org/3.8/library/sys.html#sys.stdin reads:

Under all platforms, you can override the character encoding by setting the PYTHONIOENCODING environment variable before starting Python or by using the new -X utf8 command line option and PYTHONUTF8 environment variable. However, for the Windows console, this only applies when PYTHONLEGACYWINDOWSSTDIO is also set.

@ThomasWaldmann
Copy link
Member

Hmm, did I miss something or can this "fix" just be recommending to use PYTHONIOENCODING=utf-8 if one expects (JSON) streams to be in utf-8 on a legacy OS installation that does not use a utf-8 locale already?

PYTHONUTF8 seems way too intrusive and influences how a lot of stuff works - this could break stuff that worked before.

@ThomasWaldmann ThomasWaldmann self-assigned this Feb 1, 2023
ThomasWaldmann added a commit that referenced this issue Feb 1, 2023
document another way to get UTF-8 encoding on stdin/stdout/stderr, fixes #2273
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants