Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing console's OutputEncoding on linux to unicode generates garbage #29735

Open
JustArchi opened this issue Jun 1, 2019 · 19 comments
Open

Comments

@JustArchi
Copy link
Contributor

Hello.

I've experimented a bit in my cross-platform app by declaring Console.OutputEncoding = Encoding.Unicode; globally before first console entry gets written.

On Windows, as expected, the output encoding is changed nicely and the console displays whole range of symbols, including cyrillic characters.

On Linux, the encoding is also changed, but the console generates garbage from this point onwards, where previously the cyrillic characters would also show properly (probably due to UTF-8 already being default there).

Judging by my own research based on this line, I'd expect that changing encoding on linux would truly be a no-op operation which doesn't affect anything, or at worst produces an exception to handle during runtime, but instead it broke display that worked previously.

I'm not sure if this is intended or not, I apologize in advance if it is but I couldn't find any issue that relates to my problem. Feel free to close it in this case.

Otherwise, feel free to check the issue yourself, it should be enough to launch code similar to below on any linux machine:

Console.OutputEncoding = Encoding.Unicode;
Console.WriteLine("привет");

In my case, it prints ?@825B. It's important to test it with cyrillic or something more obscure, as 00 in ASCII characters and similar will be written as NULLs on the terminal, thus not displayed.

As you can expect, this issue also affects OS X.

Thank you in advance for looking into this issue.

@JustArchi JustArchi changed the title Changing console's OutputEncoding on linux generates garbage Changing console's OutputEncoding on linux to unicode generates garbage Jun 2, 2019
@danmoseley
Copy link
Member

On Ubuntu in WSL at least this does not repro

dan@DESKTOP-CJJBOPC:~/2$ dotnet run
?�@�8�2�5�B�
 dan@DESKTOP-CJJBOPC:~/2$ ls
2.csproj  bin  obj  out  Program.cs

@JustArchi
Copy link
Contributor Author

JustArchi commented Jun 2, 2019

@danmosemsft it doesn't? Your output shows the garbage I refer to caused by incorrectly used console encoding in .net core and actual encoding in the terminal. It should print "привет" instead, like on Windows.

@danmoseley
Copy link
Member

@JustArchi I misunderstood, you mean that writing through the Console object is messed up. I guess so, at least, my next string is double spaced:

dan@DESKTOP-CJJBOPC:~/2$ dotnet run
?�@�8�2�5�B�
 h e l l o
 dan@DESKTOP-CJJBOPC:~/2$

@stephentoub
Copy link
Member

stephentoub commented Jun 2, 2019

Changing Console.OutputEncoding changes the encoding used to translate strings to the bytes written out to the underlying stream, whether that's to a terminal or redirected. There's no good way I'm aware of to programmatically tell the terminal to change the encoding it uses to decode written bytes, but we dutifully follow the request so that a terminal manually changed or redirected output gets the correct data. I'm not sure what else we could do here. What are you suggesting is the viable alternative?

@JustArchi
Copy link
Contributor Author

JustArchi commented Jun 2, 2019

@stephentoub I thought of 3 approaches to this issue (all 3 apply when !Console.IsOutputRedirected):

  • Determine if it's possible to change terminal encoding, if yes, we should use it (I guess it's not viable based on the note and the fact that this is not the case yet)
  • Determine that it's not possible to change terminal encoding and decide to make a real no-op operation that won't affect anything.
  • Determine that it's not possible to change terminal encoding and decide to throw an exception to signalize the consumer about that.

The only question that remains is whether we're able to determine whether we can safely assume that terminal encoding can't be changed (because we have unix spec and we're sure that we're dealing with the terminal). If that's the case, it should be addressed through one of two ways above, as opposed to existing logic that leaves the console encoding in a state that no consumer would want it to end up with.

My current idea involves not doing anything when calling Console.OutputEncoding setter if we're on unix and !Console.IsOutputRedirected. Alternative way is throwing an exception instead. Both cases are possible only if we can safely determine when dealing with a terminal as opposed to file redirection, which I'm not sure about and I believe you know better.

The objective is to somehow improve current result of leaving console encoding in a state that no consumer would want, without manually adding if for windows platform.

Thanks!

@stephentoub
Copy link
Member

My current idea involves not doing anything when calling Console.OutputEncoding setter if we're on unix and !Console.IsOutputRedirected

Someone can manually change the encoding of their terminal, in which case they'd want to use Console.OutputEncoding to match. If we start ignoring that request or throwing, it breaks that use case.

@JustArchi
Copy link
Contributor Author

That's true, but when you can't change the encoding of the already-set terminal then runtime should have logic for detecting and applying to the encoding that was already set (whether it's utf8 or anything else), and since you can't do anything with OutputEncoding in a way it won't break that terminal output, it could be a no-op.

@JustArchi
Copy link
Contributor Author

To the best of my knowledge you can't change encoding of already established terminal output on unix (as opposed to windows), which is why logical solution to me is runtime detecting that encoding (and applying to it), while making all future calls to output encoding a no-op.

@stephentoub
Copy link
Member

should have logic for detecting and applying to the encoding that was already set

How?

@JustArchi
Copy link
Contributor Author

That's a good question, I don't know, maybe you have some idea 😅.

@JustArchi
Copy link
Contributor Author

Majority of unix applications seem to depend on LANG, LC_CTYPE and LC_ALL for that, although it's true that it's not really stating the encoding per-se, but what the terminal should be using.

@stephentoub
Copy link
Member

maybe you have some idea

I'm not aware of any good way to reliably determine the encoding the terminal is using (if anyone knows of one, please share). And without that, I don't think this is actionable.

@stephentoub
Copy link
Member

although it's true that it's not really stating the encoding per-se

Right

@JustArchi
Copy link
Contributor Author

Which is why I'm not really suggesting any particular solution as I don't feel comfortable enough doing so, I'm just brainstorming potential approaches to the problem in order to determine whether there is anything we can do to improve in regard to this issue.

If you feel like there is nothing we can do to improve this use case then I fully understand that and the issue can be closed, I just thought that perhaps there is some possible improvement here in regards to avoiding breaking the encoding for unaware customers.

@svick
Copy link
Contributor

svick commented Jun 2, 2019

@JustArchi

What is the use case? Why are you setting Console.OutputEncoding in the first place?

Personally, I think it's really confusing that Encoding.Unicode actually means UTF-16, but I'm not sure there's anything that can be done about that.

@JustArchi
Copy link
Contributor Author

@svick I'm changing encoding on Windows to have consistent display of more obscure characters for my users, this involves stuff like cyrillic characters on non-cyrillic OS languages to display properly (instead of bunch of ????).

Accidentally this line regressed on linux/osx setups since there it changed the output encoding without changing terminal encoding, so I was forced to make my line above a conditional if (windows) in order to avoid breaking linux/osx.

The idea was that runtime could handle it in a smart way on linux/osx instead of changing encoding without terminal, but I guess there is no good way to go about this. The end objective was to have a transparency in that command that could work regardless of OS, instead of me sticking to current way of if (windows) (which is not really wrong, but could be done better and I'd prefer to rely on .net core to handle stuff such as determining whether underlying OS can do what I ask CLR to do).

@svick
Copy link
Contributor

svick commented Jun 2, 2019

@JustArchi I think a good approach to do that is to use Console.OutputEncoding = Encoding.UTF8; everywhere. On Windows, setting OutputEndoding notifies the OS, which means both UTF-16 and UTF-8 should work:

https://github.com/dotnet/corefx/blob/aa0c037c1f64c91f73698d0607dea16904d08da8/src/System.Console/src/System/ConsolePal.Windows.cs#L113-L120

On Unix, that doesn't work, so you should stick with the default UTF-8:

https://github.com/dotnet/corefx/blob/aa0c037c1f64c91f73698d0607dea16904d08da8/src/System.Console/src/System/ConsolePal.Unix.cs#L746-L750

@JustArchi
Copy link
Contributor Author

JustArchi commented Jun 2, 2019

@svick This is actually what I've decided to go with, but I still have if for Windows in-place, because if by any chance somebody would be running terminal in non-utf8 then I can't do anything about that anyway, so I rely on what runtime can detect in this case and I don't intend to change it at all. If unix terminal would be running in something like ASCII then forcing UTF-8 response could still screw it up, because characters that do not fit in a single byte would be interpreted as several wrong characters instead of just ASCII fallback of ?, so the best solution for me right now is to use UTF-8 on Windows and never attempt to change output encoding at all for all non-windows boxes.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Jul 6, 2020
@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Jan 6, 2025
@JustArchi
Copy link
Contributor Author

It's mentioned in #52374 so I'm not sure if we need standalone issue, but the problem of course still applies and is not resolved at the time of posting.

@dotnet-policy-service dotnet-policy-service bot removed no-recent-activity backlog-cleanup-candidate An inactive issue that has been marked for automated closure. labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants