Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output Windows Encoding Problem #67

Open
oTnTh opened this issue Dec 26, 2021 · 16 comments
Open

Output Windows Encoding Problem #67

oTnTh opened this issue Dec 26, 2021 · 16 comments

Comments

@oTnTh
Copy link

oTnTh commented Dec 26, 2021

SciTEUser.properties:

code.page=65001
output.code.page=936

t.au3:

$s = 'test中文测试test'
ConsoleWrite($s & @CRLF)

001

002

I think it's an encoding problem. Before update the Output panel of VSCode, AutoIt-VSCode should deal with the Encoding of strings.

Thanks.

@vanowm
Copy link

vanowm commented Sep 27, 2022

Unable to reproduce ( I copied the code from here, so it could be saved in unicode )

AutoIt-VSCode v1.0.9
Visual Studio Code v1.71.2

As far as I can tell AutoIt-VSCode doesn't use SciTE configurations.

@oTnTh
Copy link
Author

oTnTh commented Oct 4, 2022

The default codepage is depending on the language of Windows settings, for Chinese it's cp936.

So I have to put these in my SciTEUser.properties, to get correct output in the Output Panel of Scite.

code.page=65001
output.code.page=936

VSCode doesn't have similar things like this, and that cause my problems.

Please take a look with this script:

Func _StringToCodepage($sStr, $iCodepage)
    Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, _
			"int", StringLen($sStr), "ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
    Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, _
			"int", StringLen($sStr), "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
    Return DllStructGetData($tCP, 1)
EndFunc   ;==>_StringToCodepage

$cp = DllCall("kernel32.dll", "int", "GetACP")
ConsoleWrite("Default Codepage: " & $cp[0] & @CRLF)
ConsoleWrite('----------------' & @CRLF)

; Unicode: U+4E2D U+6587
$strA = "中文"
ConsoleWrite("$strA: " & $strA & @CRLF)
ConsoleWrite(String(StringToBinary($strA)) & @CRLF)
ConsoleWrite('----------------' & @CRLF)

$strB = _StringToCodepage($strA, 65001)
ConsoleWrite("$strB: " & $strB & @CRLF)
ConsoleWrite(String(StringToBinary($strB)) & @CRLF)
ConsoleWrite('----------------' & @CRLF)

204711

In Scite, with output.code.page=936, everything worked as expected.

205251

VSCode assumes encoding of output is UTF-8, which is not.

@vanowm
Copy link

vanowm commented Oct 4, 2022

VSCode doesn't seem to have cp936
image
Just copy/paste the example code works just fine...can you attach a sample file?

@oTnTh
Copy link
Author

oTnTh commented Oct 4, 2022

cp936 is GBK, a superset of GB2312.

GB18030 is a superset of GBK, but it's a 4-bytes encoding, so it has a new identifier cp54936.

I didn't know anything about VSCode Extension API, if there's no such thing like GetACP(), autoit.outputCodePage is good enough for me.

Before write to the Output Panel of VSCode, convert the output of AutoIt from autoit.outputCodePage to UTF-8, the problem should be solved。

@oTnTh
Copy link
Author

oTnTh commented Oct 4, 2022

The encoding of script file is not relevant to this problem.

Can you show me the output of my script in VSCode, please?

t.au3.txt

@vanowm
Copy link

vanowm commented Oct 5, 2022

image

I guess we are out of luck on this one. Almost 7 years since it was requested...

@oTnTh
Copy link
Author

oTnTh commented Oct 5, 2022

WOW, a text editor (sort of) cannot handle text encoding, I didn't expect for that.

Seems there's nothing we can do now.

Thanks for your time.

@vanowm
Copy link

vanowm commented Oct 6, 2022

Well, technically, if you can see text of your code properly - it handles encoding properly...it's the output of another application that it's having issues with...

@oTnTh
Copy link
Author

oTnTh commented Oct 6, 2022

Even now (Win11 22H2), Powershell and CMD use ANSI (aka cp936 for Chinese) as the default code page.

If I compile my script as a CUI EXE, here is the output:

235240

Same as the output in Scite.

ConsoleWrite intend to write something to STDOUT, and the default codepage of STDOUT is ANSI.

As a user, I would love to have a solution, but I can't say that Autoit is wrong.

Also, I think it's not fair to you. You did a greate job, but CJK users have to choose.

@vanowm
Copy link

vanowm commented Oct 6, 2022

Maybe as a work around you could use this for now:
https://www.autoitscript.com/forum/topic/208189--

vanowm added a commit to vanowm/AutoIt-VSCode that referenced this issue Oct 8, 2022
@vanowm
Copy link

vanowm commented Oct 8, 2022

Proposed #123 adds new option Output Code Page.
In this particular case I had to set it to gbk in order to get proper result:
image

With cp936 I get different $strB result:
image

@oTnTh
Copy link
Author

oTnTh commented Oct 8, 2022

WOW, thank you for keep working on this.

strB is not a valid GBK string, so when we try to encode strB from GBK to UTF-8, the result is meaningless.

I tink you can ignore the difference in AutoIt-VSCode.


However, they do have some differences between GBK and CP936.

You could consider GBK as ECMAScript7, and CP936 as Crhome V8.

If a code-point is undefined in the standard, the author of charmap could make the decision how to handle the conversion.

Take a look at this:

var encoding = require('encoding');

buf = Buffer.from([0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87])
resultB1 = encoding.convert(buf, 'utf-8', 'gbk')
resultB2 = encoding.convert(buf, 'utf-8', 'cp936')
console.log(resultB1)
console.log(resultB2)
console.log('-----------------------------------')

resultC1 = encoding.convert(resultB1, 'gbk', 'utf-8')
resultC2 = encoding.convert(resultB2, 'cp936', 'utf-8')
console.log(resultC1)
console.log(resultC2)

Output:

<Buffer e6 b6 93 ee 85 9f e6 9e 83>
<Buffer e6 b6 93 ef bf bd e9 8f 82 ef bf bd>
-----------------------------------
<Buffer e4 b8 ad e6 96 87>
<Buffer e4 b8 3f e6 96 3f>

Even though strB is not a valid GBK string, after two conversions, with GBK argument, we didn't lose any data.

I'm not sure, but I guess that's why the GBK charmap of iconv-lite is not compatible with CP936.

@vanowm
Copy link

vanowm commented Oct 8, 2022

It's all Chinese to me (pun intended)

Maybe it would be more suitable to report it at iconv-lite
If PR goes forward, it will use iconv-lite library instead of encoding

@oTnTh
Copy link
Author

oTnTh commented Oct 8, 2022

It's not a bug of iconv-lite, Chinese people would recognize the differences between GBK and CP936, they have to.

The text encodings are real pain in the ass, really. The problems could jump out everywhere.

But for English native speakers, they didn't use it, and hard to explain to them. Like you said, it's all Chinese.

So I'm very grateful for you, I do.

@vanowm
Copy link

vanowm commented Oct 8, 2022

So, the question is, does SciTe has the same issue? (cause on your screenshot it looks exactly like in vcode after conversion)
or should I suspend the PR until we find 100% working solution?

@oTnTh
Copy link
Author

oTnTh commented Oct 9, 2022

Generally, when we saw messy codes, it just means "Something is wrong here".

As long as iconv-lite handle the normal text correctly, I think we can ignore the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants