Encoding detection from XML header #148238

ams-tschoening · 2022-04-27T07:37:40Z

XML files very likely declare their used character encoding, which is not properly taken into account by VScode compared to other editors. While in the past there was very limited automatic encoding detection, things have changed in the meantime and VScode tries to detect encoding of text files when users e.g. select a different encoding. Though, from my experience, the made suggestions for XML files are very likely wrong as well, while things could be pretty safe when looking at the file itself.

The following is an example of some XML file encoded using windows-1252 and properly declaring that. VScode opens the file using UTF-8 by default, most likely because no workspace config is available, and suggests it being windows-1255.

A similar issue has been closed in the past because of a lack of encoding detection and it was unlikely to be implemented. Things have changed now, but because the bug has been closed, one can't vote on it anymore or alike to get reconsidered for implementation.

Handling XML files properly is especially important because VSCode doesn't support charset of EditorConfig yet, which would otherwise allow setting different encodings for different file names. Additionally, the support for file extension specific settings of VScode is limited as well, preventing defining multiple different XML files names or even paths with different, but known encodings.

I e.g. have a lot of legacy projects using XML files for configs and depending on the age of those projects, many of them use different encodings. Additionally, some of the software even doesn't use a somewhat recent XML parser or a parser at all and is therefore limited to really only support windows-1252. Would be great if VScode would be better in handling those XML files.

Thanks!

The text was updated successfully, but these errors were encountered:

bpasero · 2022-04-27T09:55:56Z

Prior: #36230

ams-tschoening · 2022-05-04T07:18:25Z

Hi people!

Pinging you because you were interested in better encoding detetion for XML-files in the past in multiple different issues, sadly many of them closed in the meantime without any implemented change at all. Would be great if you could have a look at this additional attempt, vote for it and spread the word. I think this is still important and a problem for lots of especially professional users, who need to deal with legacy and mixed encodings of XML files pretty often.

Thanks!

@aadsm
@AdamL67
@alrekr
@boozook
@fortinmike
@IgnakhinSS
@LinAGKar
@LinoBarreca
@luigli
@mortenseifert
@odinmillion
@rubenprins
@ujay68

LinoBarreca · 2022-05-04T08:02:00Z

They aren't able to implement it, it seems.
That's how it will go: first they will turn it into a feature request (oh wait..you already got to that point) then they will mark it as duplicate or close it for any other reason, like they did with mine. the management is just a joke.

lextm · 2022-05-17T07:23:57Z

@LinoBarreca VSCode has more than 14 million active users (Feb 2021), so if what you reported/suggested is important, it shouldn't be that hard to get enough votes and draw the necessary attention.

Many feature requests like this one, no matter how critical the reporters claimed, only matter to a handful of users, so they are doomed to make way for others, not "aren't able" nor "just a joke".

Why should Microsoft make windows-1252 encoding support for XML perfect and delay other features that I want? For decades I don't touch a file in that encoding.

sharkkor · 2022-05-17T07:51:13Z

@lextm this is not about a specific encoding, but about how to determine it from a document's header, a bunch of other simple editors and ide can do this, I don’t think this request is strange.

P.S. By the way, VS also can do it. The guys definitely have everything to implement this and this request does not look difficult. VSCode now implements a more complex mechanism for determining the encoding by content (it doesn’t work very well), but there is no banal definition of the encoding from the header.

ams-tschoening · 2022-05-17T09:49:27Z

@sharkkor Feel free to vote, I didn't see you yet. ;-)

ujay68 · 2022-05-17T18:11:04Z

Except, in this case and regarding XML specifically, it's not a lacking feature, but a bug. As has been commented here and in other related issues for years, the XML specification makes encoding detection (and using UTF-8 in the absence of an explicit encoding declaration) mandatory. A system that claims to processes XML content and does not follow XML's encoding rules is simply not conforming to the XML specification.

lextm · 2022-05-17T18:27:10Z

@ujay68 I hope you has supporting materials for your claim of "the XML specification makes encoding detection (and using UTF-8 in the absence of an explicit encoding declaration) mandatory".

What I can see from 4.3.3 https://www.w3.org/TR/xml/#charencoding is,

"All XML processors must be able to read entities in both the UTF-8 and UTF-16 encodings."
"Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them."

If something is described as "may be desired", it can be absent and not mandatory.

ujay68 · 2022-05-17T22:19:27Z

Read on:

§ 4.3.3: […] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. […]

§ F.1 (formally "Non-Normative", but I know of no XML processing software other than Code that ignores this), describes the mechanism for auto-detecting encodings, ending with:

[…] the XML encoding declaration will not work if any software changes the entity's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity.

That is exactly what happens when you open an XML file in Code and forget to switch Code to that encoding manually: You get a mix of encodings within the file and the file is broken.

lextm · 2022-05-18T01:28:31Z

@ujay68 What you quoted from 4.3.3 is irrelevant as far as I can see. "it is a fatal error for an entity" talks about what the entity (XML element in the document) should be, not what VS Code (an XML processor) should do. (Other "fatal error" references are similar).

For F section, since its "Non-Normative" status, I don't think much can be expected, except the kindness of someone to implement it. In short, this is an algorithm specific to XML documents.

More interestingly, by checking the actual VS Code source code, I think I learned more why XML specific encoding guessing was not in place,

VS Code XML extension

It has no guessing logic embedded, so when you open special XML files the actual encoding picked up is determined by the VS Code editor infrastructure,

https://github.com/microsoft/vscode/tree/1.67.1/extensions/xml

VS Code editor infrastructure

The core editor itself seems to use a different encoding guessing approach (as @sharkkor pointed out),

https://github.com/microsoft/vscode/blob/1.67.1/src/vs/workbench/services/textfile/common/encoding.ts

And no doubt it won't work very well with XML.

Possible solution

So as to permanently resolve this issue, someone will have to enhance the default XML extension, or write another XML extension to hijack the initial file loading process, where the algorithm in F section can be implemented and utilized. That's not an easy task. Just my five cents.

Either someone volunteers to conquer the challenges, or you get enough votes to allocate some resources from Microsoft.

Good luck.

ams-tschoening · 2022-05-18T06:27:25Z

@ujay68 What you quoted from 4.3.3 is irrelevant as far as I can see. "it is a fatal error for an entity" talks about what the entity (XML element in the document) should be, not what VS Code (an XML processor) should do.

That is a wrong interpretation, the important part is not "for an entity", but "[...]entity including an encoding declaration". This refers to the whole XML document and not individual elements within that, which is read by VSCode and forwarded to some XML processing, which is exactly what the paragraph talks about. You can't mix different encodings on element level within one and the same document and you can't (easily) override individual element encoding compared to the whole document using external metadata on the transport channel like HTTP or MIME.

So you are trying to overcomplicate things by picking individual words and putting them into the wrong context. But it's pretty easy actually. Though, this whole discussion about individual phrases doesn't matter too much anyway, VSCode is simply lacking at least an important feature lots of other tools provide. And no, it's no argument to tell people to simply use other tools then, people know that... :-)

VSCodeTriageBot · 2022-05-19T03:00:37Z

🙂 This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

VSCodeTriageBot · 2022-12-06T13:32:18Z

We closed this issue because we don't plan to address it in the foreseeable future. If you disagree and feel that this issue is crucial: we are happy to listen and to reconsider.

If you wonder what we are up to, please see our roadmap and issue reporting guidelines.

Thanks for your understanding, and happy coding!

vscode-triage-bot assigned deepak1556 Apr 27, 2022

deepak1556 assigned bpasero and unassigned deepak1556 Apr 27, 2022

deepak1556 removed the triage-needed label Apr 27, 2022

bpasero added file-encoding File encoding type issues feature-request Request for new features or functionality labels Apr 27, 2022

bpasero removed their assignment Apr 27, 2022

VSCodeTriageBot modified the milestones: Backlog Candidates, Backlog May 19, 2022

bpasero added the *out-of-scope Posted issue is not in scope of VS Code label Dec 6, 2022

VSCodeTriageBot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding detection from XML header #148238

Encoding detection from XML header #148238

ams-tschoening commented Apr 27, 2022 •

edited

Loading

bpasero commented Apr 27, 2022

ams-tschoening commented May 4, 2022

LinoBarreca commented May 4, 2022

lextm commented May 17, 2022

sharkkor commented May 17, 2022 •

edited

Loading

ams-tschoening commented May 17, 2022

ujay68 commented May 17, 2022

lextm commented May 17, 2022

ujay68 commented May 17, 2022

lextm commented May 18, 2022 •

edited

Loading

ams-tschoening commented May 18, 2022

VSCodeTriageBot commented May 19, 2022

VSCodeTriageBot commented Dec 6, 2022

Encoding detection from XML header #148238

Encoding detection from XML header #148238

Comments

ams-tschoening commented Apr 27, 2022 • edited Loading

bpasero commented Apr 27, 2022

ams-tschoening commented May 4, 2022

LinoBarreca commented May 4, 2022

lextm commented May 17, 2022

sharkkor commented May 17, 2022 • edited Loading

ams-tschoening commented May 17, 2022

ujay68 commented May 17, 2022

lextm commented May 17, 2022

ujay68 commented May 17, 2022

lextm commented May 18, 2022 • edited Loading

VS Code XML extension

VS Code editor infrastructure

Possible solution

ams-tschoening commented May 18, 2022

VSCodeTriageBot commented May 19, 2022

VSCodeTriageBot commented Dec 6, 2022

ams-tschoening commented Apr 27, 2022 •

edited

Loading

sharkkor commented May 17, 2022 •

edited

Loading

lextm commented May 18, 2022 •

edited

Loading