-
Notifications
You must be signed in to change notification settings - Fork 28.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding detection from XML header #148238
Comments
Prior: #36230 |
Hi people! Pinging you because you were interested in better encoding detetion for XML-files in the past in multiple different issues, sadly many of them closed in the meantime without any implemented change at all. Would be great if you could have a look at this additional attempt, vote for it and spread the word. I think this is still important and a problem for lots of especially professional users, who need to deal with legacy and mixed encodings of XML files pretty often. Thanks! @aadsm |
They aren't able to implement it, it seems. |
@LinoBarreca VSCode has more than 14 million active users (Feb 2021), so if what you reported/suggested is important, it shouldn't be that hard to get enough votes and draw the necessary attention. Many feature requests like this one, no matter how critical the reporters claimed, only matter to a handful of users, so they are doomed to make way for others, not "aren't able" nor "just a joke". Why should Microsoft make windows-1252 encoding support for XML perfect and delay other features that I want? For decades I don't touch a file in that encoding. |
@lextm this is not about a specific encoding, but about how to determine it from a document's header, a bunch of other simple editors and ide can do this, I don’t think this request is strange. P.S. By the way, VS also can do it. The guys definitely have everything to implement this and this request does not look difficult. VSCode now implements a more complex mechanism for determining the encoding by content (it doesn’t work very well), but there is no banal definition of the encoding from the header. |
@sharkkor Feel free to vote, I didn't see you yet. ;-) |
Except, in this case and regarding XML specifically, it's not a lacking feature, but a bug. As has been commented here and in other related issues for years, the XML specification makes encoding detection (and using UTF-8 in the absence of an explicit encoding declaration) mandatory. A system that claims to processes XML content and does not follow XML's encoding rules is simply not conforming to the XML specification. |
@ujay68 I hope you has supporting materials for your claim of "the XML specification makes encoding detection (and using UTF-8 in the absence of an explicit encoding declaration) mandatory". What I can see from 4.3.3 https://www.w3.org/TR/xml/#charencoding is,
If something is described as "may be desired", it can be absent and not mandatory. |
Read on: § 4.3.3: […] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. […] § F.1 (formally "Non-Normative", but I know of no XML processing software other than Code that ignores this), describes the mechanism for auto-detecting encodings, ending with: […] the XML encoding declaration will not work if any software changes the entity's character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity. That is exactly what happens when you open an XML file in Code and forget to switch Code to that encoding manually: You get a mix of encodings within the file and the file is broken. |
@ujay68 What you quoted from 4.3.3 is irrelevant as far as I can see. "it is a fatal error for an entity" talks about what the entity (XML element in the document) should be, not what VS Code (an XML processor) should do. (Other "fatal error" references are similar). For F section, since its "Non-Normative" status, I don't think much can be expected, except the kindness of someone to implement it. In short, this is an algorithm specific to XML documents. More interestingly, by checking the actual VS Code source code, I think I learned more why XML specific encoding guessing was not in place, VS Code XML extensionIt has no guessing logic embedded, so when you open special XML files the actual encoding picked up is determined by the VS Code editor infrastructure, https://github.com/microsoft/vscode/tree/1.67.1/extensions/xml VS Code editor infrastructureThe core editor itself seems to use a different encoding guessing approach (as @sharkkor pointed out), And no doubt it won't work very well with XML. Possible solutionSo as to permanently resolve this issue, someone will have to enhance the default XML extension, or write another XML extension to hijack the initial file loading process, where the algorithm in F section can be implemented and utilized. That's not an easy task. Just my five cents. Either someone volunteers to conquer the challenges, or you get enough votes to allocate some resources from Microsoft. Good luck. |
That is a wrong interpretation, the important part is not "for an entity", but "[...]entity including an encoding declaration". This refers to the whole XML document and not individual elements within that, which is read by VSCode and forwarded to some XML processing, which is exactly what the paragraph talks about. You can't mix different encodings on element level within one and the same document and you can't (easily) override individual element encoding compared to the whole document using external metadata on the transport channel like HTTP or MIME. So you are trying to overcomplicate things by picking individual words and putting them into the wrong context. But it's pretty easy actually. Though, this whole discussion about individual phrases doesn't matter too much anyway, VSCode is simply lacking at least an important feature lots of other tools provide. And no, it's no argument to tell people to simply use other tools then, people know that... :-) |
🙂 This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation. Happy Coding! |
We closed this issue because we don't plan to address it in the foreseeable future. If you disagree and feel that this issue is crucial: we are happy to listen and to reconsider. If you wonder what we are up to, please see our roadmap and issue reporting guidelines. Thanks for your understanding, and happy coding! |
XML files very likely declare their used character encoding, which is not properly taken into account by VScode compared to other editors. While in the past there was very limited automatic encoding detection, things have changed in the meantime and VScode tries to detect encoding of text files when users e.g. select a different encoding. Though, from my experience, the made suggestions for XML files are very likely wrong as well, while things could be pretty safe when looking at the file itself.
The following is an example of some XML file encoded using
windows-1252
and properly declaring that. VScode opens the file usingUTF-8
by default, most likely because no workspace config is available, and suggests it beingwindows-1255
.A similar issue has been closed in the past because of a lack of encoding detection and it was unlikely to be implemented. Things have changed now, but because the bug has been closed, one can't vote on it anymore or alike to get reconsidered for implementation.
Handling XML files properly is especially important because VSCode doesn't support
charset
of EditorConfig yet, which would otherwise allow setting different encodings for different file names. Additionally, the support for file extension specific settings of VScode is limited as well, preventing defining multiple different XML files names or even paths with different, but known encodings.I e.g. have a lot of legacy projects using XML files for configs and depending on the age of those projects, many of them use different encodings. Additionally, some of the software even doesn't use a somewhat recent XML parser or a parser at all and is therefore limited to really only support
windows-1252
. Would be great if VScode would be better in handling those XML files.Thanks!
The text was updated successfully, but these errors were encountered: