-
Notifications
You must be signed in to change notification settings - Fork 429
--gnu-emacs yes mangles UTF-8 filenames #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jidanni thanks for the issue. This is for sure a bug! If I embed that UTF-8 filename, which google tells me means "Taiwan TG butterfly garden" - nice name, if the translation is right - tidy correctly outputs the same UTF-8 into the body of the tidied document... But makes a real cheese of it when outputting that same UTF-8 sequence as the emacs type warning string ;=(( And what is, where did that "Warning: discarding invalid character code 143" come from? A test in windows showed more of these warnings, and also show the mangled utf-8 filename - ignore the fact that the codepage I had in use, 1252, does not support utf-8 display. The filename becomes just a set of 8-bit values... not petty, but not mangled...
Must look at the output mechanism used when writing to the error file, or console if none... It is not handling valid UTF-8 sequences correctly... Will certainly look at that as time permits, but if anyone else has some clues, help would be appreciated... thanks... Regards, Geoff. |
I thought the Warning: discarding invalid character code 143 stuff was
coming from the file, so I snipped the rest. As I didn't send a file,
they must instead be coming from the filename. Perhaps the warnings
should mention on each line that they are coming from the filename...
|
@jidanni ok, understand a snip of sort of repeated warnings... no problem... and it took me a while to figure out exactly where they were coming from... and that provided the clue... Have found what looks like a simple fix - changing the input encoding from diff --git a/src/config.c b/src/config.c
index ddb677c..d001d3d 100644
--- a/src/config.c
+++ b/src/config.c
@@ -934,7 +934,7 @@ Bool TY_(ParseConfigValue)( TidyDocImpl* doc, TidyOptionId optId, ctmbstr optval
if (optId == TidyOutFile)
doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, RAW );
else
- doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, ASCII );
+ doc->config.cfgIn = TY_(BufferInput)( doc, &inbuf, RAW );
doc->config.c = GetC( &doc->config );
status = option->parser( doc, option ); Obviously, if it turns out correct, then we can actually get rid of the Would appreciate if you could apply the patch, and test the You might wonder, like I did at first, why is this not Well, tidy internally translates But this emacs format just copies the string with So if we used
Well it is not only for the filename, and they sort of do, in what they do not show. If this was a So they are warnings before any actual document parsing has started, so by the absense of this info, indicates they are from when the configuration options are read, one of which would be the A possible minor enhancement of the message might be, say -
That is, if no In this case they will not occur if the config reads are configured with Hope you, and others using filenames outside ASCII range, get a chance to test and report... thanks... |
(I only test .debs... when they reach Debian sid.)
Anyways if you don't output Filename Problem: ... ... ...
they nobody will guess the messages are not about HTML,
as 99.999% of them usually are.
|
@jidanni, thanks for the brief reply...
Well, that is sad, because it is only after testing can this fix reach And there are not that many of us, in dev, that have filenames that are outside the ASCII range... it took some effort for me to create one, first in linux, and then in windows... I certainly hope others, using filenames outside ASCII range, get a chance to test and report if it fixes the
Well, as mentioned it is not only a I have looked at this, and it would not be hard to do, but it involves first setting up the As stated, at present that prefix generation is controlled by the code I have yet to prove to myself that an Will certainly be considering this, but this is a little outside providing the correct gnu-emacs prefix option... And for people who do not like to get into source patching, have pushed an
Now you should have a local As stated, hope others, especially using filenames outside ASCII range, test and report it fixes the |
As requested, looking for to users testing branch |
@geoffmcl I just ran into this issue with 5.2.0 on macOS 10.12.2 (with UTF8-encoded Chinese characters in filenames), and I can confirm that the |
@zmwangx thanks for testing this... Although you have been the only tester so far, I am convinced changing Accordingly, have pushed a fix to Although the If you, or others, could pull |
@zmwangx thanks for testing, so am closing this... |
Sorry I cannot help, because I am busy. |
1 similar comment
Sorry I cannot help, because I am busy. |
$ tidy -i --wrap 122 --drop-proprietary-attributes yes --quote-nbsp no
-asxhtml --gnu-emacs yes -utf8
/tmp/台灣TG蝶園.html > /tmp/台灣TG蝶園.tidy
Warning: discarding invalid character code 143
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:56046: Warning: inserting implicit
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:56046: Warning: replacing unexpected button with
/tmp/\345\260\347\243TG\350\266\345S^Y.html:22:55870: Warning: missing ...
The text was updated successfully, but these errors were encountered: