Don't treat BOM escape sequence as hidden character. #18909

Gusted · 2022-02-26T00:51:32Z

BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character.
Follows GitHub's behavior.
Resolves BOMs are treated as illegal escape sequences in at least the diff view /commit/* #18837

- BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character. - Follows GitHub's behavior. - Resolves go-gitea#18837

Backport go-gitea#18909

lunny · 2022-02-26T02:06:50Z

Could we add some test?

silverwind · 2022-02-26T07:42:03Z

Might need to to the first two (three if we care about little endian) of this table: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Gusted · 2022-02-26T09:01:15Z

Might need to to the first two (three if we care about little endian) of this table: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Not sure if it's correctly to add it, we uses the utf8.DecodeRune function so I'm not sure if we should include utf16(as they would need to be parsed by utf16 first before we know the actual rune)

silverwind · 2022-02-26T09:11:30Z

They might decode the same, not sure. Also see:

https://github.com/sindresorhus/strip-bom/blob/b80d7bc94e79b4744d92a2dc6328c91d9afe9775/index.js#L6-L10

If you add a test, it should tell you thought whether that works :)

Gusted · 2022-02-26T09:45:26Z

They might decode the same, not sure. Also see:

It's being shown as a broken codepoint, no need to add it.

silverwind · 2022-02-26T09:47:23Z

UTF-16 BOM is certainly the more widespread one in use, a lot of files created by Microsoft software will add it, for example CSV exports from Excel.

Gusted · 2022-02-26T09:48:23Z

UTF-16 BOM is certainly the more widespread one in use, a lot of files created by Microsoft software will add it, for example CSV exports from Excel.

Seems like a wider issue to me, every UTF-16 codepoint would be shown as broken codepoint, as they won't be decoded correctly by the current code.

silverwind · 2022-02-26T09:59:48Z

Seems like a wider issue to me, every UTF-16 codepoint would be shown as broken codepoint, as they won't be decoded correctly by the current code.

It's generally not an issue because stuff like Office documents are at least partially binary data, so they don't render as text. CSV is an exception thought, but we actually have a separate renderer for it too.

Gusted · 2022-02-26T10:00:56Z

Should it be fixed within this PR or a separate? Adding the logic for utf16 sounds a bit out-of-scope for this PR.

silverwind · 2022-02-26T10:03:03Z

I'd like to see at least both UTF-8 and UTF-16 BOM to be identified here, if it's not that hard. UTF-16 BOM is very commonplace. UTF-8 does not recommend BOM usage, so I think it's not in much use generally.

Gusted · 2022-02-26T10:06:52Z

I'd like to see at least both UTF-8 and UTF-16 BOM to be identified here, if it's not that hard. UTF-16 BOM is very commonplace. UTF-8 does not recommend BOM usage, so I think it's not in much use generally.

Including UTF-16 BOM requires "fixing" the logic to not error utf-16 codepoints as broken codepoints. As they are checked before any of this logic is being carried out.

Gusted · 2022-02-26T10:23:31Z

Including UTF-16 BOM requires "fixing" the logic to not error utf-16 codepoints as broken codepoints. As they are checked before any of this logic is being carried out.

https://go.dev/play/p/dEzAtdBP3oq Seems like utf16 doesn't pick up the BOM codepoints correctly. Might be using it the wrong way.

wxiaoguang · 2022-02-26T12:38:24Z

UTF-16 and UTF-8 are totally different, the relation of them is just like one is PNG and the other is WebP, there is nothing in common. ~~Gitea doesn't support UTF-16 rendering either.~~ (well, it seems Gitea does support...)

And UTF-16 has big-endian and little-endian variants.

For a UTF-8 code/processor, the UTF-16 data (including BOM) is totally invalid binary data.

silverwind · 2022-02-26T12:54:56Z

All i know is that CSVs exported by Excel have the UTF-16 BOM but the data otherwise is plain ASCII. Basically, UTF-16 BOM should not trigger any warnings, and should ideally just be removed before display.

Gusted · 2022-02-26T13:08:02Z

I think it's clear that this UTF-16 is wider issue and shouldn't be included into this PR, unless there's a clear testcase that shows the BOM is shown as hidden character(not as broken codepoint) it should be included.

wxiaoguang · 2022-02-26T13:10:32Z

All i know is that CSVs exported by Excel have the UTF-16 BOM but the data otherwise is plain ASCII. Basically, UTF-16 BOM should not trigger any warnings, and should ideally just be removed before display.

Are you sure your UTF-16 files can be rendered in Gitea correctly? If not, it's a unrelated problem.

silverwind · 2022-02-26T13:23:20Z

Here's an example file:

https://try.gitea.io/silverwind/symlink-test/src/branch/master/utf16bom.txt
https://github.com/silverwind/symlink-test/blob/master/utf16bom.txt

gitea warns on "hidden" characters and strangely enough renders some japanese symbols and other garbage.
github does not warn and renders unicode replacement character.

Still, I think this sequence should not trigger a warning.
Why it even warns when the characters are not actually hidden is another topic.

silverwind · 2022-02-26T13:41:03Z

Yeah the rendering of that file is a separate issue.

Actually, IIRC correctly, one has to put the UTF-16 BOM into UTF-8 encoded documents for Excel to correctly identify UTF-8 content (it otherwise interprets as ASCII), so it is certainly a non-standard edge case.

We should not need to support UTF-16, but the sample file should at least not warn on that BOM. But it can be fixed in a separate issue along with the rendering to make rendering match GitHub.

Gusted · 2022-02-26T13:46:09Z

Yeah the rendering of that file is a separate issue.

From what I can see in the code, that's the issue to be solved first, before we can properly support this UTF-16 BOM. As the bytes that are given to the EscapeControlReader function isn't what the actual content of the file is: []byte{239, 187, 191, 231, 145, 165, 231, 141, 180}(which converted into string is the 2 japanese(?) characters) vs:

-> % hexdump -C utf16bom.txt
00000000  fe ff 74 65 73 74 0a                              |..test.|
00000007

fe ff 74 65 73 74 0a -> 254, 255, 116, 101, 115, 116, 10
If this function doesn't get the correct input, it's not possible to correctly recognize the BOM correctly.

wxiaoguang · 2022-02-26T13:48:53Z

Hmm ..... Sorry I made a mistake.

It seems that currently Gitea does support UTF-16 BOM:

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-utf-16be-bom.txt

Well, we need some more work ............

wxiaoguang · 2022-02-26T13:54:54Z

Yeah the rendering of that file is a separate issue.

From what I can see in the code, that's the issue to be solved first, before we can properly support this UTF-16 BOM. As the bytes that are given to the EscapeControlReader function isn't what the actual content of the file is: []byte{239, 187, 191, 231, 145, 165, 231, 141, 180}(which converted into string is the 2 japanese(?) characters) vs:
-> % hexdump -C utf16bom.txt
00000000  fe ff 74 65 73 74 0a                              |..test.|
00000007
fe ff 74 65 73 74 0a -> 254, 255, 116, 101, 115, 116, 10 If this function doesn't get the correct input, it's not possible to correctly recognize the BOM correctly.

fe ff 74 65 73 74 0a |..test.| is an incorrect UTF-16 content for test.

See my examples test-utf-16be-bom.txt:

00000000: fe ff 4f 60 59 7d ff 0c 4e 16 75 4c 00 0a 00 68  ..O`Y}..N.uL...h
00000010: 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 6f  .e.l.l.o.,. .w.o
00000020: 00 72 00 6c 00 64 00 0a                          .r.l.d..

There are 00s between for ASCII chars hello world.

Gusted · 2022-02-26T13:56:01Z

Hmm ..... Sorry I made a mistake.

It seems that currently Gitea does support UTF-16 BOM:

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-utf-16be-bom.txt

Well, we need some more work ............

Tested the file, the current PR covers this file correctly and doesn't show and hidden characters.

wxiaoguang · 2022-02-26T13:57:14Z

Could we just check the first Rune? Generally LGTM.

And add the UTF-16 file for the tests.

(I just tried to add a commit about only checking the first Rune, if it's wrong, please revert ....)

Gusted · 2022-02-26T14:10:02Z

(I just tried to add a commit about only checking the first Rune, if it's wrong, please revert ....)

Yeah noticed that as well while trying to add it, apparently HTML text can be some of the first runes 😅 , Seems like Git decided it was a nice time to force push...

wxiaoguang · 2022-02-26T15:01:20Z

Sorry I pushed into this PR by mistake (and reverted by a force-push 😂)

My proposed solution is: Gusted#2

It can pass the unit tests, and only check the first Rune for BOM.

Details behind the problems:

UTF-8/16/32 all use the same codepoint for BOM, actually Gitea can read UTF-16 content and convert into UTF-8 internally then render it.
The old unit tests can not handle BOM correctly, so I added a function addPrefix for the unit test.

refactor

wxiaoguang

CI says that the unit-tests PASS ~~

* Don't treat BOM escape sequence as hidden character. (#18909) Backport #18909

* giteaofficial/main: Fix page and missing return on unadopted repos API (go-gitea#18848) [skip ci] Updated licenses and gitignores Allow adminstrator teams members to see other teams (go-gitea#18918) Update nginx reverse proxy docs (go-gitea#18922) Don't treat BOM escape sequence as hidden character. (go-gitea#18909) Remove CodeMirror dependencies (go-gitea#18911) Uncapitalize errors (go-gitea#18915) Disable service worker by default (go-gitea#18914) Set is_empty in fixtures (go-gitea#18869) Don't update email for organisation (go-gitea#18905) Correctly link URLs to users/repos with dashes, dots or underscores (go-gitea#18890) Set is_private in fixtures. (go-gitea#18868) Fix team management UI (go-gitea#18886) Update JS dependencies (go-gitea#18898) Fix migration v210 (go-gitea#18892) migrations: add test for importing pull requests in gitea uploader (go-gitea#18752)

* Don't treat BOM escape sequence as hidden character. - BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character. - Follows GitHub's behavior. - Resolves go-gitea#18837 Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>

Don't treat BOM escape sequence as hidden character.

b562eae

- BOM sequence is a common non-harmfull escape sequence, it shouldn't be shown as hidden character. - Follows GitHub's behavior. - Resolves go-gitea#18837

Gusted added the type/bug label Feb 26, 2022

Merge branch 'main' into fix-bom-escape

6db2679

Gusted added this to the 1.17.0 milestone Feb 26, 2022

Gusted added the backport/v1.16 label Feb 26, 2022

Gusted pushed a commit to Gusted/gitea that referenced this pull request Feb 26, 2022

Don't treat BOM escape sequence as hidden character. (go-gitea#18909)

33f0f49

Backport go-gitea#18909

Gusted mentioned this pull request Feb 26, 2022

Don't treat BOM escape sequence as hidden character. (#18909) #18910

Merged

Gusted added the backport/done All backports for this PR have been created label Feb 26, 2022

lunny added the skip-changelog This PR is irrelevant for the (next) changelog, for example bug fixes for unreleased features. label Feb 26, 2022

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 26, 2022

Gusted added 2 commits February 26, 2022 10:46

Add test

dc2b174

Merge branch 'main' into fix-bom-escape

489ff43

Merge branch 'main' into fix-bom-escape

be1a4bb

go-gitea deleted a comment from codecov-commenter Feb 26, 2022

GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 26, 2022

Add UTF-16 test case

81d2ced

Gusted force-pushed the fix-bom-escape branch from 62184d2 to 81d2ced Compare February 26, 2022 14:09

Fix newlines

cec21e6

wxiaoguang force-pushed the fix-bom-escape branch from 7d6bc60 to cec21e6 Compare February 26, 2022 14:58

wxiaoguang and others added 3 commits February 26, 2022 23:13

refactor

b5065b7

Merge pull request #2 from wxiaoguang/fix-bom-escape

0903867

refactor

fix fmt

91a5c86

wxiaoguang approved these changes Feb 26, 2022

View reviewed changes

zeripath approved these changes Feb 26, 2022

View reviewed changes

GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Feb 26, 2022

Merge branch 'main' into fix-bom-escape

4723a6c

lunny approved these changes Feb 26, 2022

View reviewed changes

zeripath merged commit bf2867d into go-gitea:main Feb 26, 2022

6543 pushed a commit that referenced this pull request Feb 26, 2022

Don't treat BOM escape sequence as hidden character. (#18909) (#18910)

4fb718d

* Don't treat BOM escape sequence as hidden character. (#18909) Backport #18909

go-gitea locked and limited conversation to collaborators Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't treat BOM escape sequence as hidden character. #18909

Don't treat BOM escape sequence as hidden character. #18909

Gusted commented Feb 26, 2022

lunny commented Feb 26, 2022

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022

silverwind commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022 •

edited

Loading

wxiaoguang commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 •

edited

Loading

wxiaoguang left a comment

Don't treat BOM escape sequence as hidden character. #18909

Don't treat BOM escape sequence as hidden character. #18909

Conversation

Gusted commented Feb 26, 2022

lunny commented Feb 26, 2022

silverwind commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022

Gusted commented Feb 26, 2022

silverwind commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 • edited Loading

silverwind commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022

silverwind commented Feb 26, 2022 • edited Loading

silverwind commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022 • edited Loading

wxiaoguang commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 • edited Loading

Gusted commented Feb 26, 2022

wxiaoguang commented Feb 26, 2022 • edited Loading

wxiaoguang left a comment

Choose a reason for hiding this comment

silverwind commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

wxiaoguang commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

silverwind commented Feb 26, 2022 •

edited

Loading

Gusted commented Feb 26, 2022 •

edited

Loading

wxiaoguang commented Feb 26, 2022 •

edited

Loading

wxiaoguang commented Feb 26, 2022 •

edited

Loading

wxiaoguang commented Feb 26, 2022 •

edited

Loading