Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't treat BOM escape sequence as hidden character. #18909

Merged
merged 11 commits into from
Feb 26, 2022

Conversation

Gusted
Copy link
Contributor

@Gusted Gusted commented Feb 26, 2022

- BOM sequence is a common non-harmfull escape sequence, it shouldn't be
shown as hidden character.
- Follows GitHub's behavior.
- Resolves go-gitea#18837
@Gusted Gusted added this to the 1.17.0 milestone Feb 26, 2022
Gusted pushed a commit to Gusted/gitea that referenced this pull request Feb 26, 2022
@Gusted Gusted added the backport/done All backports for this PR have been created label Feb 26, 2022
@lunny lunny added the skip-changelog This PR is irrelevant for the (next) changelog, for example bug fixes for unreleased features. label Feb 26, 2022
@lunny
Copy link
Member

lunny commented Feb 26, 2022

Could we add some test?

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 26, 2022
@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

Might need to to the first two (three if we care about little endian) of this table: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

Might need to to the first two (three if we care about little endian) of this table: https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Not sure if it's correctly to add it, we uses the utf8.DecodeRune function so I'm not sure if we should include utf16(as they would need to be parsed by utf16 first before we know the actual rune)

@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

They might decode the same, not sure. Also see:

https://github.com/sindresorhus/strip-bom/blob/b80d7bc94e79b4744d92a2dc6328c91d9afe9775/index.js#L6-L10

If you add a test, it should tell you thought whether that works :)

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

They might decode the same, not sure. Also see:

image
It's being shown as a broken codepoint, no need to add it.

@silverwind
Copy link
Member

UTF-16 BOM is certainly the more widespread one in use, a lot of files created by Microsoft software will add it, for example CSV exports from Excel.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

UTF-16 BOM is certainly the more widespread one in use, a lot of files created by Microsoft software will add it, for example CSV exports from Excel.

Seems like a wider issue to me, every UTF-16 codepoint would be shown as broken codepoint, as they won't be decoded correctly by the current code.

@silverwind
Copy link
Member

Seems like a wider issue to me, every UTF-16 codepoint would be shown as broken codepoint, as they won't be decoded correctly by the current code.

It's generally not an issue because stuff like Office documents are at least partially binary data, so they don't render as text. CSV is an exception thought, but we actually have a separate renderer for it too.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

Should it be fixed within this PR or a separate? Adding the logic for utf16 sounds a bit out-of-scope for this PR.

@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

I'd like to see at least both UTF-8 and UTF-16 BOM to be identified here, if it's not that hard. UTF-16 BOM is very commonplace. UTF-8 does not recommend BOM usage, so I think it's not in much use generally.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

I'd like to see at least both UTF-8 and UTF-16 BOM to be identified here, if it's not that hard. UTF-16 BOM is very commonplace. UTF-8 does not recommend BOM usage, so I think it's not in much use generally.

Including UTF-16 BOM requires "fixing" the logic to not error utf-16 codepoints as broken codepoints. As they are checked before any of this logic is being carried out.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

Including UTF-16 BOM requires "fixing" the logic to not error utf-16 codepoints as broken codepoints. As they are checked before any of this logic is being carried out.

https://go.dev/play/p/dEzAtdBP3oq Seems like utf16 doesn't pick up the BOM codepoints correctly. Might be using it the wrong way.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 26, 2022

UTF-16 and UTF-8 are totally different, the relation of them is just like one is PNG and the other is WebP, there is nothing in common. Gitea doesn't support UTF-16 rendering either. (well, it seems Gitea does support...)

And UTF-16 has big-endian and little-endian variants.

For a UTF-8 code/processor, the UTF-16 data (including BOM) is totally invalid binary data.

@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

All i know is that CSVs exported by Excel have the UTF-16 BOM but the data otherwise is plain ASCII. Basically, UTF-16 BOM should not trigger any warnings, and should ideally just be removed before display.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

I think it's clear that this UTF-16 is wider issue and shouldn't be included into this PR, unless there's a clear testcase that shows the BOM is shown as hidden character(not as broken codepoint) it should be included.

@wxiaoguang
Copy link
Contributor

All i know is that CSVs exported by Excel have the UTF-16 BOM but the data otherwise is plain ASCII. Basically, UTF-16 BOM should not trigger any warnings, and should ideally just be removed before display.

Are you sure your UTF-16 files can be rendered in Gitea correctly? If not, it's a unrelated problem.

@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

Here's an example file:

https://try.gitea.io/silverwind/symlink-test/src/branch/master/utf16bom.txt
https://github.com/silverwind/symlink-test/blob/master/utf16bom.txt

gitea warns on "hidden" characters and strangely enough renders some japanese symbols and other garbage.
github does not warn and renders unicode replacement character.

Still, I think this sequence should not trigger a warning.
Why it even warns when the characters are not actually hidden is another topic.

@silverwind
Copy link
Member

silverwind commented Feb 26, 2022

Yeah the rendering of that file is a separate issue.

Actually, IIRC correctly, one has to put the UTF-16 BOM into UTF-8 encoded documents for Excel to correctly identify UTF-8 content (it otherwise interprets as ASCII), so it is certainly a non-standard edge case.

We should not need to support UTF-16, but the sample file should at least not warn on that BOM. But it can be fixed in a separate issue along with the rendering to make rendering match GitHub.

@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

Yeah the rendering of that file is a separate issue.

From what I can see in the code, that's the issue to be solved first, before we can properly support this UTF-16 BOM. As the bytes that are given to the EscapeControlReader function isn't what the actual content of the file is: []byte{239, 187, 191, 231, 145, 165, 231, 141, 180}(which converted into string is the 2 japanese(?) characters) vs:

-> % hexdump -C utf16bom.txt
00000000  fe ff 74 65 73 74 0a                              |..test.|
00000007

fe ff 74 65 73 74 0a -> 254, 255, 116, 101, 115, 116, 10
If this function doesn't get the correct input, it's not possible to correctly recognize the BOM correctly.

@wxiaoguang
Copy link
Contributor

Hmm ..... Sorry I made a mistake.

It seems that currently Gitea does support UTF-16 BOM:

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-utf-16be-bom.txt

Well, we need some more work ............

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 26, 2022

Yeah the rendering of that file is a separate issue.

From what I can see in the code, that's the issue to be solved first, before we can properly support this UTF-16 BOM. As the bytes that are given to the EscapeControlReader function isn't what the actual content of the file is: []byte{239, 187, 191, 231, 145, 165, 231, 141, 180}(which converted into string is the 2 japanese(?) characters) vs:

-> % hexdump -C utf16bom.txt
00000000  fe ff 74 65 73 74 0a                              |..test.|
00000007

fe ff 74 65 73 74 0a -> 254, 255, 116, 101, 115, 116, 10 If this function doesn't get the correct input, it's not possible to correctly recognize the BOM correctly.

fe ff 74 65 73 74 0a |..test.| is an incorrect UTF-16 content for test.

See my examples test-utf-16be-bom.txt:

00000000: fe ff 4f 60 59 7d ff 0c 4e 16 75 4c 00 0a 00 68  ..O`Y}..N.uL...h
00000010: 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 6f  .e.l.l.o.,. .w.o
00000020: 00 72 00 6c 00 64 00 0a                          .r.l.d..

There are 00s between for ASCII chars hello world.

@go-gitea go-gitea deleted a comment from codecov-commenter Feb 26, 2022
@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

Hmm ..... Sorry I made a mistake.

It seems that currently Gitea does support UTF-16 BOM:

https://try.gitea.io/wxiaoguang/test/src/branch/master/test-utf-16be-bom.txt

Well, we need some more work ............

Tested the file, the current PR covers this file correctly and doesn't show and hidden characters.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 26, 2022

Could we just check the first Rune? Generally LGTM.

And add the UTF-16 file for the tests.

(I just tried to add a commit about only checking the first Rune, if it's wrong, please revert ....)

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Feb 26, 2022
@Gusted
Copy link
Contributor Author

Gusted commented Feb 26, 2022

(I just tried to add a commit about only checking the first Rune, if it's wrong, please revert ....)

Yeah noticed that as well while trying to add it, apparently HTML text can be some of the first runes 😅 , Seems like Git decided it was a nice time to force push...

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 26, 2022

Sorry I pushed into this PR by mistake (and reverted by a force-push 😂)

My proposed solution is: Gusted#2

It can pass the unit tests, and only check the first Rune for BOM.

Details behind the problems:

  1. UTF-8/16/32 all use the same codepoint for BOM, actually Gitea can read UTF-16 content and convert into UTF-8 internally then render it.
  2. The old unit tests can not handle BOM correctly, so I added a function addPrefix for the unit test.

Copy link
Contributor

@wxiaoguang wxiaoguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI says that the unit-tests PASS ~~

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Feb 26, 2022
@zeripath zeripath merged commit bf2867d into go-gitea:main Feb 26, 2022
6543 pushed a commit that referenced this pull request Feb 26, 2022
* Don't treat BOM escape sequence as hidden character. (#18909)

Backport #18909
zjjhot added a commit to zjjhot/gitea that referenced this pull request Feb 27, 2022
* giteaofficial/main:
  Fix page and missing return on unadopted repos API (go-gitea#18848)
  [skip ci] Updated licenses and gitignores
  Allow adminstrator teams members to see other teams (go-gitea#18918)
  Update nginx reverse proxy docs (go-gitea#18922)
  Don't treat BOM escape sequence as hidden character. (go-gitea#18909)
  Remove CodeMirror dependencies (go-gitea#18911)
  Uncapitalize errors (go-gitea#18915)
  Disable service worker by default (go-gitea#18914)
  Set is_empty in fixtures (go-gitea#18869)
  Don't update email for organisation (go-gitea#18905)
  Correctly link URLs to users/repos with dashes, dots or underscores (go-gitea#18890)
  Set is_private in fixtures. (go-gitea#18868)
  Fix team management UI (go-gitea#18886)
  Update JS dependencies (go-gitea#18898)
  Fix migration v210 (go-gitea#18892)
  migrations: add test for importing pull requests in gitea uploader (go-gitea#18752)
Chianina pushed a commit to Chianina/gitea that referenced this pull request Mar 28, 2022
* Don't treat BOM escape sequence as hidden character.

- BOM sequence is a common non-harmfull escape sequence, it shouldn't be
shown as hidden character.
- Follows GitHub's behavior.
- Resolves go-gitea#18837

Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/done All backports for this PR have been created lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. skip-changelog This PR is irrelevant for the (next) changelog, for example bug fixes for unreleased features. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants