Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset detection returns inconsistent results #8474

Closed
guillep2k opened this issue Oct 12, 2019 · 7 comments
Closed

Charset detection returns inconsistent results #8474

guillep2k opened this issue Oct 12, 2019 · 7 comments
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/testing

Comments

@guillep2k
Copy link
Member

Running the same string through charset.DetectEncoding() may return different results when called different times. This is what has been making our CI tests fail randomly at TestToUTF8WithFallback and TestToUTF8.

The underlying library for encoding detection is github.com/gogits/chardet, and it runs the given string through many "detectors" that return a level of confidence each.

The problem comes from the fact that these detectors are ran in goroutines and return their calculations through a channel. Many of them return the same level of confidence, and the first to report wins.

https://github.com/gogs/chardet/blob/2404f777256163ea3eadb273dada5dcb037993c0/detector.go#L95-L111

So, the strings that were failing in the charset tests are detected as ISO-8859-1 most of the time, but from time to time they're detected as ISO-8859-2 which produces a different string when converted to UTF-8, thus making the test fail.

From the library point of view this is not strictly a bug, since all it does is guessing the character set. However, it's not unreasonable to expect reproducibility in the results.

This situation is probably causing Gitea to parse strings inconsistently from time to time.

So, the test in charset is "easy to fix": we could just delete it or reduce its expectations. Obviously this defeats the purpose of having the test. Fixing the library will probably be much harder.

What saddens me is that I knew about it and totally forgot:

// due to a race condition in `chardet` library, it could either detect
// "ISO-8859-1" or "IS0-8859-2" here. Technically either is correct, so
// we accept either.
assert.Contains(t, encoding, "ISO-8859")

@blueworrybear
Copy link
Contributor

Hi, I was wondering if we could solve this problem by sorting not only Confidence but also Charset string?

Since the original code has sorted after results collected from channel, it seems to me it's probably okay to sort with Charset in the same time. Therefore ISO-8859-1 will always be returned. 🤔

@guillep2k
Copy link
Member Author

I've thought perhaps better is to add bits to the confidence level. Let's say confidence is currently a 32-bit value (didn't check); we can make it 64-bit. The higher 32 will be the current confidence as calculated. We can statically (const) assign each charset a priority number based on e.g. number of Google search results and use that value in the 32 lower bits, so:

[          new calculated value        ]
[       32bit      ][      32bit       ]
[  old confidence  ][  const priority  ]

It's the same idea as yours, but it has the advantage of simplifying the comparision and also we can make sense of which ones should be first.

Of course confidence must be the important parameter here.

But the thing is, this should be taken care of upstream (in the library itself). I could try and send a PR there, perhaps. But getting it accepted could be a problem; they must share the same view as us about the "bug" and that's not necessarily true.

@guillep2k
Copy link
Member Author

Another possibility is to get rid of the channels and goroutines and call the detectors sequentially. IMHO, since the library is waiting for all of them to cast a result, the goroutines only give an advantage on occasional requests (or a single benchmark) in a many-core CPU. Otherwise they add a lot of overhead (bad cache utilization, thread initialization, etc.) that could be used to serve other requests. Many web sites only have a core or two as they are low-cost VMs.

@typeless
Copy link
Contributor

@guillep2k Agreed. Using goroutines would make more sense if the charset detection is an I/O bound operation or is executed distributively. I expect that it is not the case.

@stale
Copy link

stale bot commented Feb 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. I am here to help clear issues left open even if solved or waiting for more insight. This issue will be closed if no further activity occurs during the next 2 weeks. If the issue is still valid just add a comment to keep it alive. Thank you for your contributions.

@stale stale bot added the issue/stale label Feb 15, 2020
@guillep2k
Copy link
Member Author

This still causes some CI errors from time to time.

@stale stale bot removed the issue/stale label Feb 15, 2020
@lunny lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Feb 15, 2020
@wxiaoguang
Copy link
Contributor

No more failures now

@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/testing
Projects
None yet
Development

No branches or pull requests

6 participants