Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Add encoding auto-detection logic #120

Closed
wants to merge 7 commits into from
Closed

Conversation

50Wliu
Copy link
Contributor

@50Wliu 50Wliu commented Mar 6, 2017

This is a possible fourth part to the move-encoding-detection-into-core saga. If merged, it will allow atom/text-buffer#220 to use these methods instead of having to require jschardet and iconv-lite by itself.

One thing that I'm not happy about is the use of readFileSync. Making the whole chain use callbacks feels ugly to me and is also inconsistent with setEncoding and getEncoding, though it's definitely a possibility.

Fixes atom/encoding-selector#51

@50Wliu
Copy link
Contributor Author

50Wliu commented Jun 13, 2017

Update: readFileSync is not going to be an acceptable solution, especially because this is done on file load.

@50Wliu
Copy link
Contributor Author

50Wliu commented Aug 7, 2017

Caveats:

  • The current test file for cp1255 is actually cp1252 (according to @Ben3eeE).
  • Atom's text-editor.js is incorrectly detected as cp1252 instead of utf8.
  • An encoding is returned once the confidence interval is above 0.95. Is that too low? It will definitely take longer to figure out the encoding the higher that is, but maybe that's acceptable.
  • Since this is asynchronous now, I'm not sure what will happen if you manage to edit the file before the encoding is changed.

@nathansobo
Copy link
Contributor

It's late on a Friday and I'm in a hurry, so please forgive me if any of this feedback is missing out on important details.

My concern with the approach in this group of PRs is that we do a lot of redundant I/O in the event that we're detecting a non-default encoding. We first do I/O in superstring to load the buffer before detecting the encoding, then we do I/O again to feed bytes to the jschardet, then we update the encoding triggering another reload.

It seems like we want to detect the encoding before loading the buffer at all. It seems like there are two options of varying difficulty to implement:

The first is to keep detection in JS and do a limited amount of I/O in TextBuffer.load in order to feed some bytes to jschardet. I would want a guarantee that we would stop if we didn't detect the encoding after some number of chunks so as not to read the entire file. Note that there are two code paths in TextBuffer.load, one taking a path and one taking a file.

The other approach is more optimal but harder to implement: Do the encoding detection in superstring in the load_file method. This method already does I/O and performs an encoding conversion. Perhaps it could be modified to optionally first detect the encoding based on the first batch of read bytes. This would avoid the redundant I/O but raise the difficulty and risk.

@maxbrunsfeld Do you have thoughts on this? How bad would it be to do enough I/O in the JS side of buffer loading to detect the encoding?

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Nov 4, 2017

I'm not sure how long encoding detection takes, and how much data needs to be loaded. If we want it to happen on a background thread as part of superstring's file loading, we could definitely do it in C++. The uchardet C++ library seems to be used by a lot of things.

EDIT - Hmm, ☝️ that library is MPL- licensed. Not sure if that's ok for Atom.

@nathansobo
Copy link
Contributor

@50Wliu want to try your hand at some C++?

@50Wliu
Copy link
Contributor Author

50Wliu commented Nov 4, 2017

We'll see 😬
I'll probably need a lot of hand-holding though. Will be my first time writing anything C/C++.

@50Wliu
Copy link
Contributor Author

50Wliu commented Dec 19, 2017

Ok, I took a look at uchardet. This is my first time trying to understand C++ code so my analysis could be way off.

It looks like their model is "give us as many bytes as you have, and call uchardet_get_charset(ud) once you're finished". The uchardet header doesn't expose confidence information so we can't opportunistically short-circuit if the confidence level is high enough, and on the flip side uchardet could also give us a charset that it's only 20% confident about and we wouldn't know any better.

So "detect[ing] the encoding based on the first batch of read bytes" may be unreliable using uchardet. It would also misreport the encoding if the special characters are only later on in the file.

Thoughts?

@nathansobo
Copy link
Contributor

@50Wliu In an auto-detection scenario, it might be okay to buffer up all the bytes (or up to some reasonable maximum) without transcoding them in order to hand them to to uchardet first. This would still be better than performing the actual I/O twice.

@calebmeyer
Copy link

Not sure if this helps, but it looks like this is what VS Code is doing for detection: https://github.com/Microsoft/vscode/pull/21416/files

They're also using the JSCharDet library

@50Wliu
Copy link
Contributor Author

50Wliu commented Sep 29, 2021

@50Wliu 50Wliu closed this Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uncaught Error: "toString()" failed
4 participants