Add encoding auto-detection logic #120

50Wliu · 2017-03-06T21:18:42Z

This is a possible fourth part to the move-encoding-detection-into-core saga. If merged, it will allow atom/text-buffer#220 to use these methods instead of having to require jschardet and iconv-lite by itself.

One thing that I'm not happy about is the use of readFileSync. Making the whole chain use callbacks feels ugly to me and is also inconsistent with setEncoding and getEncoding, though it's definitely a possibility.

Fixes atom/encoding-selector#51

50Wliu · 2017-06-13T20:26:17Z

Update: readFileSync is not going to be an acceptable solution, especially because this is done on file load.

Stream the file buffer to jschardet to avoid hangs when opening large files

50Wliu · 2017-08-07T14:09:12Z

Caveats:

The current test file for cp1255 is actually cp1252 (according to @Ben3eeE).
Atom's text-editor.js is incorrectly detected as cp1252 instead of utf8.
An encoding is returned once the confidence interval is above 0.95. Is that too low? It will definitely take longer to figure out the encoding the higher that is, but maybe that's acceptable.
Since this is asynchronous now, I'm not sure what will happen if you manage to edit the file before the encoding is changed.

nathansobo · 2017-11-03T23:51:56Z

It's late on a Friday and I'm in a hurry, so please forgive me if any of this feedback is missing out on important details.

My concern with the approach in this group of PRs is that we do a lot of redundant I/O in the event that we're detecting a non-default encoding. We first do I/O in superstring to load the buffer before detecting the encoding, then we do I/O again to feed bytes to the jschardet, then we update the encoding triggering another reload.

It seems like we want to detect the encoding before loading the buffer at all. It seems like there are two options of varying difficulty to implement:

The first is to keep detection in JS and do a limited amount of I/O in TextBuffer.load in order to feed some bytes to jschardet. I would want a guarantee that we would stop if we didn't detect the encoding after some number of chunks so as not to read the entire file. Note that there are two code paths in TextBuffer.load, one taking a path and one taking a file.

The other approach is more optimal but harder to implement: Do the encoding detection in superstring in the load_file method. This method already does I/O and performs an encoding conversion. Perhaps it could be modified to optionally first detect the encoding based on the first batch of read bytes. This would avoid the redundant I/O but raise the difficulty and risk.

@maxbrunsfeld Do you have thoughts on this? How bad would it be to do enough I/O in the JS side of buffer loading to detect the encoding?

maxbrunsfeld · 2017-11-04T00:25:20Z

I'm not sure how long encoding detection takes, and how much data needs to be loaded. If we want it to happen on a background thread as part of superstring's file loading, we could definitely do it in C++. The uchardet C++ library seems to be used by a lot of things.

EDIT - Hmm, ☝️ that library is MPL- licensed. Not sure if that's ok for Atom.

nathansobo · 2017-11-04T01:08:59Z

@50Wliu want to try your hand at some C++?

50Wliu · 2017-11-04T14:50:24Z

We'll see 😬
I'll probably need a lot of hand-holding though. Will be my first time writing anything C/C++.

50Wliu · 2017-12-19T23:47:48Z

Ok, I took a look at uchardet. This is my first time trying to understand C++ code so my analysis could be way off.

It looks like their model is "give us as many bytes as you have, and call uchardet_get_charset(ud) once you're finished". The uchardet header doesn't expose confidence information so we can't opportunistically short-circuit if the confidence level is high enough, and on the flip side uchardet could also give us a charset that it's only 20% confident about and we wouldn't know any better.

So "detect[ing] the encoding based on the first batch of read bytes" may be unreliable using uchardet. It would also misreport the encoding if the special characters are only later on in the file.

Thoughts?

nathansobo · 2018-01-02T16:59:11Z

@50Wliu In an auto-detection scenario, it might be okay to buffer up all the bytes (or up to some reasonable maximum) without transcoding them in order to hand them to to uchardet first. This would still be better than performing the actual I/O twice.

calebmeyer · 2018-03-05T21:03:06Z

Not sure if this helps, but it looks like this is what VS Code is doing for detection: https://github.com/Microsoft/vscode/pull/21416/files

They're also using the JSCharDet library

50Wliu · 2021-09-29T23:36:00Z

atom/atom#13943 (comment)

50Wliu added the work-in-progress label Mar 6, 2017

50Wliu added needs-review and removed work-in-progress labels Mar 22, 2017

This was referenced Mar 22, 2017

Add encoding auto-detection logic atom/text-buffer#220

Closed

Add encoding auto-detection logic atom/atom#13943

Closed

Move auto-detection logic into core atom/encoding-selector#42

Closed

shiftkey mentioned this pull request Apr 22, 2017

Windows Support (and Fewer Dependencies) mooz/node-icu-charset-detector#32

Closed

50Wliu mentioned this pull request Apr 22, 2017

Default encoding can't be set to 'auto-detect' atom/encoding-selector#24

Open

50Wliu added the requires-changes label Jun 13, 2017

50Wliu added 4 commits August 2, 2017 17:46

Add encoding auto-detection logic

8393073

Add spec

81db046

cp1255 -> windows1255

0b1c6fd

Add a test for an empty file

ee5d65e

50Wliu force-pushed the wl-auto-detect-encoding branch from 60526fc to ee5d65e Compare August 2, 2017 21:47

50Wliu added 3 commits August 3, 2017 00:45

Change detectEncoding to async

ebc7ea5

Stream the file buffer to jschardet to avoid hangs when opening large files

Always reject if encoding was unable to be detected

b30f37d

Separate the tests

d838944

50Wliu removed the requires-changes label Aug 3, 2017

50Wliu closed this Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding auto-detection logic #120

Add encoding auto-detection logic #120

50Wliu commented Mar 6, 2017 •

edited

Loading

50Wliu commented Jun 13, 2017

50Wliu commented Aug 7, 2017 •

edited

Loading

nathansobo commented Nov 3, 2017

maxbrunsfeld commented Nov 4, 2017 •

edited

Loading

nathansobo commented Nov 4, 2017

50Wliu commented Nov 4, 2017

50Wliu commented Dec 19, 2017

nathansobo commented Jan 2, 2018

calebmeyer commented Mar 5, 2018

50Wliu commented Sep 29, 2021

Add encoding auto-detection logic #120

Add encoding auto-detection logic #120

Conversation

50Wliu commented Mar 6, 2017 • edited Loading

50Wliu commented Jun 13, 2017

50Wliu commented Aug 7, 2017 • edited Loading

nathansobo commented Nov 3, 2017

maxbrunsfeld commented Nov 4, 2017 • edited Loading

nathansobo commented Nov 4, 2017

50Wliu commented Nov 4, 2017

50Wliu commented Dec 19, 2017

nathansobo commented Jan 2, 2018

calebmeyer commented Mar 5, 2018

50Wliu commented Sep 29, 2021

50Wliu commented Mar 6, 2017 •

edited

Loading

50Wliu commented Aug 7, 2017 •

edited

Loading

maxbrunsfeld commented Nov 4, 2017 •

edited

Loading