-
Notifications
You must be signed in to change notification settings - Fork 47
Conversation
Update: |
60526fc
to
ee5d65e
Compare
Stream the file buffer to jschardet to avoid hangs when opening large files
Caveats:
|
It's late on a Friday and I'm in a hurry, so please forgive me if any of this feedback is missing out on important details. My concern with the approach in this group of PRs is that we do a lot of redundant I/O in the event that we're detecting a non-default encoding. We first do I/O in superstring to load the buffer before detecting the encoding, then we do I/O again to feed bytes to the jschardet, then we update the encoding triggering another reload. It seems like we want to detect the encoding before loading the buffer at all. It seems like there are two options of varying difficulty to implement: The first is to keep detection in JS and do a limited amount of I/O in The other approach is more optimal but harder to implement: Do the encoding detection in superstring in the @maxbrunsfeld Do you have thoughts on this? How bad would it be to do enough I/O in the JS side of buffer loading to detect the encoding? |
I'm not sure how long encoding detection takes, and how much data needs to be loaded. If we want it to happen on a background thread as part of superstring's file loading, we could definitely do it in C++. The uchardet C++ library seems to be used by a lot of things. EDIT - Hmm, ☝️ that library is MPL- licensed. Not sure if that's ok for Atom. |
@50Wliu want to try your hand at some C++? |
We'll see 😬 |
Ok, I took a look at uchardet. This is my first time trying to understand C++ code so my analysis could be way off. It looks like their model is "give us as many bytes as you have, and call So "detect[ing] the encoding based on the first batch of read bytes" may be unreliable using uchardet. It would also misreport the encoding if the special characters are only later on in the file. Thoughts? |
@50Wliu In an auto-detection scenario, it might be okay to buffer up all the bytes (or up to some reasonable maximum) without transcoding them in order to hand them to to uchardet first. This would still be better than performing the actual I/O twice. |
Not sure if this helps, but it looks like this is what VS Code is doing for detection: https://github.com/Microsoft/vscode/pull/21416/files They're also using the JSCharDet library |
This is a possible fourth part to the move-encoding-detection-into-core saga. If merged, it will allow atom/text-buffer#220 to use these methods instead of having to require jschardet and iconv-lite by itself.
One thing that I'm not happy about is the use ofreadFileSync
. Making the whole chain use callbacks feels ugly to me and is also inconsistent withsetEncoding
andgetEncoding
, though it's definitely a possibility.Fixes atom/encoding-selector#51