Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate to move from jschardet to chardet #159896

Closed
joaomoreno opened this issue Sep 2, 2022 · 7 comments
Closed

Investigate to move from jschardet to chardet #159896

joaomoreno opened this issue Sep 2, 2022 · 7 comments
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues
Milestone

Comments

@joaomoreno
Copy link
Member

joaomoreno commented Sep 2, 2022

We currently use https://github.com/aadsm/jschardet and want to explore replacing that with https://github.com/runk/node-chardet

@joaomoreno joaomoreno added this to the September 2022 milestone Sep 2, 2022
@bpasero bpasero added plan-item VS Code - planned item for upcoming file-encoding File encoding type issues labels Sep 2, 2022
@bpasero
Copy link
Member

bpasero commented Sep 2, 2022

It does not look very good for our unit tests though. There is indication that node-chardet alters detected encodings. From this CI run:

2022-09-02T12:49:06.0440233Z �[31m  4 failing�[0m
2022-09-02T12:49:06.0440772Z 
2022-09-02T12:49:06.0443281Z �[0m  1) Encoding
2022-09-02T12:49:06.0444176Z        autoGuessEncoding (ASCII):
2022-09-02T12:49:06.0445053Z �[0m�[31m     �[0m�[90m
2022-09-02T12:49:06.0445942Z   AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
2022-09-02T12:49:06.0447439Z   + actual - expected
2022-09-02T12:49:06.0448015Z   
2022-09-02T12:49:06.0448597Z   + 'iso88591'
2022-09-02T12:49:06.0449212Z   - null
2022-09-02T12:49:06.0450096Z       at Context.<anonymous> (file:///__w/1/s/out-build/vs/workbench/services/textfile/test/node/encoding/encoding.test.js:158:20)
2022-09-02T12:49:06.0450872Z �[0m
2022-09-02T12:49:06.0459666Z �[0m  2) Encoding
2022-09-02T12:49:06.0460714Z        autoGuessEncoding (ShiftJIS):
2022-09-02T12:49:06.0473807Z 
2022-09-02T12:49:06.0475036Z       �[31m�[0m
2022-09-02T12:49:06.0475690Z       �[32m+ expected�[0m �[31m- actual�[0m
2022-09-02T12:49:06.0475977Z 
2022-09-02T12:49:06.0476504Z       �[31m-windows1252�[0m
2022-09-02T12:49:06.0477065Z       �[32m+shiftjis�[0m
2022-09-02T12:49:06.0477592Z       �[0m�[90m
2022-09-02T12:49:06.0478097Z   AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
2022-09-02T12:49:06.0478732Z   + actual - expected
2022-09-02T12:49:06.0479141Z   
2022-09-02T12:49:06.0479640Z   + 'windows1252'
2022-09-02T12:49:06.0480185Z   - 'shiftjis'
2022-09-02T12:49:06.0480993Z       at Context.<anonymous> (file:///__w/1/s/out-build/vs/workbench/services/textfile/test/node/encoding/encoding.test.js:164:20)
2022-09-02T12:49:06.0481949Z �[0m
2022-09-02T12:49:06.0482479Z �[0m  3) Encoding
2022-09-02T12:49:06.0482911Z        autoGuessEncoding (CP1252):
2022-09-02T12:49:06.0483155Z 
2022-09-02T12:49:06.0483654Z       �[31m�[0m
2022-09-02T12:49:06.0484229Z       �[32m+ expected�[0m �[31m- actual�[0m
2022-09-02T12:49:06.0484489Z 
2022-09-02T12:49:06.0484994Z       �[31m-iso88591�[0m
2022-09-02T12:49:06.0485571Z       �[32m+windows1252�[0m
2022-09-02T12:49:06.0486135Z       �[0m�[90m
2022-09-02T12:49:06.0486746Z   AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
2022-09-02T12:49:06.0487471Z   + actual - expected
2022-09-02T12:49:06.0488090Z   
2022-09-02T12:49:06.0488798Z   + 'iso88591'
2022-09-02T12:49:06.0489514Z   - 'windows1252'
2022-09-02T12:49:06.0490419Z       at Context.<anonymous> (file:///__w/1/s/out-build/vs/workbench/services/textfile/test/node/encoding/encoding.test.js:170:20)
2022-09-02T12:49:06.0491212Z �[0m
2022-09-02T12:49:06.0491859Z �[0m  4) Files - NativeTextFileService i/o
2022-09-02T12:49:06.0492568Z        readStream - autoguessEncoding:
2022-09-02T12:49:06.0492853Z 
2022-09-02T12:49:06.0493455Z       �[31m�[0m
2022-09-02T12:49:06.0494103Z       �[32m+ expected�[0m �[31m- actual�[0m
2022-09-02T12:49:06.0494392Z 
2022-09-02T12:49:06.0495005Z       �[31m-iso88591�[0m
2022-09-02T12:49:06.0495634Z       �[32m+windows1252�[0m
2022-09-02T12:49:06.0496231Z       �[0m�[90m
2022-09-02T12:49:06.0496781Z   AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
2022-09-02T12:49:06.0497493Z   + actual - expected
2022-09-02T12:49:06.0497947Z   
2022-09-02T12:49:06.0498506Z   + 'iso88591'
2022-09-02T12:49:06.0499109Z   - 'windows1252'
2022-09-02T12:49:06.0500027Z       at Context.<anonymous> (file:///__w/1/s/out-build/vs/workbench/services/textfile/test/common/textFileService.io.test.js:400:20)
2022-09-02T12:49:06.0500815Z �[0m
2022-09-02T12:49:06.0501028Z 
2022-09-02T12:49:06.0501302Z 
2022-09-02T12:49:07.4049515Z ##[error]Bash exited with code '1'.

@joaomoreno
Copy link
Member Author

Let's go through these one by one:

[
  { confidence: 28, name: 'windows-1252', lang: 'no' },
  { confidence: 14, name: 'windows-1254', lang: 'tr' },
  { confidence: 13, name: 'windows-1250', lang: 'ro' },
  { confidence: 13, name: 'windows-1253', lang: 'el' },
  { confidence: 10, name: 'Shift_JIS', lang: 'ja' },
  { confidence: 10, name: 'Big5', lang: 'zh' },
  { confidence: 10, name: 'GB18030', lang: 'zh' }
]
  • CP1252 actually looks OK. Our sample matches both CP1252 as well as iso88591. Check the screenshot below. I would actually improve our sample here to contain more CP1252 specific bytes.
[
  { confidence: 48, name: 'ISO-8859-1', lang: 'en' },
  { confidence: 16, name: 'ISO-8859-2', lang: 'cs' },
  { confidence: 10, name: 'Shift_JIS', lang: 'ja' },
  { confidence: 10, name: 'Big5', lang: 'zh' },
  { confidence: 10, name: 'GB18030', lang: 'zh' },
  { confidence: 6, name: 'ISO-8859-9', lang: 'tr' }
]

image

@bpasero bpasero changed the title Investigate dropping jschardet for chardet Investigate to move from jschardet to chardet Sep 3, 2022
@bpasero
Copy link
Member

bpasero commented Sep 3, 2022

Trying to run the jschardet tests against chardet. Main difference is that jschardet accepts a string while chardet only accepts Uint8Array so I am trying to convert the hex encoded input via:

Uint8Array.from([...str].map(v => v.charCodeAt(0)))

(from https://stackoverflow.com/questions/73063232/how-to-convert-a-hex-binary-string-to-uint8array)

Looks like still some failures:

Test Suites: 1 failed, 1 passed, 2 total
Tests:       13 failed, 7 skipped, 21 passed, 41 total
Snapshots:   0 total
Time:        0.37 s, estimated 1 s
Ran all test suites.

To reproduce

@joaomoreno
Copy link
Member Author

joaomoreno commented Sep 5, 2022

Main difference is that jschardet accepts a string while chardet only accepts Uint8Array

Now that you point that out, it seems like memory usage is another argument to switch to chardet. jschardet forces us to first convert byte arrays from disk into "utf8" strings, so it can run its detection and we potentially convert the byte array once again into the actual encoding. chardet would allow us to detect the encoding right from the byte array and then we convert it on the fly.

To reproduce

Awesome!

Looks like still some failures:

Yeah, differences are to be expected. Can you go through them one by one and see what are the real red flags?

@bpasero
Copy link
Member

bpasero commented Sep 5, 2022

Now that you point that out, it seems like memory usage is another argument to switch to chardet. jschardet forces us to first convert byte arrays from disk into "utf8" strings,

Not really, we pass a limited amount of bytes to jschardet:

const AUTO_ENCODING_GUESS_MAX_BYTES = 512 * 128; // set an upper limit for the number of bytes we pass on to jschardet

But yeah, its a small win of not having to call this:

const binaryString = encodeLatin1(limitedBuffer.buffer);

Can you go through them one by one and see what are the real red flags?

I am not sure how I can reason about a failing jschardet test, but what we can do is report each test issue as an issue to chardet and see what the author has to say about them and then continue from there.

@joaomoreno
Copy link
Member Author

I am not sure how I can reason about a failing jschardet test, but what we can do is report each test issue as an issue to chardet and see what the author has to say about them and then continue from there.

It's the same argumentation I had in my comment above:

  • Some of these failures are OK. For example, ascii vs ISO-8859-1.
  • Others are not. For example, windows-1252 vs ISO-8859-1.
  • Others are just very likely due to small sample inputs. For example, the UTF-32 ones. I wonder if (1) this would affect us, since it isn't likely that users depend on this feature for very small files and (2) increasing the sample size would turn the tests green.

@bpasero bpasero added feature-request Request for new features or functionality and removed plan-item VS Code - planned item for upcoming labels Sep 27, 2022
@bpasero bpasero removed their assignment Sep 27, 2022
@joaomoreno joaomoreno closed this as not planned Won't fix, can't repro, duplicate, stale Oct 7, 2022
@joaomoreno
Copy link
Member Author

Closing this as not planned.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request Request for new features or functionality file-encoding File encoding type issues
Projects
None yet
Development

No branches or pull requests

2 participants