latin1 supplement separator not working #195

sushovannits · 2021-02-16T23:47:54Z

Operating System: macOS Catalina 10.15.7
Node Version: 14.13.0
NPM Version: 6.14.8
csv-parser Version: 3.0.0

Expected Behavior

Using "¬" as a separator in csv file should work. Output should look like:

[ { a: '1', 'b': '2' } ]

Actual Behavior

Parsing seems not correct. Producing something like this:

[ { 'a�b': '1�2' } ]

How Do We Reproduce?

Test file:

a¬b
1¬2

Code:

const csv = require('csv-parser');
const fs = require('fs');
var iconv = require('iconv-lite');
const fileName = ('/tmp/test.csv')
const results = []
fs.createReadStream(fileName)
// .pipe(iconv.decodeStream("utf-8"))
// .pipe(iconv.decodeStream("latin1"))
// .pipe(iconv.encodeStream("utf-8"))
.pipe(csv({
    separator: '¬'
}))
.on('data', (row) => {
    results.push(row)
})
.on('end', () => {
    // console.log(JSON.stringify(results, null, 4))
    console.log(results)
    console.log('CSV file successfully processed');
});

The text was updated successfully, but these errors were encountered:

MakersAll8 · 2024-08-30T01:13:21Z

I ran into the same issue and wasted a few working days on this. Hope it will help someone in future.

Root cause of problem

the character code of ¬ is 172 in decimal and 0xac in hex, which is above 127 in the code page of ascii, making it an extended ascii character.

JavaScript encodes strings as utf-16, which is compatible with utf-8 and not really compable with latin1 since latin1 treats c2 as a non printable character, and utf-16 treats ac as non printable character and replaces it with the replacement character �

> const b1 = Buffer.from('¬')
undefined
> b1
<Buffer c2 ac>

> const b2 = Buffer.from([172])
undefined
> b2
<Buffer ac>

> const b3 = Buffer.from([0xac])
undefined
> b3
<Buffer ac>

The parser only supports one byte charater as separator, and it is effectively using c2 to split. So, it doesn't matter if you actually used ¬, €, §, as long as it is a character within the range 128-191 of the utf-8 code page, they will be treated as your separater. As a result, if your text string has any characters in this range, it will split.

If possible, tell your team to pick a character between 1-127 from the utf-8 table. In fact, it doesn't matter if you used the code page of ascii or utf-8 or utf-16. As long as it is from 1-127, you're good.

Solution

const NOT_SIGN = [0xac];
// or const NOT_SIGN = [172];
// or const NOT_SIGN = new Uint8Array([0xac])

const parser = () =>
  csv({
    // @ts-ignore the source code is not typed correctly. We can pass a single byte in an array as well.
    separator: NOT_SIGN,
  });

Why does this work?

This works because the source code invokes Buffer.from(separator) to know which byte serves as separator(comma) for parseLine and the the typescript declaration erroneously types separator as string, where it should have been string | [number] | Uint8Array. There is a caveat, Uint8Array can only be of legnth 1 and I don't know a good way to type it. Maybe something like below, creating a new type and somehow using a type guard? I asked ChatGPT to create the code below.

type Uint8ArrayLength1 = Uint8Array & { length: 1 };

function createUint8ArrayLength1(value: number): Uint8ArrayLength1 {
    const array = new Uint8Array([value]);
    if (array.length !== 1) {
        throw new Error("Uint8Array must have a length of 1");
    }
    return array as Uint8ArrayLength1;
}

Debug from the source code

If you have no control over what encoding the csv was saved in, or if you are not sure if your team used ascii, latin1, utf-8, or utf-16, just change the source code in your node_modules folder, and add a console.log to log out separator and buffer from the parseLine function.

Alternatively, just use iconv as suggested by the doc and pipe your file stream from its original encoding into utf-8 and handle it from

there. If you've read this far, you already have the bits and pieces to figure out the appropriate actions for your use case.

Jest test

Jest test can be something like below, just replace my section symbol with your negation sign.

import { Readable, Writable } from 'node:stream';
import parser from '../util/csvParser';
import { pipeline } from 'stream/promises';

const mockLogger = {
  log: jest.fn(),
  error: jest.fn(),
  warn: jest.fn(),
  debug: jest.fn(),
  verbose: jest.fn(),
};

describe('csvParser using §', () => {
  it('should parse CSV with default options given csv with row number', async () => {
    expect.assertions(1);

    const inputString =
      '  §header1§header2\n1§value1§value2\n2   §value3§value4';
    const hexArray = [];
    for (let i = 0; i < inputString.length; i++) {
      const hexValue = inputString.charCodeAt(i);
      hexArray.push(
        `0x${hexValue.toString(16).toUpperCase().padStart(2, '0')}`,
      );
    }
    const csvData = Buffer.from(hexArray);

    const output = new Writable({
      objectMode: true,
      write(chunk, encoding, callback) {
        results.push(chunk);
        callback();
      },
    });

    const input = new Readable({
      read() {
        this.push(csvData);
        this.push(null);
      },
    });
    const results = [];

    await pipeline(input, parser(mockLogger), output);

    expect(results).toEqual([
      { HEADER1: 'value1', HEADER2: 'value2' },
      { HEADER1: 'value3', HEADER2: 'value4' },
    ]);
  });

  it('should parse CSV with default options given csv without row number', async () => {
    expect.assertions(1);

    const inputString = 'header1§header2\nvalue1§value2\nvalue3§value4';
    const hexArray = [];
    for (let i = 0; i < inputString.length; i++) {
      const hexValue = inputString.charCodeAt(i);
      hexArray.push(
        `0x${hexValue.toString(16).toUpperCase().padStart(2, '0')}`,
      );
    }
    const csvData = Buffer.from(hexArray);

    const output = new Writable({
      objectMode: true,
      write(chunk, encoding, callback) {
        results.push(chunk);
        callback();
      },
    });

    const input = new Readable({
      read() {
        this.push(csvData);
        this.push(null);
      },
    });
    const results = [];

    await pipeline(input, parser(mockLogger), output);

    expect(results).toEqual([
      { HEADER1: 'value1', HEADER2: 'value2' },
      { HEADER1: 'value3', HEADER2: 'value4' },
    ]);
  });
});

MakersAll8 · 2024-08-30T01:57:57Z

opened a PR for this

MakersAll8 mentioned this issue Aug 30, 2024

fix: update separator type to support extended ascii character with c… #238

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

latin1 supplement separator not working #195

latin1 supplement separator not working #195

sushovannits commented Feb 16, 2021

MakersAll8 commented Aug 30, 2024 •

edited

Loading

If possible, tell your team to pick a character between 1-127 from the utf-8 table. In fact, it doesn't matter if you used the code page of ascii or utf-8 or utf-16. As long as it is from 1-127, you're good.

MakersAll8 commented Aug 30, 2024

latin1 supplement separator not working #195

latin1 supplement separator not working #195

Comments

sushovannits commented Feb 16, 2021

Expected Behavior

Actual Behavior

How Do We Reproduce?

MakersAll8 commented Aug 30, 2024 • edited Loading

Root cause of problem

If possible, tell your team to pick a character between 1-127 from the utf-8 table. In fact, it doesn't matter if you used the code page of ascii or utf-8 or utf-16. As long as it is from 1-127, you're good.

Solution

Why does this work?

Debug from the source code

Jest test

MakersAll8 commented Aug 30, 2024

MakersAll8 commented Aug 30, 2024 •

edited

Loading