Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latin1 supplement separator not working #195

Open
sushovannits opened this issue Feb 16, 2021 · 2 comments
Open

latin1 supplement separator not working #195

sushovannits opened this issue Feb 16, 2021 · 2 comments

Comments

@sushovannits
Copy link

  • Operating System: macOS Catalina 10.15.7
  • Node Version: 14.13.0
  • NPM Version: 6.14.8
  • csv-parser Version: 3.0.0

Expected Behavior

Using "¬" as a separator in csv file should work. Output should look like:

[ { a: '1', 'b': '2' } ]

Actual Behavior

Parsing seems not correct. Producing something like this:

[ { 'a�b': '1�2' } ]

How Do We Reproduce?

Test file:

a¬b
1¬2

Code:

const csv = require('csv-parser');
const fs = require('fs');
var iconv = require('iconv-lite');
const fileName = ('/tmp/test.csv')
const results = []
fs.createReadStream(fileName)
// .pipe(iconv.decodeStream("utf-8"))
// .pipe(iconv.decodeStream("latin1"))
// .pipe(iconv.encodeStream("utf-8"))
.pipe(csv({
    separator: '¬'
}))
.on('data', (row) => {
    results.push(row)
})
.on('end', () => {
    // console.log(JSON.stringify(results, null, 4))
    console.log(results)
    console.log('CSV file successfully processed');
});

@MakersAll8
Copy link

MakersAll8 commented Aug 30, 2024

I ran into the same issue and wasted a few working days on this. Hope it will help someone in future.

Root cause of problem

the character code of ¬ is 172 in decimal and 0xac in hex, which is above 127 in the code page of ascii, making it an extended ascii character.

JavaScript encodes strings as utf-16, which is compatible with utf-8 and not really compable with latin1 since latin1 treats c2 as a non printable character, and utf-16 treats ac as non printable character and replaces it with the replacement character

> const b1 = Buffer.from('¬')
undefined
> b1
<Buffer c2 ac>

> const b2 = Buffer.from([172])
undefined
> b2
<Buffer ac>

> const b3 = Buffer.from([0xac])
undefined
> b3
<Buffer ac>

The parser only supports one byte charater as separator, and it is effectively using c2 to split. So, it doesn't matter if you actually used ¬, , §, as long as it is a character within the range 128-191 of the utf-8 code page, they will be treated as your separater. As a result, if your text string has any characters in this range, it will split.

If possible, tell your team to pick a character between 1-127 from the utf-8 table. In fact, it doesn't matter if you used the code page of ascii or utf-8 or utf-16. As long as it is from 1-127, you're good.

Solution

const NOT_SIGN = [0xac];
// or const NOT_SIGN = [172];
// or const NOT_SIGN = new Uint8Array([0xac])

const parser = () =>
  csv({
    // @ts-ignore the source code is not typed correctly. We can pass a single byte in an array as well.
    separator: NOT_SIGN,
  });

Why does this work?

This works because the source code invokes Buffer.from(separator) to know which byte serves as separator(comma) for parseLine and the the typescript declaration erroneously types separator as string, where it should have been string | [number] | Uint8Array. There is a caveat, Uint8Array can only be of legnth 1 and I don't know a good way to type it. Maybe something like below, creating a new type and somehow using a type guard? I asked ChatGPT to create the code below.

type Uint8ArrayLength1 = Uint8Array & { length: 1 };

function createUint8ArrayLength1(value: number): Uint8ArrayLength1 {
    const array = new Uint8Array([value]);
    if (array.length !== 1) {
        throw new Error("Uint8Array must have a length of 1");
    }
    return array as Uint8ArrayLength1;
}

Debug from the source code

If you have no control over what encoding the csv was saved in, or if you are not sure if your team used ascii, latin1, utf-8, or utf-16, just change the source code in your node_modules folder, and add a console.log to log out separator and buffer from the parseLine function.

Alternatively, just use iconv as suggested by the doc and pipe your file stream from its original encoding into utf-8 and handle it from

there. If you've read this far, you already have the bits and pieces to figure out the appropriate actions for your use case.

Jest test

Jest test can be something like below, just replace my section symbol with your negation sign.

import { Readable, Writable } from 'node:stream';
import parser from '../util/csvParser';
import { pipeline } from 'stream/promises';

const mockLogger = {
  log: jest.fn(),
  error: jest.fn(),
  warn: jest.fn(),
  debug: jest.fn(),
  verbose: jest.fn(),
};

describe('csvParser using §', () => {
  it('should parse CSV with default options given csv with row number', async () => {
    expect.assertions(1);

    const inputString =
      '  §header1§header2\n1§value1§value2\n2   §value3§value4';
    const hexArray = [];
    for (let i = 0; i < inputString.length; i++) {
      const hexValue = inputString.charCodeAt(i);
      hexArray.push(
        `0x${hexValue.toString(16).toUpperCase().padStart(2, '0')}`,
      );
    }
    const csvData = Buffer.from(hexArray);

    const output = new Writable({
      objectMode: true,
      write(chunk, encoding, callback) {
        results.push(chunk);
        callback();
      },
    });

    const input = new Readable({
      read() {
        this.push(csvData);
        this.push(null);
      },
    });
    const results = [];

    await pipeline(input, parser(mockLogger), output);

    expect(results).toEqual([
      { HEADER1: 'value1', HEADER2: 'value2' },
      { HEADER1: 'value3', HEADER2: 'value4' },
    ]);
  });

  it('should parse CSV with default options given csv without row number', async () => {
    expect.assertions(1);

    const inputString = 'header1§header2\nvalue1§value2\nvalue3§value4';
    const hexArray = [];
    for (let i = 0; i < inputString.length; i++) {
      const hexValue = inputString.charCodeAt(i);
      hexArray.push(
        `0x${hexValue.toString(16).toUpperCase().padStart(2, '0')}`,
      );
    }
    const csvData = Buffer.from(hexArray);

    const output = new Writable({
      objectMode: true,
      write(chunk, encoding, callback) {
        results.push(chunk);
        callback();
      },
    });

    const input = new Readable({
      read() {
        this.push(csvData);
        this.push(null);
      },
    });
    const results = [];

    await pipeline(input, parser(mockLogger), output);

    expect(results).toEqual([
      { HEADER1: 'value1', HEADER2: 'value2' },
      { HEADER1: 'value3', HEADER2: 'value4' },
    ]);
  });
});

@MakersAll8
Copy link

opened a PR for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants