Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CVS delimiter detection is wrong on importing csv language files #2853

Closed
notz opened this issue Jul 30, 2020 · 8 comments · Fixed by #2857
Closed

CVS delimiter detection is wrong on importing csv language files #2853

notz opened this issue Jul 30, 2020 · 8 comments · Fixed by #2857
Assignees
Labels

Comments

@notz
Copy link

notz commented Jul 30, 2020

Subject of the issue/enhancement/features

The csv delimiter detection introduced in #2737 is not working correctly.
The introduced regex can't work correctly:

     const firstLineMatches = fileContent.match(/.+\/"{0,1}.{1}/);

It probably should look like:

    const firstLineMatches = fileContent.match(/.+\/"{0,1}[,;\t]{1}/);

Reference:

const firstLineMatches = fileContent.match(/.+\/"{0,1}.{1}/);

@oliverfoster
Copy link
Member

oliverfoster commented Jul 30, 2020

How is it not working correctly? What issue are you trying to solve?

I don't see how switching from . (any character) to [,;\t] (only comma, semicolon and tab) changes anything. It only limits the types of delimiter you could have. The code would function identically otherwise.

This line is trying to find the delimiter at the end of the first column. It says "match any character one or more times, a slash, followed by either 0 or 1 speechmarks, and then any other next character". It's the any other next character that would be the discovered delimiter. The only place I could see this not work would be if there were escaped speechmarks in the first column - which should never happen given that the first column is always a JSON property path and a _id.

Is it that the match doesn't work when the first column isn't wrapped with speechmarks?

Update: It was that there were slashes in the second column and the regex as defined is greedy

notz added a commit to notz/adapt_framework that referenced this issue Jul 30, 2020
@notz
Copy link
Author

notz commented Jul 30, 2020

The regex is matching any character.

.+ -> min 1 char 
\"{0,1} -> must not match 
.{1} -> one char 

So it can match on any place after the first char

In my case it detected "p" as delimiter. The regex is still not save to use but better. Ideally it will be removed or a library is used therefore.

@oliverfoster
Copy link
Member

Can you post the first line of your csv? I want to find something more fitting.

@notz
Copy link
Author

notz commented Jul 30, 2020

I can't post they complete line because it's a private project. But it's like that:

"articles/5e857dd5f99bb68c5cbf6515/body/","<p><strong>Le client est en ligne</strong></p>

@oliverfoster
Copy link
Member

Yea that makes sense.

@oliverfoster
Copy link
Member

oliverfoster commented Jul 30, 2020

Yea, your suggestion of limiting the delimiters seems like the only sensible option. This article suggests pipe and space should be included and we can probably drop the rest of the regex just matching the delimited list directly. /[,;\t| ]{1}/

These main use-cases would all work fine (no marks, marks, marks+html):

articles/5e857dd5f99bb68c5cbf6515/body/,Adapt
"articles/5e857dd5f99bb68c5cbf6515/body/","Adapt"
"articles/5e857dd5f99bb68c5cbf6515/body/","<p>Adapt</p>"

@notz
Copy link
Author

notz commented Jul 30, 2020

The best is to use this https://www.npmjs.com/package/detect-csv or this https://www.npmjs.com/package/csv-string#detectinput--string--string

With the regex their may still be a problem. Detection should also only be done if not defined via command line.

@oliverfoster
Copy link
Member

oliverfoster commented Jul 30, 2020

I completely disagree about using a library to do this, both of these libraries detect delimiters in ways that are almost identical to a regex detect-csv, csv-string and they are both designed for generic data. In contrast, we know what the data looks like on the first line of all of these files and we don't need the overhead of another library.

To have detection only be done if not defined at the command line, change this line to default null instead of a comma then check for null before this line, default to comma here, default to null here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment