delimiter is not correct for some csv file #35

hcheng2002cn · 2021-04-27T20:13:11Z

hi,

We have a sample of csv file:

bytearray(b'fake data'),20:53:06,2019-09-01T19:28:21
bytearray(b'fake data'),19:33:15,2005-02-15T19:10:31
bytearray(b'fake data'),10:43:05,1992-10-12T14:49:24
bytearray(b'fake data'),10:36:49,1999-07-18T17:27:55
bytearray(b'fake data'),03:33:35,1982-04-24T17:38:45
bytearray(b'fake data'),14:49:47,1983-01-05T22:17:42
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45

the delimiter guess by clevercsv is ":", it should be ",".

Thanks.

The text was updated successfully, but these errors were encountered:

hcheng2002cn · 2021-04-28T17:06:54Z

Maybe because it have 4 ":" which more than 2 "," ?

GjjvdBurg · 2021-04-28T18:56:53Z

Hi @hcheng2002cn, thanks for raising this issue. It's indeed because you have four : and two , that the scoring method gets confused. Here is the full output from CleverCSV, which you might have seen already:

$ clevercsv detect -v test.csv
... 
SimpleDialect(',', '', ''):	P =       14.666667	T =        0.666667	Q =        9.777778
...
SimpleDialect(':', '', ''):	P =       17.600000	T =        0.600000	Q =       10.560000
Detected: SimpleDialect(':', '', '')

(I've removed some output for conciseness). In this output P is the pattern score, T is the type score, and Q is their product (higher Q wins). Using the : delimiter you get five columns, three of which have a recognized type (resulting in T = 3/5 = 0.6), whereas with , as delimiter you get three columns, two of which have a recognized type (so T = 2/3).

In general it should be rare that you get a constant number of cells over all rows with distinct delimiters, but you've identified one of the cases where this can happen. The easiest fix would be to add bytearray(b'...') as a known data type, so that the correct dialect gets the highest score. This is a little bit hacky, but it doesn't seem to negatively affect the detection results, so I'll add that fix to the package.

P.s.: If you're really curious, you can read more about how the dialect detection works here 😄

hcheng2002cn · 2021-04-28T19:00:56Z

Hi @hcheng2002cn, thanks for raising this issue. It's indeed because you have four : and two , that the scoring method gets confused. Here is the full output from CleverCSV, which you might have seen already:
$ clevercsv detect -v test.csv
... 
SimpleDialect(',', '', ''):	P =       14.666667	T =        0.666667	Q =        9.777778
...
SimpleDialect(':', '', ''):	P =       17.600000	T =        0.600000	Q =       10.560000
Detected: SimpleDialect(':', '', '')
(I've removed some output for conciseness). In this output P is the pattern score, T is the type score, and Q is their product (higher Q wins). Using the : delimiter you get five columns, three of which have a recognized type (resulting in T = 3/5 = 0.6), whereas with , as delimiter you get three columns, two of which have a recognized type (so T = 2/3).

In general it should be rare that you get a constant number of cells over all rows with distinct delimiters, but you've identified one of the cases where this can happen. The easiest fix would be to add bytearray(b'...') as a known data type, so that the correct dialect gets the highest score. This is a little bit hacky, but it doesn't seem to negatively affect the detection results, so I'll add that fix to the package.

P.s.: If you're really curious, you can read more about how the dialect detection works here 😄

@GjjvdBurg Thanks so much for information !

* Add a "bytearray" type to address a specific failure case ([#35](#35)). * Minor clarifications to licensing.

GjjvdBurg · 2021-04-28T23:27:14Z

@hcheng2002cn I've updated the package to include the fix, could you let me know if this indeed solves your issue? Thanks!

hcheng2002cn · 2021-04-28T23:33:55Z

sure, thanks ! Will update status once have it. Gertjan van den Burg ***@***.***> 于2021年4月28日周三下午4:27写道：

…

@hcheng2002cn <https://github.com/hcheng2002cn> I've updated the package to include the fix, could you let me know if this indeed solves your issue? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKUHT2MBEIHRMKQGX5YUATTLCKWDANCNFSM43VV334Q> .

GjjvdBurg closed this as completed in e31b3ff Apr 28, 2021

GjjvdBurg added a commit that referenced this issue Apr 28, 2021

CleverCSV Release 0.6.8

87e934e

* Add a "bytearray" type to address a specific failure case ([#35](#35)). * Minor clarifications to licensing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delimiter is not correct for some csv file #35

delimiter is not correct for some csv file #35

hcheng2002cn commented Apr 27, 2021

hcheng2002cn commented Apr 28, 2021

GjjvdBurg commented Apr 28, 2021

hcheng2002cn commented Apr 28, 2021 •

edited

Loading

GjjvdBurg commented Apr 28, 2021

hcheng2002cn commented Apr 28, 2021 via email

delimiter is not correct for some csv file #35

delimiter is not correct for some csv file #35

Comments

hcheng2002cn commented Apr 27, 2021

hcheng2002cn commented Apr 28, 2021

GjjvdBurg commented Apr 28, 2021

hcheng2002cn commented Apr 28, 2021 • edited Loading

GjjvdBurg commented Apr 28, 2021

hcheng2002cn commented Apr 28, 2021 via email

hcheng2002cn commented Apr 28, 2021 •

edited

Loading