Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delimiter is not correct for some csv file #35

Closed
hcheng2002cn opened this issue Apr 27, 2021 · 5 comments
Closed

delimiter is not correct for some csv file #35

hcheng2002cn opened this issue Apr 27, 2021 · 5 comments

Comments

@hcheng2002cn
Copy link

hi,

We have a sample of csv file:

bytearray(b'fake data'),20:53:06,2019-09-01T19:28:21
bytearray(b'fake data'),19:33:15,2005-02-15T19:10:31
bytearray(b'fake data'),10:43:05,1992-10-12T14:49:24
bytearray(b'fake data'),10:36:49,1999-07-18T17:27:55
bytearray(b'fake data'),03:33:35,1982-04-24T17:38:45
bytearray(b'fake data'),14:49:47,1983-01-05T22:17:42
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45

the delimiter guess by clevercsv is ":", it should be ",".

Thanks.

@hcheng2002cn
Copy link
Author

Maybe because it have 4 ":" which more than 2 "," ?

@GjjvdBurg
Copy link
Collaborator

Hi @hcheng2002cn, thanks for raising this issue. It's indeed because you have four : and two , that the scoring method gets confused. Here is the full output from CleverCSV, which you might have seen already:

$ clevercsv detect -v test.csv
... 
SimpleDialect(',', '', ''):	P =       14.666667	T =        0.666667	Q =        9.777778
...
SimpleDialect(':', '', ''):	P =       17.600000	T =        0.600000	Q =       10.560000
Detected: SimpleDialect(':', '', '')

(I've removed some output for conciseness). In this output P is the pattern score, T is the type score, and Q is their product (higher Q wins). Using the : delimiter you get five columns, three of which have a recognized type (resulting in T = 3/5 = 0.6), whereas with , as delimiter you get three columns, two of which have a recognized type (so T = 2/3).

In general it should be rare that you get a constant number of cells over all rows with distinct delimiters, but you've identified one of the cases where this can happen. The easiest fix would be to add bytearray(b'...') as a known data type, so that the correct dialect gets the highest score. This is a little bit hacky, but it doesn't seem to negatively affect the detection results, so I'll add that fix to the package.

P.s.: If you're really curious, you can read more about how the dialect detection works here 😄

@hcheng2002cn
Copy link
Author

hcheng2002cn commented Apr 28, 2021

Hi @hcheng2002cn, thanks for raising this issue. It's indeed because you have four : and two , that the scoring method gets confused. Here is the full output from CleverCSV, which you might have seen already:

$ clevercsv detect -v test.csv
... 
SimpleDialect(',', '', ''):	P =       14.666667	T =        0.666667	Q =        9.777778
...
SimpleDialect(':', '', ''):	P =       17.600000	T =        0.600000	Q =       10.560000
Detected: SimpleDialect(':', '', '')

(I've removed some output for conciseness). In this output P is the pattern score, T is the type score, and Q is their product (higher Q wins). Using the : delimiter you get five columns, three of which have a recognized type (resulting in T = 3/5 = 0.6), whereas with , as delimiter you get three columns, two of which have a recognized type (so T = 2/3).

In general it should be rare that you get a constant number of cells over all rows with distinct delimiters, but you've identified one of the cases where this can happen. The easiest fix would be to add bytearray(b'...') as a known data type, so that the correct dialect gets the highest score. This is a little bit hacky, but it doesn't seem to negatively affect the detection results, so I'll add that fix to the package.

P.s.: If you're really curious, you can read more about how the dialect detection works here 😄

@GjjvdBurg Thanks so much for information !

GjjvdBurg added a commit that referenced this issue Apr 28, 2021
* Add a "bytearray" type to address a specific failure case
  ([#35](#35)).
* Minor clarifications to licensing.
@GjjvdBurg
Copy link
Collaborator

@hcheng2002cn I've updated the package to include the fix, could you let me know if this indeed solves your issue? Thanks!

@hcheng2002cn
Copy link
Author

hcheng2002cn commented Apr 28, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants