Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Irreversible empty string handling by fread() and fwrite() #2214

Closed
ezwelty opened this issue Jun 22, 2017 · 4 comments
Closed

Irreversible empty string handling by fread() and fwrite() #2214

ezwelty opened this issue Jun 22, 2017 · 4 comments

Comments

@ezwelty
Copy link

ezwelty commented Jun 22, 2017

First off, thank you for this fantastic package. It effortlessly powers many of my data caving adventures.

Sometimes, it's necessary to distinguish between null (NA) and empty ("") strings, and I'm trying to establish a pipeline that preserves this distinction with minimal markup. This doesn't currently work. To work, fread() would need to distinguish between and "", and fwrite() ideally would quote empty strings when quote = "auto".

Consider this data.table:

dt <- data.table::data.table(chr = c(NA, "", "a"), num = c(NA, NA, 1))

Here is the fwrite output with quote="auto":

csv <- paste(
  capture.output(
    data.table::fwrite(dt, quote = "auto")
  ),
  collapse = "\n"
))
cat(csv)
chr,num
,
,
a,1

The empty string is not quoted, and thus indistinguishable from the null string. They are both read back in as empty strings:

data.table::fread(csv)
   chr num
1:      NA
2:      NA
3:   a   1

If instead we force quotes, the distinction is kept between the null and empty strings:

csv_quoted <- paste(
  capture.output(
    data.table::fwrite(dt, quote = TRUE)
  ),
  collapse = "\n"
)
cat(csv_quoted)
"chr","num"
,
"",
"a",1

However, there is no way to read them back in as such. Either they are both empty:

data.table::fread(csv_quoted)
   chr num
1:      NA
2:      NA
3:   a   1

Or both null:

data.table::fread(csv_quoted, na.strings = "")
   chr num
1:  NA  NA
2:  NA  NA
3:   a   1
@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 22, 2017 via email

@ezwelty
Copy link
Author

ezwelty commented Jun 22, 2017

@MichaelChirico Sorry, I should have added that:

  • Unfortunately "NA" (or any other stand-in) needs to be a valid character string.
  • The files aren't necessarily produced by me.

For context, this is in an attempt to implement the Tabular Data Package specification, which requires a distinction to be made between empty and null strings. This is easily done in JSON with null, but seems less standard in CSV.

So can I read this file:

cat("chr,num\n,\n\"\",\na,1")
chr,num
,
"",
a,1

in as:

data.table::data.table(chr = c(NA, "", "a"), num = c(NA, NA, 1))

and back out again without data loss? At least not currently, although fwrite() with quote=TRUE clearly knows the difference between null and empty strings.

Is this a current limitation or somehow anti-csv and thus by-design?

@st-pasha
Copy link
Contributor

st-pasha commented Jun 22, 2017

So assuming that the correct representation of the file is

chr,num↵
,↵
"",↵
"a",1↵

-- there are 2 issues here: one with fread and another with fwrite.

In order for fwrite to create a file like this with quote="auto", it would first need to check whether there are any NAs in each string column, and if there are then quote every field in the column. Such check is possible, but would increase the run time. Alternatively, we could do no checks up-front, and then only force-quote empty fields. This would impose no runtime overhead and would be sufficient for fread to understand the purpose, but is not guaranteed to be well-understood by other csv readers.

In order for fread to correctly read a file like this, there should be two separate "types" of string fields: one which treats void fields as empty strings, and another that treats them as NAs. Then upon seeing first quoted empty string "" the first type would bump to the second. This seems like a relatively simple change, however not much can be done while Matt is on vacation.

Since we need a separate Issue per each feature request, I'm splitting this into two.

@ezwelty
Copy link
Author

ezwelty commented Jun 24, 2017

@st-pasha That's an excellent overview of the issue(s), thank you. My uninformed opinion would favor only force-quoting empty strings with quote="auto" – as I see you have, since one could just use quote=TRUE to quote all strings but NA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants