You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: \0xef\0xbb\0xbf
As noted here, the default behavior of read.csv is now to detect and delete the BOM. Unfortunately, for me at least, fread seems to have converted the three characters into a space.
Fortunately, strip.white removes this before returning the data.table; unfortunately, my file also has lots of important trailing white space, so I need to set strip.white = FALSE, negating this.
That is, it has treated the first 3 characters as being a space. With strip.white = TRUE, this space disappears in the output.
I compare this to the behavior of read.csv (also a nuisance to use because the file is on the large side):
> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel Nancy FW19554 2011R187 70 70 45880 21809 1 00070007030020530050KGKG1616N100 Abbotsford Sch Dist Abbotsford Elementary 61010Clark County 04PO Box A Abbotsford WI 54405-0901 510 W Hemlock St Abbotsford WI 54405 Abbotsford WI54405-0901Abbotsford WI54405 715-223-4281 Gary Gunderson NNN "
That is, read.csv seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.
For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like fread should match the behavior of read.csv here.
The text was updated successfully, but these errors were encountered:
I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file:
\0xef
\0xbb
\0xbf
As noted here, the default behavior of
read.csv
is now to detect and delete the BOM. Unfortunately, for me at least,fread
seems to have converted the three characters into a space.Fortunately,
strip.white
removes this before returning thedata.table
; unfortunately, my file also has lots of important trailing white space, so I need to setstrip.white = FALSE
, negating this.Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe
To see the BOM, run:
Here's some relevant output from
fread
withverbose = TRUE
:That is, it has treated the first 3 characters as being a space. With
strip.white = TRUE
, this space disappears in the output.I compare this to the behavior of
read.csv
(also a nuisance to use because the file is on the large side):That is,
read.csv
seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like
fread
should match the behavior ofread.csv
here.The text was updated successfully, but these errors were encountered: