-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #5360] Add fill=T to fread #536
Comments
Requested here as well (on a file from CBOE) : Seems likely it could be Excel generating such files. Could potentially be quite large and worth fread supporting then : http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table/25341502?noredirect=1#comment39526821_25341502 |
For what it is worth, our problematic dataset is data from Clinical Practice Research Datalink in the UK (the additional clinical details file where things like blood pressure, cholesterol, body weight, etc. are stored). Very commonly used in epidemiology and health services research. That one is not excel-based. |
I often have data dumps taken from ad server data in a list format: Reading with a fill=T, flag would create a binary matrix
becomes
Now with a quick awk script i can transform these awk -F '[ ,\\[\\]]+' '{for (i=2; i<NF; i++) print $1,$i}' $1 >> "transformed_$1" I am then able to use fread, and post process, (i personally read into a sparse data matrix) But the use case it obviously much more to save having to AWK data files prior to reading and then converting. This proves to be significantly faster than something like: ReadMaxCSVCols <- function(f, sep = ",", quote = "\"'", header = FALSE, ...) {
nc <- max(count.fields(f, sep = sep, quote = quote))
read.table(f,
sep = sep,
quote = quote,
header = header,
fill = TRUE,
col.names = paste("V", 1:nc, sep = ""),
...)
}
foo <- data.table(ReadMaxCSVCols("myfile.txt")) |
@mpearmain Thanks, really useful. Tuple columns like VALUE was what |
@markdanese Great, yes very useful to know, thanks. Could you post a link to a sample file perhaps (or a made-up example of 3 or 4 lines that's close would be great). I had a look at http://www.cprd.com/ and it seems huge and varied ... and interesting. We could do fill=TRUE, but might sep2= into a |
The list probably won't help. It is a simple flat file and would probably be easiest as columns -- to create a complete table. I took a small file and changed individual digits randomly so that this is not identifiable. This dropbox link should allow you to get the .txt file: Thanks for your help, and let me know if this file doesn't work. |
Hi Matt, I think you've hit the nail on the head with what you want to do after, to me the main use is to load as fast as possible and with a structure that is consistent, the list mechanism would allow for this. I'm looking to do binary matrix factorization and so a full or sparse matrix is the end point, and so the list isnt ideal, but it adds structure if i am given a list of cols, I can of course transform this into a DT or matrix, my concern is the overhead of the transform operation. which means running a few AWK or SED scripts before may still be the best option in my situation. |
Submitted by: Michele Carriero; Assigned to: Nobody; R-Forge link
Since this option is being added to rbind I wonder if it could be added to fread too, in order to reflect the read.table feature.
The text was updated successfully, but these errors were encountered: