-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #4931] Support file connections for fread #561
Comments
This would be a pretty awesome feature that I am after as well!! |
I agree, this would be great. |
👍 I need this! |
An interesting use case for this feature would be reading chunks from de CSV file and pass it to a worker to process such information to create additive and/or semi-additive metrics. I have a working example using read.table (read.csv2) to deal with CSV file that doesn’t fit the available memory (supposing the need of all fields in the process). However, it is still not possible to use all workers in the best way given the slow nature of read.table. I have great expectation for fread with file connection as input. library(doSNOW) #(SOCK - Windows - mem copy-on-call - slower) chunkSize = 250000 it <- iter(function() { somefun <- function(dt) { allaggreg <- foreach(slice=it, .packages='data.table', .combine='rbind', .inorder=FALSE) %dopar% { setkey(allaggreg, CATEGORICAL_VARIABLE_A) close(conn) |
+1. I wish I could fread(bzfile("file.csv")). |
+1. I'd like to process
|
@mauriciocramos - does this not work for your case:
or
|
I would like to ! I have a huge CSV file (about 7Gb) and a small RAM (about 8Gb). I do chunk with for loop and skip and nrows parameters for extract some features. If it's very efficient at the beginning it's very slow at the end. I would like to memories were I was in the file and don't look at each time for the beginning of my chunk from the beginning of the file. I think that use connection could help. I hope that I was clear, In advance thank you. |
@st-pasha made several relevant points to this issue in #1721, for example:
Based on those points I actually don't think data.table needs to support general file connections or chunking. Sure, it would be convenient, but probably not worth the future trouble. There's already several existing solutions:
|
a note to explore after implementing: use
instead of outputting it to disk as is done now if I'm not mistaken. |
Curious here if At a glance I think this implementation doesn't satisfy the "chunked read" use case, am I missing anything else? |
@MichaelChirico This would not satisfy my usecase. The idea of using large bzip'd files is precisely to avoid spilling to (slow) disk. If |
I hope the feature would enable reading a few lines of file at a time and do something to these lines in a much more fast way than using readLines. |
If it's just a few lines, |
For example, I want to go through a gz file, which is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz and is more than 2gb. If I use |
Or, can |
+1. for |
Submitted by: Chris Neff; Assigned to: Nobody; R-Forge link
I use a corporate internal networked file system for much of my data, and so often times i need to call read.csv with a file connection. fread doesn't support this yet.
Namely I would like the following to work:
f = file("~/path/to/file.csv")
dt = fread(f)
The text was updated successfully, but these errors were encountered: