-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readcsv performance #3350
Comments
Cc: @tanmaykm |
I noticed that countlines uses a local 8K buffer to read data into and works off that. That seems to add to its speed. Should that be adopted in readuntil also? Another approach could also probably be to have a CSV data type that just keeps the data as a blob and cells as offsets into it. Then getindex can return strings (or any other type) on the fly. This should work well for cases where the CSV is readonly or where the CSV needs to be filtered through a user supplied function. |
@loladiro Should |
Also, I believe that buffered I/O will only help address about 20% of the performance issue here. |
File reading still uses the old I/O system, which is already buffered. |
Ok, this is what
|
This may not have to do with buffered I/O, but just the sheer number of read calls and the number strings created. However, reading a buffer, and then working line by line within that buffer does seem to be much faster. |
We need to skip the encoding check that |
Neither |
A
|
With @tanmaykm 's
|
Comparing our The DataFrames |
R's I did try |
The performance improvements are huge. Anything @tanmaykm can do to make IO faster will be a big gain in the future. @JeffBezanson and I were chatting a while back about I will try to get back to |
Just as a sidenote: the biggest problem with CSV files in the wild is that they're not purely line-delimited: lines of CSV files are allowed to contain unescaped newlines inside them as long as you use quotes. |
With mixed types, you almost always want a DataFrame. However, there are cases where a csv file is well-formed and is all numbers or strings, in which case - an array is good enough. I think It would be nice to refactor |
@johnmyleswhite Even the current That said, now that we are at it, it is worth making |
Put the initial code up as a package at https://github.com/tanmaykm/CSV.jl for comments. It exports a |
Interesting thread. I've actually been working on related work for Sqlite package over the last few days in writing the @time Sqlite.readdlmsql(Pkg.dir() * "/Sqlite/test/sales.csv";sql="select * from sales",name="sales")
elapsed time: 13.848334564 seconds
@time readcsv2(Pkg.dir() * "/Sqlite/test/sales.csv")
elapsed time: 10.939370505 seconds This is using @tanmaykm's new system.time(t <- read.csv("C:/Users/karbarcca/Google Drive/Dropbox/Dropbox/Sears/Teradata/Sears-Julia/sales.csv"))
user system elapsed
7.75 0.13 8.02 Definitely some work to do here overall to get Julia up to speed. I just pushed the new update to the Sqlite package. |
Since sqlite is probably C compiled code, it would perform better, particularly when most of the columns are integers and there is little overhead of receiving them in Julia. It will be interesting to see how much can be gained in Julia by having numbers parsed directly out of byte buffers. |
I guess that for columns that one knows are ints, either inferred or supplied by the user, it is worth avoiding the overhead of creating substrings. The experiment of extending |
There are two separate file formats: tsv/dlm files, where each row is one line, and CSV, which is more complicated. We could support only the first in base, and have readcsv in DataFrames. |
I think we are using CSV loosely as a term for the simple files, where each row is one line, for the purposes of discussion here. However, I agree that I think that the current |
In the past few days there was a 2x regression in the performance of |
One thing I was going to mention with regards to "csv files in the wild" with quoted values containing Would this be useful in base |
I think we should push all these real world requirements into DataFrames. |
With the improved |
What if |
I am loading a 100MB csv file using
readcsv
, and it takes 70 seconds. The file reads comma separated values that are a mix of integers and strings. It is 1,600,000 rows and 9 columns. Some rows have a 10th column as well, but that just becomes part of the last column.The profiler reveals that a majority of the time is spent in
split
, which is not unexpected, but it would be nice to load such files quickly.The text was updated successfully, but these errors were encountered: