-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV.read() return value #2
Comments
How is |
A I'm still working out the interface for using Julia functions on a Table type since it's entirely compatible through SQLite's UDF interface; just need to iron out how exactly a user does that. |
I thought I'd seen something about a NamedTuple package, I wonder if that might work well. |
In my data loading experiments, I used Union (Array{Nullable(Int64),1},Array{Nullable(Int64),1} .. ) essentially a union of Arrays of nullable . This is essentially the typed version of Pandas's basic datastucture and seems to work well enough. That said, my focus was on solving the performance issues readdlm and Dataframe.read had (vs R and Pandas) when trying to load > 10 million rows, which is a different problem from the writing a generic CSV parser. And it works in that we can beat R's read.table handily and comes within a constant factor of R's data.table.fread. In a production grade system, inferring a column's type (vs declaring it) gets tricky, but is still doable . I am experimenting with approaches to this that don't lose performance, and also with implementation of operators on the loaded dataset, which gets interesting in the presence of really large datasets and calls for query optimization etc. In short, I'm focused on loading and working with really large data sets . Which is a different, if intersecting, problem from writing a generic CSV Parser, which is much more complex, with many gnarly special cases. This might be blasphemy but I think the right approach might be to just choose a reasonable datastructure and go with it, and to test for if performance/memory problems and refactor as required. I vote for Dict{String,NullableArray{T}} . My only suggestion, fwiw, here is that you might want to do timing/memory checks with large (10 - 100 million row) data sets post implementation to confirm we are getting better performance than @jiahao found with the existing readdlm implementations ( JuliaLang/julia#10428 for details) . I'm fairly sure that Dict{String,NullableArray{T}} will work . And worst case someone can write a converter to whatever other datastructure is required. My 2 cents. @johnmyleswhite @davidagold @StefanKarpinski @jiahao @quinnj |
I really don't like either of the Dict{String,...} approaches. It means that you have to be growing N different vectors (N == number of columns), as you read data, and what happens when there is no header. |
I really would love to have a |
I think that the CSV reader should ideally be composable with different consumers that construct different data types. I.e. the parsing and type inference logic should go in one place while different consumers can provide different logic for constructing different data structures. Going even further, it should be possible to decouple the input format from this as well so that you can N data formats and M data types and not require N*M readers to handle all possible combinations. The tricky bit is getting the abstraction right to allow this flexibility without sacrificing performance. |
Will comment more on this later, but I started making an abstract CSV parser once in the past: https://github.com/johnmyleswhite/CSVReaders.jl FWIW, I'm much more excited about Jacob's work than about finishing the one I started. |
Just looked at that, looks like a nice abstraction, would it make sense to fit Jacob's code into that framework, adding support for some of the stuff that Jacob described above? |
@johnmyleswhite, how do you feel about the idea of that abstraction, having given it a try? Are you meh on the idea of that decoupling in principle or just want a fresh start on the implementation and/or design? |
I like the idea as a goal to aspire to, but I'd have to work on it again to say whether it could ever be made to work. It's a use case where you need to be obsessive about inlining -- but I'm not still sure that's sufficient because some type information might not be available at the right points in time. |
FWIW, my Pipelines.jl idea is the same kind of I/O abstraction for hooking up arbitrary readers/writers. I've come to realize that it's certainly more of a "phase 2" kind of idea where a lot of the actual readers/writers still need a lot of development (hence my push on CSV, SQLite, ODBC, etc.). I certainly think it's the right long-term idea, but for now, I'm more focused on doing a few connections between types (CSV=>Julia structure, CSV=>SQLite table, ODBC=>CSV, ODBC=SQLite, etc.). For this issue, I'd just like to figure out a good Julia-side structure to return the contents of a CSV file. I think I'm leaning towards Dict{String,NullableVector{T}} or DataFrame with NullableVector{T} columns. |
I'd suggest |
@quinnj @johnmyleswhite Are you not concerned about 1) what to do with headerless CSV files and 2) performance issues of having to expand each column separately, when a new row is added? |
Could point, John. Where did Jeff's PR for making Base.Dict ordered by default land? It'd be great to avoid another package dependency for this. @ScottPJones, nope not concerned. For headerless CSVs (which are the minority in my experience), we can auto-generate column names (i.e. "Column1", "Column2", etc.) like DataFrames, R's dataframe, data.table, pandas, etc. all do and is pretty much standard in these cases. Note you can also provide your own column names if you want. I'm also not sure what your point about "expand each column separately" means. You can't add rows to a Matrix anyway, so whether it's a Dict{String,Vector} or a DataFrame, it's the same operation of adding additional elements to the individual columns. I'm also not sure why would we necessarily care about the performance of adding rows to a Julia structure in a CSV parsing package? That seems like a concern for Base or DataStructures or DataFrames. |
When you read the next row from a CSV, then you need to |
No, all the fastest CSV readers pre-allocate vectors for parsing; no pushing necessary.
Yes, just like DataFrames and to a certain extent, Julia arrays in general (all column-oriented). |
That still doesn't address the issue of having to hit M (number of columns) cache-lines for every set or access. |
@ScottPJones, I'd love to see you take a crack at a CSV parser! I think it may open your eyes a little more to the real costs and issues; it feels a little like you're picking specific topics you think may be issues without really looking at the current code or having tried things out yourself. Potential cache-line misses and most other implementation nuances like this are totally swamped by the actual text parsing => typed value operation. Do some profiling on The current issues with row-orientation are the fact that you can't actually make tuples! Go ahead, try to write a row parser that returns a tuple. That's what Keno is currently working on with JuliaLang/julia#12113. Even with that functionality, I'm still not sure on the tradeoff of having to continuously allocate row tuples while parsing vs. the current industry-wide approach of pre-allocating typed column vectors. Sure there are interface arguments where someone may want to parse by iterating over row tuples, but from an overall performance perspective, I'm going to guess that you're going to have a hard time making row-oriented parsing as fast as column-oriented parsing; not to mention that other language CSV readers would probably all be row-oriented if that was a magic solution to being faster. |
@quinnj I have written a CSV parser before, and I'm just raising the issues that I remember from years ago (plus ones that are specific to Julia). I'm not sure about the "current industry-wide approach" you talk about, can you give examples? |
@ScottPJones, your participation in this discussion is becoming increasingly frustrating. I would really recommend you taking some time to look at the code in this repository, the DataFrames readtable code, and John's CSVReaders code, and take some time to actually code something that does it better, to get a better feel for the kind of issues we're discussing for parsing CSV files into appropriately typed Julia structures. You may have a lot of "industry experience", but it's so far I feel like I'm catching you up to what we're actually trying to do here vs. other, current best-in-class CSV readers (which I've already mentioned several times so far). If you just want a python or Java CSV reader, Base has |
@quinnj Why are you getting frustrated? |
@ScottPJones I do agree with @quinnj that your comments here are quite frustrating. It sounds like you have mounted a legal defense about every line Jacob has said. You have brought in a number of unrelated issues to this comment, and how is "someone saying something at JuliaCon" even meaningful to support any argument? And then there is the time tested "I have implemented everything before and I know what is right" argument. What do cache lines have to do with this discussion. Please see the original topic. I suggest you please delete all your comments on this thread so that we can keep the discussion to the topic at hand. |
@ViralBShah I would say that @quinnj 's comments have been rather frustrating. He started this off by asking for recommendations, and since I've implemented a CSV parser before, I responded. Would you rather that only people with no experience implementing this before respond? Your comment is actually the one that has absolutely nothing technical to contribute. |
Ok, if you reject the examples of Pandas and R as good CSV readers, please point us to the CSV reader we should be looking at or comparing against instead. |
@ScottPJones Talking about the performance of a data structure without specifying access pattern is meaningless. Column oriented :
Since the majority use case is linear processing, column oriented is definitely the way to go here. If you want to do random access and find out that memory access is limiting (i.e. even if you do random access you have a small very hot working set that barely fits in cache) then you'll have to build an appropriate data structure and ingest the csv in it (i.e. what databases do). The goal is not to have the CSV parser spit out a data structure which works for every use case since such a thing does not exist. |
@JeffBezanson I never rejected them, I just don't seem them as being "industry wide", R in particular is very much specific to a particular segment. @carnaval If you look at what I wrote earlier, I said very specifically that there were definitely use cases where a column oriented structure would be better, and that it would be good if the parser could create either a column oriented structure and something like a
Again, I said that it would be good if it output into different formats, and not just a single column oriented one. |
The extant CSV parsers I am aware of fall into two categories.
Category 1 parsers can be implemented in Julia essentially with code like rows = [split(line, ',') for line in readlines(open("data.csv"))] Essentially all of @quinnj's options fall into Category 2. P.S. It turns out that |
Regarding the |
@jiahao CSV parsers are a bit more complicated than what you've described for "Category 1", read RFC 4180 for even the simple "standard" CSV format, and just because it is "row oriented" doesn't mean that values aren't parsed into numeric values, dates, etc. CSV files are often used as a way of dumping RDBMS tables to be loaded into another DBMS. More recently I've seen more use of JSON for that sort of thing though. |
@ScottPJones what I wrote for the examples I gave in Category 1 is correct. The parsers I linked to literally do not do any other parsing, except for I did read RFC 4180, which states very clearly in the preamble that "It does not specify an Internet standard of any kind." and on page 4 that 'Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files'. |
Thanks for your comment @jiahao; that's a good summary of the landscape that I'm aware of. Also nice find on the csv-spectrum testing suite; those should complement my set nicely. At this point, I think I'm going to give Dict{String,NullableVector{T}} and DataFrame with NullableVector{T} columns both a try and report back with some results. |
because
In general yes, given enough columns that you fill up L1 by fetching a cache line worth of column data (but not so much that you'll hit L2 anyway before being done with a single row, and that you do little-to-no computation) you have a potential efficiency cliff here. We could argue about performance speculation all day but I can already tell you what's going to happen anyway : @quinnj will implement what he needs to make parsing fast in julia (i.e. column is easier) and operation typical of numeric things much more efficient (say, mean(column)). If someone (e.g. you ?) feels that in one use case it's not efficient enough, (s)he'll be free to contribute the additional flexibility. |
@jiahao I don't see how the code there with |
What happens to CSV files that are too large to fit in memory? One of the nice aspects of the SQLite DB wrapping EDIT: Also, @quinnj would help developing the DataFrame w/ NullableVector{T} columns be useful to you? |
One idea I had been kicking around (but haven't written any code for, so take with a grain of salt), was to make the CSV type parametric with a tuple argument, similar to the ideas being discussed in JuliaData/DataFrames.jl#744. To make things more concrete, you would have something like
which would represent a 3-column csv format, with the corresponding column types. Then you would write something like a,b,c = readline(file, csvfmt) which would call A,B,C = readall(file, csvfmt) which would return corresponding vectors (or Then the general
|
@simonbyrne, I've actually gone down a similar road of thought. I think there's a lot of potential there, the major hiccup is what @JeffBezanson hilariously pointed out during a JuliaCon hackathon: there's unfortunately no way to "make" a tuple. I got to the point of actually writing the code when I realized this. There's no way to create an "empty" typed tuple and then set the values. I'm hopeful Keno's work will alleviate this, but currently I don't know of any efficient workarounds here. |
Can't you do it via a vector?, e.g. t = [ASCIIString, Float64, Float64]
tt = Tuple{t...} |
Oh, sorry, I misunderstood. What I had in mind (but forgot to mention) was that EDIT: Roughly what I had in mind: @generated function readline{Ts}(io::IO, ::CSVTyped{Ts})
b = Expr(:block)
r = Expr(:tuple)
for T in Ts
g = gensym()
push!(b.args, :($g = parse($T, readfield(io))))
push!(r.args, g)
end
quote
$b
$r
end
end |
I don't get this business about not being able to "make" a tuple. julia> x(1,2.3)
(a => 1, b => 2.3)
julia> y = (1, 2.35)
(1,2.35)
julia> x(y...)
(a => 1, b => 2.35)
julia> typeof(ans)
NamedTuples._NT_ab{Int64,Float64} Is that not making a tuple, which can be addressed by field name or index, which retains the types? |
Ah, thanks for the code snippet @simonbyrne. Indeed, I hadn't thought of the generated function + |
Is there any risk of running afoul of |
Note, you can do things like this:
|
@simonbyrne 's idea, of generating the CSV reader, might be the solution to a lot of issues, IMO, and make the performance be able to fly compared to anything else out there. |
@simonbyrne doesn't My benchmarking in JuliaLang/julia#10428 showed that |
Can we choose some specific files we want to optimize against? For small files, I think the staged function is going to be very slow, but it's not clear to me how much of a win there is for larger files. |
I have my test TSV files in 10428 we can use as large test data. |
Is that file (or a simulated equivalent) available somewhere? |
@simonbyrne the data are proprietary, but I can get you access on the research machine. |
I've got a ton of datasets that I can send over. I particularly like optimizing against the files Wes used for Pandas: http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/ |
I'd appreciate any and all test files people have access to and are willing to share. Feel free to share a link or email me personally. Great find @ihnorton, I'll definitely do some benchmarking vs. that site. |
Sent you an e-mail, @quinnj, with some of the bigger datasets I've been profiling against. |
Here's a discussion on https://www.linkedin.com/grp/post/5144163-5836134323916398595 |
The default return type of function getfield!{T}(io::IOBuffer, dest::NullableVector{T}, ::Type{T}, opts, row, col)
@inbounds val, null = CSV.getfield(io, T, opts, row, col) # row + datarow
@inbounds dest.values[row], dest.isnull[row] = val, null
return
end
function Data.stream!(source::CSV.Source,sink::Data.Table)
rows, cols = size(source)
types = Data.types(source)
for row = 1:rows, col = 1:cols
@inbounds T = types[col]
CSV.getfield!(source.data, Data.unsafe_column(sink, col, T), T, source.options, row, col) # row + datarow
end
return sink
end |
I'd like to decide on the Julia structure that
CSV.read()
returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:I'm leaning towards
Dict{String,NullableArray{T}}
as it's the most straightforward@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan
The text was updated successfully, but these errors were encountered: