-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrames replacement that has no type-uncertainty #744
Comments
Thanks for writing this up, Tom. One point I'd make: we should try to decouple interface and implementation. Whether DataFrames are row-oriented or column-oriented shouldn't matter that much if we can offer tolerable performance for both iterating over rows and extracting whole columns. (Of course, defining tolerable performance is a tricky matter.) In particular, I'd like to see a DataFrames implementation that directly wraps SQLite3. Defining the interface at that level means that you can switch between SQLite3's in-memory database and something custom written for Julia readily depending on your particular applications performance characteristics. |
I am thinking about adding a PostgresSQL client for Julia and I would like to expose a interface via DataFrames, so I am very much in favour of decoupling interface and implementation. I would like to see an interface that takes into account datasets that are larger than the available memory on the client and require some sort of streaming. |
A bit against-the-grain question: so if a goal of dataframe is to be an in-memory front end of a potentially much larger database (or a powerful engine), is performance of this thin layer that critical? It almost feels that flexibility and expressive power trumps raw performance in that case. |
@tonyhffong I think the intention is that it can be both: there will be a pure julia DataFrame for general use, but you can also swap this for a different backend without changing code. One other topic perhaps worth considering is indexing: in particular, what interface to use, and do we want pandas-style hierarchical indexing? |
If dot overloading (JuliaLang/julia#1974) is implemented well, could |
Since that would expand to |
If we had "rerun type inference after inlining in some cases" from #3440, that would probably also be sufficient to make |
I mentioned this on the mailing list, but some kind of |
@one-more-minute, see JuliaLang/julia#414 |
This is just a framing point, but one way that I'd like to talk about this issue is in terms of "leaving money on the table", rather than in terms of performance optimization. A standard database engine gives you lots of type information that we're just throwing out. We're not exploiting knowledge about whether a column is of type T and we're also not exploiting knowledge about whether a column contains nulls or not. Many other design issues (e.g. row-oriented vs. column-oriented) depend on assumptions about how data will be accessed, but type information is a strict improvement (up to compilation costs and the risks of over-using memory from compiling too many variants of a function). |
I tried to write down two very simple pieces of code that exemplify the very deep conceptual problems we need to solve if we'd like to unify the DataFrames and SQL table models of computation: https://gist.github.com/johnmyleswhite/584cd12bb51c27a19725 |
To reiterate @tonyhffong's point, I wonder, maybe naively, why one cannot use an SQLite in-memory database and an interface a la dplyr to carry out all the analyses. I have the impression that database engines have solved and optimised a lot of issues that we are trying to address here. Besides one of the main frustrations with R (at least mine) is the fact that large data sets cannot be handled directly. This would also remediate that issue. I can foresee some limitations with this approach
|
Using SQLite3 as a backend is something that would be worth exploring. There are also lots of good ideas in dplyr. That said, I don't really think that using SQLite3 resolves the biggest unsolved problem, which is how to express to Julia that we have very strong type information about dataframes/databases, but which is only available at run-time. To borrow an idea from @simonster, the big issue is how to do something like:
The best case scenario I can see for this function is to defer compilation of everything after the call to |
Your example is what I meant by writing efficient custom functions. A possibility that I can see is to "somehow" compile Julia functions in SQLite user-defined functions. But this is probably cumbersome. |
@johnmyleswhite Here is a python data interop protocol for working with external databases etc through blaze (numpy/pandas 2 ecosystem) http://datashape.pydata.org/overview.html It is currently being used only to lower/ JIT expressions on numpy arrays, but facilitates interop and discovery with other backends: http://matthewrocklin.com/blog/work/2014/11/19/Blaze-Datasets/ Not sure if there are ideas here that can help in any way, but thought I would drop it in regardless. |
These things are definitely useful. I think we need to think about they interact with Julia's existing static-analysis JIT. |
Glad it is helpful. Here is the coordinating library that connects to these projects: https://github.com/ContinuumIO/blaze It itself has some good ideas for chunking, streaming etc Here is a scheduler for doing out of core ops: http://matthewrocklin.com/blog/work/2015/01/16/Towards-OOC-SpillToDisk/ The graphs optimized to remove unnecessary computation dask/dask#20 Maybe after some introspection, dataframes can use blocks.jl to stream database into memory transparently. Does Julia have facilities to build and optimize parallel scheduling expression graphs? |
@johnmyleswhite I ended up coding something that could be useful during your talk about this earlier today, and then I found this issue so I figured the most convenient option might be to discuss it here. I came up with a rough sketch of a type-stable and type-specific implementation of DataFrames; a gist can be here. @simonbyrne's prototype ended up heavily informing how I structured the code, so it should look somewhat similar. Pros for this implementation:
Cons:
|
Note that, given the above implementation, it's also pretty easy to add type-stable/specific methods for Edit: Just added this to the gist for good measure. Seeing it in action: julia> df = @dframe(:numbers = collect(1:10), :letters = 'a':'j')
DataFrame{Tuple{Array{Int64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([1,2,3,4,5,6,7,8,9,10],'a':1:'j'),Tuple{:numbers,:letters})
julia> df[1]
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> df[2, Field{:numbers}]
2 This implementation also easily supports row slicing: julia> df[1:3, Field{:numbers}]
3-element Array{Int64,1}:
1
2
3 (...though I'm not sure whether the lack of support for row slicing is by design or not in the current DataFrames implementation) |
Nice @jrevels! The main drawback I see is that type-stable indexing is still cumbersome. You need |
@tshort True. I'm not sure that this implementation could ever support a syntax that clean, but I can naively think of few alternatives that could at least make it a little easier to deal with:
julia> const numbers = Field{:numbers}
Field{:numbers}
julia> df[numbers]
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10 This results in a nice syntax, but of course requires reserving a name for the constant. This could be done automatically as part of the |
Unless we specialise DataFrames into the language somehow (e.g., by making them act like Modules), I suspect that this sort of idea is likely to be the most useful. The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice. |
This is great. I'm totally onboard with this. My one source of hesitation is that I'm not really sure what performance problems we want to solve are. What I like about this is that it puts us in a position to use staged functions to make iterating over the rows of a DataFrame fast. That's a big win. But I think we still need radical changes to introduce indexing into DataFrames and even more work to provide support for SQL-style queries. In an ideal world, we could revamp DataFrames with this trick while also exploring a SQLite-backed approach. If we're worried about labor resources, I wonder if we need to write out some benchmarks that establish what kinds of functionality we want to provide and what we want to make fast. |
If you guys decide that it's worth it, I'd love to help refactor DataFrames to use the proposed implementation. I'd just have to get up to speed with where the package is at, given that such a refactor might also entail breaking API changes. If you think that DataFrames isn't ready, or that a large refactor is untenable given the SQL-ish direction that you want to eventually trend towards, that's also cool. Once decisions have been made as to what should be done, feel free to let me know how I can help.
I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption). |
I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to |
I still have the feeling that this is reinventing the wheel if the end goal is to have an SQL-like/style syntax! I recently stumbled upon monetdb (https://www.monetdb.org/Home). They have embedded R in the database (https://www.monetdb.org/content/embedded-r-monetdb) and developed a package to access from R via dplyr. I know this is not "pure" Julia, but I could very well imagine that one can work with such a technology which would also enable to have user defined Julia functions "transformed" to database level functions. Next step would be to have something like this http://hannes.muehleisen.org/ssdbm2014-r-embedded-monetdb-cr.pdf Would an idea like this be worth exploring? |
@teucer The problem is that for Julia to generate efficient code, it needs to know the types of the input variables when compiling the functions. dplyr and its connections to databases are very interesting, but they don't solve the type-stability issue if we want fast Julia code operating on data frames. |
For a quick summary, see @johnmyleswhite's post: |
@MikeInnes - this link is now broken, but I'm rather interested in seeing the direction you were taking. Do you still have the code around somewhere? |
I'd also like to check out the gist. |
Sure, you can actually see the whole code at Data.jl and I put up a self-contained gist here. Data.jl is basically just a fleshed-out version where the full DataFrame object is built on top of the TypedDict. Honestly though, when I was playing around with this I'm not sure the type inferrability of dataframe indexing made a whole lot of difference. You end up having to factor operations out into acting on columns anyway (see e.g. DecisionTrees.jl). For me the biggest wins came from (1) replacing |
We should probably document the ways in which non-concrete types harm the performance of code written for DataFrames until we have a better design that we can all agree on. Each of the problems that comes up is very different from the others.
df = get_data_frame()
for r in eachrow(df)
do_something_per_row(r)
end
df = get_data_frame()
do_something_to_whole_table(df)
df = get_data_frame()
do_something_to_columns(df[:a], df[:b], df[:c])
function do_something_to_columns(a, b, c)
s = 0.0
for i in 1:length(a)
s += a[i] + b[i] + c[i]
end
return s
end |
@johnmyleswhite What is a "kernel function" in this context? |
It's possible I don't understand what other people mean, but I'm using kernel function to refer to a function |
Hi everyone, At work we need to manage data with multiple types and I've also been working on how to do typed data containers in Julia and have created a reasonably functional package called Tables.jl. My approach is a little more complex (and heavy on metaprogramming) than the example given by @MikeInnes. But it also includes a first-class I have seen recent additions to Julia 0.5 that fix a few speed problems with tuples, so this should become a reasonably fast approach (it should already be better than DataFrames with no care taken with regards to type stability, but I haven't done any in-depth benchmarking beyond inspecting the generated llvm/native code). Furthermore, using the new I would appreciate any feedback. I was also thinking of making a PR to METADATA to make it public, but I definitely didn't want to annoy or upset the JuliaStats community either, so I am also seeking comments regarding this (i.e. if there was soon-to-be-released something similar from you guys, or if there was vehement opposition to having multiple packages with some overlap in functionality, etc). |
Great, @andyferris! Please feel free to register Tables. Looks like lots of good stuff in there. The API for Tables is wordy and noisy. That may be the price for type stability. To simplify, could Are you planning conversion methods to DataFrames? |
@tshort Thanks, and yes, that is quite possible. At the very least, I just tried julia> function f(x::Symbol)
Base.@_pure_meta
Val{x}
end
f (generic function with 1 method)
julia> g() = f(:a)
g (generic function with 1 method)
julia> code_warntype(g,())
Variables:
#self#::#g
Body:
begin # none, line 1:
return $(QuoteNode(Val{:a}))
end::Type{Val{:a}} in nightly. This seems promissing to me. At the very least, this doesn't seem possible in 0.4: julia> @inline f2(x::Symbol) = Val{x}
f2 (generic function with 1 method)
julia> g2() = f2(:a)
g2 (generic function with 1 method)
julia> code_warntype(g2,())
Variables:
#self#::#g2
Body:
begin # none, line 1:
return (top(apply_type))(Main.Val,:a)::Type{_<:Val{T}}
end::Type{_<:Val{T}} This change will make the interface for Tables.jl quite a bit nicer to use. Would be interesting if the constructors could be cleaned up too. And yes methods to eat and spit out DataFrames is planned... at the very least that will provide I/O in the short-term. |
Great news, @andyferris. I'd shoot for v0.5 then. That would make the "everyday" API much nicer. |
Really interesting! I hadn't anticipated that I have a few questions, which may well reflect my misunderstanding:
(Finally, the name bikeshedding session: I think |
Thanks @nalimilan. Yes, the Let me address your questions
Table{FieldIndex{(Field{:A,Int64}(), Field{:B,Float64}())}(), Tuple{Int64,Float64}, Tuple{Array{Int64,1},Array{Float64,1}}}(([1,2,3], [2.0,4.0,6.0])) Yuk, right? Anyway, I realize it is dreadful and some information can be computed (the element types from the array types, perhaps the fields could just be names, I don't know).
As for the macros, I wanted to allow both field names and field_A = Field{:A,Int64}()
cell_1 = @cell(field_A = 42)
cell_2 = @cell(A::Int64 = 42) # == cell_1 I wasn't sure how to make both inputs work. DIfferent macros? Use a symbol, like
I believe this could be simplified, but some things might have to be checked later in the process. Also, working with Tuple types in 0.4 is a bit painful. I'm guessing there will be enough metaprogramming power in Finally, regarding the package name, I shortened the module name myself when I got frustrated with typing a longer name during development (frequent |
Sorry for the late reply. I wonder why you wouldn't be able to simplify the type to this: Finally, I don't think you need to store both the type of the columns and their element type: |
Yes Milan, good idea and I agree and have thought about this lately, but I haven't got around to doing a type-design iteration yet. And yes the invariants can be checked - it's just a matter of when. I had included both the element types and storage types in the header because I was trying to see how far one could go without generated functions, because The other place the element type is useful is getting the correct function signature, for instance to push a Do you think The other two possibilities are:
or
I was considering the latter. A field name might be carried in a Your opinion? |
I would go with the latter ( |
Indeed, I agree with you @nalimilan. I expect I'll go that way when I find the time. :) As a general comment, I don't think there is much top-down imposition nor much community consensus on how Anyway, that is just an aside. I have some deeper questions for actual R-data.frames/data.table/dplyr and python-pandas users. I've come up with a macro @select which easily combines the notation of dplyr's @select(table, col1, newname = col2, newcol::newtype = col1 -> f(col1)) where it has three possible behaviours, respectively: selecting a column, renaming a column (and including it) and creating a new column via a row-by-row calculation. The syntax is quite constrained by the semantics of Julia, but any comments or suggested improvements would be welcome! I tried to emulate |
@andyferris have you seen this? https://github.com/JuliaStats/DataFramesMeta.jl |
Sounds like a good strategy. As @datnamer, I think the state of our reflections with regard to this is in DataFramesMeta (see also the issues in that project). It would make sense to follow SLQ/LINQ/dplyr as much as possible, unless there's a good reason not to. There has also been some discussion regarding APIs here: #369. |
This may be premature, but are there any thoughts on SQL code generation? Edit: There is also this https://github.com/MikeInnes/Flow.jl. I wonder if that can help (query planning, delayed eval etc) |
@datnamer: quick answer regarding SQL is "no". But thank you to both of you for interesting reads (I definitely had seen DataFramesMeta quite some time ago but hadn't thought about it lately). Regarding #369, the point is just syntax where you write code in devectorized form? I think my new |
Well, mostly, but it would also allow you to refer to all existing columns as variables, as well as to create new ones like you would create variables. So it would take a complete code block instead of short expressions passed as separate arguments. See the example I provided there. It's particularly useful when combining several variables using |
Closing as we're not going to make the |
@johnmyleswhite started this interesting email thread:
https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M
Discussions included:
df[1, Field{:a}()]
instead ofdf[1, :a]
.field"a"
, could make the column indexing look a bit better for the example above.df.a
but not asdf[:a]
(same issue as Simon's approach).The following issues in Base may help with type certainty:
a.b
The text was updated successfully, but these errors were encountered: