Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames replacement that has no type-uncertainty #744

Closed
tshort opened this issue Dec 17, 2014 · 87 comments
Closed

DataFrames replacement that has no type-uncertainty #744

tshort opened this issue Dec 17, 2014 · 87 comments

Comments

@tshort
Copy link
Contributor

tshort commented Dec 17, 2014

@johnmyleswhite started this interesting email thread:

https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3M

Discussions included:

  • Row-based storage vs. column-based storage -- both approaches have advantages.
  • @simonbyrne prototyped an approach to type certainty using a composite type that embeds the column information as type parameters. Staged functions are used for getindex. To maintain type certainty, indexing has to be done as df[1, Field{:a}()] instead of df[1, :a].
  • @one-more-minute noted that a string macro, field"a", could make the column indexing look a bit better for the example above.
  • CompositeDataFrames are one approach to type certainty. Types are certain if indexed as df.a but not as df[:a] (same issue as Simon's approach).
  • The function boundary (or lack of) causes issues. For example, we'd like to be able to read in a DataFrame and do analysis within the same function.

The following issues in Base may help with type certainty:

@johnmyleswhite
Copy link
Contributor

Thanks for writing this up, Tom.

One point I'd make: we should try to decouple interface and implementation. Whether DataFrames are row-oriented or column-oriented shouldn't matter that much if we can offer tolerable performance for both iterating over rows and extracting whole columns. (Of course, defining tolerable performance is a tricky matter.) In particular, I'd like to see a DataFrames implementation that directly wraps SQLite3. Defining the interface at that level means that you can switch between SQLite3's in-memory database and something custom written for Julia readily depending on your particular applications performance characteristics.

@vchuravy
Copy link
Contributor

I am thinking about adding a PostgresSQL client for Julia and I would like to expose a interface via DataFrames, so I am very much in favour of decoupling interface and implementation.

I would like to see an interface that takes into account datasets that are larger than the available memory on the client and require some sort of streaming.

@tonyhffong
Copy link
Contributor

A bit against-the-grain question: so if a goal of dataframe is to be an in-memory front end of a potentially much larger database (or a powerful engine), is performance of this thin layer that critical? It almost feels that flexibility and expressive power trumps raw performance in that case.

@simonbyrne
Copy link
Contributor

@tonyhffong I think the intention is that it can be both: there will be a pure julia DataFrame for general use, but you can also swap this for a different backend without changing code.

One other topic perhaps worth considering is indexing: in particular, what interface to use, and do we want pandas-style hierarchical indexing?

@stevengj
Copy link

If dot overloading (JuliaLang/julia#1974) is implemented well, could df[1, :a] be replaced by df[1].a while still using @simonbyrne's staged-function tricks?

@MikeInnes
Copy link
Contributor

Since that would expand to getfield(df[1], Field{:a}()) you could certainly use the staged function trick in principle. But that depends heavily on how efficient df[1] is as well. Data frame slices might be needed.

@simonster
Copy link
Contributor

If we had "rerun type inference after inlining in some cases" from #3440, that would probably also be sufficient to make df[1, :a] work. Or some hack to provide type information for arbitrary functions given the Exprs they were passed, similar to what inference.jl does for tupleref.

@MikeInnes
Copy link
Contributor

I mentioned this on the mailing list, but some kind of @pure notation that gives the compiler freedom to partially evaluate the function when it has compile-time-known arguments would also be sufficient – certainly for this and other indexing purposes, and perhaps others.

@stevengj
Copy link

@one-more-minute, see JuliaLang/julia#414

@johnmyleswhite
Copy link
Contributor

This is just a framing point, but one way that I'd like to talk about this issue is in terms of "leaving money on the table", rather than in terms of performance optimization. A standard database engine gives you lots of type information that we're just throwing out. We're not exploiting knowledge about whether a column is of type T and we're also not exploiting knowledge about whether a column contains nulls or not. Many other design issues (e.g. row-oriented vs. column-oriented) depend on assumptions about how data will be accessed, but type information is a strict improvement (up to compilation costs and the risks of over-using memory from compiling too many variants of a function).

@johnmyleswhite
Copy link
Contributor

I tried to write down two very simple pieces of code that exemplify the very deep conceptual problems we need to solve if we'd like to unify the DataFrames and SQL table models of computation: https://gist.github.com/johnmyleswhite/584cd12bb51c27a19725

@teucer
Copy link

teucer commented Dec 31, 2014

To reiterate @tonyhffong's point, I wonder, maybe naively, why one cannot use an SQLite in-memory database and an interface a la dplyr to carry out all the analyses. I have the impression that database engines have solved and optimised a lot of issues that we are trying to address here. Besides one of the main frustrations with R (at least mine) is the fact that large data sets cannot be handled directly. This would also remediate that issue.

I can foresee some limitations with this approach

  • efficient custom functions
  • unsupported datatypes in SQLite
  • conversion from tables to e.g. vectors and matrices and vice versa

@johnmyleswhite
Copy link
Contributor

Using SQLite3 as a backend is something that would be worth exploring. There are also lots of good ideas in dplyr.

That said, I don't really think that using SQLite3 resolves the biggest unsolved problem, which is how to express to Julia that we have very strong type information about dataframes/databases, but which is only available at run-time. To borrow an idea from @simonster, the big issue is how to do something like:

function sum_column_1(path::String)
    df = readtable(path)

    s = 0.0

    for row in eachrow(df)
        s += row.column_1
    end

    return s
end

The best case scenario I can see for this function is to defer compilation of everything after the call to readtable and then compile the rest of the body of the function after readtable has produced a concrete type for df. There are, of course, other ways to achieve this effect (including calling a second function from inside of this function), but it's a shame that naive code like the following should suffer from so much type uncertainty that could, in principle, be avoided by deferring some of the type-specialization process.

@teucer
Copy link

teucer commented Dec 31, 2014

Your example is what I meant by writing efficient custom functions. A possibility that I can see is to "somehow" compile Julia functions in SQLite user-defined functions. But this is probably cumbersome.

@garborg garborg mentioned this issue Jan 1, 2015
@datnamer
Copy link

datnamer commented Feb 5, 2015

@johnmyleswhite Here is a python data interop protocol for working with external databases etc through blaze (numpy/pandas 2 ecosystem) http://datashape.pydata.org/overview.html

It is currently being used only to lower/ JIT expressions on numpy arrays, but facilitates interop and discovery with other backends: http://matthewrocklin.com/blog/work/2014/11/19/Blaze-Datasets/

Not sure if there are ideas here that can help in any way, but thought I would drop it in regardless.

@johnmyleswhite
Copy link
Contributor

These things are definitely useful. I think we need to think about they interact with Julia's existing static-analysis JIT.

@datnamer
Copy link

datnamer commented Feb 5, 2015

Glad it is helpful. Here is the coordinating library that connects to these projects: https://github.com/ContinuumIO/blaze It itself has some good ideas for chunking, streaming etc

Here is a scheduler for doing out of core ops: http://matthewrocklin.com/blog/work/2015/01/16/Towards-OOC-SpillToDisk/

The graphs optimized to remove unnecessary computation dask/dask#20

Maybe after some introspection, dataframes can use blocks.jl to stream database into memory transparently. Does Julia have facilities to build and optimize parallel scheduling expression graphs?

@jrevels
Copy link

jrevels commented Jun 26, 2015

@johnmyleswhite I ended up coding something that could be useful during your talk about this earlier today, and then I found this issue so I figured the most convenient option might be to discuss it here.

I came up with a rough sketch of a type-stable and type-specific implementation of DataFrames; a gist can be here. @simonbyrne's prototype ended up heavily informing how I structured the code, so it should look somewhat similar.

Pros for this implementation:

  • Totally type-stable/specific construction and accessor methods
  • Basically provides the same advantages of a CompositeDataFrame, but in a general implementation, eliminating the need to implement new composite types for new kinds of DataFrames.

Cons:

  • I wrote it using Julia v0.4; it might be difficult to convert to a form suitable for v0.3, but maybe Compat could handle it?
  • The @dframe constructor macro, while type-stable, does change the syntax a little bit compared to the current DataFrames(;kwargs...) constructor...mainly, the fact that it's a macro and not a normal type constructor. Secondarily, when writing the kwargs pairs, one must actually add the colon to the key symbol: DataFrame(a=collect(1:10)) vs. @dframe(:a=collect(1:10)).

@jrevels
Copy link

jrevels commented Jun 26, 2015

Note that, given the above implementation, it's also pretty easy to add type-stable/specific methods for getindex that have the other indexing behaviors currently defined on DataFrames (e.g. df[i, Field{:s}], df[i]).

Edit: Just added this to the gist for good measure. Seeing it in action:

julia> df = @dframe(:numbers = collect(1:10), :letters = 'a':'j')
DataFrame{Tuple{Array{Int64,1},StepRange{Char,Int64}},Tuple{:numbers,:letters}}(([1,2,3,4,5,6,7,8,9,10],'a':1:'j'),Tuple{:numbers,:letters})

julia> df[1]
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> df[2, Field{:numbers}]
2

This implementation also easily supports row slicing:

julia> df[1:3, Field{:numbers}]
3-element Array{Int64,1}:
 1
 2
 3

(...though I'm not sure whether the lack of support for row slicing is by design or not in the current DataFrames implementation)

@tshort
Copy link
Contributor Author

tshort commented Jun 26, 2015

Nice @jrevels! The main drawback I see is that type-stable indexing is still cumbersome. You need df[Field{:numbers}]rather than df.numbers.

@jrevels
Copy link

jrevels commented Jun 26, 2015

@tshort True. I'm not sure that this implementation could ever support a syntax that clean, but I can naively think of few alternatives that could at least make it a little easier to deal with:

  1. Have an access macro: @field df.numbers that expands to df[Field{:numbers}]. This is still kind of annoying to write, but at least reads like the nice syntax you propose. Coded properly, you could write something like @field df.numbers[1] + df.numbers[2] and the macro could expand each use of df.* to df[Field{:*}].
  2. Shorten the name of the Field type, e.g. abstract fld{f} so that access looks like df[fld{:numbers}]. This makes it a bit easier to type, but IMO makes it even harder to read, and is probably uglier than is acceptable.
  3. Use constants assigned to the appropriate Field{f} type. For example:
julia> const numbers = Field{:numbers}
Field{:numbers}

julia> df[numbers]
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

This results in a nice syntax, but of course requires reserving a name for the constant. This could be done automatically as part of the @dframe macro, but I think unexpectedly introducing new constants into the user's environment might be too intrusive.

@simonbyrne
Copy link
Contributor

Unless we specialise DataFrames into the language somehow (e.g., by making them act like Modules), I suspect that this sort of idea is likely to be the most useful.

The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.

@johnmyleswhite
Copy link
Contributor

This is great. I'm totally onboard with this.

My one source of hesitation is that I'm not really sure what performance problems we want to solve are. What I like about this is that it puts us in a position to use staged functions to make iterating over the rows of a DataFrame fast. That's a big win. But I think we still need radical changes to introduce indexing into DataFrames and even more work to provide support for SQL-style queries.

In an ideal world, we could revamp DataFrames with this trick while also exploring a SQLite-backed approach. If we're worried about labor resources, I wonder if we need to write out some benchmarks that establish what kinds of functionality we want to provide and what we want to make fast.

@jrevels
Copy link

jrevels commented Jun 26, 2015

If you guys decide that it's worth it, I'd love to help refactor DataFrames to use the proposed implementation. I'd just have to get up to speed with where the package is at, given that such a refactor might also entail breaking API changes. If you think that DataFrames isn't ready, or that a large refactor is untenable given the SQL-ish direction that you want to eventually trend towards, that's also cool. Once decisions have been made as to what should be done, feel free to let me know how I can help.

The key downside from a performance perspective is going to be the JIT overhead every time you apply a function to a new DataFrame signature, though I'm not sure how important this is likely to be in practice.

I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).

@johnmyleswhite
Copy link
Contributor

I imagine that the average user won't be generating so many different types of DataFrames in a given session that it will ever be an issue (though I could be wrong in this assumption).

I tend to agree. I'd be surprised if the number of distinct DataFrames exceeded 100 in most programs, although I'm sure somebody will attempt to build a million column DataFrame by incremental calls to hcat.

@teucer
Copy link

teucer commented Jun 26, 2015

I still have the feeling that this is reinventing the wheel if the end goal is to have an SQL-like/style syntax!

I recently stumbled upon monetdb (https://www.monetdb.org/Home).

They have embedded R in the database (https://www.monetdb.org/content/embedded-r-monetdb) and developed a package to access from R via dplyr.

I know this is not "pure" Julia, but I could very well imagine that one can work with such a technology which would also enable to have user defined Julia functions "transformed" to database level functions. Next step would be to have something like this http://hannes.muehleisen.org/ssdbm2014-r-embedded-monetdb-cr.pdf

Would an idea like this be worth exploring?

@nalimilan
Copy link
Member

@teucer The problem is that for Julia to generate efficient code, it needs to know the types of the input variables when compiling the functions. dplyr and its connections to databases are very interesting, but they don't solve the type-stability issue if we want fast Julia code operating on data frames.

@simonbyrne
Copy link
Contributor

@c42f
Copy link

c42f commented Feb 24, 2016

Gist here, use at your own risk, may set Julia on fire etc.

@MikeInnes - this link is now broken, but I'm rather interested in seeing the direction you were taking. Do you still have the code around somewhere?

@datnamer
Copy link

I'd also like to check out the gist.

@MikeInnes
Copy link
Contributor

Sure, you can actually see the whole code at Data.jl and I put up a self-contained gist here. Data.jl is basically just a fleshed-out version where the full DataFrame object is built on top of the TypedDict.

Honestly though, when I was playing around with this I'm not sure the type inferrability of dataframe indexing made a whole lot of difference. You end up having to factor operations out into acting on columns anyway (see e.g. DecisionTrees.jl).

For me the biggest wins came from (1) replacing DataArray with Vector{Nullable}, (2) specialising columns where appropriate (removing Nullable, pooled arrays etc.) (3) writing fast column-level kernels for high-level operations. That gets you performance on par with scikit-learn without having to mess around with matrices and strange data encodings.

@johnmyleswhite
Copy link
Contributor

We should probably document the ways in which non-concrete types harm the performance of code written for DataFrames until we have a better design that we can all agree on. Each of the problems that comes up is very different from the others.

  • You try to stream through individual rows of a DataFrame as atomic chunks of data. This basically never works because the rows would need to be tuples to be acceptably fast. Basically, this pattern is always a warning sign:
df = get_data_frame()

for r in eachrow(df)
    do_something_per_row(r)
end
  • You try to operate on a DataFrame without using any "kernel functions", which really means that you try to call a function on a DataFrame, when you actually need to call functions on its columns instead. This is the problem that a type-stable DataFrame solves: it allows you to write functions that operate on a DataFrame as a single argument. This lets you write code that looks like this without having to worry about what happens inside of do_something_to_whole_table(df).
df = get_data_frame()

do_something_to_whole_table(df)
  • You use kernel functions that operate on the columns of a DataFrame, but the columns are DataArrays that generate scalar values of type Union{T, NAtype}. These scalar values get boxed and computations on them go through dynamic dispatch, so the computations are much slower than you'd get if you worked with a vector whose elements were all a primitive type like Float64.
df = get_data_frame()

do_something_to_columns(df[:a], df[:b], df[:c])

function do_something_to_columns(a, b, c)
    s = 0.0
    for i in 1:length(a)
        s += a[i] + b[i] + c[i]
    end
   return s
end

@davidagold
Copy link

@johnmyleswhite What is a "kernel function" in this context?

@johnmyleswhite
Copy link
Contributor

It's possible I don't understand what other people mean, but I'm using kernel function to refer to a function f(args...) all of whose arguments are the columns of a DataFrame after extracting them and resolving any type-uncertainty. This function is where the core computational work happens; as long as it's fast, the rest of the system has acceptable performance.

@andyferris
Copy link
Member

Hi everyone,

At work we need to manage data with multiple types and I've also been working on how to do typed data containers in Julia and have created a reasonably functional package called Tables.jl.

My approach is a little more complex (and heavy on metaprogramming) than the example given by @MikeInnes. But it also includes a first-class Row tuple-wrapper which actually makes dealing with data in row chunks fast (@johnmyleswhite - in Julia we can have our cake and eat it). Or you can extract the columns / raw data / etc as necessary.

I have seen recent additions to Julia 0.5 that fix a few speed problems with tuples, so this should become a reasonably fast approach (it should already be better than DataFrames with no care taken with regards to type stability, but I haven't done any in-depth benchmarking beyond inspecting the generated llvm/native code). Furthermore, using the new @pure meta and factoring out some of the metaprogramming (into a separate typed-dict package or something) should simplify the code significantly and allow for static compilation in the future.

I would appreciate any feedback. I was also thinking of making a PR to METADATA to make it public, but I definitely didn't want to annoy or upset the JuliaStats community either, so I am also seeking comments regarding this (i.e. if there was soon-to-be-released something similar from you guys, or if there was vehement opposition to having multiple packages with some overlap in functionality, etc).

@tshort
Copy link
Contributor Author

tshort commented Mar 13, 2016

Great, @andyferris! Please feel free to register Tables. Looks like lots of good stuff in there. The API for Tables is wordy and noisy. That may be the price for type stability. To simplify, could @pure allow tbl[:A] to be treated like tbl[Val{:A}]?

Are you planning conversion methods to DataFrames?

@andyferris
Copy link
Member

@tshort Thanks, and yes, that is quite possible. At the very least, I just tried

julia> function f(x::Symbol)
         Base.@_pure_meta
         Val{x}
       end
f (generic function with 1 method)

julia> g() = f(:a)
g (generic function with 1 method)

julia> code_warntype(g,())
Variables:
  #self#::#g

Body:
  begin  # none, line 1:
      return $(QuoteNode(Val{:a}))
  end::Type{Val{:a}}

in nightly. This seems promissing to me. At the very least, this doesn't seem possible in 0.4:

julia> @inline f2(x::Symbol) = Val{x}
f2 (generic function with 1 method)

julia> g2() = f2(:a)
g2 (generic function with 1 method)

julia> code_warntype(g2,())
Variables:
  #self#::#g2

Body:
  begin  # none, line 1:
      return (top(apply_type))(Main.Val,:a)::Type{_<:Val{T}}
  end::Type{_<:Val{T}}

This change will make the interface for Tables.jl quite a bit nicer to use. Would be interesting if the constructors could be cleaned up too.

And yes methods to eat and spit out DataFrames is planned... at the very least that will provide I/O in the short-term.

@tshort
Copy link
Contributor Author

tshort commented Mar 13, 2016

Great news, @andyferris. I'd shoot for v0.5 then. That would make the "everyday" API much nicer.

@nalimilan
Copy link
Member

Really interesting! I hadn't anticipated that @pure would allow tbl[:A] to be type-stable.

I have a few questions, which may well reflect my misunderstanding:

  1. Why do @table, @field and @cell need to be macros?
  2. Likely related: Why are type annotations in @table(A::Int64=[1,2,3], B::Float64=[2.0,4.0,6.0]) or @cell(A::Int64=1) required? I guess they can be useful if you want to force an abstract type to be used when passing a single value to @cell, but shouldn't inference be enough in other cases?
  3. I don't understand the signification/need for FieldIndex. Shouldn't this information be stored as a Tuple{} of field types in the table or row objects, instead of being exposed to the user?

(Finally, the name bikeshedding session: I think Tables.jl is too general. I also had a package called that way, and I renamed it to FreqTables.jl before registering it, as tables can mean very different things in different fields. How about DataTables.jl, which is the name you use at the top of README.md? Anyway, probably better discuss this in the METADATA.jl registration PR to keep the present thread focused.)

@andyferris
Copy link
Member

Thanks @nalimilan. Yes, the @pure thing is pretty cool... I'm not certain yet if it is a "compiler hint" or follows a clear set of rules. Currently Julia's type inference system gives up and goes on holiday when, e.g., passing around large tuples and so-on. Fortunately, generated functions seem to cause it to not be lazy (for obvious reasons).

Let me address your questions

  1. The macros are only for convenience. The inner constructor for the table is something like:
Table{FieldIndex{(Field{:A,Int64}(), Field{:B,Float64}())}(), Tuple{Int64,Float64}, Tuple{Array{Int64,1},Array{Float64,1}}}(([1,2,3], [2.0,4.0,6.0]))

Yuk, right? Anyway, I realize it is dreadful and some information can be computed (the element types from the array types, perhaps the fields could just be names, I don't know).

  1. That was just my original design. A Field combines a name and a type, just like the fields of type and immutable objects. The programmer designs a table with specific data in mind - and data can be converted upon input, or whatever, transparently. The type contract is never broken.

As for the macros, I wanted to allow both field names and Field objects to be used, so the non-type-annotated version assumes a field already exists, like:

field_A = Field{:A,Int64}()
cell_1 = @cell(field_A = 42)
cell_2 = @cell(A::Int64 = 42) # == cell_1

I wasn't sure how to make both inputs work. DIfferent macros? Use a symbol, like @cell(:A = 42)? I think the API could definitely be refined.

  1. Great question. This was a design decision I made for a couple of reasons. First, I liked that I could write an object where some invariants are checked in the constructor (e.g. it won't allow two fields with the same name, or any called :Row). Similarly, being a distinct type means I can dispatch many things with the type, such as call with a tuple which creates a Row or Table depending on the types.

I believe this could be simplified, but some things might have to be checked later in the process. Also, working with Tuple types in 0.4 is a bit painful. I'm guessing there will be enough metaprogramming power in @pure that some of these things can be done differently (and without generated functions).

Finally, regarding the package name, I shortened the module name myself when I got frustrated with typing a longer name during development (frequent reload commands means I use import not using). Tables.jl, DataTables.jl and TypedTables.jl are all acceptable to me, but I plan bring this up in the PR.

@nalimilan
Copy link
Member

Sorry for the late reply. I wonder why you wouldn't be able to simplify the type to this: Table{Tuple{:A, :B}, Tuple{Array{Int64,1}, Array{Float64, 1}}}, i.e. a tuple type of column names and a tuple type of column types. This wouldn't prevent checking for invariants at all. Then Table(A=[1,2,3], B=[2.0,4.0,6.0]) could easily infer the type parameters from its arguments.

Finally, I don't think you need to store both the type of the columns and their element type: eltype should always provide this information if you call it on the column type. This should help reduce the number of parameters.

@andyferris
Copy link
Member

Yes Milan, good idea and I agree and have thought about this lately, but I haven't got around to doing a type-design iteration yet. And yes the invariants can be checked - it's just a matter of when. I had included both the element types and storage types in the header because I was trying to see how far one could go without generated functions, because eltypes (plural, not singular) fails to generate good type information on Julia 0.4 if it is not generated. One could also use Julia 0.5 pure functions to fix that. But as it stands, most the type-stability (and zero run-time overhead) relies heavily on generated functions.

The other place the element type is useful is getting the correct function signature, for instance to push a Row onto a Table. But it is superfluous - one just moves the checks into the function (a generated/pure helper function should lead to zero run-time overhead).

Do you think Tuple{:A,:B} is OK? Interestingly, it's a type that can't be instantiated and Steffan and/or Jeff once saw I had done that before and they thought it was a mistake that Julia even allows it (sorry I can't seem to find the reference to that discussion right now).

The other two possibilities are:

Table{Tuple{Val{:A}, Val{:B}}, Tuple{Array{Int64,1}, Array{Float64, 1}}}

or

Table{(:A, :B), Tuple{Array{Int64,1}, Array{Float64, 1}}}

I was considering the latter. A field name might be carried in a Val of a symbol, and a collection of fields (currently FieldIndex) is a Val of a tuple of symbols. Or, I could make a new type like Name and Names for these, which would have many methods defined (like currently I can take the union of two FieldIndexs - I would feel bad for defining a bunch of methods on Val). Or, fields and indices can stay as they are, with full type information.

Your opinion?

@nalimilan
Copy link
Member

I would go with the latter ((:A, :B)), as indeed the column names are not types, they are just parameters. And I don't think Val is supposed to be used that way, it's just a way to dispatch on a specific value when calling a function.

@andyferris
Copy link
Member

Indeed, I agree with you @nalimilan. I expect I'll go that way when I find the time. :)

As a general comment, I don't think there is much top-down imposition nor much community consensus on how Val should be used. It is defined in Base as immutable Val{T}; end without comment or any methods defined for manipulating Vals; furthermore there is AFAIK just a few comments online discussing its use (e.g. Val{T}() vs Val{T}). Certainly, Tuple{Val{x}} can at least be instantiated as (Val{x}()), so it is better than Val{:A,:B}. Also compare using Val-types vs other singletons in Base like LinearFast() which are passed as values not types.

Anyway, that is just an aside. I have some deeper questions for actual R-data.frames/data.table/dplyr and python-pandas users. I've come up with a macro @select which easily combines the notation of dplyr's select and mutate together (my Tables are immutable, so mutate doesn't even make sense in Julia). It is used like:

@select(table, col1, newname = col2, newcol::newtype = col1 -> f(col1))

where it has three possible behaviours, respectively: selecting a column, renaming a column (and including it) and creating a new column via a row-by-row calculation. The syntax is quite constrained by the semantics of Julia, but any comments or suggested improvements would be welcome!

I tried to emulate dplyr-like syntax since it seems to have the preferable syntax (compared to R's data.table, which people seem to prefer for speed not syntax). As such, I was hoping to implement @filter, @arrange etc. Suggestions and greedy desires from other data manipulators would be useful at this early stage!

@datnamer
Copy link

@nalimilan
Copy link
Member

Sounds like a good strategy. As @datnamer, I think the state of our reflections with regard to this is in DataFramesMeta (see also the issues in that project). It would make sense to follow SLQ/LINQ/dplyr as much as possible, unless there's a good reason not to.

There has also been some discussion regarding APIs here: #369.

@datnamer
Copy link

This may be premature, but are there any thoughts on SQL code generation?

Edit: There is also this https://github.com/MikeInnes/Flow.jl. I wonder if that can help (query planning, delayed eval etc)

@andyferris
Copy link
Member

@datnamer: quick answer regarding SQL is "no". But thank you to both of you for interesting reads (I definitely had seen DataFramesMeta quite some time ago but hadn't thought about it lately).

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

@nalimilan
Copy link
Member

Regarding #369, the point is just syntax where you write code in devectorized form? I think my new @select macro should be fine for this since it uses a hidden comprehension, which allow for pretty generic code inside. The user can declare the element type but not the container type (so its hard to use NullableArrays or pooled data structures - I need to think more about this).

Well, mostly, but it would also allow you to refer to all existing columns as variables, as well as to create new ones like you would create variables. So it would take a complete code block instead of short expressions passed as separate arguments. See the example I provided there. It's particularly useful when combining several variables using if to create a new one. Then there's also the question of propagating nulls automatically, which is a bit harder.

@nalimilan
Copy link
Member

Closing as we're not going to make the DataFrame type encode column type information in 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests