Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Establishing a Julian precedent for translating R functions that use delayed evaluation #472

Closed
wants to merge 1 commit into from

Conversation

johnmyleswhite
Copy link
Contributor

This small PR implements a basic @select macro. This macro is just a demo, which I'm using to start a conversation about how I'd like Julia to approach problems for which R would make use of delayed evaluation.

The core premises behind this new approach are simple:

Premise 1: Many uses of delayed evaluation in R can be conceptualized as the extension of the caller's scope to include new variables that correspond to the column names of a DataFrame. This can only be done at run-time when the names of the columns are known, which is why we've been imitating R operations by passing expressions to DataFrames. But that approach has proven hard to work with, because it means that many parts of the caller's scope are difficult to access without careful escaping operations. It also makes code a little ugly, because one has to write things like df[:(A > $x), :].

Premise 2: Julia should not provide delayed evaluation until we know that it is unavoidable. Instead of doing computations that augment the caller's scope, we should provide a tool for mixing column names with the caller's scope that can be handled entirely at macro compile-time. This requires that the user state at code-writing time which variables are column names. I would argue that this added burden is a good thing, because it makes the magic we're about to perform more explicit and systematic.

Premise 3: We can have users distinguish the caller's scope from the column names of a DataFrame by asking them to write explicit symbols, which are turned into quoted symbols by the parser. Thus the caller's x is written as x, whereas the column name x is written as :x.

The examples below demonstrate this approach and show that it substantially simplifies indexing DataFrames using complex expressions:

using DataArrays
using DataFrames

df = DataFrame(A = 1:3, B = [2, 1, 2])
x = [2, 1, 0]

@select df :A .> 1
@select df :B .> 1
@select df :A .> x
@select df :B .> x
@select df :A .> :B
@select df sin(:A) .> cos(:B)
@select df cos(:A) .> sin(:B)
@select df cos(x) .> sin(:B) + 1
@select df sin(x) .> sin(:B) + 1
@select df sin(:A) .> sin(:B) + 1
@select df sin(:A) .> sin(:B)
@select df sin(:A) .> exp(cos(:B) - 1)

Importantly, the indirection introduced by the @select macro will allow us to gradually reimplement these operations in more efficient ways that exploit indices in DataFrames. It will also allow us to implement these operations in a way that, like LINQ, would apply to both DataFrames and databases.

@johnmyleswhite
Copy link
Contributor Author

The major complication with this approach is the question of how formulas should work. To be consistent with this idiom, we'd have to write things like lm(:Y ~ :X), which I don't like. It's not clear to me we can easily reverse the sense of : though and use it only for variables rather than column names.

One solution, which I'd be alright with is to assume that formulas simply do not have any access to the enclosing scope. They just describe transformations of DataFrames into matrices. This is a little awkward, but formulas already constitute their own DSL, so it's at least defensible.

@nalimilan
Copy link
Member

Makes sense. But I think I'd prefer writing it with parentheses: df2 = @select(df, :A .> 1) looks more standard.

Regarding formulas: maybe both lm(:Y ~ :X) and lm(:(Y ~ X)) could be supported, except that in the latter case you wouldn't be allowed to access the enclosing scope (most common case).

Finally, would the existing df[:(A .> 1), "col1"] syntax still be supported? It has the advantage the be similar with general array indexing. Would @select allow selecting one or several columns, i.e. would it completely supersede getindex()?

@johnmyleswhite
Copy link
Contributor Author

Happy to make a style guide that advocates using parentheses. I've been avoiding them so far because I'd actually prefer having something that reads more like SQL (e.g. newdf = @SELECT FROM df WHERE :A > 1), but that's somewhere between really hard and impossible to implement given Julia's current parser.

As of yesterday, Julia got patched in a way that I think should lead us to fully deprecate lm(:(Y ~ X)). See point below for rationale.

I'd like to get rid of df[:(A .> 1), "col1"] completely since I think it encourages the bad habit of passing expressions around for run-time evaluation. We should not ever be calling eval in our codebase.

I would like to augment @select so that you could write @select df (:ColA, :ColB) :ColC .> x.

@nalimilan
Copy link
Member

Actually I don't really like SQL's syntax. :-) It looks very out of place in a C-like syntax like Julia's. Anyway, you only have @select, not FROM nor WHERE.

@select df (:ColA, :ColB) :ColC .> x. would be nice, but why change the order of arguments? @select(df, :ColC .> x, (:ColA, :ColB)) sounds more logical to me. There's also the debate of using a tuple rather than an array (as in getindex()). Finally, we need a way to select columns without any conditions on the rows. Something like @select(df, :, (:ColA, :ColB))?

About formulas, you mentioned in you first comment that you envisaged disallowing access to the enclosing scope. How would you do that if lm(:(Y ~ X)) is not supported?

EDIT: condition on the columns rather than on the rows.

@johnmyleswhite
Copy link
Contributor Author

I'm happy to use @select(df, :ColC .> x, (:ColA, :ColB)) to minimize the changes between asking for all columns and asking for only a subset. join already doesn't take parameters in the standard SQL order, so there's no need to try hard to get more SQL compatibility.

To be honest, I haven't fully thought out how formulas will work now that they're baked into Julia's parser, but my sense is that we'll do something like the following:

  • lm(Z ~ X + Y) will be parsed as lm(@~(Z, X + Y)) in Julia going forward.
  • @~(Z, X + Y) will expand to Formula(:(Z), :(X + Y)).
  • ModelFrame(Formula(:Z, :(X + Y)), df) will interpret all variables as column names of transformations of columns names relative to df.

Saying all of that, the one unfortunate thing is that ModelFrame needs to add every function we want to support into its DSL definition. Evaluating some things in the enclosing scope would be nice, but it seems hard to do that without then calling hcat instead of just filling in a matrix column-by-column.

@johnmyleswhite
Copy link
Contributor Author

@StefanKarpinksi, it would be great to hear what you think of this proposal since I know you disliked the Expr indexing.

@johnmyleswhite
Copy link
Contributor Author

Thinking about this more, one big problem is that the column names need to be known at macro compile-time, which means @select can't be used inside of other functions that might take in column names as arguments.

I'm starting to think that we should just not allow anything that even resembles delayed evaluation. The R idioms we're trying to match are very convenient when you're typing one-off commands, but they don't allow for safe composition.

@johnmyleswhite
Copy link
Contributor Author

Having just read this book section on R's metaprogramming tools, I'm almost completely convinced that the pursuit of saving a few characters is not worth the costs of introducing brittle, non-composable and opaque macros all over our data workflows. I'm going to close this PR. I think it's time to give up on trying to emulate delayed evaluation.

What we need is an improved select function that only uses conventional function evaluation techniques.

@johnmyleswhite johnmyleswhite deleted the jmw/select branch January 11, 2014 21:30
@HarlanH
Copy link
Contributor

HarlanH commented Jan 11, 2014

For me, saying "very convenient when you're typing one-off commands,
but..." raises a red flag. I think we should really focus on finding ways
to do exactly that -- make Julia convenient for interactive data analysis.
R's level of convenience probably isn't viable, but anything that takes the
analyst out of their flow, and makes them focus on the code, is
problematic, and I hope can be minimized. But I don't think that we
necessarily need the idiomatic, abbreviated syntax used when doing
interactive work to be the recommended syntax used when writing code in
functions. Is there a solution where there are macros that will only work
at the REPL, emulating delayed eval, but more straightforward approaches
for programming?

On Sat, Jan 11, 2014 at 4:03 PM, John Myles White
notifications@github.comwrote:

Thinking about this more, one big problem is that the column names need to
be known at macro compile-time, which means @select can't be used inside
of other functions that might take in column names as arguments.

I'm starting to think that we should just not allow anything that even
resembles delayed evaluation. The R idioms we're trying to match are very
convenient when you're typing one-off commands, but they don't allow for
safe composition.


Reply to this email directly or view it on GitHubhttps://github.com//pull/472#issuecomment-32107007
.

@johnmyleswhite
Copy link
Contributor Author

I don't think we should ever allow anything in Julia to work only in the REPL.

I'm a huge believer in "there will only be one right way to do it" after spending the last year trying to get the JuliaData libraries to work properly. Right now, the DataFrames codebase is much too hard to reason about because there are too many different ways of doing the same thing and most of those ways are subtly broken.

Rails manages to do lots of clever interactions with databases using, IIRC, only standard function evaluation. It's not at all clear to me that we can't make something that's effective for interactive data analysis and also high-performance.

@nalimilan
Copy link
Member

I agree with Harlan that convenience is not something to neglect. We should try to find both correct and convenient solutions, as much as possible. Having macros only working at the REPL doesn't sound like a good idea, but having less efficient convenient macros for the REPL or user code, and efficient versions for libraries or programs can be a good solution; that doesn't meant there should be a dozen ways of doing the same thing.

That said, what plan do you have in mind to replace @select?

@johnmyleswhite
Copy link
Contributor Author

The problem with the proposed @select macro isn't that it's slow (it's most definitely not): it's that it creates a tool that can't be used reliably inside of other functions because all of the column names have to be fully known at code-writing time. We can't introduce features that will fail in unexpected and subtle ways unless you're familiar with the internals of DataFrames. Our functionality needs to be much less subtle than that. The really important type of convenience for Julia is having a language that isn't totally incomprehensible to end users. In this case, that means that whatever solution we pick, it needs to somehow also work if the names of the columns are stored in variables.

I don't have a clear plan yet, but somehow we need to define a select function that takes in a data store and a set of constraints and returns a subset of the data store that satisfies those constraints. Rails used to do things like find(db, constraints = {:date => today()}. What we need is a mechanism that extends that approach in a way that will allow us to specify Boolean expressions over constraints (AND's and OR's) where the constraints can refer to both the column names (likely specified as symbols) and to variables in the enclosing scope. The trick is that you want to be able to write expressions like sin(exp(ColA)) that refer to transformations of the columns of the data store.

As such, the question becomes how you handle three classes of names:

  • Names that refer to values in the enclosing scope and can be evaluated before evaluating the select function.
  • Names that refer to the data store scope, which can be replaced with references to the columns of the data store.
  • Names that refer to values in the enclosing scope, which happen to refer to names in the data store scope.

Possibly we need to introduce an additional quoting mechanism like quoted(x) for this third case.

@johnmyleswhite
Copy link
Contributor Author

Any construction we're going to introduce needs to be at least as powerful as the following example shows standard indexing is:

df = DataFrame(Col1 = 1:2, Col2 = 3:4)
x = [2, 1]
y = 0
z = "Col2"
df[(x .> 1) & (df["Col1"] .> y), z]

Pushing on this topic just makes me appreciate our standard indexing operations even more.

@tshort
Copy link
Contributor

tshort commented Jan 12, 2014

I like this direction. It's worth keeping open to discuss more options and explore how far this takes us. It's better than indexing with expressions. For the case where the author doesn't know the column name ahead of time, the author could use normal style indexing, like @select df :A .> df[colname], or we could find another way to designate that, maybe like @select df :A .> ~colname.

I prefer the function-like syntax over the straight macro syntax. I think I'd also rather see the notation flipped, so that anything not flagged as external (except functions) is assumed to be a column name. The function-like syntax could be extended to LINQ-ish syntax.

We still may end up deciding to stick with plain indexing, but I think it's worth exploring more. I do hope we get df.Col1. That will ease things if we end up just using regular indexing. It'll tend to encourage short DataFrame names, but I do that already anyway.

@johnmyleswhite
Copy link
Contributor Author

Ok, I'll open a new issue for further debate (or maybe just refer to the existing LINQ issue). I removed the branch behind this PR, so it's not possible to reopen this thread.

Assuming that everything's a column name except when flagged seems tricky to get right at the implementation level, although I wholeheartedly agree that it would be a better idiom since it would be consistent with formula models. The code I demoed out is so concise because you can assume that anything in a quote node is safe to treat as a column name. We might pull off the reverse and assume that every non-quoted symbol is a column name, but I'm not totally sure offhand if that would work. It seems tricky to get right since things like function call expressions refer to most entities using symbols.

To be clear, I'm ok with df.Col1 as long as we remove df["Col1"]. I just don't want both to exist. I'd be totally happy simultaneously imposing the rule that every column name must be a valid Julia identifier since that's an invariant we'd need to make df.Col1 indexing work everywhere and would also make formulas work better.

@tshort
Copy link
Contributor

tshort commented Jan 12, 2014

I'm good with removing df["Col1"] as long as there's a way to access a column with a variable, maybe df[:, mycol] or df.(mycol). That's a good point on column names being valid Julia identifiers.

@johnmyleswhite
Copy link
Contributor Author

At one point df.(mycol) would have worked. Not sure if that's still true.

df[:, mycol] is potentially in danger of not working in the future as a mechanism for accessing the column as a whole. Specifically, : is still magically rewritten by the parser (we actually do this ourselves now in the newest versions of the @data and @pdata macros) into a specific range object. Since we're going to start producing SubDataArrays soon under indexing, that means you wouldn't get the underlying DataArray, but only a view on it. I'd guess that @StefanKarpinski will therefore rewrite how we handle : as part of his work on array views.

@nalimilan
Copy link
Member

Reversing the syntax and assuming symbols are column names if not escaped (in a way we need to find, but it could also be $ just like an extension of interpolation, instead of ~) sounds like a good solution. The practical implementation of this needs investigation, but at least there does not seem to be any flaws in the design (which is the most important). Maybe that would be a place where staged functions would help (to know which symbol is a function and which is a variable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants