-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Establishing a Julian precedent for translating R functions that use delayed evaluation #472
Conversation
The major complication with this approach is the question of how formulas should work. To be consistent with this idiom, we'd have to write things like One solution, which I'd be alright with is to assume that formulas simply do not have any access to the enclosing scope. They just describe transformations of DataFrames into matrices. This is a little awkward, but formulas already constitute their own DSL, so it's at least defensible. |
Makes sense. But I think I'd prefer writing it with parentheses: Regarding formulas: maybe both Finally, would the existing |
Happy to make a style guide that advocates using parentheses. I've been avoiding them so far because I'd actually prefer having something that reads more like SQL (e.g. As of yesterday, Julia got patched in a way that I think should lead us to fully deprecate I'd like to get rid of I would like to augment |
Actually I don't really like SQL's syntax. :-) It looks very out of place in a C-like syntax like Julia's. Anyway, you only have
About formulas, you mentioned in you first comment that you envisaged disallowing access to the enclosing scope. How would you do that if EDIT: condition on the columns rather than on the rows. |
I'm happy to use To be honest, I haven't fully thought out how formulas will work now that they're baked into Julia's parser, but my sense is that we'll do something like the following:
Saying all of that, the one unfortunate thing is that |
@StefanKarpinksi, it would be great to hear what you think of this proposal since I know you disliked the |
Thinking about this more, one big problem is that the column names need to be known at macro compile-time, which means I'm starting to think that we should just not allow anything that even resembles delayed evaluation. The R idioms we're trying to match are very convenient when you're typing one-off commands, but they don't allow for safe composition. |
Having just read this book section on R's metaprogramming tools, I'm almost completely convinced that the pursuit of saving a few characters is not worth the costs of introducing brittle, non-composable and opaque macros all over our data workflows. I'm going to close this PR. I think it's time to give up on trying to emulate delayed evaluation. What we need is an improved |
For me, saying "very convenient when you're typing one-off commands, On Sat, Jan 11, 2014 at 4:03 PM, John Myles White
|
I don't think we should ever allow anything in Julia to work only in the REPL. I'm a huge believer in "there will only be one right way to do it" after spending the last year trying to get the JuliaData libraries to work properly. Right now, the DataFrames codebase is much too hard to reason about because there are too many different ways of doing the same thing and most of those ways are subtly broken. Rails manages to do lots of clever interactions with databases using, IIRC, only standard function evaluation. It's not at all clear to me that we can't make something that's effective for interactive data analysis and also high-performance. |
I agree with Harlan that convenience is not something to neglect. We should try to find both correct and convenient solutions, as much as possible. Having macros only working at the REPL doesn't sound like a good idea, but having less efficient convenient macros for the REPL or user code, and efficient versions for libraries or programs can be a good solution; that doesn't meant there should be a dozen ways of doing the same thing. That said, what plan do you have in mind to replace |
The problem with the proposed I don't have a clear plan yet, but somehow we need to define a As such, the question becomes how you handle three classes of names:
Possibly we need to introduce an additional quoting mechanism like |
Any construction we're going to introduce needs to be at least as powerful as the following example shows standard indexing is:
Pushing on this topic just makes me appreciate our standard indexing operations even more. |
I like this direction. It's worth keeping open to discuss more options and explore how far this takes us. It's better than indexing with expressions. For the case where the author doesn't know the column name ahead of time, the author could use normal style indexing, like I prefer the function-like syntax over the straight macro syntax. I think I'd also rather see the notation flipped, so that anything not flagged as external (except functions) is assumed to be a column name. The function-like syntax could be extended to LINQ-ish syntax. We still may end up deciding to stick with plain indexing, but I think it's worth exploring more. I do hope we get |
Ok, I'll open a new issue for further debate (or maybe just refer to the existing LINQ issue). I removed the branch behind this PR, so it's not possible to reopen this thread. Assuming that everything's a column name except when flagged seems tricky to get right at the implementation level, although I wholeheartedly agree that it would be a better idiom since it would be consistent with formula models. The code I demoed out is so concise because you can assume that anything in a quote node is safe to treat as a column name. We might pull off the reverse and assume that every non-quoted symbol is a column name, but I'm not totally sure offhand if that would work. It seems tricky to get right since things like function call expressions refer to most entities using symbols. To be clear, I'm ok with |
I'm good with removing |
At one point
|
Reversing the syntax and assuming symbols are column names if not escaped (in a way we need to find, but it could also be |
This small PR implements a basic
@select
macro. This macro is just a demo, which I'm using to start a conversation about how I'd like Julia to approach problems for which R would make use of delayed evaluation.The core premises behind this new approach are simple:
Premise 1: Many uses of delayed evaluation in R can be conceptualized as the extension of the caller's scope to include new variables that correspond to the column names of a DataFrame. This can only be done at run-time when the names of the columns are known, which is why we've been imitating R operations by passing expressions to DataFrames. But that approach has proven hard to work with, because it means that many parts of the caller's scope are difficult to access without careful escaping operations. It also makes code a little ugly, because one has to write things like
df[:(A > $x), :]
.Premise 2: Julia should not provide delayed evaluation until we know that it is unavoidable. Instead of doing computations that augment the caller's scope, we should provide a tool for mixing column names with the caller's scope that can be handled entirely at macro compile-time. This requires that the user state at code-writing time which variables are column names. I would argue that this added burden is a good thing, because it makes the magic we're about to perform more explicit and systematic.
Premise 3: We can have users distinguish the caller's scope from the column names of a DataFrame by asking them to write explicit symbols, which are turned into quoted symbols by the parser. Thus the caller's
x
is written asx
, whereas the column namex
is written as:x
.The examples below demonstrate this approach and show that it substantially simplifies indexing DataFrames using complex expressions:
Importantly, the indirection introduced by the
@select
macro will allow us to gradually reimplement these operations in more efficient ways that exploit indices in DataFrames. It will also allow us to implement these operations in a way that, like LINQ, would apply to both DataFrames and databases.