Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow use of colon operator to slice ranges by column names #393

Closed
sglyon opened this issue Nov 6, 2013 · 28 comments
Closed

ENH: Allow use of colon operator to slice ranges by column names #393

sglyon opened this issue Nov 6, 2013 · 28 comments

Comments

@sglyon
Copy link

sglyon commented Nov 6, 2013

This seems like reasonable functionality that is currently not implemented:

julia> df = DataFrame(quote
       A = [1:10]
       B = [1:10] .* 2
       C = [1:10] .* 3
       end
       )
10x3 DataFrame:
          A  B  C
[1,]      1  2  3
[2,]      2  4  6
[3,]      3  6  9
[4,]      4  8 12
[5,]      5 10 15
[6,]      6 12 18
[7,]      7 14 21
[8,]      8 16 24
[9,]      9 18 27
[10,]    10 20 30


julia> df["A":"B"]
ERROR: no method colon(ASCIIString,ASCIIString)

I would expect the return value to be something like:

10x2 DataFrame:
          A  B
[1,]      1  2
[2,]      2  4
[3,]      3  6
[4,]      4  8
[5,]      5 10
[6,]      6 12
[7,]      7 14
[8,]      8 16
[9,]      9 18
[10,]    10 20
@johnmyleswhite
Copy link
Contributor

I don't think this is feasible given Julia's semantics. Let me explain my concerns:

  • Julia doesn't treat length-1 strings and characters as equivalent. Characters have a natural counting order, so that 'a':'d' actually makes sense. In contrast, "a":"d"doesn't really make sense because Julia doesn't impose any counting order on strings: Julia implements only a lexicographic ordering on strings.
  • Even if we were to invent a meaning for "a":"d", I would be fairly opposed to any attempt to make it work for DataFrames, but not work elsewhere in the language. In general, I think Julia has the great virtue of almost purely local semantics, which ensure that expressions have well-defined meanings in all contexts and don't vary based on surrounding factors. Making "a":"d" mean something inside of brackets that it doesn't mean outside of them would break this contract. If "a":"d" were to acquire meaning, that meaning should be defined in the core language, not in a library. If we could get all of the people in charge of Julia's core language to agree on a proper counting ordering for strings, then we could try to do this.

@kmsquire
Copy link
Contributor

kmsquire commented Nov 6, 2013

Hi John, I think that "a" and "d" are meant to be column labels in a DataFrame, and that "a":"d" is specifically meant to change meaning depending on the order of the columns in a dataframe (as in pandas). For example, one DataFrame might define columns as ["a", "mean", "var", "d"], and "a":"d" would be interpreted as columns 1 through 4, but a different DataFrame could have ["a", "z", "d"], and "a":"d" would be columns 1:3. I don't think there's any intent or use for "a":"d" to have a global meaning.

The only trick in actually defining the function here is that it has the name colon(a::String, b::String), since colons are a special syntax used to define symbols.

@sglyon
Copy link
Author

sglyon commented Nov 6, 2013

Thanks @kmsquire, that is exactly what I intended in the original post

@johnmyleswhite
Copy link
Contributor

I'm not sure I like the idea of allowing the meaning of an expression like "A":"D" to vary depending on the surrounding container. It starts to require something like delayed evaluation. I suppose Julia does already have end, but I'm kind of loathe to encourage that kind of magic to spread outside of the core language.

That said, I'll defer to majority opinion if other people really like this idea.

@simonster
Copy link
Contributor

This proposal bears some ideological similarity to JuliaLang/julia#1032, but I agree with @johnmyleswhite that it's a little awkward, especially outside of Base. From an implementation standpoint, there's no way to avoid giving "a":"d" global meaning; if we define colon(a::String, b::String), no other code can define its own meaning for that syntax.

@kmsquire
Copy link
Contributor

kmsquire commented Nov 7, 2013

if we define colon(a::String, b::String), no other code can define its own meaning for that syntax.

Yeah, I see your point. (Actually, I think other code could define the same thing, it's just that the code last compiled would win, which wouldn't be good for consistency...)

One way forward would be to decide in Base that colon(a::String, b::String) always produces a StrColon(a,b) type, and then other code could dispatch on that.

@johnmyleswhite
Copy link
Contributor

I spent today thinking about this. In addition to @simonster's concerns about introducing a meaning for "a":"d" that's not in Base, what makes me uncomfortable about this is that it will make code hard to reason about in isolation. If this were in Julia code, you would need to know a lot about the context of an indexing operation to know what "a":"d" would evaluate to. I think that's bad for people reading code. It would probably also make it hard to write static analysis tools for Julia.

These concerns are actually not a problem for a construct like a[1:end], which can always be rewritten as a[1:length(a)] without any knowledge about the context in which they are evaluated.

@HarlanH
Copy link
Contributor

HarlanH commented Nov 7, 2013

Another option that doesn't require funny syntax is to put back the group
feature in column names. Spencer, for a while, before it got hard to
manage, we had a feature where you could give a name to a group of columns,
then use that as a reference, or a formula in glm or whatever: outcome ~
Manipulated + Context or whatever.

On Thu, Nov 7, 2013 at 1:24 AM, John Myles White
notifications@github.comwrote:

I spent today thinking about this. In addition to @simonsterhttps://github.com/simonster's
concerns about introducing a meaning for "a":"d" that's not in Base, what
makes me uncomfortable about this is that it will make code hard to reason
about in isolation. If this were in Julia code, you would need to know a
lot about the context of an indexing operation to know what "a":"d" would
evaluate to. I think that's bad for people reading code. It would probably
also make it hard to write static analysis tools for Julia.

These concerns are actually not a problem for a construct like a[1:end],
which can always be rewritten as a[1:length(a)] without any knowledge
about the context in which they are evaluated.


Reply to this email directly or view it on GitHubhttps://github.com//issues/393#issuecomment-27940459
.

@sglyon
Copy link
Author

sglyon commented Nov 7, 2013

I actually came across this feature in an old PR while searching for hierarchical indexing. I noticed that the PR was merged, but was surprised to see that I couldn't find the functionality.

Why did it get removed?

@tshort
Copy link
Contributor

tshort commented Nov 7, 2013

The grouping feature added quite a bit of complexity that was difficult to
support as the code base changed.

@kmsquire
Copy link
Contributor

kmsquire commented Nov 7, 2013

Countering @johnmyleswhite, the purpose of df["a":"d"] would be to include all columns from "a" to "d". The alternative, currently, would be

julia> df[index(df)["A"]:index(df)["C"]]
10x2 DataFrame:
          A  B
[1,]      1  2
[2,]      2  4
[3,]      3  6
[4,]      4  8
[5,]      5 10
[6,]      6 12
[7,]      7 14
[8,]      8 16
[9,]      9 18
[10,]    10 20

Although I can certainly reason about what it means, I find that notation rather ugly.

Another option, of course, is just to use numbers, as df[1:3]. I find that much harder to reason about, and much harder to write if I want something beyond the 7th column.

@johnmyleswhite
Copy link
Contributor

I'd like to hear what someone in Julia core thinks of this, since this change might end up affecting the whole language and not just this package.

For me, what's not so great about this approach is that I use strings as indices when I don't care about the order of columns in the DataFrame and I use numbers when I do care.

But to use this syntax, I have to care about the order of the strings -- saying "a":"d" only makes sense if you have perfect knowledge of the order of the columns. What happens when someone adds a new column between "a" and "d"? Your old code breaks unexpectedly?

Without knowing something about all of the columns in the DataFrame, you don't even know how many columns you'll get back. That's a non-trivial change from all of the non-expression based indexing we currently have.

Anyway, I'll back down and merge this kind of change if others really want it.

-- John

On Nov 7, 2013, at 2:47 PM, Kevin Squire notifications@github.com wrote:

Countering @johnmyleswhite, the purpose of df["a":"d"] would be to include all columns from "a" to "d". The alternative, currently, would be

julia> df[index(df)["A"]:index(df)["C"]]
10x2 DataFrame:
A B
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20
Although I can certainly reason about what it means, I find that notation rather ugly.

Another option, of course, is just to use numbers, as df[1:3]. I find that much harder to reason about, and much harder to write if I want something beyond the 7th column.


Reply to this email directly or view it on GitHub.

@tshort
Copy link
Contributor

tshort commented Nov 8, 2013

Here are other ideas on this theme.

df[ cols"colZ:colB" ]
df[ :(colZ : colB) ]
df[ colrange(df, "colZ", "colB") ]  # you can do this now, but you might be better off with:
colrange(df, "colZ", "colB")  # again, you can do this now
df[ colrange("colZ", "colB") ]  # here colrange() is a curried function 

If I were to need this a lot (and I don't), I'd probably use the colrange(df, "colZ", "colB") option.

The first two of these ideas could also be used to give column names without quotes like:

df[ cols"colZ, colB, colA" ]
df[ :(colZ, colB, colA) ]

The curried function option is interesting in that you could have a numerics function that selects numeric columns, and it could be used as df[ numerics ].

Anyway, I think Stefan said once that we already have too many ways to do things, so I probably shouldn't fan the fire:)

@kmsquire
Copy link
Contributor

kmsquire commented Nov 8, 2013

@johnmyleswhite, it might just be that I use DataFrames in a slightly different way than you're used to.

I have some tables where the format is prespecified (e.g., chromosome name, location, + specific columns with information about those regions), which I mostly interact with in pandas. Order matters, at least for the first 3-8 columns, and ordering within groups somewhat matters after that. There may be 250-300 columns. Of course, I don't want to look at all columns at once, but sometimes I want a group of them where I know the first and last label. Plus I want the genomic location, and possibly some other info from the first few columns. So, e.g., I'd like to be able to do:

df[["CHROM", "POS", "REF", "ALT", "DISEASES_PHENOTYPES":"Consequence_severest"], :]

This tells me a lot about what's in the resulting table (genomic location and disease information).

There might be other ways to do this in julia, and if so, that's great. (@tshort, thanks for the colrange pointer!) I'd just like the method to be not too much less flexible, expressive, or understandable than what I do now in pandas.

@johnmyleswhite
Copy link
Contributor

That use case does make this seem much more reasonable.

Let's see what @StefanKarpinski, @ViralBShah or @JeffBezanson think. If any of them are on board, I'll stop complaining.

@StefanKarpinski
Copy link
Member

Overloading : like this seems like a big no-no to me. However, the use-case does make some sense. One thought is to use "bar".."foo" to mean the interval of strings that are lexicographically between "bar" and "foo", but that only helps here if the column names are lexicographically ordered. I kind of think that explicitly taking indices is kind of a good thing since otherwise it's a bit weird for column ordering to be significant. Maybe there could be a convenience function for this?

@nalimilan
Copy link
Member

+1. Lexicographic order sounds more robust than order of columns in the DataFrame. I think such a feature is supported in common statistical software (SAS, Stata IIRC). A separate colrange() function for the latter would be useful, but not for [].

@johnmyleswhite
Copy link
Contributor

I'm glad other people are also a little turned off by this suggestion.

Let's bikeshed the best name for colrange to determine how to do this. This should be easy to implement once we agree on the interface.

@kmsquire
Copy link
Contributor

kmsquire commented Nov 9, 2013

I'd only ask that something like this be permissible:

df[["CHROM", "POS", "REF", "ALT", 
    colrange("DISEASES_PHENOTYPES","Consequence_severest")], :]

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Yeah, supporting (:col1):(:col3) is clearly not going to happen these days, but if someone wants to take a stab at a string macro or use of the .. operator, I think it could be entertained. Part of the issue is that DataFrames has moved to symbol indexing, and :col1..:col3 won't work because it tries to parse the ..: operator, which isn't valid.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

I would close it. We have fixed standard indexing API. If someone needs to do it Tables.columnindex can be used to get what is needed (it is not the shortest syntax imaginable but it is good enough IMO):

start, stop = Tables.columnindex(Ref(df), (:col1, :col2))
select(df, start:stop)

Feel free to reopen this if you disagree.

@bkamins bkamins closed this as completed Jul 25, 2019
@nalimilan
Copy link
Member

I think we should support something like JuliaDB's select(t, Between(start, stop)). That's something that also exists in dplyr.

@nalimilan nalimilan reopened this Jul 25, 2019
@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

OK. Then Between should be moved to DataAPI.jl first. I am OK to add this.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

@quinnj + @piever: do you think it should go to DataAPI.jl or Tables.jl?

@piever
Copy link

piever commented Jul 26, 2019

I don't have a strong preference either way, maybe DataAPI makes the most sense as it really is just an API. Slgihtly off-topic, I would suggest to also add the All selector, which takes the union of all selectors: https://juliacomputing.github.io/JuliaDB.jl/latest/api/#IndexedTables.All (if one wants to select two intervals for example).

@bkamins
Copy link
Member

bkamins commented Jul 26, 2019

Sure - adding All has been a pending request. So let us move both Between and All to DataAPI.jl.

@quinnj - are you OK with this?

@quinnj
Copy link
Member

quinnj commented Jul 26, 2019

Sure

@bkamins
Copy link
Member

bkamins commented Aug 11, 2019

Added in #1914

@bkamins bkamins closed this as completed Aug 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests