Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify position on iteration API #2254

Closed
andyferris opened this issue May 14, 2020 · 34 comments · Fixed by #2662
Closed

Clarify position on iteration API #2254

andyferris opened this issue May 14, 2020 · 34 comments · Fixed by #2662
Labels
decision non-breaking The proposed change is not breaking
Milestone

Comments

@andyferris
Copy link
Member

andyferris commented May 14, 2020

I watched the awesome recent AMA and realized you guys are preparing for a 1.0 release 🎉

There's probably one decision I haven't got my head around with DataFrames and whether it would lead to breaking API changes, and that is to do with the iteration API.

In some cases a DataFrame opts into some iteration API functions:

first(df)
last(df)
filter(row -> predicate(row.column), df)

while in other cases it does not:

iterate(df)
map(df)
reduce(df)

It would seem cleanest to me to pick a side of the fence, and either:

  1. Implement row-iteration of DataFrame.
  2. require filter(predicate, eachrow(df)), first(eachrow(df)), last(eachrow(df)) before releasing a version 1.0.

If doing 1. then we'd want to see if there are any methods implemented that contradict this interpretation before releasing 1.0, but that actual implemtation could come later. I think most of the column iteration functionality has been removed, though.

Sorry if this has been discussed or decided elsewhere and I'm not aware of the outcome.


For some background, I quite often use filter on DataFrames and find this interface quite useful. I do tend to enjoy using Vector{NamedTuple} and TypedTables.jl because I also have access to map, reduce and mapreduce from Base, plus all the functions in SplitApplyCombine which exist for, well, splitting, and applying, and combining, especially for tables. I would love to support DataFrame in SplitApplyCombine, for example. (And there are many other great packages out there that rely on iteration, too).

My personal bias has always been towards a table is a relation which is a collection of rows and we can use relational algebra to manipulate it. I don't see a particular advantage of having DataFrame not support iterating either columns or rows and requiring eachrow and eachcol when you could choose one by default.

@bkamins
Copy link
Member

bkamins commented May 14, 2020

In general I tend towards supporting the iteration API. There are however the following issues that require being discussed.

  1. I think iterate is just fine to add; I think then we are fine to add other methods in Base which rely on "row"-oriented interface (Implement empty / empty! as the truncate operation #2251, findfirst/findlast/nextind/prevind not working for eachcol in v0.21.0 #2229 applied also to AbstractDataFrame, reintroducing length, )
  2. Deciding on filter and broadcasting for GroupedDataFrame #2194 for GroupedDataFrame
  3. Dealing with rand(::GroupedDataFrame) sampler? #2097 for GroupedDataFrame and DataFrame; similarly Additional functions supported for DataFrame.jl #2088 and Add shuffle, shuffle! functions #2048
  4. with map and reduce there is a problem what should be the output type of the result (add map for AbstractDataFrame #2053, also this is the reason why we deprecated map for GroupedDataFrame)
  5. There is a tricky thing about what type to pass to functions - DataFrameRow or NamedTuple or positional arguments - but this can be worked out I think later (in general - we should have different options here as different things are good depending on the fact if the table is wide or not)

Out of those points point 4 (and in consequence point 2) are crucial to decide on and hardest so I would concentrate the discussion on them first. The general questions are:

  • if I apply map to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply reduce to a DataFrame and we process it row-wise what should be the type of the result
  • if I apply map to a GroupedDataFrame what should be the type of the result
  • if I apply reduce to a GroupedDataFrame what should be the type of the result
  • if I apply broadcasting to a GroupedDataFrame what should be the type of the result
  • relatedly (does not have to be resolved now and in this thread but I note it for completeness) if I apply broadcasting to a DataFrameRow what should be the type of the result (here a link is to NamedTuple which currently does not allow broadcasting)

@bkamins bkamins added decision non-breaking The proposed change is not breaking labels May 14, 2020
@bkamins bkamins added this to the 1.0 milestone May 14, 2020
@nalimilan
Copy link
Member

To add one point to @bkamins' list: one tricky point is that broadcast applies the function to each cell (which is very useful). So if we supported map to apply an operation on each row it would be a bit inconsistent. Not necessarily a problem, though.

Overall I don't think we have to settle this before 1.0. At #1514 we've decided DataFrames are collections of rows, so function from the collection API that we implement follow that convention. But we don't have to implement all of them right now, especially since you can always use eachrow.

@piever
Copy link

piever commented May 14, 2020

Concerning the point about reduce, I would expect reduce(gdf::GroupedDataFrame, :old => + => :new) to be the same as combine(gdf, :old => sum => :new) (once the groups are reduced to just one element, IMO there isn't much point in keeping the grouping).

I think reduce on a GroupedDataFrame is relevant because you get support for:

  • Transducers
  • OnlineStats

With the added bonus that these operations would multithread automatically.

@bkamins
Copy link
Member

bkamins commented May 14, 2020

So my thinking was a bit different. I thought that reduce(op, gdf::GroupedDataFrame) would apply op as a reduction to elements of gdf (which are SubDataFrames).

On the other hand reduce(op, df::AbstractDataFrame) would operate row-wise (with possibly allowing column selectors).

So what you write would be rather combine(gdf, :old => (x -> reduce(+, x)) => :new) which is already allowed (and the same as just combine(gdf, :old => sum => :new as you noted).

@piever
Copy link

piever commented May 14, 2020

So what you write would be rather combine(gdf, :old => (x -> reduce(+, x)) => :new) which is already allowed (and the same as just combine(gdf, :old => sum => :new) as you noted)

In my experience with JuliaDB, the problem is that this approach is much more wasteful, because you need to actually extract the subvectors. OTOH if you know that the user is applying a reduce operations many optimizations become possible (compared to :old => (x -> reduce(+, x)) => :new, where you cannot know that the user did not need the whole vector).

I suspect the same rationale applies here, because you can simply perform the reduction when iterating on gdf.idxs with gdf::GroupedDataFrame.

@bkamins
Copy link
Member

bkamins commented May 14, 2020

Yes - this is true (and internally we do it in a different way). I am simply referring to the fact what API of reduce should be (i.e. it should reduce over an iterable, not over iterables nested within this iterable).

@nalimilan
Copy link
Member

Yes, overall one potential issue is that iteration over DataFrame and GroupedDataFrame gives very different objects. So all functions that rely on it can appear as quite inconsistent/unintuitive to users if they want to apply them to both types. OTOH select, transform and combine have more consistent behaviors in that regard, but they don't follow iteration.

Something that just occurred to me is that we could use combine(gd, :x => Reduce(op) => :newx) for reductions (similar to ByRow). That's actually how it works internally, and the corresponding objects could almost be exported as-is. Then we wouldn't need a new function at all.

@bkamins
Copy link
Member

bkamins commented May 14, 2020

we could use combine(gd, :x => Reduce(op) => :newx)

💯

However, this will solve the issue only partially, we still have many other functions to decide how they should be handled (or not handled - what we do now, because we are unsure what we want 😄).

@andyferris
Copy link
Member Author

andyferris commented May 18, 2020

Awesome. 😄

In general I note a lot of interface questions like "what should map on ... return? I agree all this is hard - very hard - in fact I would suggest not answering these questions at all and deferring to some higher interface!

For example, a Vector{NamedTuple} makes a relatively handy table/dataframe, straight from Base. A NamedTuple is a bit too specific (much like Tuple compared to AbstractVector) and really needs generalizing to a more versatile row type; while we have Tables.jl definition based on getproperties these day's I'm thinking more about Dictionaries.AbstractDictionary{Symbol} for rows, so as an interface for a table or dataframe I'm thinking of, basically, AbstractArray{<:AbstarctDictionary{Symbol, Any}} where each row has identical indices (i.e. keys(table[1]) === keys(table[2]), etc).

This immediate restricts you (in a good way!). Starting with df::AbstactArray{<:AbstractDictionary{Symbol, Any}}:

  • if I apply map to a DataFrame and we process it row-wise what should be the type of the result.

map(f, df) always gives gives an array; if f returns rows the the answer is also a table/dataframe.

  • if I apply reduce to a DataFrame and we process it row-wise what should be the type of the result

For mapreduce(f, op, df) it depends on f and op, but you are just doing mapreduce over an AbstractArray so the anwser is given.

  • if I apply map to a GroupedDataFrame what should be the type of the result
  • if I apply reduce to a GroupedDataFrame what should be the type of the result

If a table is an array-of-dictionaries what is your grouped data frame? I see space for two different objects, actually, which you want to be able to switch between with ease. The first is a dictionary-of-tables (i.e. dictionary-of-arrays-of-dictionaries), which is what SplitApplyCombine.group gives when the input is a table, and it's pretty easy to use e.g. map on each group before combining them together again. The second is where you may also want to turn that object into a view of a "flat" table where the outer dictionary key becomes one of the columns (so the internals are akin to what JuliaDB calls NDSparse or what SQL DBs call "partitioned tables"). With the former you can map/filter/reduce over the groups, with the latter you can map/filter/reduce over the individual rows - just depends what you want to do.

IMO using external interfaces like AbstractVector and some container type which can represent "rows" will not only make your life easier as implementers but more importantly your users won't have to guess because they will just be using the interface they know from Base. Even if you don't literally subtype AbstractVector you can still use these arguments to drive your intuition (and you can bet your users probably will, whether you want them to or not).


To add one point to @bkamins' list: one tricky point is that broadcast applies the function to each cell (which is very useful).

I feel this is general; I would frequently like to use broadcast on Vector{Vector{Float64}} say but we don't have syntax for that - I would be very tempted to find an elegant solution to the general problem and use that instead. (In short I personally feel that broadcast should generalize map and they should coincide when the keys of all the inputs are identical, which is what happens with arrays).

@nalimilan
Copy link
Member

I agree we should decide based on comparisons with vectors of rows. The anwser is pretty clear: map on data frames should operate on rows and return a data frame, and map on GroupedDataFrame should operate on data frames and return a GroupedDataFrame. But I'm don't think I'd be willing to trade the current behavior broadcast on data frames, which is very useful, for consistency with map, which is less useful IMO.

@bkamins
Copy link
Member

bkamins commented May 18, 2020

As for broadcast I am pretty sure that what we do now (treat data frame as a matrix) is what we should keep. It is inconsistent with the rest of the design (unfortunately), but this is what all users asked for for a long time. Also something like:

df[!, several_new_cols_and_some_old_cols] .= 0

makes perfect sense currently, but it is only possible because we treat data frame like a matrix.

@andyferris
Copy link
Member Author

andyferris commented May 18, 2020

and map on GroupedDataFrame should operate on data frames and return a GroupedDataFrame

To be clear - I'd argue the type of map(f, gdf) would depend on what f returns. If data frames are returned then yes you'd get a GroupedDataFrame back, but if it returns rows or scalars you might want something else.

@nalimilan
Copy link
Member

Ah yes that's an interesting option.

@bkamins
Copy link
Member

bkamins commented May 19, 2020

But this would then have to apply to map on DataFrame also - right?

@andyferris
Copy link
Member Author

I suppose I'd want a Vector most of the time when the result of f in map(f, df) is not some kind of "row"?

@tkf
Copy link
Contributor

tkf commented May 20, 2020

I agree with @andyferris. I think map(f, df) should return a DataFrame only if f returns a row-like object. Otherwise a plain Vector makes sense (e.g., map(_ -> 1.0, df) should give ones(length(df))). For this, I think we need something like

Tables.isrow(::Nameduple) = true
Tables.isrow(::AbstractDict{<:Union{Symbol,String}}) = true
Tables.isrow(::DataFrameRow) = true
Tables.isrow(_) = false

Like wise, I think map(f, gdf) should return a GroupedDataFrame only if f returns a rowtable-like object (or maybe table-like is OK). We have Tables.isrowtable so this is implementable. But what should happen in other cases? Just return a Dict{GroupKey,Whatever}? Or would it be nice to have GroupedDataRow ("dictionary-of-dictionary") when f returns a row-like object? What about GroupedDataCell ("dictionary")?


BTW, I find it nice to conceptualize map by

map(f, xs) = reduce(concatenate, (singleton(k, f(v)) for (k, v) in pairs(xs)))

where xs is either a DataFrame or GroupedDataFrame. Defining map is equivalent to find appropriate definitions of concatenate and singleton. Note that the output type of singleton would be the output type of map. In general, singleton can depend on the input type (i.e., singleton(k, f(v), typeof(xs)) instead of singleton(k, f(v))).

Importantly, this connects map and reduce and avoids inconsistency in the design.

For example, map(f, ::Vector) in Base can be implemented (conceptualized) with concatenate = vcat and singleton(_, x) = [x].

A possible definition for dataframes is something like

concatenate(a::AbstractDataFrame, b::AbstractDataFrame) = vcat(a, b)
concatenate(a::GroupedDataFrame, b::GroupedDataFrame) = mergewith(concatenate, a, b)
concatenate(a::AbstractDict, b::AbstractDict) = merge(a, b)
concatenate(a::Any, b::Any) = vcat(a, b)  # fallback

# called via `map(f, ::DataFrame)`
singleton(::Integer, x) = Tables.isrow(x) ? DataFrame([x]) : [x]

# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
    if Tables.isrowtable(v)
        return GroupedDataFrame(k, DataFrame(v))  # made-up constructor
    elseif Tables.isrow(v)
        return Dict(k => DataFrameRow(v))  # made-up constructor
        # return GroupedDataRow(k, DataFrameRow(v))  # made-up type
    else
        return Dict(k => v)
        # return GroupedDataCell(k, v)  # made-up type
    end
end

@andyferris
Copy link
Member Author

Or would it be nice to have GroupedDataRow ("dictionary-of-dictionary") when f returns a row-like object?

Yes. I believe we need tables (collections of rows) that index like an array and other tables that index like a dictionary. I'd be tempted to use the latter to model a table with a primary key (though strictly speaking it's unnecessary for the dictionary key to be one of the columns, it can live "outside" like row indices) .

@tkf
Copy link
Contributor

tkf commented May 21, 2020

I believe we need tables (collections of rows) that index like an array and other tables that index like a dictionary.

Are you talking about Tables.rowaccess vs Tables.columnaccess or something else? Don't we need additional kinds of "table" here? With Haskell-ish notation, I think we need to handle

[K -> V]                 | Vector{Dict{K,V}}                     "row table"
K -> [V]                 | Dict{K,Vector{V}}                     "column table"
(K1 -> V1) -> [K2 -> V2] | Dict{Dict{K1,V1},Vector{Dict{K1,V1}}} "GroupedDataFrame"
(K1 -> V1) -> (K2 -> V2) | Dict{Dict{K1,V1},Dict{K2,V2}}         "GroupedDataRow"
(K -> V1) -> V2          | Dict{Dict{K1,V1},V2}                  "GroupedDataCell"

With the SQL-like approach I'd imagine the latter three would be flattened to a table. But I thought @nalimilan's comment above #2254 (comment) indicates we are dealing with (K1 -> V1) -> [K2 -> V2]. How do you treat (K1 -> V1) -> [K2 -> V2] and (K1 -> V1) -> (K2 -> V2) like tables?

BTW, I find the naming GroupedDataFrame unfortunate if it really is (K1 -> V1) -> [K2 -> V2]. I think it'd be nice to rename it to DataFrameGroups or something and then re-introduce GroupedDataFrame as a row table view of DataFrameGroups.

@bkamins
Copy link
Member

bkamins commented May 21, 2020

Indeed the points you rise are good. I have the following comments:

  • we need to keep DataFrames.jl as simple as possible - it is assumed to be an entry-level package so we should keep in mind that most users will not be able to digest all the details
  • we in general can do it for 2.0 release of DataFrames.jl, but probably for 1.0 release we will stick with what we have
  • making all this first requires agreement in Tables.jl about the common API (actually I think all we write should not be DataFrames.jl specific I think)

Given this I have another perspective. Maybe we should always return a Vector of what is returned in map and just later delegate to the user what the user wants to do with this Vector (in particular if one wants a DataFrame it can be easily created later using one command). This is of course not most elegant solution but maybe it will be easiest for users to digest in practice?

@quinnj
Copy link
Member

quinnj commented May 21, 2020

FWIW, I'm not sure we'd really need a Tables.isrow because we have Tables.Row; i.e. you can wrap anything that satisfies the AbstractRow interface like Tables.Row(rowlike) and it provides consistent behaviors and is a dispatchable type. It should also be efficient and not even allocated in cases where it's just a temporary wrapper.

@tkf
Copy link
Contributor

tkf commented May 21, 2020

Maybe we should always return a Vector of what is returned in map and just later delegate to the user what the user wants to do with this Vector (in particular if one wants a DataFrame it can be easily created later using one command).

You can already use [f(x) for x in df] or collect(f(x) for x in df) to express that the result should be a Vector. I think they are pretty concise. So, if you want a fixed return type, I think map should return a DataFrame/GroupedDataFrame when the input is DataFrame/GroupedDataFrame.

Having realized collect(f(x) for x in df), I'm actually a bit more inclined to map(f, ::DataFrame) :: DataFrame. Sorry @andyferris :)

The map(f, _) implementation can assert the output of f is row/table-like or maybe automatically "upcast" to the correct type. It's probably safer to avoid accidentally wrap a table in a row. Something like:

# called via `map(f, ::DataFrame)`
function singleton(::Integer, x)
    if Tables.istable(v)
        error("Expected a row-like object. Please use `Iterators.flatten` when ",
              "the number of rows may change.")
    elseif Tables.isrow(v)
        return DataFrame([v])
    else
        return DataFrame(value=[v])
    end
end

# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
    if Tables.istable(v)
        return GroupedDataFrame(k, DataFrame(v))
    elseif Tables.isrow(v)
        return GroupedDataFrame(k, DataFrame([v]))
    else
        return GroupedDataFrame(k, DataFrame(value=[v]))
    end
end

we have Tables.Row

Ah, good point. I think being explicit is better here.

@nalimilan
Copy link
Member

I agree that the fact that comprehensions always return vectors make it less useful to have map do the same thing. That's why map on GroupedDataFrame didn't return vectors (before its deprecation).

Regarding singleton, actually we already have a similar logic in combine to decide how to interpret the return value (whether it's a single value, a single row, or a table). But there when a scalar is returned we automatically generate a column name to store it. This is essential there since it's not very useful to return a vector without grouping information, and as @bkamins said it keeps things simple for users. So it could make sense to do the same for map(f, ::AbstractDataFrame).

@jonas-schulze
Copy link
Contributor

I really like the simplicity of Haskell's fmap :: (a -> b) -> f a -> f b, which I always think of when reasoning about Julia's map. Mapping a function a -> b over a "container" f a gives a container f b of the same "structure" f but with elements of type b instead of a. Once understood, it is easy to remember and reason about.

This should be the same conception as in #2254 (comment).

If we consider a DataFrame as a collection of rows, DF Row of sorts, applying the above concept I would expect this:

  • map(fn, df::DataFrame)::DataFrame. As such, I also only expect this to work if the function/callable fn maps a row to another row-alike, i.e. fn as Row -> Row2 (this should be the DataFrame containing the new columns of the current transform). If I need a sharper tool, I wouldn't complain using combine instead.
  • map(fn, gdf::GroupedDataFrame)::GroupedDataFrame. Here, the function can return basically anything, e.g. a row-alike or a table-alike (e.g. for GDF (DF Row) and fn as DF Row -> T, I expect a GDF T). If I want a DataFrame or Vector, that's only one simple concatenation/reduce/combine away.

For reduction, that should completely depend on fn. The least surprise of reduce/foldl for me would be this:

  • reduce(fn, df::DataFrame; init::T)::T for fn as T -> Row -> T.
  • reduce(fn, gdf::GroupedDataFrame; init::T)::T for fn as T -> DF Row -> T.

I wouldn't say that I have sufficiently understood broadcasting, so I'm leaving it out.

I have to admit that I haven't thought too much about pleasantries like map(:old => f => :new, _), which I hope are just syntactic sugar. For DataFrames.jl specific functions like transform and combine they are fine, but for map I would expect a more bare-bones/low-level behavior.

@bkamins
Copy link
Member

bkamins commented Jan 27, 2021

For GroupedDataFrame you have that map in the form you propose is just combine with a single function passed as a first argument (with a tweak ungroup=false for GroupedDataFrame).

For AbstractDataFrame it is more useful as its combine equivalent is not super user friendly (but combine(df, :cols => ByRow(fun) => AsTable) is essentially the same).

The question is how useful having these map functions defined be (apart of aesthetic considerations)?

@nalimilan
Copy link
Member

nalimilan commented Jan 27, 2021

map(fn, gdf::GroupedDataFrame)::GroupedDataFrame. Here, the function can return basically anything, e.g. a row-alike or a table-alike (e.g. for GDF (DF Row) and fn as DF Row -> T, I expect a GDF T). If I want a DataFrame or Vector, that's only one simple concatenation/reduce/combine away.

As noted above, since GroupedDataFrame is a collection of SubDataFrames, the only consistent definition of map is to have it pass SubDataFrame objects to the function. That's one of the reasons why we provide other functions like combine which have more similar behaviors on DataFrame and GroupedDataFrame. EDIT: the same line of reasoning applies to reduce.

@jonas-schulze
Copy link
Contributor

The question is how useful having these map functions defined be (apart of aesthetic considerations)?

For discoverability, I think, it would be great. Also, for a newcomer to the package or non-MS-Excel empirical data analysis in general, especially without too much knowledge about SQL and the parallels of which in many data frame APIs across different languages, having a simple map would lower the bar of entry, I'd say.

Oftentimes, one is not able to just "scroll over the data" or "take a glimpse" and get meaningful insight. For a newcomer/Excel-leaver, it takes a paradigm shift to instead formulate a request about the data and have some package execute that request. That goes in hand with a new language, and I think a simple map, that doesn't try to be clever and is therefore easy to understand, would be the best introduction to that new paradigm. Once understood, or once its limitations are clear, combine + friends and their domain-specific syntax are easier to get into.

Also, oftentimes I don't need more than a simple map; and not having to re-read the documentation of a package is a plus.

@bkamins bkamins added breaking The proposed change is breaking. and removed breaking The proposed change is breaking. labels Mar 13, 2021
@bkamins
Copy link
Member

bkamins commented Mar 13, 2021

I was thinking about it and came to the following recommendation:

  • deprecate filter(src => predicate, df) and filter(src => predicate, gdf); only allow filter(predicate, df) and filter(predicate, gdf) and direct users to subset for more advanced functionality
  • add iterate to AbstractDataFrame that would iterate rows
  • all functions except filter should use their generic implementations (otherwise things get supper messy with a lot of corner cases; if a row-like object is returned e.g. from map then it is easy to just pass it to DataFrame constructor to get a data frame); filter is special because:
    1. it is a legacy functionality we do not want to break
    2. the contract for it in Julia Base is "Return a copy of collection a, removing elements for which f is false." so the collection type should stay unchanged
    3. filter in base does not work with iterators (as opposed to map, mapreduce etc.) so it is special anyway

If we agree to this I will implement it before 1.0 release.

The difference between AbstractDataFrame and DataFrameRows is that AbstractDataFrame would only implement iteration protocol while DataFrameRows is AbstractVector{<:DataFrameRow}. This is significant because this DataFrameRows will then support more functions from Julia Base, on the other hand it allows us to make a different behavior between e.g. map and broadcasting for AbstractDataFrame.

@nalimilan
Copy link
Member

Sounds good, except that I'm not totally sure what you mean by this:

  • all functions except filter should use their generic implementations (otherwise things get supper messy with a lot of corner cases; if a row-like object is returned e.g. from map then it is easy to just pass it to DataFrame constructor to get a data frame);

What functions do you have in mind? For map at least as we discussed above it sounds more useful to return a DataFrame than a vector.

@bkamins
Copy link
Member

bkamins commented Mar 13, 2021

The point is that I do not want to specify these functions but rather indicate that they should do whatever they do generically.

For instance for map the implementation would have to be something like:

  1. apply standard map
  2. then check if all returned elements are row-like with the same schema
  3. if yes then return DataFrame
  4. if not then return a vector of produced elements

The point is that "then check if all returned elements are row-like with the same schema" is:

  1. very hard to get right
  2. will not be easy to learn by the users when the test passess
  3. inefficient (as we have to do runtime check of every element)
  4. and would save very little typing (as if vector is returned it can be easily transformed to a DataFrame via constructor)

@andyferris
Copy link
Member Author

This seems like a solid plan to me, @bkamins.

  1. then check if all returned elements are row-like with the same schema

As a point of reference, in TypedTables I can just check if Core.compiler.return_type gives a concrete NamedTuple. That's not going to work well here though, since the input is a DataFrameRow rather than a named tuple, and the output shouldn't necessarily be a NamedTuple either. I think you are likely correct that it is hard to get this right! (The case where f itself is of some special function type dedicated to building rows is OK though).

deprecate filter(src => predicate, df)

Just an aside — I kind of feel that Symbol should be callable (with s::Symbol)(x) = getproperty(x, s), like Clojure) and then this is legitimately filter(:column ∘ predicate, df). You can even optimize that pretty easily. (I don't know what to do for src => f => dest, though...)

@bkamins
Copy link
Member

bkamins commented Mar 14, 2021

`(s::Symbol)(x) = getproperty(x, s)

This is a nice idea but it would have to go to JuliaBase. Have you discussed it there?

check returned elements

In the end I think it is acceptable to just do runtime check. map is going to be slow anyway as it will be type unstable. So essentially functions like filter and map will be come relatively slow, but conforming to Julia Base API.

Then we should also update map for GrupedDataFrame to produce a GroupedDataFrame if returned elements are table-like (now it produces a vector). I will make a PR making all these changes and we can discuss the consequences there.

@andyferris
Copy link
Member Author

This is a nice idea but it would have to go to JuliaBase. Have you discussed it there?

No, I haven't. I would like to but haven't found the time.

@nalimilan
Copy link
Member

Maybe I'm missing something, but rather than collecting all returned values to a vector or relying on return_type, can't we just take the same approach as map/collect in Base and combine in DataFrames, i.e. check the type of the first returned value, and decide what to do based on that? If we want to allow mixing different return types, we could fall back to a slow path if we get an element if a different type from the first one.

@bkamins
Copy link
Member

bkamins commented Mar 14, 2021

You can get e.g. DataFrameRow objects of the same type but with different sets of columns (of course this is an artificial example).

I have opened #2662 to discuss the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants