-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify position on iteration API #2254
Comments
In general I tend towards supporting the iteration API. There are however the following issues that require being discussed.
Out of those points point 4 (and in consequence point 2) are crucial to decide on and hardest so I would concentrate the discussion on them first. The general questions are:
|
To add one point to @bkamins' list: one tricky point is that Overall I don't think we have to settle this before 1.0. At #1514 we've decided DataFrames are collections of rows, so function from the collection API that we implement follow that convention. But we don't have to implement all of them right now, especially since you can always use |
Concerning the point about I think
With the added bonus that these operations would multithread automatically. |
So my thinking was a bit different. I thought that On the other hand So what you write would be rather |
In my experience with JuliaDB, the problem is that this approach is much more wasteful, because you need to actually extract the subvectors. OTOH if you know that the user is applying a I suspect the same rationale applies here, because you can simply perform the reduction when iterating on |
Yes - this is true (and internally we do it in a different way). I am simply referring to the fact what API of |
Yes, overall one potential issue is that iteration over Something that just occurred to me is that we could use |
💯 However, this will solve the issue only partially, we still have many other functions to decide how they should be handled (or not handled - what we do now, because we are unsure what we want 😄). |
Awesome. 😄 In general I note a lot of interface questions like "what should For example, a This immediate restricts you (in a good way!). Starting with
For
If a table is an array-of-dictionaries what is your grouped data frame? I see space for two different objects, actually, which you want to be able to switch between with ease. The first is a dictionary-of-tables (i.e. dictionary-of-arrays-of-dictionaries), which is what IMO using external interfaces like
I feel this is general; I would frequently like to use broadcast on |
I agree we should decide based on comparisons with vectors of rows. The anwser is pretty clear: |
As for
makes perfect sense currently, but it is only possible because we treat data frame like a matrix. |
To be clear - I'd argue the type of |
Ah yes that's an interesting option. |
But this would then have to apply to |
I suppose I'd want a |
I agree with @andyferris. I think Tables.isrow(::Nameduple) = true
Tables.isrow(::AbstractDict{<:Union{Symbol,String}}) = true
Tables.isrow(::DataFrameRow) = true
Tables.isrow(_) = false Like wise, I think BTW, I find it nice to conceptualize map(f, xs) = reduce(concatenate, (singleton(k, f(v)) for (k, v) in pairs(xs))) where Importantly, this connects For example, A possible definition for dataframes is something like concatenate(a::AbstractDataFrame, b::AbstractDataFrame) = vcat(a, b)
concatenate(a::GroupedDataFrame, b::GroupedDataFrame) = mergewith(concatenate, a, b)
concatenate(a::AbstractDict, b::AbstractDict) = merge(a, b)
concatenate(a::Any, b::Any) = vcat(a, b) # fallback
# called via `map(f, ::DataFrame)`
singleton(::Integer, x) = Tables.isrow(x) ? DataFrame([x]) : [x]
# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
if Tables.isrowtable(v)
return GroupedDataFrame(k, DataFrame(v)) # made-up constructor
elseif Tables.isrow(v)
return Dict(k => DataFrameRow(v)) # made-up constructor
# return GroupedDataRow(k, DataFrameRow(v)) # made-up type
else
return Dict(k => v)
# return GroupedDataCell(k, v) # made-up type
end
end |
Yes. I believe we need tables (collections of rows) that index like an array and other tables that index like a dictionary. I'd be tempted to use the latter to model a table with a primary key (though strictly speaking it's unnecessary for the dictionary key to be one of the columns, it can live "outside" like row indices) . |
Are you talking about [K -> V] | Vector{Dict{K,V}} "row table"
K -> [V] | Dict{K,Vector{V}} "column table"
(K1 -> V1) -> [K2 -> V2] | Dict{Dict{K1,V1},Vector{Dict{K1,V1}}} "GroupedDataFrame"
(K1 -> V1) -> (K2 -> V2) | Dict{Dict{K1,V1},Dict{K2,V2}} "GroupedDataRow"
(K -> V1) -> V2 | Dict{Dict{K1,V1},V2} "GroupedDataCell" With the SQL-like approach I'd imagine the latter three would be flattened to a table. But I thought @nalimilan's comment above #2254 (comment) indicates we are dealing with BTW, I find the naming |
Indeed the points you rise are good. I have the following comments:
Given this I have another perspective. Maybe we should always return a |
FWIW, I'm not sure we'd really need a |
You can already use Having realized The # called via `map(f, ::DataFrame)`
function singleton(::Integer, x)
if Tables.istable(v)
error("Expected a row-like object. Please use `Iterators.flatten` when ",
"the number of rows may change.")
elseif Tables.isrow(v)
return DataFrame([v])
else
return DataFrame(value=[v])
end
end
# called via `map(f, ::GroupedDataFrame)`
function singleton(k::GroupKey, v)
if Tables.istable(v)
return GroupedDataFrame(k, DataFrame(v))
elseif Tables.isrow(v)
return GroupedDataFrame(k, DataFrame([v]))
else
return GroupedDataFrame(k, DataFrame(value=[v]))
end
end
Ah, good point. I think being explicit is better here. |
I agree that the fact that comprehensions always return vectors make it less useful to have Regarding |
I really like the simplicity of Haskell's This should be the same conception as in #2254 (comment). If we consider a
For reduction, that should completely depend on
I wouldn't say that I have sufficiently understood broadcasting, so I'm leaving it out. I have to admit that I haven't thought too much about pleasantries like |
For For The question is how useful having these |
As noted above, since |
For discoverability, I think, it would be great. Also, for a newcomer to the package or non-MS-Excel empirical data analysis in general, especially without too much knowledge about SQL and the parallels of which in many data frame APIs across different languages, having a simple Oftentimes, one is not able to just "scroll over the data" or "take a glimpse" and get meaningful insight. For a newcomer/Excel-leaver, it takes a paradigm shift to instead formulate a request about the data and have some package execute that request. That goes in hand with a new language, and I think a simple Also, oftentimes I don't need more than a simple |
I was thinking about it and came to the following recommendation:
If we agree to this I will implement it before 1.0 release. The difference between |
Sounds good, except that I'm not totally sure what you mean by this:
What functions do you have in mind? For |
The point is that I do not want to specify these functions but rather indicate that they should do whatever they do generically. For instance for
The point is that "then check if all returned elements are row-like with the same schema" is:
|
This seems like a solid plan to me, @bkamins.
As a point of reference, in TypedTables I can just check if
Just an aside — I kind of feel that |
This is a nice idea but it would have to go to JuliaBase. Have you discussed it there?
In the end I think it is acceptable to just do runtime check. Then we should also update |
No, I haven't. I would like to but haven't found the time. |
Maybe I'm missing something, but rather than collecting all returned values to a vector or relying on |
You can get e.g. I have opened #2662 to discuss the details. |
I watched the awesome recent AMA and realized you guys are preparing for a 1.0 release 🎉
There's probably one decision I haven't got my head around with DataFrames and whether it would lead to breaking API changes, and that is to do with the iteration API.
In some cases a
DataFrame
opts into some iteration API functions:while in other cases it does not:
It would seem cleanest to me to pick a side of the fence, and either:
DataFrame
.filter(predicate, eachrow(df))
,first(eachrow(df))
,last(eachrow(df))
before releasing a version 1.0.If doing 1. then we'd want to see if there are any methods implemented that contradict this interpretation before releasing 1.0, but that actual implemtation could come later. I think most of the column iteration functionality has been removed, though.
Sorry if this has been discussed or decided elsewhere and I'm not aware of the outcome.
For some background, I quite often use
filter
onDataFrame
s and find this interface quite useful. I do tend to enjoy usingVector{NamedTuple}
and TypedTables.jl because I also have access tomap
,reduce
andmapreduce
fromBase
, plus all the functions inSplitApplyCombine
which exist for, well, splitting, and applying, and combining, especially for tables. I would love to supportDataFrame
in SplitApplyCombine, for example. (And there are many other great packages out there that rely on iteration, too).My personal bias has always been towards a table is a relation which is a collection of rows and we can use relational algebra to manipulate it. I don't see a particular advantage of having
DataFrame
not support iterating either columns or rows and requiringeachrow
andeachcol
when you could choose one by default.The text was updated successfully, but these errors were encountered: