Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve copy of SubDataFrame and DataFrameRow #1564

Merged
merged 3 commits into from
Nov 1, 2018

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 12, 2018

@nalimilan This is a follow up to #1534, and should be merged if the other is merged (I separate it as we have enough of changes there).

What it does:

  • change copy for SubDataFrame to ensure that it returns a DataFrame in the future (this is the same functionality as we have but with a syntax taking into account future changes to getindex that are currently in deprecation period)
  • introduce copy for DataFrameRow

Actually the second point is crucial as it is a simple change that is consistent with #1534 logic but shows one important thing. Should returned NamedTuple have minimal (narrowest) second parametric argument (type specification). This is what is implemented now.
Or it should have a wide type specification consistent with eltype of DataFrame column.

The difference is that either:

  • narrow type specification will be possibly faster when NamedTuple is passed to a function and it might be a bit faster to create (not 100% sure though);
  • wider type specification will mean that each NamedTuple created from a row of a given DataFrame will have the same concrete type.

I am not fully clear what is better (I have implemented first, but I have the feeling that maybe the second approach is better).

@nalimilan
Copy link
Member

Thanks. I'd be inclined to use wider types instead, with the idea that it preserves more information and it makes things more "type-stable" in the sense that the type does not depend on the particular row (but of course inference doesn't know the columns types of a DataFrameRow). This is useful if you need to know what type to expect from remaining rows when allocating a column. It's possible that the wider type would be slightly less efficient (e.g. Union{T,Missing} rather than T), but taking a row in isolation doesn't make much sense.

Cc: @quinnj

@nalimilan
Copy link
Member

See discussion at JuliaData/Tables.jl#28.

@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2018

@nalimilan
Given the decision in #1534 what copy on DataFrameRow should produce? A DataFrame or something else?

Also:

Base.convert(::Type{Array}, r::DataFrameRow) = convert(Array, parent(r)[row(r),:])

method seems strange given our current approach. Maybe we should deprecate it and only provide support for Vector only and then also add Vector(dfr) and Vector{T}(dfr) methods?

@nalimilan
Copy link
Member

Maybe a NamedTuple would be better, as it is more similar to a DataFrameRow than a DataFrame (e.g. indexing gives a scalar).

The convert methods indeed need adjusting, but I think we'd better keep them as there's no reason to support only the constructor.

@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2018

OK.

With the convert methods the problem is that only convert(Array, dfr) now works and it produces a matrix not a vector. In particular convert(Vector, dfr) or convert(Matrix, dfr) currently fail.

@nalimilan
Copy link
Member

nalimilan commented Oct 26, 2018

OK. I guess we should at least support convert(Vector, dfr); it's a bit weird to see DataFrameRows as matrices.

Does this PR need anything more?

@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2018

I have added a doc string and it should be good.
I will handle conversions in a separate PR.

@bkamins bkamins merged commit 10eb8d2 into JuliaData:master Nov 1, 2018
@bkamins bkamins deleted the copy_cleanup branch November 1, 2018 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants