Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Clean up operators #351

Merged
merged 13 commits into from
Sep 20, 2013
Merged

Conversation

simonster
Copy link
Contributor

There are two main goals here: to improve performance by allowing type inference to happen for most of these operations, and to reduce the amount of repetitive code. See #327 for more background.

Some notes:

  • I'm not a huge fan of the giant lists of operators at the top of the file, but I left them in since they're used in dataframe_blocks.jl. My main grievance is that they make it hard to tell what's being defined where without jumping around in the file. A secondary issue is that the operator categories that make sense in operators.jl don't necessarily make sense elsewhere. For example, ./ needs to be defined separately in operators.jl for type reasons, so it's not in array_arithmetic_operators, and at the moment this also means it's not handled in dataframe_blocks.jl.
  • Would it be reasonable to give all AbstractDataArrays isna and data methods that take indices? The isna method would return a Bool, and the data method would return dv[i] if dv[i] != NA and could return anything otherwise. This would permit efficient type inference without accessing fields directly, which would let me remove the special cases for DataArrays and speed up other AbstractDataArrays.
  • The pairwise and cumulative vector operators will give errors instead of returning NA when there are undefined values in the array underlying the DataArray, which can only happen for non-bits types. This is left over from the old implementation, and I'm not sure whether it's worth fixing, or how much effort it's worth putting into performance if I do.
  • I only made cosmetic changes to col* and row*, but these could use some more work. The API should probably change to be more Julian (Make API more Julian #159) and there are other considerations as well (see Clean up basic functions like mean and std #325 and Implement na_rm for math functions? #259).
  • I'm not too sure about the behavior of all. Whether we return NA or false depends on the order of the vector, i.e., all([false, NA]) == false whereas all([NA, false]) == NA. I haven't changed this, since I wanted the existing tests to pass, but is this really what we want?
  • I didn't touch the rle methods, since I'm not sure they're sufficiently commonly used to be worth optimizing.
  • I still need to go through the tests and make sure that the coverage is still somewhere close to full.

@johnmyleswhite
Copy link
Contributor

Is this generally good to merge? I'd like to get something (even if it's a draft) merged soon, so that I can finish the work I've started splitting DataArrays out of DataFrames.

@simonster
Copy link
Contributor Author

It passes the current tests, but it might be worth adding some tests to operators.jl for PooledDataArray, since most operators now have separate code paths for DataArray and AbstractDataArray. I'll try to get to that today.

@johnmyleswhite
Copy link
Contributor

That would be great. I'll give it a review now. Sorry for taking so long to look at this. Please ping me in the future if you think I'm neglecting something. Or just merge it if you're happy with it.

@nfoti
Copy link
Contributor

nfoti commented Sep 19, 2013

I can move the code in PR #354 to the new repo when it's made.

@johnmyleswhite
Copy link
Contributor

That would be great. And the same comments I made are relevant there: if I'm behind with a PR, please ping me. I've gotten to have more to do now than I can easily manage, so any reminders are really helpful.

function similar{T}(d::DataArray{T}, dims::Dims)
DataArray(Array(T, dims), trues(dims))
end
similar(d::DataArray, T, dims::Dims) = DataArray(Array(T, dims), trues(dims))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert this change? I've edited this on my own and now feel like similar should not initialize the na bit mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this not to initialize the na mask, but I still think we only need a single similar function with all the arguments. The one- and two-argument versions of similar in Base will just call this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I'll make that change on my end.

@simonster
Copy link
Contributor Author

Okay, I've added some additional tests and fixed two bugs they picked up, one preexisting and one new. I think this should be good to merge.

@johnmyleswhite
Copy link
Contributor

Ok. Let's merge this. Then I'll start the split into DataFrames and DataArrays. After that, we can review this stuff again.

johnmyleswhite added a commit that referenced this pull request Sep 20, 2013
@johnmyleswhite johnmyleswhite merged commit af27a63 into JuliaData:master Sep 20, 2013
@simonster simonster mentioned this pull request Sep 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants