-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add ak.packed
#912
Conversation
4ec56fe
to
43f6fcf
Compare
@jpivarski I'm going to be working on this but please feel free to make suggestions at any time! |
I'm not sure how to handle |
The word that we use for that is "materialize," though I think it would always be true for broadcasting or recursively applying. Operations should be expected to make a lazy array non-lazy. That's what this does: |
Great. I sort of agree. I think maybe we don't add the flag for that reason, also because |
This In this comment, I'll write a list of rules that
Before applying the above rules, option-type nodes (IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray) should be simplified and UnionArrays should be simplified. This prevents double-masking and unnecessary unions. In principle, we shouldn't be seeing these double-masked or unnecessary unions, but bugs happen and a function like this should be defensive against that. All of the above rules are intended to make pickled and otherwise serialized data as small as they can be, without resorting to compression techniques (like run-length encoding, variable-length encoding) or packing bytes into bits (we'll leave them as bits if they're bits; otherwise no). |
Sure, I had the same thoughts as I pushed the PR. Let's do it.
Part of the logic for using |
The reasons one might want to do that are (a) to simplify/reduce boilerplate and/or (b) keep all "knowledge" of how to recurse in one place. Some functions are made simpler with For (b), that ship has sailed. The "knowledge" of what kinds of children each node type has is scattered throughout the codebase, mostly for the reason that centralizing it would make the code harder to read, and (you may have seen) some parts are already hard enough. Because of this lack of DRY centralization, we have lost the ability to easily add a new node type. That's something I accepted at the beginning because of how previous attempts went, and therefore put in an effort to get the set of node types right, knowing that it would be hard to add new ones in the future. For examples of the ship sailing, look at convert.py. |
Ping (because GitHub doesn't email us about edited comments, only new comments). I've posted a list of rules in #912 (comment). |
You're right. It's a combination of the two, but taking a step back and thinking about it - there is no saving on code here, it's just trampolining into the same function which implements most of the types. I'll move over to a recursion when I am back at it. |
I'm not yet familiar with all of these types (though the docs make their usage clear). Are you suggesting that this should be a two-pass process, with the first pass operating upon option types only? |
No, it can be a single pass, but when you encounter, say, an IndexedOptionArray, do elif isinstance(layout, ak._util.indexedoptiontypes):
if isinstance(layout.content, ak._util.optiontypes): # the bad case we want to protect against
return recurse(layout.simplify())
# apply the rules for IndexedOptionArray This won't infinitely recurse because For unions, it will be somewhat trickier because some unions can be simplified and others can't. There's a rule that I can help you with when you get to it: a UnionArray can be simplified when all of its Also note that we can |
For unions, I think the behaviour that we want is:
(1.) is already handled by |
Yes, this is right. And UnionArray's Just using UnionArray's The guard is easier for option-types, since option-type's |
Yes, I need to think about this a bit more. I'm not sure whether the
conversion is safe though. How do we handle converting uint64 to floats?
I'm afk right now but the thought occurred to me that the maxint is bigger
than the 2^53-1.
…On Sat, 12 Jun 2021, 18:26 Jim Pivarski, ***@***.***> wrote:
For unions, I think the behaviour that we want is:
1. to lift any immediate child unions into the parent union
2. to merge any mergable contents of the new top-level union
3. to compact the array contents and their associated indices.
Does this match your expectations?
Yes, this is right. And UnionArray's simplify does this—the merging of
ints and floats (and complex) is intentional: it's a choice to consider
them subtypes of each other. (I can't conceive of a real-world advantage of
a "union of ints and floats" over "just floats." There's a potential
distinction in precision, but again, I can't conceive of a *real-world*
case where you'd want to mix them functionally without mixing them
numerically.)
Just using UnionArray's simplify before going in and packing the contents
is fine. What I thought would be tricky would be determining if simplify
would change anything before calling it. If you always call simplify and
recurse on the result of the simplification, it would be an infinite
recursive loop. I was talking about how to write the guard.
The guard is easier for option-types, since option-type's simplify just
ensures that two option-type nodes aren't double-stacked, nothing more.
It's easy to write the guard for that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#912 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJQZHO3TWORBCON6LXLNE3TSOKDTANCNFSM46Q3NZEQ>
.
|
It's actually the same conversion as NumPy. (We let them decide the "right" way to merge numbers and then match it.) |
Hmm. I can see that NumPy provide such a function, it just seems like a last-resort because of the lack of representation for large ints. |
@jpivarski how would you feel about exposing Furthermore, I don't think there a recursion risk here. If |
To implement the truncation step, I've added a private |
Why do we care whether the content has |
Now to push the tests... |
I've made a mistake in the |
OK, this is looking like it is nearly ready for a review. I have not yet implemented the axis / depth argument. I don't think we want to restrict this to only a single axis if As you can imagine, the reason that I want to pass a depth parameter is that for routines like |
Additionally, the tests are probably not quite finished - I don't thoroughly check that identities and parameters are preserved, or that the contents are always simplified too. Perhaps some input as to how much testing we want to do would be helpful. |
When implementing the depth parameter (I settled on a = ak.layout.NumpyArray(np.arange(4))
b = ak.layout.RegularArray(ak.layout.NumpyArray(np.arange(12)), 3)
layout = ak.layout.UnionArray8_64(
ak.layout.Index8([1, 1, 0, 0]), ak.layout.Index64([0,1, 0,1]), [a, b]
)
layout.axis_wrap_if_negative(-1) I expect this to just return |
I think
I think it would be more useful to leave the partitions as partitions. |
@jpivarski I was motivated to work on this by #910. In that case, we don't need to flatten every axis unless the user passes |
This reverts commit 08b2b08.
4619efa
to
ad58d33
Compare
Sorry for the git noise. I rebased to drop the first commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Just a few comments inline.
This won't involve a copy.
OK, since making the discussed changes, I added a missing pass-through case for |
I just made I added documentation (that I'm not 100% sure will render correctly; I'm just crossing my fingers and will look at awkward-array.org when it's done). I don't think you have any other changes to make, so I'm enabling auto-merge. If there's something else you want me to add, let me know right away and I can stop the process! |
This will simplify the length and order violating layouts such that external operations are compatible with per-layout transformations like
ak.unflatten
.content
tolen(original) * size
with special handling (pass through) ifsize == 0
toListOffsetArray64(true)
(thetrue
means starting at zero)toListOffsetArray64(true)
(it's a pass-through if it's already true thatoffsets[0] == 0
)contents
tolen(original)
project()
ittoIndexedOptionArray
if thecontent
does not havePrimitiveType
. Doing so will naturally lead to the right kind ofindex
. If not changing the type (because thecontent
hasPrimitiveType
), at least truncate thecontent
length tolen(original)
.toByteMaskedArray
if thecontent
hasPrimitiveType
; otherwise, we want to project thecontent
such that the non-negativeindex
values become simple counting... theindex
should end up looking like0, 1, 2, -1, 3, -1, -1, -1, 4, 5...
. That will take some thought.toIndexedOptionArray
if thecontent
does not havePrimitiveType
. If not changing the type, at least truncate thecontent
length tolen(original)
.index
byproject
ing each of thecontents
PartitionedArray concatenate partitions?