Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add ak.packed #912

Merged
merged 55 commits into from
Jun 14, 2021
Merged

Feature: add ak.packed #912

merged 55 commits into from
Jun 14, 2021

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Jun 11, 2021

This will simplify the length and order violating layouts such that external operations are compatible with per-layout transformations like ak.unflatten.

  • EmptyArray: unchanged
  • NumpyArray: converted to contiguous, if not already
  • RegularArray: truncate the content to len(original) * size with special handling (pass through) if size == 0
  • ListArray: convert toListOffsetArray64(true) (the true means starting at zero)
  • ListOffsetArray: convert toListOffsetArray64(true) (it's a pass-through if it's already true that offsets[0] == 0)
  • RecordArray: truncate all the contents to len(original)
  • IndexedArray: project() it
  • ByteMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. Doing so will naturally lead to the right kind of index. If not changing the type (because the content has PrimitiveType), at least truncate the content length to len(original).
  • IndexedOptionArray: convert toByteMaskedArray if the content has PrimitiveType; otherwise, we want to project the content such that the non-negative index values become simple counting... the index should end up looking like 0, 1, 2, -1, 3, -1, -1, -1, 4, 5.... That will take some thought.
  • BitMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. If not changing the type, at least truncate the content length to len(original).
  • UnmaskedArray: unchanged (but recursively descend, of course)
  • UnionArray: simplify the index by projecting each of the contents
  • VirtualArray: materialize and recursively descend
  • PartitionedArray concatenate partitions?

@agoose77 agoose77 changed the title Feature add layout simplification Feature: add layout simplification Jun 11, 2021
@agoose77 agoose77 force-pushed the feature-add-simplify branch from 4ec56fe to 43f6fcf Compare June 11, 2021 14:32
@agoose77
Copy link
Collaborator Author

@jpivarski I'm going to be working on this but please feel free to make suggestions at any time!

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

I'm not sure how to handle VirtualArray. It could just pass through to avoid realising it, but equally, it would be confusing if you had to trigger a load in order for simplify to traverse into it. Perhaps using the same flag materialize=True as recursive_walk (but default to True) would be the right thing here?

@jpivarski
Copy link
Member

Perhaps a flag realise_virtual (or something)

The word that we use for that is "materialize," though I think it would always be true for broadcasting or recursively applying. Operations should be expected to make a lazy array non-lazy. That's what this does:

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L1312-L1313

@agoose77
Copy link
Collaborator Author

Perhaps a flag realise_virtual (or something)

The word that we use for that is "materialize," though I think it would always be true for broadcasting or recursively applying. Operations should be expected to make a lazy array non-lazy. That's what this does:

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L1312-L1313

Great. I sort of agree. I think maybe we don't add the flag for that reason, also because recursively_apply doesn't define the same arg.

@jpivarski
Copy link
Member

jpivarski commented Jun 11, 2021

This simplify function looks very similar to one we've wanted for a while that would be called ak.packed: #746. I think you should make this into that: packing does simplify, but it also removes inaccessible elements.

In this comment, I'll write a list of rules that ak.packed should have. Since you'd be doing something on just about every node, it would probably make more sense to not try to shoehorn it into ak._util.recursively_apply. That function is for avoiding boilerplate when you only want to affect one node level; here, you'll be affecting them all.

  • EmptyArray: unchanged
  • NumpyArray: converted to contiguous, if not already
  • RegularArray: truncate the content to len(original) * size with special handling (pass through) if size == 0
  • ListArray: convert toListOffsetArray64(true) (the true means starting at zero)
  • ListOffsetArray: convert toListOffsetArray64(true) (it's a pass-through if it's already true that offsets[0] == 0)
  • RecordArray: truncate all the contents to len(original)
  • IndexedArray: project() it
  • IndexedOptionArray: convert toByteMaskedArray if the content has PrimitiveType; otherwise, we want to project the content such that the non-negative index values become simple counting... the index should end up looking like 0, 1, 2, -1, 3, -1, -1, -1, 4, 5.... That will take some thought.
  • ByteMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. Doing so will naturally lead to the right kind of index. If not changing the type (because the content has PrimitiveType), at least truncate the content length to len(original).
  • BitMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. If not changing the type, at least truncate the content length to len(original).
  • UnmaskedArray: unchanged (but recursively descend, of course)
  • UnionArray: simplify the index by projecting each of the contents
  • VirtualArray: materialize and recursively descend

Before applying the above rules, option-type nodes (IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray) should be simplified and UnionArrays should be simplified. This prevents double-masking and unnecessary unions. In principle, we shouldn't be seeing these double-masked or unnecessary unions, but bugs happen and a function like this should be defensive against that.

All of the above rules are intended to make pickled and otherwise serialized data as small as they can be, without resorting to compression techniques (like run-length encoding, variable-length encoding) or packing bytes into bits (we'll leave them as bits if they're bits; otherwise no).

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

This simplify function looks very similar to one we've wanted for a while that would be called ak.packed: #746. I think you should make this into that: packing does simplify, but it also removes inaccessible elements.

Sure, I had the same thoughts as I pushed the PR. Let's do it.

In this comment, I'll write a list of rules that ak.packed should have. Since you'd be doing something on just about every node, it would probably make more sense to not try to shoehorn it into ak._util.recursively_apply. That function is for avoiding boilerplate when you only want to affect one node level; here, you'll be affecting them all.

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

@agoose77 agoose77 changed the title Feature: add layout simplification Feature: add ak.packed Jun 11, 2021
@jpivarski
Copy link
Member

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

The reasons one might want to do that are (a) to simplify/reduce boilerplate and/or (b) keep all "knowledge" of how to recurse in one place.

Some functions are made simpler with recursively_apply because they only do something specialized for one node type, all others "pass through" in a way that is very similar from one function to another. As long as we were talking about that (e.g. in unflatten), that's a good argument to use and possibly expand recursively_apply. But for this case, as you'll see when I finish the comment above, there will be something non-trivial to do for nearly every node type. Trying to shoehorn this function into recursively_apply would only make it more complex, not less complex. It would be like trying to read some callback-heavy Javascript code. Sometimes, a straight-through, top-to-bottom Fortran style is more appropriate—whatever makes it easier to read, and that depends on context. So motivation (a) doesn't apply to this one.

For (b), that ship has sailed. The "knowledge" of what kinds of children each node type has is scattered throughout the codebase, mostly for the reason that centralizing it would make the code harder to read, and (you may have seen) some parts are already hard enough. Because of this lack of DRY centralization, we have lost the ability to easily add a new node type. That's something I accepted at the beginning because of how previous attempts went, and therefore put in an effort to get the set of node types right, knowing that it would be hard to add new ones in the future.

For examples of the ship sailing, look at convert.py.

@jpivarski
Copy link
Member

Ping (because GitHub doesn't email us about edited comments, only new comments). I've posted a list of rules in #912 (comment).

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 11, 2021

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

The reasons one might want to do that are (a) to simplify/reduce boilerplate and/or (b) keep all "knowledge" of how to recurse in one place.

You're right. It's a combination of the two, but taking a step back and thinking about it - there is no saving on code here, it's just trampolining into the same function which implements most of the types. I'll move over to a recursion when I am back at it.

@agoose77
Copy link
Collaborator Author

Before applying the above rules, option-type nodes (IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray) should be simplified and UnionArrays should be simplified. This prevents double-masking and unnecessary unions. In principle, we shouldn't be seeing these double-masked or unnecessary unions, but bugs happen and a function like this should be defensive against that.

I'm not yet familiar with all of these types (though the docs make their usage clear). Are you suggesting that this should be a two-pass process, with the first pass operating upon option types only?

@jpivarski
Copy link
Member

Are you suggesting that this should be a two-pass process, with the first pass operating upon option types only?

No, it can be a single pass, but when you encounter, say, an IndexedOptionArray, do

elif isinstance(layout, ak._util.indexedoptiontypes):
    if isinstance(layout.content, ak._util.optiontypes):   # the bad case we want to protect against
        return recurse(layout.simplify())
    # apply the rules for IndexedOptionArray

This won't infinitely recurse because layout.simplify() will reduce the number of levels of option-types containing option-types by 1. Eventually, the option-type node won't contain an option-type node: it will eventually reach that state because it's possible to turn any option-type of option-type into just an option-type. But that also means that it's essential to guard this recursion with isinstance(layout.content, ak._util.optiontypes).

For unions, it will be somewhat trickier because some unions can be simplified and others can't. There's a rule that I can help you with when you get to it: a UnionArray can be simplified when all of its contents are mergeable with each other. (Actually, you'd be able to work that out, I believe.)

Also note that we can isinstance groups of classes with similar properties using

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L66-L105

@agoose77
Copy link
Collaborator Author

For unions, it will be somewhat trickier because some unions can be simplified and others can't. There's a rule that I can help you with when you get to it: a UnionArray can be simplified when all of its contents are mergeable with each other. (Actually, you'd be able to work that out, I believe.)

Also note that we can isinstance groups of classes with similar properties using

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L66-L105

For unions, I think the behaviour that we want is:

  1. to lift any immediate child unions into the parent union
  2. to merge any mergable contents of the new top-level union
  3. to compact the array contents and their associated indices.
    Does this match your expectations?

(1.) is already handled by layout.simplify, but doesn't behave as I would expect where the array types aren't compatible (e.g float vs int). I wonder if this needs to be done manually instead.
(2.) this is also (I think) handled by layout.simplify, but doesn't behave well as discussed. Effectively, I think we want to find the partition of the set of contents with the smallest cardinality?

@jpivarski
Copy link
Member

For unions, I think the behaviour that we want is:

  1. to lift any immediate child unions into the parent union
  2. to merge any mergable contents of the new top-level union
  3. to compact the array contents and their associated indices.
    Does this match your expectations?

Yes, this is right. And UnionArray's simplify does this—the merging of ints and floats (and complex) is intentional: it's a choice to consider them subtypes of each other. (I can't conceive of a real-world advantage of a "union of ints and floats" over "just floats." There's a potential distinction in precision, but again, I can't conceive of a real-world case where you'd want to mix them functionally without mixing them numerically.)

Just using UnionArray's simplify before going in and packing the contents is fine. What I thought would be tricky would be determining if simplify would change anything before calling it. If you always call simplify and recurse on the result of the simplification, it would be an infinite recursive loop. I was talking about how to write the guard.

The guard is easier for option-types, since option-type's simplify just ensures that two option-type nodes aren't double-stacked, nothing more. It's easy to write the guard for that.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 12, 2021 via email

@jpivarski
Copy link
Member

Yes, I need to think about this a bit more. I'm not sure whether the conversion is safe though. How do we handle converting uint64 to floats? I'm afk right now but the thought occurred to me that the maxint is bigger than the 2^53-1.

It's actually the same conversion as NumPy. (We let them decide the "right" way to merge numbers and then match it.)

@agoose77
Copy link
Collaborator Author

Hmm. I can see that NumPy provide such a function, it just seems like a last-resort because of the lack of representation for large ints.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 13, 2021

@jpivarski how would you feel about exposing referentially_equal, e.g. as __eq__? I think this would be useful, and would avoid us having to test for simplification. Looking at simplify_uniontype, it seems that all I need to test for is whether any contents are also union arrays, right? The merge ability is just a secondary optimisation for the contents of these nested union arrays, but from the top-level perspective, simplify_uniontype will always merge the nested union's contents with its own.

Furthermore, I don't think there a recursion risk here. If simplify returns a non UnionArray type, then we recurse on the result. Otherwise, we treat the simplification as a success (it doesn't matter if it actually did anything), and simplify the contents.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 13, 2021

To implement the truncation step, I've added a private truncate helper that introduces a new IndexedArray64 that wraps the existing content. Is this a bad idea? My thought process is that we then do not need to implement a new routine that truncates each type - Awkward already knows how to do this.

@agoose77
Copy link
Collaborator Author

Why do we care whether the content has PrimitiveType in the masked types?

@agoose77
Copy link
Collaborator Author

Now to push the tests...

@agoose77
Copy link
Collaborator Author

I've made a mistake in the IndexedOptionArray case which I need to fix. I also need to make changes to stop mutating existing properties of layouts in-place.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 14, 2021

OK, this is looking like it is nearly ready for a review. I have not yet implemented the axis / depth argument. I don't think we want to restrict this to only a single axis if depth is not None. Instead, I would probably want to define the upper bound (i.e. simplify to a given inclusive depth). The depth = None case can be used to simplify the entire structure. Do you think this is reasonable?

As you can imagine, the reason that I want to pass a depth parameter is that for routines like unflatten, we do not need any simplification below the axis of unflattening, so it would be unnecessary to invoke the copies etc that would be required to flatten the entire subtree.

@agoose77
Copy link
Collaborator Author

Additionally, the tests are probably not quite finished - I don't thoroughly check that identities and parameters are preserved, or that the contents are always simplified too. Perhaps some input as to how much testing we want to do would be helpful.

@agoose77
Copy link
Collaborator Author

agoose77 commented Jun 14, 2021

When implementing the depth parameter (I settled on to_axis because it's up-to-and-including the axis), I observed that UnionArray raises an exception when asked to wrap axis=-1 for the following layout

a = ak.layout.NumpyArray(np.arange(4))
b = ak.layout.RegularArray(ak.layout.NumpyArray(np.arange(12)), 3)
layout = ak.layout.UnionArray8_64(
    ak.layout.Index8([1, 1,  0, 0]), ak.layout.Index64([0,1, 0,1]), [a, b]
)
layout.axis_wrap_if_negative(-1)

I expect this to just return -1, because each array has its own concept of axis=-1. Do you have any thoughts here?

@jpivarski
Copy link
Member

I think ak.packed should always apply to all levels: no axis (depth) parameter. I can't imagine why anyone would want to partially pack an array: it's an "invisible" feature (like virtualness, partitioning, categoricals with IndexedArray, etc.). It would be hard to even know that it has been applied to one axis and not another—it's something one would want to apply to a whole array before ak.to_buffers or to reduce the indirection in an array and make it faster to access.

  • PartitionedArray concatenate partitions?

I think it would be more useful to leave the partitions as partitions. ak.repartition generalizes the process of concatenating them, so it would be good if ak.packed and ak.repartition can be used independently. Since ak.packed is important for storage, we don't want it to interfere with another storage-related property.

@agoose77
Copy link
Collaborator Author

@jpivarski I was motivated to work on this by #910. In that case, we don't need to flatten every axis unless the user passes axis=-1, and indeed, one wouldn't want to flatten additional axes unless required to, because it introduces many copies of the data. For example, if I wanted to add structure to a wide RecordArray, ak.unflatten would then potentially require copying all of the branches (to project them).

@agoose77
Copy link
Collaborator Author

Sorry for the git noise. I rebased to drop the first commit.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Just a few comments inline.

src/awkward/operations/structure.py Outdated Show resolved Hide resolved
src/awkward/operations/structure.py Outdated Show resolved Hide resolved
src/awkward/operations/structure.py Outdated Show resolved Hide resolved
src/awkward/operations/structure.py Outdated Show resolved Hide resolved
@agoose77
Copy link
Collaborator Author

OK, since making the discussed changes, I added a missing pass-through case for PartitionedArray, and another for Record

@jpivarski jpivarski linked an issue Jun 14, 2021 that may be closed by this pull request
@jpivarski
Copy link
Member

I just made ak.packed a preprocessing step for pickling, which resolves scikit-hep/awkward-0.x#246 and #701. Also, I think there's some alliterative possibilities there.

I added documentation (that I'm not 100% sure will render correctly; I'm just crossing my fingers and will look at awkward-array.org when it's done). I don't think you have any other changes to make, so I'm enabling auto-merge. If there's something else you want me to add, let me know right away and I can stop the process!

@jpivarski jpivarski enabled auto-merge (squash) June 14, 2021 22:00
@jpivarski jpivarski merged commit 0fe7791 into main Jun 14, 2021
@jpivarski jpivarski deleted the feature-add-simplify branch June 14, 2021 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We need an ak.packed function
2 participants