Feature: add `ak.packed` #912

agoose77 · 2021-06-11T14:30:03Z

This will simplify the length and order violating layouts such that external operations are compatible with per-layout transformations like ak.unflatten.

agoose77 · 2021-06-11T14:33:17Z

@jpivarski I'm going to be working on this but please feel free to make suggestions at any time!

agoose77 · 2021-06-11T14:38:24Z

I'm not sure how to handle VirtualArray. It could just pass through to avoid realising it, but equally, it would be confusing if you had to trigger a load in order for simplify to traverse into it. Perhaps using the same flag materialize=True as recursive_walk (but default to True) would be the right thing here?

jpivarski · 2021-06-11T14:45:52Z

Perhaps a flag realise_virtual (or something)

The word that we use for that is "materialize," though I think it would always be true for broadcasting or recursively applying. Operations should be expected to make a lazy array non-lazy. That's what this does:

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L1312-L1313

agoose77 · 2021-06-11T14:47:57Z

Perhaps a flag realise_virtual (or something)

The word that we use for that is "materialize," though I think it would always be true for broadcasting or recursively applying. Operations should be expected to make a lazy array non-lazy. That's what this does:

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L1312-L1313

Great. I sort of agree. I think maybe we don't add the flag for that reason, also because recursively_apply doesn't define the same arg.

jpivarski · 2021-06-11T14:50:32Z

This simplify function looks very similar to one we've wanted for a while that would be called ak.packed: #746. I think you should make this into that: packing does simplify, but it also removes inaccessible elements.

In this comment, I'll write a list of rules that ak.packed should have. Since you'd be doing something on just about every node, it would probably make more sense to not try to shoehorn it into ak._util.recursively_apply. That function is for avoiding boilerplate when you only want to affect one node level; here, you'll be affecting them all.

EmptyArray: unchanged
NumpyArray: converted to contiguous, if not already
RegularArray: truncate the content to len(original) * size with special handling (pass through) if size == 0
ListArray: convert toListOffsetArray64(true) (the true means starting at zero)
ListOffsetArray: convert toListOffsetArray64(true) (it's a pass-through if it's already true that offsets[0] == 0)
RecordArray: truncate all the contents to len(original)
IndexedArray: project() it
IndexedOptionArray: convert toByteMaskedArray if the content has PrimitiveType; otherwise, we want to project the content such that the non-negative index values become simple counting... the index should end up looking like 0, 1, 2, -1, 3, -1, -1, -1, 4, 5.... That will take some thought.
ByteMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. Doing so will naturally lead to the right kind of index. If not changing the type (because the content has PrimitiveType), at least truncate the content length to len(original).
BitMaskedArray: convert toIndexedOptionArray if the content does not have PrimitiveType. If not changing the type, at least truncate the content length to len(original).
UnmaskedArray: unchanged (but recursively descend, of course)
UnionArray: simplify the index by projecting each of the contents
VirtualArray: materialize and recursively descend

Before applying the above rules, option-type nodes (IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray) should be simplified and UnionArrays should be simplified. This prevents double-masking and unnecessary unions. In principle, we shouldn't be seeing these double-masked or unnecessary unions, but bugs happen and a function like this should be defensive against that.

All of the above rules are intended to make pickled and otherwise serialized data as small as they can be, without resorting to compression techniques (like run-length encoding, variable-length encoding) or packing bytes into bits (we'll leave them as bits if they're bits; otherwise no).

agoose77 · 2021-06-11T14:52:50Z

This simplify function looks very similar to one we've wanted for a while that would be called ak.packed: #746. I think you should make this into that: packing does simplify, but it also removes inaccessible elements.

Sure, I had the same thoughts as I pushed the PR. Let's do it.

In this comment, I'll write a list of rules that ak.packed should have. Since you'd be doing something on just about every node, it would probably make more sense to not try to shoehorn it into ak._util.recursively_apply. That function is for avoiding boilerplate when you only want to affect one node level; here, you'll be affecting them all.

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

jpivarski · 2021-06-11T15:02:37Z

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

The reasons one might want to do that are (a) to simplify/reduce boilerplate and/or (b) keep all "knowledge" of how to recurse in one place.

Some functions are made simpler with recursively_apply because they only do something specialized for one node type, all others "pass through" in a way that is very similar from one function to another. As long as we were talking about that (e.g. in unflatten), that's a good argument to use and possibly expand recursively_apply. But for this case, as you'll see when I finish the comment above, there will be something non-trivial to do for nearly every node type. Trying to shoehorn this function into recursively_apply would only make it more complex, not less complex. It would be like trying to read some callback-heavy Javascript code. Sometimes, a straight-through, top-to-bottom Fortran style is more appropriate—whatever makes it easier to read, and that depends on context. So motivation (a) doesn't apply to this one.

For (b), that ship has sailed. The "knowledge" of what kinds of children each node type has is scattered throughout the codebase, mostly for the reason that centralizing it would make the code harder to read, and (you may have seen) some parts are already hard enough. Because of this lack of DRY centralization, we have lost the ability to easily add a new node type. That's something I accepted at the beginning because of how previous attempts went, and therefore put in an effort to get the set of node types right, knowing that it would be hard to add new ones in the future.

For examples of the ship sailing, look at convert.py.

jpivarski · 2021-06-11T15:23:11Z

Ping (because GitHub doesn't email us about edited comments, only new comments). I've posted a list of rules in #912 (comment).

agoose77 · 2021-06-11T15:33:53Z

Part of the logic for using recursively_apply is that I don't want to re-implement the type-aware tree logic that it already implements. What are you thoughts there?

The reasons one might want to do that are (a) to simplify/reduce boilerplate and/or (b) keep all "knowledge" of how to recurse in one place.

You're right. It's a combination of the two, but taking a step back and thinking about it - there is no saving on code here, it's just trampolining into the same function which implements most of the types. I'll move over to a recursion when I am back at it.

agoose77 · 2021-06-11T16:13:16Z

Before applying the above rules, option-type nodes (IndexedOptionArray, ByteMaskedArray, BitMaskedArray, UnmaskedArray) should be simplified and UnionArrays should be simplified. This prevents double-masking and unnecessary unions. In principle, we shouldn't be seeing these double-masked or unnecessary unions, but bugs happen and a function like this should be defensive against that.

I'm not yet familiar with all of these types (though the docs make their usage clear). Are you suggesting that this should be a two-pass process, with the first pass operating upon option types only?

jpivarski · 2021-06-11T17:58:36Z

Are you suggesting that this should be a two-pass process, with the first pass operating upon option types only?

No, it can be a single pass, but when you encounter, say, an IndexedOptionArray, do

elif isinstance(layout, ak._util.indexedoptiontypes):
    if isinstance(layout.content, ak._util.optiontypes):   # the bad case we want to protect against
        return recurse(layout.simplify())
    # apply the rules for IndexedOptionArray

This won't infinitely recurse because layout.simplify() will reduce the number of levels of option-types containing option-types by 1. Eventually, the option-type node won't contain an option-type node: it will eventually reach that state because it's possible to turn any option-type of option-type into just an option-type. But that also means that it's essential to guard this recursion with isinstance(layout.content, ak._util.optiontypes).

For unions, it will be somewhat trickier because some unions can be simplified and others can't. There's a rule that I can help you with when you get to it: a UnionArray can be simplified when all of its contents are mergeable with each other. (Actually, you'd be able to work that out, I believe.)

Also note that we can isinstance groups of classes with similar properties using

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L66-L105

agoose77 · 2021-06-12T16:51:28Z

For unions, it will be somewhat trickier because some unions can be simplified and others can't. There's a rule that I can help you with when you get to it: a UnionArray can be simplified when all of its contents are mergeable with each other. (Actually, you'd be able to work that out, I believe.)

Also note that we can isinstance groups of classes with similar properties using

https://github.com/scikit-hep/awkward-1.0/blob/94de4e5112ad3a2d5fc9c2ec0fc29e242543a0d6/src/awkward/_util.py#L66-L105

For unions, I think the behaviour that we want is:

to lift any immediate child unions into the parent union
to merge any mergable contents of the new top-level union
to compact the array contents and their associated indices.
Does this match your expectations?

(1.) is already handled by layout.simplify, but doesn't behave as I would expect where the array types aren't compatible (e.g float vs int). I wonder if this needs to be done manually instead.
(2.) this is also (I think) handled by layout.simplify, but doesn't behave well as discussed. Effectively, I think we want to find the partition of the set of contents with the smallest cardinality?

jpivarski · 2021-06-12T17:26:05Z

For unions, I think the behaviour that we want is:

to lift any immediate child unions into the parent union

to merge any mergable contents of the new top-level union

to compact the array contents and their associated indices.
Does this match your expectations?

Yes, this is right. And UnionArray's simplify does this—the merging of ints and floats (and complex) is intentional: it's a choice to consider them subtypes of each other. (I can't conceive of a real-world advantage of a "union of ints and floats" over "just floats." There's a potential distinction in precision, but again, I can't conceive of a real-world case where you'd want to mix them functionally without mixing them numerically.)

Just using UnionArray's simplify before going in and packing the contents is fine. What I thought would be tricky would be determining if simplify would change anything before calling it. If you always call simplify and recurse on the result of the simplification, it would be an infinite recursive loop. I was talking about how to write the guard.

The guard is easier for option-types, since option-type's simplify just ensures that two option-type nodes aren't double-stacked, nothing more. It's easy to write the guard for that.

agoose77 · 2021-06-12T21:17:17Z

Yes, I need to think about this a bit more. I'm not sure whether the conversion is safe though. How do we handle converting uint64 to floats? I'm afk right now but the thought occurred to me that the maxint is bigger than the 2^53-1.

…

On Sat, 12 Jun 2021, 18:26 Jim Pivarski, ***@***.***> wrote: For unions, I think the behaviour that we want is: 1. to lift any immediate child unions into the parent union 2. to merge any mergable contents of the new top-level union 3. to compact the array contents and their associated indices. Does this match your expectations? Yes, this is right. And UnionArray's simplify does this—the merging of ints and floats (and complex) is intentional: it's a choice to consider them subtypes of each other. (I can't conceive of a real-world advantage of a "union of ints and floats" over "just floats." There's a potential distinction in precision, but again, I can't conceive of a *real-world* case where you'd want to mix them functionally without mixing them numerically.) Just using UnionArray's simplify before going in and packing the contents is fine. What I thought would be tricky would be determining if simplify would change anything before calling it. If you always call simplify and recurse on the result of the simplification, it would be an infinite recursive loop. I was talking about how to write the guard. The guard is easier for option-types, since option-type's simplify just ensures that two option-type nodes aren't double-stacked, nothing more. It's easy to write the guard for that. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#912 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJQZHO3TWORBCON6LXLNE3TSOKDTANCNFSM46Q3NZEQ> .

jpivarski · 2021-06-12T22:16:48Z

Yes, I need to think about this a bit more. I'm not sure whether the conversion is safe though. How do we handle converting uint64 to floats? I'm afk right now but the thought occurred to me that the maxint is bigger than the 2^53-1.

It's actually the same conversion as NumPy. (We let them decide the "right" way to merge numbers and then match it.)

agoose77 · 2021-06-13T14:30:37Z

Hmm. I can see that NumPy provide such a function, it just seems like a last-resort because of the lack of representation for large ints.

agoose77 · 2021-06-13T14:48:38Z

@jpivarski how would you feel about exposing referentially_equal, e.g. as __eq__? I think this would be useful, and would avoid us having to test for simplification. Looking at simplify_uniontype, it seems that all I need to test for is whether any contents are also union arrays, right? The merge ability is just a secondary optimisation for the contents of these nested union arrays, but from the top-level perspective, simplify_uniontype will always merge the nested union's contents with its own.

Furthermore, I don't think there a recursion risk here. If simplify returns a non UnionArray type, then we recurse on the result. Otherwise, we treat the simplification as a success (it doesn't matter if it actually did anything), and simplify the contents.

agoose77 · 2021-06-13T15:25:16Z

To implement the truncation step, I've added a private truncate helper that introduces a new IndexedArray64 that wraps the existing content. Is this a bad idea? My thought process is that we then do not need to implement a new routine that truncates each type - Awkward already knows how to do this.

agoose77 · 2021-06-13T16:59:46Z

Why do we care whether the content has PrimitiveType in the masked types?

agoose77 · 2021-06-13T17:51:38Z

Now to push the tests...

agoose77 · 2021-06-13T19:27:49Z

I've made a mistake in the IndexedOptionArray case which I need to fix. I also need to make changes to stop mutating existing properties of layouts in-place.

agoose77 · 2021-06-14T06:22:28Z

OK, this is looking like it is nearly ready for a review. I have not yet implemented the axis / depth argument. I don't think we want to restrict this to only a single axis if depth is not None. Instead, I would probably want to define the upper bound (i.e. simplify to a given inclusive depth). The depth = None case can be used to simplify the entire structure. Do you think this is reasonable?

As you can imagine, the reason that I want to pass a depth parameter is that for routines like unflatten, we do not need any simplification below the axis of unflattening, so it would be unnecessary to invoke the copies etc that would be required to flatten the entire subtree.

agoose77 · 2021-06-14T06:23:40Z

Additionally, the tests are probably not quite finished - I don't thoroughly check that identities and parameters are preserved, or that the contents are always simplified too. Perhaps some input as to how much testing we want to do would be helpful.

agoose77 · 2021-06-14T07:29:56Z

When implementing the depth parameter (I settled on to_axis because it's up-to-and-including the axis), I observed that UnionArray raises an exception when asked to wrap axis=-1 for the following layout

a = ak.layout.NumpyArray(np.arange(4))
b = ak.layout.RegularArray(ak.layout.NumpyArray(np.arange(12)), 3)
layout = ak.layout.UnionArray8_64(
    ak.layout.Index8([1, 1,  0, 0]), ak.layout.Index64([0,1, 0,1]), [a, b]
)
layout.axis_wrap_if_negative(-1)

I expect this to just return -1, because each array has its own concept of axis=-1. Do you have any thoughts here?

jpivarski · 2021-06-14T13:53:38Z

I think ak.packed should always apply to all levels: no axis (depth) parameter. I can't imagine why anyone would want to partially pack an array: it's an "invisible" feature (like virtualness, partitioning, categoricals with IndexedArray, etc.). It would be hard to even know that it has been applied to one axis and not another—it's something one would want to apply to a whole array before ak.to_buffers or to reduce the indirection in an array and make it faster to access.

PartitionedArray concatenate partitions?

I think it would be more useful to leave the partitions as partitions. ak.repartition generalizes the process of concatenating them, so it would be good if ak.packed and ak.repartition can be used independently. Since ak.packed is important for storage, we don't want it to interfere with another storage-related property.

agoose77 · 2021-06-14T13:56:18Z

@jpivarski I was motivated to work on this by #910. In that case, we don't need to flatten every axis unless the user passes axis=-1, and indeed, one wouldn't want to flatten additional axes unless required to, because it introduces many copies of the data. For example, if I wanted to add structure to a wide RecordArray, ak.unflatten would then potentially require copying all of the branches (to project them).

This reverts commit 08b2b08.

agoose77 · 2021-06-14T15:14:30Z

Sorry for the git noise. I rebased to drop the first commit.

jpivarski

This is looking good! Just a few comments inline.

src/awkward/operations/structure.py

This won't involve a copy.

agoose77 · 2021-06-14T20:33:23Z

OK, since making the discussed changes, I added a missing pass-through case for PartitionedArray, and another for Record

jpivarski · 2021-06-14T22:00:05Z

I just made ak.packed a preprocessing step for pickling, which resolves scikit-hep/awkward-0.x#246 and #701. Also, I think there's some alliterative possibilities there.

I added documentation (that I'm not 100% sure will render correctly; I'm just crossing my fingers and will look at awkward-array.org when it's done). I don't think you have any other changes to make, so I'm enabling auto-merge. If there's something else you want me to add, let me know right away and I can stop the process!

agoose77 changed the title ~~Feature add layout simplification~~ Feature: add layout simplification Jun 11, 2021

agoose77 force-pushed the feature-add-simplify branch from 4ec56fe to 43f6fcf Compare June 11, 2021 14:32

agoose77 changed the title ~~Feature: add layout simplification~~ Feature: add ak.packed Jun 11, 2021

agoose77 mentioned this pull request Jun 14, 2021

ak.unflatten does not respect IndexedArray w.r.t counts #910

Closed

agoose77 added 10 commits June 14, 2021 16:10

Test: add RegularArray test

e047462

Test: add BitMaskedArray test

f4b31d6

Test: add ByteMaskedArray test

62e605b

Feature: add highlevel argument to packed

d9fad97

Bugfix: call apply for ListArray

77413b5

Revert "Bugfix: call apply for ListArray"

e60623f

This reverts commit 08b2b08.

Test: fix tests for RegularArray

ab493c5

Bugfix: fix RegularArray

0ffdf40

Refactor: make implementation private with hidden _axis argument

c86ef41

Test: test axis argument

ad58d33

agoose77 force-pushed the feature-add-simplify branch from 4619efa to ad58d33 Compare June 14, 2021 15:10

agoose77 mentioned this pull request Jun 14, 2021

Feature: add pass_apply argument to util #918

Merged

jpivarski reviewed Jun 14, 2021

View reviewed changes

src/awkward/operations/structure.py Outdated Show resolved Hide resolved

src/awkward/operations/structure.py Outdated Show resolved Hide resolved

src/awkward/operations/structure.py Outdated Show resolved Hide resolved

src/awkward/operations/structure.py Outdated Show resolved Hide resolved

agoose77 added 9 commits June 14, 2021 16:54

Refactor: explicitly test for r != 0

5e64582

Refactor: remove implicit elif via return

537ef7a

Refactor: raise AssertionError on unrecognised layouts

347c0e7

Refactor: use __getitem__ to truncate layouts

13d8a5e

This won't involve a copy.

Feature: support PartitionedArray (passthrough)

0085b3e

Feature: support Record (passthrough)

e4cb3c0

Test: add test for PartitionedArrays

0de3e24

Test: add test for Record type

6046822

Feature: add case for Record type

9584915

jpivarski linked an issue Jun 14, 2021 that may be closed by this pull request

We need an ak.packed function #746

Closed

ak.packed is now a step in pickling; added documentation.

834b3b8

jpivarski enabled auto-merge (squash) June 14, 2021 22:00

jpivarski merged commit 0fe7791 into main Jun 14, 2021

jpivarski deleted the feature-add-simplify branch June 14, 2021 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add `ak.packed` #912

Feature: add `ak.packed` #912

agoose77 commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021

agoose77 commented Jun 11, 2021 •

edited

Loading

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021

jpivarski commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021 •

edited

Loading

jpivarski commented Jun 11, 2021

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021

jpivarski commented Jun 11, 2021

agoose77 commented Jun 12, 2021

jpivarski commented Jun 12, 2021

agoose77 commented Jun 12, 2021 via email

jpivarski commented Jun 12, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021 •

edited

Loading

agoose77 commented Jun 13, 2021 •

edited

Loading

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 14, 2021 •

edited

Loading

agoose77 commented Jun 14, 2021

agoose77 commented Jun 14, 2021 •

edited

Loading

jpivarski commented Jun 14, 2021

agoose77 commented Jun 14, 2021

agoose77 commented Jun 14, 2021

jpivarski left a comment

agoose77 commented Jun 14, 2021

jpivarski commented Jun 14, 2021

Feature: add ak.packed #912

Feature: add ak.packed #912

Conversation

agoose77 commented Jun 11, 2021 • edited Loading

agoose77 commented Jun 11, 2021

agoose77 commented Jun 11, 2021 • edited Loading

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021

jpivarski commented Jun 11, 2021 • edited Loading

agoose77 commented Jun 11, 2021 • edited Loading

jpivarski commented Jun 11, 2021

jpivarski commented Jun 11, 2021

agoose77 commented Jun 11, 2021 • edited Loading

agoose77 commented Jun 11, 2021

jpivarski commented Jun 11, 2021

agoose77 commented Jun 12, 2021

jpivarski commented Jun 12, 2021

agoose77 commented Jun 12, 2021 via email

jpivarski commented Jun 12, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021 • edited Loading

agoose77 commented Jun 13, 2021 • edited Loading

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 13, 2021

agoose77 commented Jun 14, 2021 • edited Loading

agoose77 commented Jun 14, 2021

agoose77 commented Jun 14, 2021 • edited Loading

jpivarski commented Jun 14, 2021

agoose77 commented Jun 14, 2021

agoose77 commented Jun 14, 2021

jpivarski left a comment

Choose a reason for hiding this comment

agoose77 commented Jun 14, 2021

jpivarski commented Jun 14, 2021

Feature: add `ak.packed` #912

Feature: add `ak.packed` #912

agoose77 commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021 •

edited

Loading

jpivarski commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 11, 2021 •

edited

Loading

agoose77 commented Jun 13, 2021 •

edited

Loading

agoose77 commented Jun 13, 2021 •

edited

Loading

agoose77 commented Jun 14, 2021 •

edited

Loading

agoose77 commented Jun 14, 2021 •

edited

Loading