Designing smart modules #1

francois-rozet · 2023-06-03T05:57:01Z

francois-rozet
Jun 3, 2023
Maintainer

Hello @patrick-kidger, we can continue the discussion we started in jax-ml/jax#16170 here. I choose to do it in this repo rather than pollute the issues of Equinox. To summarize, we agree that automatically detecting static leaves would be better than the current Equinox behavior. We both propose an approach that is based on the tree.

My approach is to detect static leaves when flattening the tree. The advantage is that everything, from the types to the very structure of the tree, can be modified in-place, as users would expect from a Python object. The main drawback is that it does not work with the current implementation of jax.vmap, which assumes that the tree structure does not change when leaves are modified. In my opinion, this is actually a limitation of JAX and should/could be fixed. In fact, JAX already has a similar behavior with None objects.

>>> jax.tree_util.tree_flatten((0, 1))
([0, 1], PyTreeDef((*, *)))
>>> jax.tree_util.tree_flatten((0, None))
([0], PyTreeDef((*, None)))

You propose to detect static leaves when attributes are set (in __setattr__). The advantage is that tree_map (or equivalent composition of tree_flatten and tree_unflatten) would not change the tree structure. The main drawback is that in-place modifications of attributes would not work as users expect, as __getattr__ would create deep copies of the attribute. For instance,

module.container = []
module.container.append('something')  # container is still empty!

It might also add some overhead (although compiled away by jax.jit).

Edit: Actually, I now think your approach should be more efficient. The only operation that is not JIT-optimized in JAX is tree_flatten, which is necessary to compare input PyTrees and detect cache collisions. With your approach, tree_flatten is much simpler and, hence, faster.

francois-rozet · 2023-06-03T09:08:23Z

francois-rozet
Jun 3, 2023
Maintainer Author

I pushed a new branch (dev) which implements your approach. I think this is quite nice, especially as tree_flatten and tree_unflatten are now inherited from Namespace. However, it would require heavy warnings in the documentation that in-place modifications do NOT work as one would expect.

Now remains to tackle the issue of the "stateful" layers, like BatchNorm and Dropout. I think a functional approach where the layers return their own state is too cumbersome and introduces a lot of complexity (for both users and developers). As you said, my approach fails silently when the module enters "pure function" territory, that is jax.jit and jax.vmap (I don't think jax.grad is an issue as you rarely need the gradient of the buffers anyway). However, this can be fixed easily by also returning the module/buffers along the outputs.

We can either warn the user and give them the responsibility to handle this or introduce small wrappers around jax.jit and jax.vmap that would also return the inputs after application.

0 replies

patrick-kidger · 2023-06-03T18:10:13Z

patrick-kidger
Jun 3, 2023

So my strongest opinion is about this statement:

In my opinion, this is actually a limitation of JAX and should/could be fixed.

I strongly disagree!

Equinox makes heavy use of the assumption. I previously mentioned that it's what needed to make equinox.tree_at work; it's also used substantially in many of the more advanced APIs inside equinox.internal. So I would very much like JAX not to lift that assumption, as it would probably break everything I do downstream!

Regarding simplicity of tree_flatten: yup, this is really important for the reason you describe. See this comment on how Distrax does something similar to what you're describing instead, and is very very slow as a result.

Moving on to in-place updates: Equinox deliberately has frozen modules, that you cannot modify outside of __init__. This functional style avoids (a) that changes aren't propagated through flatten+unflatten, as already discussed, but also (b) also fixes one of the limitations that PyTorch has had on this front, when doing data parallelism alongside stateful operations. In their case, it's been the case that e.g. BatchNorm updates will only use the data from just one device, because they had a difference BatchNorm(...) instance for each device, and there was no way to express that the stateful operation should be propagated between them.

1 reply

francois-rozet Jun 4, 2023
Maintainer Author

IMO, JAX should not be a limiting factor for downstream tasks. If it is possible to drop this assumption without hurting JAX's performance/complexity, then it should be dropped. Downstream libraries/frameworks can make this assumption if they want/need to, but it is not JAX's job to impose it.

Anyway, as I have adopted your approach for static leaves, this is not an issue anymore (I hope).

Although I understand why frozen modules can be useful/safer, I think they make manipulating modules cumbersome. To some extent, the popularity of PyTorch is due to the ease and intuitiveness of composing/manipulating modules in-place. Want to replace the net.layers[0].activation layer? Just do it: net.layers[0].activation = nn.ReLU()! No need to introduce eqx.tree_at-like functions. Note that even if I allow in-place modifications, flatten+unflatten still makes a deep copy of Inox modules, so I think (a) is not an issue.

Now, because modules are frozen, stateful layers in Equinox must take/return their own state as input/output. This leads to a cascade of modules that must handle the state of their layers, hence introducing a lot of code. Stepping back, what happens for the user is something like

def f(params, buffers, x):
    model = build(params)
    y, buffers = model(x, buffers)
    return y, buffers

where buffers represent the state of stateful layers. With Inox's in-place approach, the previous code becomes

def f_bis(params, buffers, x):
    model = build(params, buffers)
    y = model(x)  # buffers is updated in-place
    return y, buffers

There is still a need for users to handle the buffers, but only outside of the module! For jax.jit and jax.vmap, however, f and f_bis are actually the exact same functions. Hence, if some sharding/data parallelism is possible with the Equinox's approach it should be as well with Inox's.

patrick-kidger · 2023-06-04T17:18:39Z

patrick-kidger
Jun 4, 2023

IMO, JAX should not be a limiting factor for downstream tasks. If it is possible to drop this assumption without hurting JAX's performance/complexity, then it should be dropped. Downstream libraries/frameworks can make this assumption if they want/need to, but it is not JAX's job to impose it.

Once again I disagree. Changing this in JAX now would lead to further fragmentation in the ecosystem. (Which frankly we already have more than enough of -- Equinox vs Haiku vs Flax being the most prominent example.) Right now Equinox is compatible with every pytree, and thus compatible with essentially everything in the JAX ecosystem.

If we make this distinction of a privileged "special" kind of pytree then that goes away -- we end up with a mini "Equinox ecosystem" as being the only place where we could make compatibility guarantees. From a library author perspective: the loss of a shared language makes it much harder for new projects to get started. From an end user perspective: it means that two unrelated libraries are much less likely to be compatible.

Note that even if I allow in-place modifications, flatten+unflatten still makes a deep copy of Inox modules, so I think (a) is not an issue.

Flattening and unflattening happens implicitly in many places in JAX. Every time you cross jax.{jit, grad, vmap, ...} etc for one thing. When using higher-order primitives like lax.{scan, while_loop, ...} etc. is another. When doing anything nontrivial it becomes difficult to keep track of where this has happened -- and if you miss just one instance of this, then your code silently returns the wrong results. This is not scalable to larger codebases with multiple developers.

Now, because modules are frozen, stateful layers in Equinox must take/return their own state as input/output.

I'd argue this is a feature. For example when doing batch norm + neural ODEs, you will actually evaluate the layer multiple times, but only want to do the stateful update just once. (Else you get different normalisations at different points of the solve, which breaks hte ODE-like structure.) Decoupling evaluation from stateful update makes this possible.

6 replies

patrick-kidger Jun 4, 2023

Right, but unless you're careful to always return the in-place-updated copy from each region, then you silently don't see the updates. We mooted this exact idea back when Equinox was first designed, and the ease of making a mistake was always the reason we didn't pursue it seriously.
Yeah, I believe you could find an alternate API, e.g. a flag for whether to actually make the in-place update or not. This flag is unnecessary for an explicit approach to statefulness though.
The main issue here is backward compatibility, specifically with leaves that are intended to be dynamic, but are initialised as bool/int/float/complex and implicitly promoted. I know I've done this several times. Perhaps only use auto-dynamic/static detection if attributes weren't declared in advance. (And continue to use that as the API for manual control over dynamic/static, for those few use cases where it's really needed.)

I'm not too worried about the backward incompatibility with modules not being dataclasses any more. I suspect that's used relatively rarely. We might even be able to fake it by adding __dataclass_fields__ and __dataclass_params__ to the instance after __init__. (Rather than the class, where those attributes are usually put.)

Realistically I don't expect to have time to put this change together any time soon, though! As discussed it's probably a bit tricky, a bit backward-incompatible, and ends up introducing an inconsistency: we would no longer need filter_jit (everything non-array is static), we would still need filter_grad (to separate inexact arrays from int/bool arrays).

francois-rozet Jun 5, 2023
Maintainer Author

I'll try to find something that makes updates more explicit. Maybe aStateless module that can wrap another and makes its updates functional.

Concerning the inconsistency, the Buffer class in Inox was initially not meant for in-place updates, but simply marking arrays that should not be optimized by gradient descent, similar to PyTorch's Buffer. This allows the partition method of Inox to return the list of params, list of buffers and tree-builder, instead of two trees like eqx.partition. Then, what the user is supposed to optimize is very explicit: the list of params. This is why I don't provide any filter_* helper: I want users to use partition for training.

Also, you mention in your docs that it is faster to make the training step take the flattened tree (because flattening is slow) as input. With Inox's partition, this is not necessary, as the partitions are already flat.

Realistically I don't expect to have time to put this change together any time soon, though!

In the meantime, can I add you as co-author of Inox?

patrick-kidger Jun 6, 2023

Also, you mention in your docs that it is faster to make the training step take the flattened tree (because flattening is slow) as input.

Actually, it's usually pretty negligible in most cases!

With Inox's partition, this is not necessary, as the partitions are already flat.

Even so this is an interesting idea. If I were to try this again I might give this a go.

In the meantime, can I add you as co-author of Inox?

I'm not sure I understand. Realistically I won't have time to make code contributions, if that's what you're asking.

francois-rozet Jun 6, 2023
Maintainer Author

Oh no no, I don't expect you to make code contributions. As you said, introducing the changes we discussed in Equinox is tricky, notably because of backward incompatibilities. However, these changes are already implemented in Inox. So it could become a philosophical successor/sibling of Equinox. But, I think you should be credited as co-author (in the License, PyPi and documentation), as Inox is heavily inspired from Equinox, and you contributed to its design through our discussions.

patrick-kidger Jun 7, 2023

Ah, I'd prefer not to be listed as an author, but thank you for the offer!

ASEM000 · 2023-06-08T17:57:40Z

ASEM000
Jun 8, 2023

Hello, I came into the described Jax problem; I have authored a pytree library and tested some of the suggestions/ideas first-hand that @francois-rozet / @patrick-kidger mentioned.

From my experience as a user/designer of pytree-based libraries, it's tempting to add cool features like automatic detection of static fields/overriding some magic methods (like PyTorch itself does in 'Module'), etc. Still, this is fine if the user knows what's happening underneath. However, this is only true on occasion.
For example, I believe the Flax PyTreeNode/Equinox/simple_tree'static field' is an example of this; the static field values are placed in tree metadata when flattening and must be hashable, but due to a Jax bug, non-hashable values are allowed in the metadata. Users do not receive any warning about non-hashability. The PalmJax implementation in Equinox has a fault like this, which took me some time to discover.
Another issue with static fields is speed; the flattening process will always loop through the fields, even if they are non-static; as @patrick-kidger mentioned, flattening is all over Jax internals/other libraries. This is one reason why libraries adopting this approach have a perf hit.
Third, while you may not want to change the static field, you may need to use jax.tree_map to filter depending on its value; if you constantly mark it as a static field, you lose this ability.

For in-place updates, The Immutability assumption is a fundamental assumption to abide by; if you break this assumption, I am pretty sure you will encounter all sorts of problems (as I have discovered ) . For tree modification, in my library this net.layers[0].activation = nn.ReLU() becomes net=net.at['layers'].at[0].at['activation'].set(nn.ReLU). this functionality is fully compatible with pytrees that are registered in jax path registry and is decoupled from the module design.

P.S. I really like the logo. :D

2 replies

francois-rozet Jun 12, 2023
Maintainer Author

Hello @ASEM000, thank you for the feedback and sorry for the delay.

the static field values are placed in tree metadata when flattening and must be hashable, but due to a Jax bug, non-hashable values are allowed in the metadata. Users do not receive any warning about non-hashability.

It should be fairly easy to add a warning for non-hashability, both in Equinox and Inox. However, I am not sure this is desirable as some non-hashable objects, like lambda functions, are perfectly fine to put inside a module. I think a bigger issue in Equinox is the impossibility to distinguish optimizable arrays (parameters) from non-optimizable ones (buffers). This is likely the reason for the PalmJax issue you mention.

Another issue with static fields is speed; the flattening process will always loop through the fields, even if they are non-static; as @patrick-kidger mentioned, flattening is all over Jax internals/other libraries. This is one reason why libraries adopting this approach have a perf hit.

I'm not sure to get your point. The flattening process will always iterate over the fields, regardless of the way static leaves are handled.

For in-place updates, The Immutability assumption is a fundamental assumption to abide by;

I disagree. JAX never assumes that PyTrees are immutable. In fact, lists and dicts, which are core PyTree objects, are mutable. It can be useful to make some PyTrees immutable, for the sake of safety, but making all PyTrees immutable is a convenience nightmare.

P.S. I really like the logo. :D

Thanks!

ASEM000 Jun 13, 2023

It should be fairly easy to add a warning for non-hashability, both in Equinox and Inox.

This is a reasonable solution.

I think a bigger issue in Equinox is the impossibility to distinguish optimizable arrays (parameters) from non-optimizable ones (buffers).

This is not an equinox problem.

I'm not sure to get your point. The flattening process will always iterate over the fields, regardless of the way static leaves are handled.

Since classes store their values in dicts, it's usually slower to iterate over dict and perform some logic (static field logic) compared To using their built-in methods to flatten them. The example shows iterations vs. dict flatten method; I did not consider any static field logic, instance checks, and so on, which could take longer. My point is that regardless of whether a static field exists or not, you got to iterate, which would yield, in general, lower perf.

import dis 

a = { str(i): i for i in range(1_000)}

def flatten_dict(d):
    keys = []
    values = []
    for k, v in d.items():
        keys.append(k)
        values.append(v)
    return keys, values

# %timeit flatten_dict(a)
# 58.2 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

print(dis.dis(flatten_dict))
#   6           0 BUILD_LIST               0
#               2 STORE_FAST               1 (keys)

#   7           4 BUILD_LIST               0
#               6 STORE_FAST               2 (values)

#   8           8 LOAD_FAST                0 (d)
#              10 LOAD_METHOD              0 (items)
#              12 CALL_METHOD              0
#              14 GET_ITER
#         >>   16 FOR_ITER                14 (to 46)
#              18 UNPACK_SEQUENCE          2
#              20 STORE_FAST               3 (k)
#              22 STORE_FAST               4 (v)

#   9          24 LOAD_FAST                1 (keys)
#              26 LOAD_METHOD              1 (append)
#              28 LOAD_FAST                3 (k)
#              30 CALL_METHOD              1
#              32 POP_TOP

#  10          34 LOAD_FAST                2 (values)
#              36 LOAD_METHOD              1 (append)
#              38 LOAD_FAST                4 (v)
#              40 CALL_METHOD              1
#              42 POP_TOP
#              44 JUMP_ABSOLUTE            8 (to 16)

#  11     >>   46 LOAD_FAST                1 (keys)
#              48 LOAD_FAST                2 (values)
#              50 BUILD_TUPLE              2
#              52 RETURN_VALUE


def flatten_dict(d):
    return tuple(d.keys()), tuple(d.values())

# %timeit flatten_dict(a)
# 7.83 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

print(dis.dis(flatten_dict))
#  53           0 LOAD_GLOBAL              0 (tuple)
#               2 LOAD_FAST                0 (d)
#               4 LOAD_METHOD              1 (keys)
#               6 CALL_METHOD              0
#               8 CALL_FUNCTION            1
#              10 LOAD_GLOBAL              0 (tuple)
#              12 LOAD_FAST                0 (d)
#              14 LOAD_METHOD              2 (values)
#              16 CALL_METHOD              0
#              18 CALL_FUNCTION            1
#              20 BUILD_TUPLE              2
#              22 RETURN_VALUE

You have a nice and very polished package; congrats on putting it together. 👍

francois-rozet · 2024-01-04T21:47:10Z

francois-rozet
Jan 4, 2024
Maintainer Author

Hello @patrick-kidger and @ASEM000 👋 I hope you are doing well. Just to say that I finally had the time to finish the interface of Inox. I think it is quite nice now!

In the end, detecting and wrapping static leaves on __setattr__ wasn't that good of an idea. Flattening and unflattening was fast(er), but __getattribute__ (which happens quite often) was very slow, especially for complex attributes (list, dict, ...). In addition, it meant that __getattribute__ returned a copy of the attribute, which lead to unexpected behaviors (e.g. module.layers.append(...) does not modify module).

I therefore went back to detecting static leaves during flattening, which is slightly slower (a factor 5x for flattening in my experiments) but remains negligible with respect to actual tensor operations. In addition, flattening/unflattening can be avoided (with jax.jit) by training flattened modules, which is now easier thanks to the Module.pure method, which returns (a) the treedef of the module and (b) the state of the module as a (several) flat dictionary(ies) of arrays. This pure method is also handy to transform stateful modules into pure modules, which are safer to use!

Note that I previously had an issue with jax.vmap, which would not support detecting static leaves during flattening. This was in fact trivial to fix by observing that vmap uses object() instances as placeholders.

Thank you for the very interesting discussions!

3 replies

francois-rozet Jan 5, 2024
Maintainer Author

Hello @patrick-kidger, regarding your comment jax-ml/jax#16170 (comment). I agree that if libraries assume that leaves can be replaced without changing the tree structure, dynamically detecting non-array leaves becomes an issue. However, I still think that the assumption itself should not be made.

The issue with detecting non-array leaves after __init__ is that it makes manipulating models (e.g. replacing/removing layers) quite annoying during and after initialization. Providing something like the .at method of @ASEM000's pytreeclass could be a viable solution though.

The last idea I have, which I am not sure I like (because it introduces a lot of complexity), would be to provide a freeze method that would save the current structure of the tree and whether each leaf is static or not. Later-on, flattening/unflattening would not use the type of the current leaves but the data saved during freeze. WDYT?

patrick-kidger Jan 5, 2024

However, I still think that the assumption itself should not be made.

Unfortunately that's just how JAX is designed! I don't think any of us get to change that now, and a lot of code has already been written assuming this invariant.

manipulating models (e.g. replacing/removing layers) quite annoying during and after initialization

Take a look at equinox.tree_at, which I think does this quite elegantly. (You could certainly also put a different syntax on this if you prefer.) I don't see how this is related to dynamic/static detection of leaves though.

I am not sure I like (because it introduces a lot of complexity)

Haha! Yes, I don't like it either. Better to be simple; better not to have the footgun of "you forgot to call freeze."

francois-rozet Jan 5, 2024
Maintainer Author

better not to have the footgun of "you forgot to call freeze."

Oh no! I meant as a replacement of the current behavior. So a tree would simply not be flattenable without calling freeze first (or freeze would be called automatically on the first flatten, but reset on __setattr__ and .at).

ASEM000 · 2024-01-06T04:54:23Z

ASEM000
Jan 6, 2024

Happy to hear from you again and happy holidays!

So, IMO, I came to two conclusions 1) keep everything explicit 2) Same class instance should have the same flatten rule.

For 1), The problem is a) you are doing extra host work every time your invoke for loop/tree_map + conditionals to decide which is static/frozen (this scales with your leaves count). The performance hit becomes visible for small nets. b) If you hide the process from the user, then detecting what is trainable/not trainable is maybe a footgun here, recompiling behavior becomes unclear here.

For 2), I think I want to follow standard pytree rules like list/dict where any instance of them has the same rule. this simplify the reasoning about them. so for me I prefer flatten step ~ vars(tree) to match dict. This speeds up the overall performance compared to other impls.

My current approach is to use leafless wrapper (Static in your impl), then you can use it in two different ways:

descriptor based approach instead of overriding __setattr__/__getattr__ to wrap/unwrap you desired leaf (similar to flax.struct.field(pytree_node=False)/equinox.static_field). This design has couple of benefits: 1) decouples any logic from the class flatten rule yielding faster perf overall, 2) can be used with any pytree class (not-special cased to flax.struct.PytreeNode or eqx.Module) 3) static/frozen becomes a special application of this approach, you can have buffer_field for instance here. 4) Simplified internals, because you have a single way to handle static/freezing via wrapping.
Static + tree_map to wrap a leaf before some jax boundary (e.g. jit) based on some condition (e.g. is it jax type) and unwrap it after.
this has a couple of benefits I discussed here

Correct me If im wrong, your main motivation is to make static/non-static decisions hidden from the user on a module-level, (smart modules) so you dont have to recreate all jax transforms to understand this (like Patrick's work with filtered jax transforms), I think this is a good goal but my feeling (might be incorrect ofc) is that, eventually users have to understand this point in case they are hit with some error/weird behavior when interacting with other libraries or jax transforms.You can see a sample issue either in flaxhere/equinoxhere and here where there is some confusion about the static/non-static and jaxtype/non-jaxtype behavior. I think if we promise the user that we will handle everything on their behalf, its frustaing when they hit some error/behavior that they dont understand and then they have to understand this behavior to debug it.So we either keep our promise and handle everything on their behalf (which is hard) if we are dealing on low-level jax, or we do not make a promise and introduce the concept of static/non-static jaxtype/non-jaxtype to the user at the expense of steeper learning curve.

I am interested to know about you statefulness handling, I will read more about your stateful modules as this is (IMO) one of tricky problems in pytree-based approaches. Let me know what you think.

3 replies

francois-rozet Jan 6, 2024
Maintainer Author

Happy new year!

Regarding 1) my take is that only arrays (NumPy and JAX arrays) should be considered leaves. Implicitly casting float or int to arrays is very rarely what users want, and if they really want a scalar array, they can always write jnp.array(3.1415). Also, if a str or a boolean flag is modified then it should recompile as it is possibly not the same function anymore! Inox does not implicitly chooses which part of the tree is static, but explicitly imposes that only (and all) arrays are not static. The difference with Equinox is that I want this to be always true, not only with custom transformations.

Regarding 2) in my tests, there is a 5-10x factor between flattening inox.tree_util.Namespace and inox.tree_util.Auto (namespace with auto static leaf detection), but it remains in the same order of magnitude as other libraries for flattening+unflattening (while doing the filtering work, which remains to be done for other libraries).

I tested the descriptor approach, but did not like it because it does not work for arbitrary PyTrees. You have to assign the parameters, buffers, hyperparams, ... directly to the module and handling lists/dicts of mixed parameters/constants/layers/functions is impossible (or you fall in the same issues as detecting static leaves on __setattr__). Finally, it introduces many concepts, classes and functions (which is bad for a user, see static_field of Equinox) and is annoying to write (verbose).

I like the explicit tree_mask of Serket, but I think it is quite annoying for users. You have to remember to mask/unmask before/after every transformation, which will be an issue with other JAX libraries (same as filter_* transformations of Equinox actually).

I think if we promise the user that we will handle everything on their behalf, its frustrating when they hit some error/behavior that they dont understand and then they have to understand this behavior to debug it.

There will always be bugs, regardless the abstraction. My take is that users don't need to understand the implementation as long as they understand the intent/rules. There is only one rule with Inox modules: arrays are leaves and non-arrays are not. IMO, a library should not be intentionally less convenient to use/harder to understand in order to educate users to its inner workings.

ASEM000 Jan 6, 2024

my take is that only arrays (NumPy and JAX arrays) should be considered leaves

you will deviate from jax notion of allowed jaxtypes.

if a str or a boolean flag is modified then it should recompile

In all cases it will recompile, The issue mentioned that explicitly defining str as static is probably better.

I tested the descriptor approach, but did not like it because it does not work for arbitrary PyTrees. You have to assign the parameters, buffers, hyperparams, ... directly to the module and handling lists/dicts of mixed parameters/constants/layers/functions is impossible (or you fall in the same issues as detecting static leaves on setattr).

There is some usage for it, like buffer field for batchnorm stats array. It works on python level so can be reasoned about as another descriptor.

before/after every transformation, which will be an issue with other JAX libraries

I am afraid you misunderstood it, please take a close look at an example in the docs like training PINN to see how to works with optax/jaxopt seamlessly with no code change and without having to do the before/after masking for each transformation. Compared to your approach, this approach is faster, incur minimal number of copies, more explicit, and works with any pytree.

a library should not be intentionally less convenient to use/harder to understand in order to educate users to its inner workings.

Note that I was talking about overall jax behavior, not my library. So I meant about jax behavior in why cases like jax.jit(lambda x,y:x)(1, "a") will raise an error.

francois-rozet Jan 6, 2024
Maintainer Author

you will deviate from jax notion of allowed jaxtypes

I think that is fine. I would even argue that it reduces the likeliness of mistakes (e.g. forgetting that int/float could be "trainable").

The issue mentioned that explicitly defining str as static is probably better.

Well that was my point: Inox explicitely imposes that all str are static.

There is some usage for it, like buffer field for batchnorm stats array.

Using jax.lax.stop_gradient for buffers makes sense, but this might become an issue if one wants to train the running statistics later on (which happens!).

I am afraid you misunderstood it [...] without having to do the before/after masking for each transformation.

I meant before/after transformations of functions that take the module as input. So for instance, if as part of the network you need to jax.vmap a layer, or jax.lax.scan a module (the module being the carry), or you have an internal layer which you want to jax.jit, then you need to mask/unmask. What is great about sk.tree_mask is that, as you said, it works for any PyTree and you don't need to overwrite native transformations like Equinox does. However you still have to mask/unmask (filter) for every transformation that traces the module (minus those where you don't actually apply the module, and can keep it masked).

francois-rozet · 2024-01-08T15:59:08Z

francois-rozet
Jan 8, 2024
Maintainer Author

Hello @patrick-kidger and @ASEM000, sorry to bother you again. I tried my last idea (see #1 (reply in thread)) and it was terrible 🤣 In the end, I think you guys are right: automatic detection of static leaves during flattening makes Inox modules (a) error prone around JAX transformations and (b) incompatible with part of the JAX ecosystem.

Therefore, I decided to make my modules dumb PyTrees (namespaces) and adopted the lifted transformation approach of Equinox. Roughly, for a function f and a transformation jax.transform,

y = inox.transform(f)(x)

is equivalent to

g = lambda x: tree_mask(f(tree_unmask(x)))
y = tree_unmask(jax.transform(g)(tree_mask(x)))

The lifting is not tailored to each transformation, which means that Inox transformations might be slightly slower than Equinox transformations. However, the simplicity and generality allows to keep exactly the same interface as the base transformation. For better performances, users should use tree_mask directly, like in Serket.

These changes effectively make Inox a mini-version of Equinox, which I mention in the README along with Serket for the inspiration. Thanks again for the discussions!

P.S. Equinox and Inox are now pretty much entirely compatible, at least for evaluation. Inox still uses a Parameter wrapper to indicate parameter arrays, which allows to split arrays into different partitions.

8 replies

francois-rozet Jan 8, 2024
Maintainer Author

I think its the same idea! Really nice! The only difference is that I cache the result of the inner wrapper to hit the cache of the JAX transformation if a function were to be transformed twice.

I also wanted to do the "inline" one (which I called executor instead of transformation), but it was harder to write a single wrapper as args/kwargs could be arguments of the executor or the function.

francois-rozet Jan 8, 2024
Maintainer Author

I just skimmed through Serket/Sepes and oh-my! The amount of documentation and tutorials is crazy for a lone dev 🤯 It's truly a goldmine, but I must say the number of features is a little overwhelming. I hope your work finds its audience!

ASEM000 Jan 11, 2024

Also take a look here, to compare with your sharing ref mechanism.

francois-rozet Jan 12, 2024
Maintainer Author

@ASEM000 Hmmmm, I am not sure I like it as it adds a lot of boilerplate code. Also, doesn't it mean that the weights that are not used are still present in the tree? So if something else uses the weights without running _tied_call, it would not be the right weights. Another way to do something similar without the value_and_tree is to overwrite tree_unflatten like

@classmethod
def tree_unflatten(cls, static, leaves):
    self = super().tree_unflatten(static, leaves)
    self.dec1.weight = self.enc1.weight.T  # or with self.at if frozen
    self.dec2.weight = self.enc2.weight.T
    return self

which ensures that the weights are always tied.

What do you think about inox.nn.share? It is easy to use and very explicit about what is shared/not-shared. Surprisingly it even works with reference cycles!

ASEM000 Jan 19, 2024

Also, doesn't it mean that the weights that are not used are still present in the tree?

You can set it to None at initialization. I did not do so, to show that sharing happens within methods.

So if something else uses the weights without running _tied_call, it would not be the right weights.

running _tied_call will cause error, because it modifes the tree inplace, only tied_call is allowed to work. So, if you do not use tied_call then no weight sharing is observed. As you know, the reason why I do this repeated setting of ref during method call is because of the referential transparency principle.

What do you think about inox.nn.share

I experimented with something similar to this approach (ref marker) a while back, but I settled down for the current approach, mainly because it handles stateful updates( unlike your impl, its not allowed in my impl) and ref sharing with same impl. I do not have a strong feeling about either, but In general you can think of your approach as an early step ( like you have to handle the state at init time) vs my impl which is a delayed step (set the ref within the method).

@ASEM000 Hmmmm, I am not sure I like it as it adds a lot of boilerplate code.

I have easier time using it, specifically when using function transformation on methods that has stateful updates, the canonical example for this use case is this. In fact, writing this example the way its written ( similar to PyTorch with stateful updates) was the main motivation to start writing this package.

Designing smart modules #1

francois-rozet Jun 3, 2023 Maintainer

Replies: 7 comments · 23 replies

francois-rozet Jun 3, 2023 Maintainer Author

francois-rozet Jun 4, 2023 Maintainer Author

francois-rozet Jun 5, 2023 Maintainer Author

francois-rozet Jun 6, 2023 Maintainer Author

francois-rozet Jun 12, 2023 Maintainer Author

francois-rozet Jan 4, 2024 Maintainer Author

francois-rozet Jan 5, 2024 Maintainer Author

francois-rozet Jan 5, 2024 Maintainer Author

francois-rozet Jan 6, 2024 Maintainer Author

francois-rozet Jan 6, 2024 Maintainer Author

francois-rozet Jan 8, 2024 Maintainer Author

francois-rozet Jan 8, 2024 Maintainer Author

francois-rozet Jan 8, 2024 Maintainer Author

francois-rozet Jan 12, 2024 Maintainer Author

francois-rozet
Jun 3, 2023
Maintainer

Replies: 7 comments 23 replies

francois-rozet
Jun 3, 2023
Maintainer Author

francois-rozet Jun 4, 2023
Maintainer Author

francois-rozet Jun 5, 2023
Maintainer Author

francois-rozet Jun 6, 2023
Maintainer Author

francois-rozet Jun 12, 2023
Maintainer Author

francois-rozet
Jan 4, 2024
Maintainer Author

francois-rozet Jan 5, 2024
Maintainer Author

francois-rozet Jan 5, 2024
Maintainer Author

francois-rozet Jan 6, 2024
Maintainer Author

francois-rozet Jan 6, 2024
Maintainer Author

francois-rozet
Jan 8, 2024
Maintainer Author

francois-rozet Jan 8, 2024
Maintainer Author

francois-rozet Jan 8, 2024
Maintainer Author

francois-rozet Jan 12, 2024
Maintainer Author