Support replica groups for distributed batchnorm #42

jekbradbury · 2020-03-09T07:47:03Z

Requires jax-ml/jax#2382. Needed for large scale (i.e., small per-device batch size) training of ResNets, where the ideal number of examples to normalize over seems to be about 128 (and normalizing over the whole pmap is both unnecessarily slow and gives worse results).

jheek · 2020-03-09T11:21:59Z

The changes look good to me.
Do you know if this is possible to test this on travis CPU?

avital · 2020-03-09T14:12:05Z

flax/nn/normalization.py

@@ -65,6 +66,8 @@ def apply(self,
      scale_init: initializer for scale, by default, one.
      axis_name: the axis name used to combine batch statistics from multiple
        devices. See `jax.pmap` for a description of axis names (default: None).
+      replica_groups: the custom replica groups used to combine batch statistics


Maybe this is known to everyone but me, but maybe we can document better what "replica groups" are?

I think it should refer to lax.psum/pmean for more details

I looked up the documentation and source code of lax.psum but I can't find the term "replica groups" there.

Apologies as maybe I am "new here" but give that JAX lacks internal docstrings, I think we need to own explaining this in the Flax API, unless we can point people to other JAX references (which would be better!)

The replica_groups kwarg is added in jax-ml/jax#2382. We might end up using a different name, though. (The name and concept are definitely not known to everyone, but the idea is something many people want to express: doing batch normalization over more than just the examples on each accelerator alone, but less than the entire global batch). TF code often uses a distributed_group_size keyword argument and then converts that into XLA replica groups later.

Got it. Perhaps a link to that pull request in the docstring is a simple solution?

jheek · 2020-03-13T08:39:13Z

Actually I have one more question about this: What is the benefit of using replica_groups over a nested pmap? You could do something like pmap(pmap(train_step, 'bn_group'), 'batch') and use BatchNorm(axis_name='bn_group') to get a similar result, correct?

jekbradbury · 2020-03-13T17:36:11Z

Yes—the difference is only that dist BN is a library feature that ideally wouldn’t affect the top-level training loop. But I’ll try seeing if I can do what I need with nested pmap; looks like that might be easier than wiring through replica group support in a way we’re fully happy with.

avital · 2020-03-29T19:06:21Z

Hi @jekbradbury -- what's the latest on this? Have you been able to use nested pmaps or is this PR still necessary for the training that you're doing?

avital · 2020-04-24T10:21:07Z

Looks like this PR may be stale. I'll close it for now, but feel free to re-open with additional context as appropriate.

avital · 2020-04-24T10:21:12Z

Looks like this PR may be stale. I'll close it for now, but feel free to re-open with additional context as appropriate.

support replica groups for distributed batchnorm

9169bd1

googlebot added the cla: yes label Mar 9, 2020

avital reviewed Mar 9, 2020

View reviewed changes

jheek closed this Mar 27, 2020

jheek reopened this Mar 27, 2020

jheek changed the base branch from prerelease to master March 27, 2020 11:40

avital closed this Apr 24, 2020

jekbradbury mentioned this pull request May 13, 2020

support axis index groups for distributed batchnorm #257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support replica groups for distributed batchnorm #42

Support replica groups for distributed batchnorm #42

jekbradbury commented Mar 9, 2020

jheek commented Mar 9, 2020

avital Mar 9, 2020

jheek Mar 10, 2020

avital Mar 10, 2020

jekbradbury Mar 10, 2020

avital Mar 10, 2020

jheek commented Mar 13, 2020

jekbradbury commented Mar 13, 2020

avital commented Mar 29, 2020

avital commented Apr 24, 2020

avital commented Apr 24, 2020

Support replica groups for distributed batchnorm #42

Support replica groups for distributed batchnorm #42

Conversation

jekbradbury commented Mar 9, 2020

jheek commented Mar 9, 2020

avital Mar 9, 2020

Choose a reason for hiding this comment

jheek Mar 10, 2020

Choose a reason for hiding this comment

avital Mar 10, 2020

Choose a reason for hiding this comment

jekbradbury Mar 10, 2020

Choose a reason for hiding this comment

avital Mar 10, 2020

Choose a reason for hiding this comment

jheek commented Mar 13, 2020

jekbradbury commented Mar 13, 2020

avital commented Mar 29, 2020

avital commented Apr 24, 2020

avital commented Apr 24, 2020