scan and scan_layers #7901

tengyifei · 2024-08-23T00:57:04Z

Add the lowering of scan to HLO While op.

Introduce scan_layers which can sequentially apply a bunch of layers
using scan underneath.

Beef up unit tests including linear layers and decoders.

JackCaoG · 2024-08-23T16:57:02Z

======================================================================
ERROR: test_decoder_model (__main__.ApplyLayersTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/xla/xla/pytorch/xla/test/test_apply_layers.py", line 77, in test_decoder_model
    from decoder_only_model import DecoderOnlyConfig, DecoderOnlyModel  # type:ignore
ModuleNotFoundError: No module named 'decoder_only_model'

you can't just import it, you need to setup import dir correctly. Take a look at https://github.com/pytorch/xla/blob/master/test/dynamo/test_dynamo_dynamic_shape.py#L1-L6

tengyifei · 2024-08-23T18:37:08Z

@JackCaoG ty. i followed your example and got it working.

test/test_apply_layers.py

test/test_operations.py

test/test_scan.py

JackCaoG · 2024-08-23T20:03:08Z

torch_xla/csrc/init_python_bindings.cpp

      at::Tensor input = MakeTensorFromXlaLiteral(literal, dtype);
-      results[param_ids[i]] = input;
+      std::optional param_id = lowering_ctx.GetParameterId(device_data[i]);
+      XLA_CHECK(param_id.has_value());


when would it not has value?

When GetParameterId receives a BackendData that is not a parameter in this lowering context, it will return std::nullopt. However, this loop is only iterating over parameters (line 1071, const std::vector<torch::lazy::BackendDataPtr>& device_data = lowering_ctx.GetParametersData();), so we will expect all BackendData there to have an ID. Seems good to enforce this invariant.

torch_xla/experimental/apply_layers.py

JackCaoG · 2024-08-23T20:08:21Z

torch_xla/experimental/apply_layers.py

+  example_layer = deepcopy(next(iter(layers)))
+
+  # Hollow out the weights and biases in the example layer.
+  example_layer = example_layer.to_empty(device=None)


is this not going to impact the cloned arg?

Could you clarify this question -- I thought to_empty is going to destroy the value inside example_layer, so I deepcopy it before to backup.

torch_xla/experimental/scan.py

JackCaoG · 2024-08-26T22:39:13Z

torch_xla/experimental/scan.py

+
+    def step_fn(grad_carry, pytree: Tuple[torch.Tensor, torch.Tensor,
+                                          torch.Tensor]):
+      grad_y, carry, x = pytree


is this a typo?

I don't think so -- pytree is a tuple of the output grad at current step (grad_y), carry at the current step (carry), and input at current step (x)

torch_xla/experimental/scan.py

miladm · 2024-09-20T00:06:34Z

@tengyifei is this PR a 2.5 candidate?

tengyifei · 2024-09-20T05:47:29Z

@miladm yes, I'd like to backport this to 2.5 after addressing the comments etc.

test/test_operations.py

rpsilva-aws · 2024-11-20T01:12:29Z

torch_xla/experimental/scan.py

+  with torch.enable_grad():
+    fw_compiler, get_fwd = _make_get_graph_compiler()
+    bw_compiler, get_bwd = _make_get_graph_compiler()
+    fn_compiled = aot_function(


Will this fail if there are tensors within fn that were not provided as parameters, since they are not 'fake' tensors? Say, for model parameters if we have a module inside fn that we wish to also trace fwd/bwd on. Is this targeted as a follow-up?

Yes, it will. That's why I added scan_layers (previously named apply_layers) in this PR to extract module parameters and functionalize the module.

test/tpu/run_tests.sh

tengyifei · 2024-11-22T18:44:03Z

This is ready for another look. For ease of review I added two additional commits for stuff we talked about offline:

1 Test that the XLA compiler can propagate SPMD sharding annotations just fine through the While op and the Body computations
2 Change the name of the PyLoweringContext to FnComputation when it appears in the HLO

JackCaoG · 2024-11-22T18:47:26Z

This is ready for another look. For ease of review I added two additional commits for stuff we talked about offline:

1 Test that the XLA compiler can propagate SPMD sharding annotations just fine through the While op and the Body computations

2 Change the name of the PyLoweringContext to FnComputation when it appears in the HLO

please move the SPMD stuff to a different pr, I will try to finish reviewing this pr today( github needs a better way to stack prs....).

tengyifei · 2024-11-22T19:06:32Z

please move the SPMD stuff to a different pr

sure, done

torch_xla/experimental/scan.py

This commit adds the lowering of scan to HLO While op. It also introduce apply_layers which can sequentially apply a bunch of layers using scan underneath. In this milestone we use AOTAutograd to obtain the backward of the function being scanned. Users can either save the activations in fn or recompute them by passing different graph partitioners to AOTAutograd. ALso give the lowered fn computation a more meaningful name

tengyifei · 2024-11-23T02:27:52Z

Thanks for the review. I'll merge this first. I'm looking at caching the fn computation properly but that'll take some time. I might be able to send that PR over (and the SPMD one) for next Monday.

ydwu4 · 2024-12-16T21:54:32Z

Hi! PyTorch also has a scan operator https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/scan.py. Wondering if we want to consolidate the efforts and what's the plan going forward?

tengyifei · 2024-12-16T21:57:54Z

@ydwu4 does the PyTorch scan operator support autograd? I remember there's an issue tracking autograd support. It would be great to use upstream op but without autograd support we can't use that in training.

ydwu4 · 2024-12-16T22:31:22Z

Sounds good. On PyTorch side, we've been prioritizing to get inference working e2e and will get to autograd next half. It will be great if we could reduce fragmentation with a single front end op.

tengyifei · 2024-12-16T22:53:52Z

That's a great idea.

If you looked at the current scan impl in PyTorch/XLA, it uses AOTAutograd to derive a backward graph to implement the backward pass of scan. That API has a lot of limitations and IIUC dynamo is the well supported frontend.

tengyifei requested review from JackCaoG and alanwaketan August 23, 2024 18:37