[scan] Avoid re-tracing the combine function on every call

## 🚀 Feature

It should be possible to somehow cache the traced graphs in `torch_xla.experimental.scan` so we don't trace on every call.

## Motivation

Today `torch_xla.experimental.scan` and `scan_layers` traces the user function with both AOTAutograd (to get the backward) and with LazyTensor (to lower them to HLO). AOTAutograd is very slow and we can easily become tracing bound. For example, `python3 examples/train_decoder_only_base.py` takes 1min30s but `python3 examples/train_decoder_only_base.py scan.decoder_with_scan.DecoderWithScan` takes 4min.

## Pitch

We could wait for `torch.scan` to support autograd (c.f. https://github.com/pytorch/xla/pull/7901#issuecomment-2546903424) which will take a long time. In the meantime, we can implement some simple caching based on the `id` of the input function/module.

The caching should be opt-in because it's only sound if the function is pure. We can add a `assume_pure=True` argument to `scan` so that it only uses the caching when the user confirms that their function is pure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[scan] Avoid re-tracing the combine function on every call #8632

🚀 Feature

Motivation

Pitch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[scan] Avoid re-tracing the combine function on every call #8632

Description

🚀 Feature

Motivation

Pitch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions