XLM-R model OOM (PyTorch XLA limitations vs TF) #1870

tmabraham · 2020-04-03T20:49:31Z

I am trying to train an XLM-R model in Kaggle Kernels with TPU enabled. There was a TF kernel that was able to do this successfully:
https://www.kaggle.com/xhlulu/jigsaw-tpu-xlm-roberta

However, attempts to train a similar model with PyTorch XLA have not been successful due to OOM errors. I tried to keep the code as similar as possible and made sure all non-XLA variables (dataset, model, etc.) were defined globally so it wasn't replicated 8 times. I am actually using a smaller model of the model (base vs large) and am using much lower batch sizes. I even tried using multi-threading interface (which is apparently now deprecated) as I read multi-threading uses less memory. In all cases I get OOM errors. In most cases, it will load the model do the forward function, but fail when calculating the loss function. In some cases, it fails at loss.backward().

I have two questions related to this:

Is there a way to get PyTorch XLA to work with this model in TPU Kaggle Kernels as it was possible with TF?
What are the limitations of PyTorch XLA usage on TPU compared to TF usage on TPU? Are there certain models that cannot be used?

tmabraham · 2020-04-03T20:50:33Z

Example kernel: https://www.kaggle.com/tanlikesmath/simple-xlmr-tpu-pytorch?scriptVersionId=31379659

jysohn23 · 2020-04-03T21:20:12Z

There is a fundamental difference in PyTorch/XLA vs TF/TPU paradigms. Whereas PT/TPU builds all the graphs, initializes the weights, runs input pipelines etc and then feeds the TPUs. TF/TPU builds the TF graphs, converts it into XLA graphs and hands it over to the TPU for doing all the heavy lifting.

Also, based on your Kaggle Kernel you posted, I assume that the SIGKILL was issued by due to RAM OOM, though we'd need to check kernel logs to know for sure (not the memory on the TPU core). @ifigotin may be working on bumping that limit but I'll let him chime in on that status.

tmabraham · 2020-04-03T21:43:16Z

@jysohn23 Thanks for your reply. So does this difference in paradigm lead to some models not working, either due to OOM problems or other problems? If so, why?

I would also note that Kaggle did give me a separate message saying:

Your notebook tried to allocate more memory than is available.

So you are probably right, I realize now it probably is VM RAM OOM. How can I reduce the memory usage in this notebook?

dlibenzi · 2020-04-03T21:45:08Z

Can you try this?

!free -h

dlibenzi · 2020-04-03T21:46:16Z

And this:

!cat /proc/cpuinfo | grep processor | wc -l

jysohn23 · 2020-04-03T21:51:42Z

@tmabraham Yeah, PT/TPU sometime uses more RAM on GCE VM whereas TF/TPU uses on TPU VM. But as long as you can get more RAM GCE VM you should be fine.

tmabraham · 2020-04-03T21:52:56Z

@dlibenzi There is a total of 18 GB of memory:

              total        used        free      shared  buff/cache   available
Mem:            18G        866M         12G        1.0M        5.5G         17G
Swap:            0B          0B          0B

The output of the second command is 4

tmabraham · 2020-04-03T21:53:27Z

@jysohn23 Are there any steps I can take to reduce VM RAM usage in my notebook?

dlibenzi · 2020-04-03T21:55:36Z

How big are the x_train and x_valid tensors?

tmabraham · 2020-04-03T22:02:45Z

@dlibenzi x_train is (435775,128) and y_train is (435775,1). Note the TF TPU kernel uses (435775,192).

dlibenzi · 2020-04-03T22:05:54Z

That gets encoded AFAICT. Can you print the final x_train shape?

tmabraham · 2020-04-03T22:14:04Z

@dlibenzi I didn't understand. This is the tokenized/encoded shape. There are 435775 sentences and when encoded the representation has max_len equal to 128. This is the shape of the data that I use when creating my TensorDataset.

tmabraham · 2020-04-06T20:26:36Z

Just wanted to follow up on this... Is there any way this can be fixed? Or is this a limitation of PyTorch XLA?

tmabraham · 2020-04-06T20:54:29Z

Also I will note that I get similar OOM problems when using bert-base-cased, which is supposed to be a model of the same size. I haven't investigated this thoroughly, so don't know if it's the same issue or not,

tmabraham · 2020-04-07T21:49:24Z

@jysohn23 @dlibenzi Sorry to keep bothering you. Just wanted to see if this is a known issue or if there is something I am supposed to do in my environment to prevent this issue.

jysohn23 · 2020-04-07T22:04:40Z

Hi @tmabraham sorry for the late reply. As far as I can tell, it's a limitation of the VM that kaggle provides. Can you try running the exact same on a Colab notebook? Make a copy off of this colab notebook sample we provide: https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb and paste in your content.

tmabraham · 2020-04-07T23:00:36Z

@jysohn23 Yes I was able to get something that seems to work in Colab.

@tmabraham Yeah, PT/TPU sometime uses more RAM on GCE VM whereas TF/TPU uses on TPU VM. But as long as you can get more RAM GCE VM you should be fine.

Are there steps we can take to reduce PT/TPU RAM usage, or is this an inherent limitation of PyTorch XLA?

dlibenzi · 2020-04-07T23:04:31Z

So I ran your Kaggle notebook, and after tokenization, there are about 3GB left.

There are 11GB buffer cache, but the dataset seems pretty big.

jysohn23 · 2020-04-07T23:06:24Z

To be clear it's not a limitation of pytorch/xla, but rather an imbalance in resources that are given out for free. On Kaggle they're granting a couple CPU cores and few GB ram to feed 8 TPU cores. You'd have the exact same problem if you were given 4 free V100 on Kaggle kernels with only couple CPU cores and few GB RAM. You can try creating model before forking processes as long as they're read only to reduce memory footprint caused by model weights.

tmabraham · 2020-04-07T23:15:10Z

@jysohn23 I still think this could be a limitation of PyTorch XLA because there is a TF kernel that works in Kaggle Kernels. Maybe there are some optimizations in TF that are not possible in PyTorch XLA?

I am creating the model before the forking processes.

jysohn23 · 2020-04-07T23:20:35Z

PT/TPU vs TF/TPU is not an apples-to-apples comparison as they have different paradigms: #1870 (comment)

tmabraham · 2020-04-07T23:25:04Z

@jysohn23 I understand that, but I guess I was hoping there is still something that could be done to prevent the higher usage of RAM with PyTorch XLA vs TF TPU. I guess the answer is no.

I will try to ask Kaggle if they can potentially increase the RAM for the VMs. If not, I will train in Colab. Thanks for the clarification!

jysohn23 · 2020-04-08T00:39:21Z

I ran your Kaggle kernel trying to add stuff like: del df_train, df_valid which freed up like 500 MBs, but it still looks like it runs out of memory right as we do model = mx.to(xm.xla_device()) in the xmp context, since there we create 8 copies of the model after sending to device 😞.

dlibenzi · 2020-04-10T14:07:30Z

We have made a change (which should be on nightly) that lowers the host memory utilization.
@tmabraham mind giving it a try on nightly?

dlibenzi · 2020-04-10T15:19:29Z

I tried myself with nightly and it trains:

https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch

The trick is adding --version nightly to the env-setup script.

Jerry2001Qu · 2020-04-10T20:20:54Z

In that kernel: https://www.kaggle.com/davidelibenzi/simple-xlmr-tpu-pytorch

nprocs is set to 1 instead of 8.

I'm running into essentially the same issue in my own Kaggle kernel. Bert Base loads fine without OOM issues, however XLM-RoBERTa-Base does not. I don't even dare try XLM-RoBERTa-Large, which is able to load on TensorFlow as can be seen in this kernel: https://www.kaggle.com/xhlulu/jigsaw-tpu-xlm-roberta

dlibenzi · 2020-04-10T20:42:13Z

Have tried with 8 as well. With nightly, it trains.
Rerunning now ...

dlibenzi · 2020-04-10T20:57:46Z

dlibenzi · 2020-04-10T21:04:06Z

So, we normally recommend this kind of structure:

def _mp_fn(...):
  model = Net().to(xla_device)
  ...

xmp.spawn(_mp_fn, ..., start_method='fork')

But if the VM is RAM starved and the model are hefty parameter sizes, the recommended setup is more like:

# Create once at global scope to share pages with child processes.
model = Net()

def _mp_fn(...):
  model.to(xla_device)
  ...

xmp.spawn(_mp_fn, ..., start_method='fork')

That, together with fork(2) makes sure there will be only one copy of the models parameters on PyTorch CPU host memory.

Jerry2001Qu · 2020-04-10T21:32:37Z

Oh that's awesome. @dlibenzi can you commit the updated Kaggle Kernel? I don't see --version nightly on the notebook when I click your link.

dlibenzi · 2020-04-20T19:01:22Z

Since input data is pretty tiny here, and given that is Colab this is running tight on memory, I suggest not to use the data loaders.
I honestly suggest you take my notebook as baseline.
There are two major things there.

Loading model onto device in a serialized fashion
Creating a file indexed dataset to minimize initial footprint

AdityaSoni19031997 · 2020-04-21T02:29:53Z

So being inspired from @dlibenzi 's streaming code idea, i am pretty sure it will seem complicated to many people who haven't worked directly with binary streaming (at least for me it was very True); So i have added inline comments where ever it was necessary, and put it up as a inline_commented_gist.

I hope it helps everyone in future who comes across this issue!

My question is this @dlibenzi ,
We can stream data directly using numpy's memmap as well, or using lmdb. So is there any benefit that the above methods will fail to have when it's compared to the BytesIO streaming?

One obvious limitation is that when using memmap, you cannot cache tensor's of different shapes; Anything else in your experience?

If we don't do bucketing while caching, then we can use memmap directly wrapped in a torch.utils.data.Dataset as shown below, (and it should be as optimal as the BytesIO version is, right?)

class MyStreamingDataset(torch.utils.data.Dataset):
        # np.array(x_train).shape  -> (435712, 3)   
    
    def __init__(self, data_memmap, target_memmap, shape=()):
        self.data = np.memmap(data_memmap, shape=shape, mode="r", dtype="int32")
        self.target = np.memmap(target_memmap, shape=(shape[1],), mode="r", dtype="int32")
        self.shape = shape
    
    def __len__(self):
        return self.shape[1]
    
    def __getitem__(self, idx):
        # mem-map contains input_ids, masks, targets in that index order;
        return np.array(self.data[0][idx]), np.array(self.data[1][idx]), np.array(self.target[idx])

Thanks a lot!

dlibenzi · 2020-04-21T17:47:11Z

There is no need to buffer reads, for (at least) two reasons.
First, there are at least two buffering layers underneath already. The Python one and the OS one.
Second, with shuffling (which is default-true), read offsets are all over the places, so there is very little locality that caching can exploit.

dlibenzi · 2020-05-04T17:38:17Z

We have added a new API which makes global model sharing and serialized to() less hacky:

xla/torch_xla/distributed/xla_multiprocessing.py

Line 303 in f335324

WRAPPED_MODEL = MpModelWrapper(MyNetwork())

psinger · 2020-05-25T08:12:37Z

@dlibenzi Thanks - does this solve some of the memory issues?

dlibenzi · 2020-05-25T13:41:36Z

Yes, it does.
It avoids all processes doing the send-to-device at the same time.
We have also update the Colab with the example usage:

https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/mnist-training.ipynb

taylanbil · 2020-05-25T13:41:38Z

See [pr](#2107 <#2107>) and [colab]( https://colab.sandbox.google.com/github/pytorch/xla/blob/master/contrib/colab/resnet18-training.ipynb ).

…

On Mon, May 25, 2020, 01:12 Philipp Singer ***@***.***> wrote: @dlibenzi <https://github.com/dlibenzi> Thanks - is there any example on how to use it? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1870 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDAQYJO3GFQQO6DAJNM3CLRTISALANCNFSM4L4N2UJQ> .

psinger · 2020-05-27T19:58:11Z

Thanks @dlibenzi & @taylanbil - this is really helpful.

Unfortunately, I am still struggling with saving the model weights in case of large models. The model is pulled to memory when doing so and the kernel (on Kaggle) fails.

I think I asked that before, but is there any way to optimize / circumvent that?

dlibenzi · 2020-05-28T14:46:36Z

You are using xm.save() right?

psinger · 2020-05-28T14:50:55Z

Yes exactly.

taylanbil · 2020-05-28T17:01:33Z

@dlibenzi, although super sub-optimal, could we do this?

xm.save from master (ordinal 0) while others are sitting in rendezvous-1
ordinal 0 reaches rv-1
ordinal 1 xm.saves while ordinals 0, 2-7 sits in rv-2
ordinal 1 reaches rv-2
-ordinal 2 xm.saves while others sits in rv-3
...

etc

dlibenzi · 2020-05-28T19:18:53Z

No.
We already have only ordinal 0 actually fetch device data to CPU tensors, and save.
The issue is that at that point the memory is already low, and even if only one process fetches the tensors to CPU, it OOMs.

taylanbil · 2020-05-28T19:23:03Z

I thought we send data to cpu from all devices, and only save from master after we send.

https://github.com/pytorch/xla/blob/master/torch_xla/core/xla_model.py#L631

Is that not true?

dlibenzi · 2020-05-28T19:26:05Z

No.
We sync from all devices. Sync leaves data on device.
Then we fetch CPU data from master in order to call torch.save().

taylanbil · 2020-05-28T19:29:32Z

I see, thx for the info.

…

On Thu, May 28, 2020, 12:26 Davide Libenzi ***@***.***> wrote: No. We sync from all devices. Sync leaves data on device. Then we fetch CPU data from master in order to call torch.save(). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1870 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDAQYPFOKOKV3B7QVIOFWDRT23FVANCNFSM4L4N2UJQ> .

dlibenzi · 2020-05-29T00:22:40Z

So the memory issue from the pytorch serialization comes from the fact that not only all the CPU tensors must be loaded in host memory at the same time, but pytorch uses a memory buffer to store them.
This effectively doubles the memory required:

https://github.com/pytorch/pytorch/blob/e029d678b63d8970d92e7d9713af74eb0ea69ad8/torch/serialization.py#L462

I have created #2140 which streams tensors to CPU (and then to file) one at a time.
But this requires using the matching load() API.

psinger · 2020-06-05T07:46:14Z

@dlibenzi Thanks again for this!

Just tried it and I get the following error:
FileExistsError: [Errno 17] File exists: 'model.bin.tensors'

I am trying to save the model within the _run method, same spot as where I would call xm.save.

dlibenzi · 2020-06-05T13:25:46Z

I was envisioning the API to explicitly fail on existing checkpoints, but I realized this is not the normal torch.save() behavior.
Let me fix that ...

dlibenzi · 2020-06-05T13:37:28Z

#2173

garyongguanjie · 2020-07-20T14:44:16Z

@dlibenzi Thanks again for this!

Just tried it and I get the following error:
FileExistsError: [Errno 17] File exists: 'model.bin.tensors'

I am trying to save the model within the _run method, same spot as where I would call xm.save.

May i ask how did you fix this?

dlibenzi · 2020-07-20T14:50:17Z

Made the serialization module override an existing checkpoint. Like torch.save() would.

kbrajwani · 2020-09-09T04:57:00Z

hey @dlibenzi
i was trying to do gpt large model in kaggle its running fine but i am getting memory issues. When i try to find its shows me that running this script
! curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
! python pytorch-xla-env-setup.py --version nightly --apt-packages libomp5 libopenblas-dev
takes 3gb ram.
so my question is how can i clear that ram.

This was referenced Apr 24, 2020

Colab TPU : process terminated with signal SIGKILL Lightning-AI/pytorch-lightning#1590

Closed

kaggle training pipelines don't work #2001

Closed

dlibenzi mentioned this issue May 12, 2020

pytorch xla printing 1 epoch performance 8 times and validation accuracy using 1 core always #2054

Closed

dlibenzi mentioned this issue May 29, 2020

Provide a save/restore API which does not require every tensor to be … #2140

Merged

zcain117 added the kaggle label Jul 1, 2020

This was referenced Jul 22, 2020

examples/seq2seq/finetune.py and BART supports TPU huggingface/transformers#5895

Closed

Adding Minimal Reproducible Usage Example For TPU support on examples/seq2seq huggingface/transformers#5960

Closed

Jerry2001Qu mentioned this issue Oct 31, 2020

Exception: process 0 terminated with signal SIGKILL huggingface/transformers#3660

Closed

XLM-R model OOM (PyTorch XLA limitations vs TF) #1870

XLM-R model OOM (PyTorch XLA limitations vs TF) #1870

Comments

tmabraham commented Apr 3, 2020

tmabraham commented Apr 3, 2020

jysohn23 commented Apr 3, 2020

tmabraham commented Apr 3, 2020 • edited Loading

dlibenzi commented Apr 3, 2020

dlibenzi commented Apr 3, 2020

jysohn23 commented Apr 3, 2020

tmabraham commented Apr 3, 2020

tmabraham commented Apr 3, 2020

dlibenzi commented Apr 3, 2020

tmabraham commented Apr 3, 2020

dlibenzi commented Apr 3, 2020

tmabraham commented Apr 3, 2020

tmabraham commented Apr 6, 2020

tmabraham commented Apr 6, 2020

tmabraham commented Apr 7, 2020

jysohn23 commented Apr 7, 2020

tmabraham commented Apr 7, 2020

dlibenzi commented Apr 7, 2020 • edited Loading

jysohn23 commented Apr 7, 2020

tmabraham commented Apr 7, 2020 • edited Loading

jysohn23 commented Apr 7, 2020

tmabraham commented Apr 7, 2020 • edited Loading

jysohn23 commented Apr 8, 2020

dlibenzi commented Apr 10, 2020

dlibenzi commented Apr 10, 2020 • edited Loading

Jerry2001Qu commented Apr 10, 2020

dlibenzi commented Apr 10, 2020 • edited Loading

dlibenzi commented Apr 10, 2020

dlibenzi commented Apr 10, 2020 • edited Loading

Jerry2001Qu commented Apr 10, 2020

dlibenzi commented Apr 20, 2020

AdityaSoni19031997 commented Apr 21, 2020 • edited Loading

dlibenzi commented Apr 21, 2020

dlibenzi commented May 4, 2020

psinger commented May 25, 2020 • edited Loading

dlibenzi commented May 25, 2020

taylanbil commented May 25, 2020 via email

psinger commented May 27, 2020

dlibenzi commented May 28, 2020

psinger commented May 28, 2020

taylanbil commented May 28, 2020

dlibenzi commented May 28, 2020

taylanbil commented May 28, 2020

dlibenzi commented May 28, 2020

taylanbil commented May 28, 2020 via email

dlibenzi commented May 29, 2020

psinger commented Jun 5, 2020

dlibenzi commented Jun 5, 2020

dlibenzi commented Jun 5, 2020

garyongguanjie commented Jul 20, 2020

dlibenzi commented Jul 20, 2020

kbrajwani commented Sep 9, 2020

tmabraham commented Apr 3, 2020 •

edited

Loading

dlibenzi commented Apr 7, 2020 •

edited

Loading

tmabraham commented Apr 7, 2020 •

edited

Loading

tmabraham commented Apr 7, 2020 •

edited

Loading

dlibenzi commented Apr 10, 2020 •

edited

Loading

dlibenzi commented Apr 10, 2020 •

edited

Loading

dlibenzi commented Apr 10, 2020 •

edited

Loading

AdityaSoni19031997 commented Apr 21, 2020 •

edited

Loading

psinger commented May 25, 2020 •

edited

Loading