You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We tried adding the following line to torch/distributed/fsdp/_init_utils.py
tensor.to("cuda:0")
But this operation gives another error as follows
│ /raid/ganesh/namitha/miniconda3/envs/icl_as_ft/lib/python3.9/site-packages/torch/distributed/fsd │
│ │
│ 753 │ │ device: Optional[torch.device] = None │
│ 754 │ │ # For `use_orig_params=True`, permit non-uniform `requires_grad` │
│ 755 │ │ for tensor in tensors: │
│ ❱ 756 │ │ │ tensor.to("cuda:0") │
│ 757 │ │ │ if isinstance(tensor, FlatParameter): │
│ 758 │ │ │ │ raise ValueError("Cannot flatten a `FlatParameter`") │
│ 759 │ │ │ if dtype is None and not tensor.is_floating_point(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NotImplementedError: Cannot copy out of meta tensor; no data!
We have made other changes to pile-tiny.yaml , scripts/train.py and scripts/util.py to make it compatible for training
I am attaching a zip of those files here : changes.zip
However we circumvented this issue by commenting out the raise error (within torch/distributed/fsdp/_init_utils.py ) as follows
except BaseException as e:
warnings.warn(
"Unable to call `reset_parameters()` for module on meta "
f"device with error {str(e)}. Please ensure that your module of"
f"type {type(module)} implements a `reset_parameters()` method."
)
#raise e
I have attached the entire file within changes.zip , just in case
The text was updated successfully, but these errors were encountered:
Hi @adityakusupati , This is Prateek Chanda from GRI. @Advaid-Deepak and me were experimenting with Matformer OLmo for trying out a few ideas externally and were facing some issues with finetuning with a matformer checkpoint shown above.
Would really appreciate if you could kindly point out any steps which we possibly missed.
Thanks for your interest. I am unsure as to what is happening here as well. MatFormer-OLMo models are not that competitive either to do any experiments (barring scaling laws) and get meaningful results.
The only good MatFormer models publicly released at the MatViT models in scenic which are actually SOTA as regular ViT models and a drop in replacement.
As of now I am unable to look at this closely and can only do so after May 2nd week. The script and readme is what I used to restart my trained runs when something failed for ckpt, so that will imply fine-tuning should work similarly.
We were trying to finetune a Matformer checkpoint ( MatFormer-OLMo-180M Link )
We used the following command to call the training script
where the folder mentioned in load_path is obtained by download from the link mentioned in the README for MatFormer-OLMo-180M .
However running this gives us the following error
We are unable to resolve this issue
We tried adding the following line to torch/distributed/fsdp/_init_utils.py
But this operation gives another error as follows
We have made other changes to pile-tiny.yaml , scripts/train.py and scripts/util.py to make it compatible for training
I am attaching a zip of those files here :
changes.zip
Apart from this we were facing another issue
However we circumvented this issue by commenting out the raise error (within torch/distributed/fsdp/_init_utils.py ) as follows
I have attached the entire file within changes.zip , just in case
The text was updated successfully, but these errors were encountered: