-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resizing HF token embeddings with PipelineModule #1010
Comments
I don't know much about DS's version of pipeline - I worked only with the pytorch's native version of it. I know I had to add gather and re-partition with Zero-3 for that exact situation of resizing embeds - twice in this code: Does pipeline have a similar feature? I just don't know this side of DS |
@stas00 have you already migrated HF towards native pipeline parallelism with pytorch 1.8? If so, can you point me to that? For pipe, I don't think there's an equivalent feature to the one you listed @stas00, although I have a few questions about that:
Would also like @ShadenSmith and @tjruwase's thoughts here because I think 2 above (i.e., ZeRO perf on HF classes) is incredibly poor, as seen with ZeRO-2. I have a separate GitHub issue ongoing with @tjruwase about this. I feel like pipeline or even 3D parallelism would be better than ZeRO-3/infinity because of the reduced communication volume with pipeline parallelism. Fitting massive models in GPU memory is one thing, being able to train them fast by minimizing communication volume is another thing. ZeRO-3/infinity may help with the former, but it looks like pipeline (or more broadly, 3D) parallelism is the better solution because it also allows the latter. |
re: Pipeline Parallelism: Most HF models are too complicated. All Pipe-approached that I tried require:
So after spending weeks on this I gave up (or rather parked the idea). I managed to make a pipeline using 2 pytorch pipelines because pipelines can't handle conditional modules, which encoder/decoder models are. The performance was terrible, I couldn't get over the 50% gpu util over 2 gpus. pytorch Pipe API has been becoming more user-friendly wrt (2) and will soon handle any input/outputs. In order to convert HF models to pipeline the models have to drop complex feature like past key and hidden states aggregates - this was the most difficult part. I made a workaround using closures but it doesn't scale well. If you want to see some really crazy code that experimental PR is full of it. Bottom line - to make pipelines work the models
re: Performance testing I hope to start doing that in the next few days, now that ZeRO-Inf has been merged. As usual I will make an Issue on HF transformers and start sharing the results. We plan to do an extensive benchmarking including sagemaker, jax, megatron-lm and deepspeed, of course. I'm not sure if fairscale will be included - last one of us looked it was not complete, but perhaps they have caught up - I was too busy with deepspeed integration and dealing with bf16-pretrained models getting NaNs under fp16/mixed precision/deepspeed to have time to look. One other approach I hope to include is FlexFlow https://github.com/flexflow/flexflow - I hope we will now be able to convert our models to pytorch.fx trace - which is a prerequisite for flexflow. I highly recommend you check it out - the paper looks very interesting - but I haven't had a chance to see it in action yet and hope this will change soon. @michaelbenayoun has been making an awesome progress proxying the symbolic tracing via huggingface/transformers#11475 which should enable flexflow usage with HF transformers. |
With a HF model class, one can resize token embeddings to account for any special tokens, there's no upper limit, i.e.,
in the usual scenario (this isn't necessarily working code, I may have gotten the tokenizer APIs incorrect):
The last line essentially allocates 2 new indices for the newly added special tokens in the input embeddings matrix, and initializes their embeddings with random weights.
Now in the pipeline regime, one cannot just resize the token embeddings after initialization of the
PipelineModule
, since the module would have already split the model across pipeline stages. Is it possible to provide a callback/mechanism withPipelineModule
that can allow for resizing and fresh initialization of newly added special token embeddings for downstream users?Also, shouldn't this be a problem with the implementation of pipeline (and more generally 3D) parallelism in the
DeepSpeedExamples
repo too? A user of a model that's been pre-trained with pipeline parallelism would certainly have some basic downstream needs such as addition of special tokens for fine-tuning.@ShadenSmith @stas00
The text was updated successfully, but these errors were encountered: