-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong weight initialization for TF t5 model #13327
Comments
I agree! Would you like to open a PR to fix it? :-) |
Will try to do it on coming days |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
dshirron
pushed a commit
to dshirron/transformers
that referenced
this issue
Nov 2, 2021
Rocketknight1
pushed a commit
that referenced
this issue
Nov 3, 2021
) * Fix of issue #13327: Wrong weight initialization for TF t5 model * run black formatter * fix typo * remove my name tag from comments Co-authored-by: Shirron <dan.shirron@intel.com>
LysandreJik
pushed a commit
that referenced
this issue
Nov 16, 2021
) * Fix of issue #13327: Wrong weight initialization for TF t5 model * run black formatter * fix typo * remove my name tag from comments Co-authored-by: Shirron <dan.shirron@intel.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Environment info
Who can help
@patil-suraj
@patrickvonplaten
Information
Model I am using: Pre-training T5-base
The problem arises when using:
The tasks I am working on is:
Expected behavior
Before updating init weights to normal distribution (as in transformers/src/transformers/models/t5/modeling_flax_t5.py) loss stuck at 4.5 (unlike FLAX behaviour). after update of init weights i get same behaviour as in FLAX and reach <2 loss.
Example:
In flax code: class: FlaxT5DenseReluDense: lines 95:,96
wi_init_std = self.config.initializer_factor * (self.config.d_model ** -0.5)
wo_init_std = self.config.initializer_factor * (self.config.d_ff ** -0.5)
In TF code, the default initializer is used. My suggested fix:
wi_initializer = tf.keras.initializers.RandomNormal(mean = 0, stddev = config.initializer_factor * (config.d_model ** -0.5))
wo_initializer = tf.keras.initializers.RandomNormal(mean = 0, stddev = config.initializer_factor * (config.d_ff ** -0.5))
self.wi = tf.keras.layers.Dense(config.d_ff, use_bias=False, name="wi",kernel_initializer=wi_initializer)
self.wo = tf.keras.layers.Dense(config.d_model, use_bias=False, name="wo",kernel_initializer=wo_initializer)
This is relevant for all weights and embeddings initialization.
The text was updated successfully, but these errors were encountered: