Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute true loss Flax examples #18458

Closed

Conversation

duongna21
Copy link
Contributor

What does this PR do?

'True' losses should be computed in Flax examples, as discussed with @sanchit-gandhi.

Who can review?

cc @sanchit-gandhi @patrickvonplaten

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 3, 2022

The documentation is not available anymore as the PR was closed or merged.

@duongna21 duongna21 changed the title Compute true loss flax examples Compute true loss Flax examples Aug 3, 2022
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @duongna21, thanks for the PR - the 'true loss' implementation looks clean! There appear to be a few unintentional changes that have crept into the PR (deletion of lines, changes to drop_last_batch. It might simply be a case of rebasing onto main to fix these:

git fetch upstream
git rebase upstream/main

Comment on lines -891 to +897
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size, drop_last=False)
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchit-gandhi I noticed that with drop_last=False the eval loss will become nan at the beginning of training, but eval accuracy is still on track. It appears to occur with both run_bart_dlm_flax and run_summarization so I temporarily turned it off. It would be great if you could take a look and fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! What immediately jumps out to me is that num_labels is 0, causing the 'true loss' to be nan. You didn't get this behaviour previously with the pmap operation? What eval batch size are you using?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug occurs without dividing loss by num_labels actually. I took a quick look at the eval loss computed on every token at each step, the last losses was nan.

python run_summarization_flax.py \
	--output_dir ./bart-base-xsum \
	--model_name_or_path facebook/bart-base \
	--tokenizer_name facebook/bart-base \
	--dataset_name="xsum" \
	--do_train --do_eval --do_predict --predict_with_generate \
	--num_train_epochs 6 \
	--learning_rate 5e-5 --warmup_steps 0 \
	--per_device_train_batch_size 64 \
	--per_device_eval_batch_size 64 \
	--overwrite_output_dir \
	--max_source_length 512 --max_target_length 64 \
	--push_to_hub 

Printed tensor has a shape of (8, 64, 64), provided that I have 8 TPU cores, per_device_eval_batch_size=64 and max_target_length=64.
ảnh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! That's interesting to see. Is this an artefact of using the psum? As in, were the losses nan when we used a pmap previously? If so, we'll need to address this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The losses were nan right after loss = optax.softmax_cross_entropy(logits, soft_labels). Great if you could have a look at this issue!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, this problem doesn't seem to be related to this PR. Flax examples are young so we'll improve it step by step :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @duongna21, just wondering what the status is on the nan issue being discussed here. I'm running into this issue while using the recently added run_bart_dlm_flax.py script but am very new to Flax/Jax so haven't been able to really make sense of it yet. Is run_bart_dlm_flax.py useable for model pre-training in its current state? Just as a side note, I haven't seen the nan issue when running run_t5_mlm_flax.py on my training data. Thanks in advance for any clarification!

Copy link
Contributor Author

@duongna21 duongna21 Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannonk I'm sorry for the late reply. I believe this error is related to the drop_last=False option. The training will be fine if you set drop_last=True, at the cost that a few examples in the last batch will be skipped. Nice to see anyone is able to work on this weird bug.

Comment on lines -931 to +937
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size, drop_last=False)
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here! Is drop_last causing issues with the eval loss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples/flax/language-modeling/run_bart_dlm_flax.py Outdated Show resolved Hide resolved
ydshieh and others added 27 commits September 14, 2022 16:13
* Fix DocumentQuestionAnsweringPipelineTests

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…face#18814)

* add gpt-neox-japanese model and tokenizer as new model

* Correction to PR's comment for GPT NeoX Japanese
- Fix to be able to use gpu
- Add comment # Copied... at the top of RotaryEmbedding
- Implement nn.Linear instead of original linear class
- Add generation test under @slow

* fix bias treatment for gpt-neox-japanese

* Modidy gpt-neox-japanese following PR
- add doc for bias_dropout_add
- style change following a PR comment

* add document for gpt-neox-japanese

* remove unused import from gpt-neox-japanese

* fix README for gpt-neox-japanese
* Automate check for new pipelines and metadata update

* Add Datasets to quality extra
…ce#19001)

* Fix a broken link for deepspeed ZeRO inference

* fix link

Co-authored-by: Stas Bekman <stas@stason.org>
correct the import statement
* Small replacement

- replace `modules_to_not_convert` by `module_to_not_convert`

* refactor a bit

- changed variables name
- now output a list
- change error message

* make style

* add list

* make style

* change args name

Co-authored-by: stas00 <stas00@users.noreply.github.com>

* fix comment

* fix typo

Co-authored-by: stas00 <stas00@users.noreply.github.com>

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: stas00 <stas00@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Updated test values

The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of huggingface#1849  (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see
[PR comment](huggingface#18499 (comment))).

Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits.

* Use double quotes, like previous example

* Fix up
* Fix test_save_load for TFViTMAEModelTest

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…face#19034)

* Override save() to use the serving signature as the default

* Replace int32 with int64 in all our serving signatures

* Remember one very important line so as not to break every test at once

* Dtype fix for TFLED

* dtype fix for shift_tokens_right in general

* Dtype fixes in mBART and RAG

* Fix dtypes for test_unpack_inputs

* More dtype fixes

* Yet more mBART + RAG dtype fixes

* Yet more mBART + RAG dtype fixes

* Add a check that the model actually has a serving method
* init PR

* optimize top p and add edge case

* styling

* style

* revert tf and flax test

* add edge case test for FLAX and TF

* update doc with smallest set sampling for top p

* make style
* Fixing OPT fast tokenizer option.

* Remove dependency on `pt`.

* Move it to GPT2 tokenization tests.

* Added a few tests.
* Fix CI for custom tokenizers

* Add nightly tests

* Run CI, run!

* Fix paths

* Typos

* Fix test
* Enable torchdynamo tests

* make style

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…uggingface#18702)

* Adds package and requirement spec output to version check exception

It's difficult to understand what package is affected when `got_ver`
here comes back None, so output the requirement and the package. The
requirement probably contains the package but let's output both for good
measure.

Non-exhaustive references for this problem aside from my own encounter:

* https://stackoverflow.com/questions/70151167/valueerror-got-ver-is-none-when-importing-tensorflow
* https://discuss.huggingface.co/t/valueerror-got-ver-is-none/17465
* UKPLab/sentence-transformers#1186
* huggingface#13356

I speculate that the root of the error comes from a conflict of
conda-managed and pip-managed Python packages but I've not yet proven
this.

* Combines version presence check and streamlines exception message

See also: huggingface#18702 (comment)

Co-authored-by: Stas Bekman <stas@stason.org>
- set `use_cache` to `True` for consistency with other `transformers` models
* Support for ConvNext

* Support for Wav2Vec2

* Support for Resnet

* Fix small issue in test_modeling_convnext
…P16 input (huggingface#18746)

* Adding cast to fp32 in convnext layernorm to prevent rounding errors in the case of fp16 input

* Trigger CI
rchan26 and others added 3 commits October 10, 2022 09:25
* Remove dependency of Roberta in Blenderbot

* Move Copied from statements to each method of the Roberta classes

* Remove copied from line for mask_token.setter

* update output from example in docs
The sequence_masked variable is actually the part of the sequence that is kept unmasked for the encoder. This commit renames the variable.
* Add `OPTForQuestionAnswering`

- added `OPTForQuestionAnswering` class based on `BloomForQuestionAnswering`
- added `OPTForQuestionAnswering` in common tests
- all common tests pass
- make fixup done

* added docstrings for OPTForQuestionAnswering

* Fix docstrings for OPTForQuestionAnswering
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @duongna21! Sorry for the late review here. In general this looks good to me! Two things from my review:

  1. Could you rebase onto main? This should update all the jax.tree_map calls to jax.tree_util.tree_map
  2. Is using drop_last=False giving nan for the eval loss? If so, we can dig into this on a separate issue. Ideally we should keep the changes for this PR related to the 'true' Flax loss. We can open a separate issue/PR for any other side issues!

Comment on lines -891 to +897
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size, drop_last=False)
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's drop_last=False which is causing the eval loss to be nan?

Copy link
Contributor Author

@duongna21 duongna21 Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchit-gandhi Thanks for the review. Rebased.

Comment on lines -931 to +937
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size, drop_last=False)
eval_batch_idx = generate_batch_splits(eval_samples_idx, eval_batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here! Is drop_last causing issues with the eval loss?

amyeroberts and others added 23 commits October 10, 2022 14:48
* simplify loop

* add featur extractor

* add model

* start conversion

* add dropout

* initial commit of test files

* copnversion for all models

* update processor for correct padding

* update feature extraction

* update integration test logits match

* fmnt: off for the logits

* on the fly mel bank

* small nit

* update test

* update tokenizer

* nit feature extraction

* update

* update tokenizer test

* adds logit processor and update tokenizer to get supress tokens

* style

* clean convert

* revert to original modeling tf utils

* Update

* update

* nit

* clean convert file

* update tests and nits

* quality

* slow generation test

* ffn_dim to allow customization

* update readme

* add to toctreee

* start fixing integration tests

* update tests and code

* fix feature extractor

* fix config tests common

* update code to fix tests

* fix feature exctractor

* nit feature extraction

* update test for new feature extractor

* style

* add absrtact

* large logits wioth custom decoder input ids

* wraap around is otrch available

* fix feature extractor

* correct logits for whisper small.en

* nit

* fix encoder_attentino_mask

* some fixes

* remove unnecessary inputs

* nits

* add normalizer file

* update etst tokenization

* fix attention mask not defined

* fix generate

* remove uncoder attention mask useless

* update test modeling whisper

* update condfig to add second non supress tokens

* nits on feature exrtactor

* nit for test tokenizers

* update etsts

* update tests

* update tokenization test

* fixup

* invalidated hf token. Clean convert openai to whisper

* fix logit tests

* fixup

* Add model to README

* Fix doc tests

* clean merge

* revert toc_tree changes

* remove useless LogitProcessor

* Update whisper .mdx

* update config file doc

* update configuration docstring

* update test tokenization

* update test tokenization

* update tokenization whisper
Added copied from where needed

* update feature extraction

* nit test name

* style

* quality

* remove get suppress tokens and update non_speech tokens global variables

* Update src/transformers/models/whisper/feature_extraction_whisper.py

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* clean modeling whisper and test
Removed the attention mask arguments that are deprecated

* fix large test

* Add multilingual audio test, and translate test

* style

* fix larg multilingual test

* nits

* add copied from for attention layer

* remove attention masks in doc

* add english normalizer

* Update docs/source/en/model_doc/whisper.mdx

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update tokenization test

* remove copied from in whisper attention : no bias in k_proj only

* wrap around dependencies in english normalizer

* style

* correct import generation logits

* for now, wrap feature extractor with torch

* remove torch depencies for feature extraction and style

* Update src/transformers/models/whisper/convert_openai_whisper_to_tfms.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/models/whisper/configuration_whisper.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update docs/source/en/model_doc/whisper.mdx

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* fixup

* nit

* update logitds

* style

* nit

* nits and fix final tests

* add `is_more_itertools_available` to utils

* quality

* add begin supress tokens, supress tokens to generate args and config

* clean supressTokensLogitProcessor in generation logits

* Nit naming

* add supressTokensAtBegin

* udpate tests, supress tokens to None or correct values

* nit and style

* update RAG to fit test and generate_logit

* add copy pasted statment on english normalizer

* add arguments to config_common_kwargs

* Update src/transformers/generation_utils.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update src/transformers/generation_logits_process.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* revert changes based on reviews

* update doc and nits

* Update src/transformers/models/whisper/configuration_whisper.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* more nits

* last nits

* update test configuration common

* add BART name in decoder attention mask documentation

* Update src/transformers/models/whisper/modeling_whisper.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* style

* nit

* nit

* add english.json file to git

* nits on documentation

* nit

* nits

* last styling

* add main toctree file

* remove sentence piece dependency

* clean init file

* fix tokenizer that has no dependencies on sentencepiece

* update whisper init file, nit

* remove english.json file

* add get decoder prompt id

* All weights loading

* Remove hanging pdb

* Fixup and tidy up

* Use same copied from as PT model

* Remove whitespace changes

* Remove torch references

* Tie embeddings

* Remove logits processor input to generate

* Update logit values

* revert changes and add forced logit processor

* nit

* clean normalizer

* remove protected

* Add logit processors and update generation code & tests

* Some tidy up

* Update docstring

* update

* update based on review

* Update src/transformers/models/whisper/configuration_whisper.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/whisper/configuration_whisper.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update to reflect changes on the PT model branch

* Tidy up

* Remove extra whitespace

* Fix test - make input ids small enough we can append

* Include upstream changes on main

* PR comments - add batch tests, remove comments & defaults

* Fix model output imports

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/generation_tf_logits_process.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update tests/models/whisper/test_modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update docstring example

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Remove changes to adjust_logits_during_generation function

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Tidy up imports that don't require TF

* Update tests - skip and no more skip

* Update tests/generation/test_generation_tf_logits_process.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/models/whisper/modeling_tf_whisper.py

* Update src/transformers/models/whisper/modeling_tf_whisper.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Add training flags

* Add (skipped) XLA generation tests

* Add embedding correctness test

* Add constant ids for generation tests

* Make logits finding a bit tidier

* Remove unused args

* xla generation enabled

* Don't skip XLA tests anymore

* Fix tests - add position ids to expected signature and update rag generation

* Undo method reorder

* Remove added whitespace

* Remove copy-paste gradient checkopint ref

* Remove

* Trigger CI - (issue with refs when pulling)

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: NielsRogge <niels.rogge1@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Joao Gante <joao@huggingface.co>
* fix conflicts

* start translating

* proof check

* add toc

* fix errors and typos
The momentum value for PyTorch and TensorFlow batch normalization layers is not equivalent. The TensorFlow value should be (1 - pytorch_momentum) in order to ensure the correct updates are applied to the running mean and running variance calculations. We wouldn't observe a difference loading a pretrained model and performing inference, but evaluation outputs would change after some training steps.
* Fixed a non-working hyperlink in the README.md file

The hyperlink to the community notebooks was outdated.

* Fixing missing double slash in hyperlink
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
@patrickvonplaten
Copy link
Contributor

@duongna21 super sorry it seems like the git commit history got messed up :-/ Any chance you could re-submit your PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.