-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash at the end of training #9
Comments
Here's the specific command I ran for more context:
|
Hi Kerem, yes I fixed this bug yesterday in commit 2c5d993 (a bug with batches of dimension 1) I got good results with these hyperparameters last night: python run_squad.py \
--vocab_file $BERT_BASE_DIR/vocab.txt \
--bert_config_file $BERT_BASE_DIR/bert_config.json \
--init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin \
--do_train \
--do_predict \
--do_lower_case
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ../debug_squad/ I found: {"f1": 88.52381567990474, "exact_match": 81.22043519394512} Feel free to reopen the issue if needed. |
Closed
stevezheng23
added a commit
to stevezheng23/transformers
that referenced
this issue
Mar 24, 2020
fix issues in new quac-kd runner
LysandreJik
added a commit
that referenced
this issue
Apr 10, 2020
* Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (#5) * Add XLNet in list of models for `run_glue_tpu.py` (#6) * Add RoBERTa to list of models in TPU GLUE (#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (#8) * Use barriers to reduce duplicate work/resources (#9) * Shard eval dataset and aggregate eval metrics (#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2 tasks
rraminen
pushed a commit
to rraminen/transformers
that referenced
this issue
Jun 3, 2022
jlamypoirier
added a commit
to jlamypoirier/transformers
that referenced
this issue
Apr 4, 2023
* dockerfile * formatting and fixes * cleanup * style
xloem
pushed a commit
to xloem/transformers
that referenced
this issue
Apr 9, 2023
* Update trainer and model flows to accommodate sparseml Disable FP16 on QAT start (huggingface#12) * Override LRScheduler when using LRModifiers * Disable FP16 on QAT start * keep wrapped scaler object for training after disabling Using QATMatMul in DistilBERT model class (huggingface#41) Removed double quantization of output of context layer. (huggingface#45) Fix DataParallel validation forward signatures (huggingface#47) * Fix: DataParallel validation forward signatures * Update: generalize forward_fn selection Best model after epoch (huggingface#46) fix sclaer check for non fp16 mode in trainer (huggingface#38) Mobilebert QAT (huggingface#55) * Remove duplicate quantization of vocabulary. enable a QATWrapper for non-parameterized matmuls in BERT self attention (huggingface#9) * Utils and auxillary changes update Zoo stub loading for SparseZoo 1.1 refactor (huggingface#54) add flag to signal NM integration is active (huggingface#32) Add recipe_name to file names * Fix errors introduced in manual cherry-pick upgrade Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
sim-so
added a commit
to sim-so/transformers
that referenced
this issue
Apr 23, 2023
# This is the 1st commit message: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#2: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#3: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#4: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#5: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#6: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#7: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#8: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#9: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#10: Update docs/source/ko/tasks/summarization.mdx Co-authored-by: Wonhyeong Seo <wonhseo@kakao.com> # This is the commit message huggingface#11: Update docs/source/ko/tasks/summarization.mdx
jameshennessytempus
pushed a commit
to jameshennessytempus/transformers
that referenced
this issue
Jun 1, 2023
1 task
ocavue
pushed a commit
to ocavue/transformers
that referenced
this issue
Sep 13, 2023
Use merged decoders
younesbelkada
pushed a commit
to younesbelkada/transformers
that referenced
this issue
Mar 14, 2024
LysandreJik
pushed a commit
that referenced
this issue
Mar 15, 2024
* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <ahmetustun89@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
lcong
pushed a commit
to lcong/transformers
that referenced
this issue
Apr 9, 2024
Update 16_tensorboard.py
ArthurZucker
pushed a commit
that referenced
this issue
Apr 9, 2024
itazap
pushed a commit
that referenced
this issue
May 14, 2024
* Cohere Model Release (#1) Cohere Model Release * Remove unnecessary files and code (#2) Some cleanup * Delete cohere-model directory (#3) * Make Fix (#5) * Pr fixes (#6) * fixes for pr * pr fixes for the format * pr fixes for the format * src/transformers/models/auto/tokenization_auto.py * Tokenizer test (#8) * tokenizer test * format fix * Adding Docs and other minor changes (#7) * Add modeling tests (#9) * Smol Fix (#11) * tokenization tests are fixed * format fixes * fix pr doc tests * fix pr doc tests * fix pr doc tests * fix pr style check * small changes in cohere.md * FIX: Address final comments for transformers integration (#13) * fix modeling final nits and add proper test file * for now leave empty tests * add integration test * push new test * fix modeling cohere (#14) * Update chat templates to use the new API (#15) --------- Co-authored-by: ahmetustun <ahmetustun89@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
SangbumChoi
added a commit
to SangbumChoi/transformers
that referenced
this issue
Aug 22, 2024
add num_nms parameters and set to 100
ZYC-ModelCloud
pushed a commit
to ZYC-ModelCloud/transformers
that referenced
this issue
Nov 14, 2024
* remove (zeors -= 1) * add warning * support backwards compatibility * support and fix bug * remove not necessary parm * fix test_q4 bug * fix test_q4 bug * fix bug double converting * Update _utils.py * FIX type error * module is nn.Module * sync name * need return module * modify default format to gptq_v2 * fix need return model * remove fixme and default to gptq_v2 for quantize_config * save _qlinear_kernel and allow save to older format * fix name * pass quantize * update * store quant log/stats in dict slice and return to user in quantize() * accept saved quant_log in quantize() and calculate diff * tqdm the layer loop * log awq vs autogptq outputs in awq compat test * fix cached models is not compatible with new pr. add v2 to cache file name * add deprecation warning for loading .bin/.pt weights * add missing termcolor req * spell * fix triton v2 * rename quant log column 'name' to 'module' * ruff * add quantization tests for sym=False * spell * fix type hint * more testing, fix serialization bug, no additional dependency * fix version * no need for ... in tqdm * Use threadpoolctl to limit packing threads * layer # sync with visual tqdm * use thread limit 1: as good as 4 and 1 beats 16 threads in testing * fix saving of gptq (v1) * deep copy * remove todo: verified * TEST/DEBUG underflow protection and output underflow stats fix underflow cond reversed * force underflow math (testing shows this is better than skipping math) * 1) disable serialization of sym=False to v1 by default. 2) disable loading of v1 sym=False by default. * revert adding underflow check/stats * pass test_quant and test both v1 and v2 save/load * performance fix for convert_v1/v2(). * need to ++ version so can delimit models make pre/post pr * add meta and meta.quantizer to quantized_config.json * fix json save and add meta check to test_quantization. distutils is deprecated by python. add packaging. depend * fix failed test * fix awq unpack/repacking thread regression * remove highly flaky mistral tiny test with input/output that is nonsensical * now we can detect quant producer, we don't need use_unsafe_math for loading * updat tests * default to gptq v1 for max compat and remove use_unsafe_math check in save_quantized * misc * separate the concept of meta.quantizer and meta.packer (intel/auto-round as example) * clean * test allow loading quantized lm_head * rename * fix quantized lm_head loading * sync with main * ADD GLM model support --------- Co-authored-by: leejunjae <qwopqwop200@gmail.com> Co-authored-by: Liurl26 <lrl@lbx.dev> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
ZYC-ModelCloud
pushed a commit
to ZYC-ModelCloud/transformers
that referenced
this issue
Nov 14, 2024
* Update README.md * Update README.md
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, I tried running the Squad model this morning (on a single GPU with gradient accumulation over 3 steps) but after 3 hours of training, my job failed with the following output:
I was running the code, unmodified, from commit 3bfbc21
Is this an issue you know about?
The text was updated successfully, but these errors were encountered: