Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError when training coref example project #11734

Closed
itssimon opened this issue Nov 2, 2022 · 5 comments · Fixed by explosion/spacy-experimental#30
Closed

RuntimeError when training coref example project #11734

itssimon opened this issue Nov 2, 2022 · 5 comments · Fixed by explosion/spacy-experimental#30
Labels
bug Bugs and behaviour differing from documentation experimental Experimental components and features feat / coref Feature: Coreference resolution

Comments

@itssimon
Copy link
Contributor

itssimon commented Nov 2, 2022

This is an issue in the experimental coref project. If this is not the right place to capture issues regarding that repository, please let me know.

Running spacy project run all to train the coref model leads to the following error:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

Full output:

ℹ Running workflow 'all'

================================= preprocess =================================
ℹ Skipping 'preprocess': nothing changed

=============================== train-cluster ===============================
Running command: /home/jovyan/.envs/clinex/bin/python3.10 -m spacy train configs//cluster.cfg -g 0 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy -o training/cluster --training.max_epochs 20
ℹ Saving to output directory: training/cluster
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-11-02 17:39:33,675] [INFO] Set up nlp object from config
[2022-11-02 17:39:33,694] [INFO] Pipeline: ['transformer', 'coref']
[2022-11-02 17:39:33,699] [INFO] Created vocabulary
[2022-11-02 17:39:33,701] [INFO] Finished initializing nlp object
Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 481/481 [00:00<00:00, 423kB/s]
Downloading vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 878k/878k [00:01<00:00, 568kB/s]
Downloading merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 446k/446k [00:01<00:00, 385kB/s]
Downloading tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29M/1.29M [00:01<00:00, 978kB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 478M/478M [00:18<00:00, 27.7MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/home/jovyan/.envs/clinex/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jovyan/.envs/clinex/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/typer/main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/cli/train.py", line 72, in train
    nlp = init_nlp(config, use_gpu=use_gpu)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/training/initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy/language.py", line 1323, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy_experimental/coref/coref_component.py", line 357, in initialize
    self.model.initialize(X=X, Y=Y)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/model.py", line 299, in initialize
    self.init(self, X=X, Y=Y)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/layers/chain.py", line 92, in init
    curr_input = layer.predict(curr_input)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy_experimental/coref/coref_model.py", line 85, in coref_forward
    return model.layers[0](X, is_train)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/layers/pytorchwrapper.py", line 143, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/shims/pytorch.py", line 72, in __call__
    return self.predict(inputs), lambda a: ...
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/thinc/shims/pytorch.py", line 90, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy_experimental/coref/pytorch_coref_model.py", line 77, in forward
    pairwise = self.pairwise(top_indices)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/.envs/clinex/lib/python3.10/site-packages/spacy_experimental/coref/pytorch_coref_model.py", line 269, in forward
    distance = (word_ids.unsqueeze(1) - word_ids[top_indices]).clamp_min_(min=1)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

How to reproduce the behaviour

git clone https://github.com/explosion/projects.git
cd projects/experimental/coref
# Prepare CoNLL 2012 data...
spacy project run all

Your Environment

  • spaCy version: 3.4.2
  • Platform: Ubuntu 22.04
  • Python version: 3.10.6
  • Pipelines: en_coreference_web_trf (3.4.0a0), en_core_web_md (3.4.1)
  • PyTorch version: 1.13.0
  • CUDA version: 11.7
@itssimon itssimon changed the title RuntimeError when training coref experimental example project RuntimeError when training coref example project Nov 2, 2022
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation experimental Experimental components and features labels Nov 2, 2022
@polm polm added the feat / coref Feature: Coreference resolution label Nov 7, 2022
@polm
Copy link
Contributor

polm commented Nov 7, 2022

Thanks for the report and sorry you're having trouble with this.

I notice you're using Torch 1.13.0, which was just released. Can you downgrade to a 1.12 version and see if that works? This may be related to #11742.

@itssimon
Copy link
Contributor Author

itssimon commented Nov 7, 2022

You're right, it works fine with PyTorch 1.12.1!

I suspect it's a separate issue to #11742 though, as there was no fatal error / segfault.

@polm
Copy link
Contributor

polm commented Nov 7, 2022

Thanks for confirming that works.

I guess the specific error is different, but I think 1.13 has brought many changes that we'll need to test more thoroughly, so in general I would hold off on 1.13 for now unless you're feeling adventurous.

@polm
Copy link
Contributor

polm commented Nov 15, 2022

The fix for this is in the main branch at spacy-experimental, but it'll be a little while til a release yet, so for the moment I'd still recommend using Torch 1.12.

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation experimental Experimental components and features feat / coref Feature: Coreference resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants