Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: Invalid Nan value when encoding double #11757

Closed
IllyShaieb opened this issue Nov 7, 2022 · 8 comments
Closed

OverflowError: Invalid Nan value when encoding double #11757

IllyShaieb opened this issue Nov 7, 2022 · 8 comments
Labels
feat / textcat Feature: Text Classifier training Training and updating models

Comments

@IllyShaieb
Copy link

How to reproduce the behaviour

An extract of my python code:

    ...

    def train(self) -> None:
        """
        Train the model on the given dataset for the given number of epochs.
        """
        if self.config.rebuild_data:
            self._prepare_data()

        if self.config.rebuild_config:
            os.system(
                "poetry run python -m spacy init fill-config .\\data\\intents\\base_config.cfg .\\data\\intents\\config.cfg"
            )

        os.system(
            "poetry run python -m spacy train .\\data\\intents\\config.cfg --output .\\models\\intents"
        )

    def _make_spacy_docs(
        self, data: list[tuple[str, str]], labels_to_idx: dict[str, int]
    ) -> list:
        """
        Helper function to create take a list of texts and labels and
        create a list of spaCy docs.

        returns: A list of spaCy docs.
        """
        docs = []
        for doc, label in tqdm(self.nlp.pipe(data, as_tuples=True), total=len(data)):
            doc.cats[label] = labels_to_idx[label]
            docs.append(doc)

        return docs

    def _prepare_data(self) -> None:
        """
        Helper function to prepare and save the data for training and validation.
        """
        dataset = data.IntentClassifierDataset(
            Path(self.config.data_path), shuffle=True
        )
        train_data, test_data = dataset.split(self.config.train_percentage)

        labels_to_idx = {label: idx for idx, label in enumerate(dataset.intents)}

        train_docs = self._make_spacy_docs(train_data.values.tolist(), labels_to_idx)
        test_docs = self._make_spacy_docs(test_data.values.tolist(), labels_to_idx)

        train_bin = DocBin(docs=train_docs)
        test_bin = DocBin(docs=test_docs)

        train_bin.to_disk(self.config.train_data_save_path)
        test_bin.to_disk(self.config.valid_data_save_path)

    ...

After creating the data and running the spaCy train loop I get an overflow error:

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 629.33it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 556.84it/s]
✔ Auto-filled config with all values
✔ Saved config
data\intents\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
ℹ Saving to output directory: models\intents
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-11-07 00:54:24,127] [INFO] Set up nlp object from config
[2022-11-07 00:54:24,149] [INFO] Pipeline: ['textcat']
[2022-11-07 00:54:24,155] [INFO] Created vocabulary
[2022-11-07 00:54:24,157] [INFO] Finished initializing nlp object
[2022-11-07 00:54:24,204] [INFO] Initialized pipeline components: ['textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE
---  ------  ------------  ----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
OverflowError('Invalid Nan value when encoding double')
Traceback (most recent call last):
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 122, in train
    raise e
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 110, in train
    save_checkpoint(is_best_checkpoint)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2060, in to_disk
    util.to_disk(path, serializers, exclude)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\util.py", line 1339, in to_disk
    writer(path / key)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2051, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
OverflowError: Invalid Nan value when encoding double

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\typer\main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 126, in train
    save_checkpoint(False)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2060, in to_disk
    util.to_disk(path, serializers, exclude)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\util.py", line 1339, in to_disk
    writer(path / key)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2051, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
OverflowError: Invalid Nan value when encoding double

Your Environment

  • Operating System: Windows 10
  • Python Version Used: 3.10
  • spaCy Version Used: 3.4.2
  • Environment Information: Using poetry to run my python code
@polm
Copy link
Contributor

polm commented Nov 7, 2022

This looks like a duplicate of #10217. It looks like the issue is that you are using the enumerated ID of your labels as the value of doc.cats[label], which is not the way that is intended to be used - it should be 1 for true labels, and 0 otherwise. This should work if you change your scores to be assigned that way.

Out of curiosity, how did you come up with the current code for assigning labels?

@polm polm added training Training and updating models feat / textcat Feature: Text Classifier resolved The issue was addressed / answered labels Nov 7, 2022
@IllyShaieb
Copy link
Author

Brilliant, thank you. I was using the code in this Medium article building-a-text-classifier-with-spacy.

I was under the assumption that the label had to be unique. Maybe the documentation needs to be clearer?

@github-actions github-actions bot removed the resolved The issue was addressed / answered label Nov 7, 2022
@polm
Copy link
Contributor

polm commented Nov 7, 2022

It's true that the label has to be unique - duplicate labels will be treated as the same thing - but the label value does not have to be unique, it just needs to be 0 or 1.

This is actually called out in the API docs for textcat, which were updated to be clearer as part of #9041. Is there any other place you were checking where we could make this more explicit?

I will look at adding a check for this during training.

@IllyShaieb
Copy link
Author

Ah I did not read the documentation properly. I think it seemed a bit overwhelming (which I think is a little my fault) which is why I went to other tutorials.

I think the error message is also not very clear. I think there should be some sort of page on the spaCy section on TextCategorizer or Training Pipelines & Models with common issues and examples of how to fix them.

@polm
Copy link
Contributor

polm commented Nov 7, 2022

Thanks for the suggestions, and the note that the documentation seems overwhelming.

The error message here is definitely unhelpful, and I've written a PR (#11763) to help with that part.

We do have an FAQ label in Discussions, and a top-level FAQ, but I'll see if there's some way we can make these more accessible from the main docs.

@IllyShaieb
Copy link
Author

Thank you very much 😃

@polm polm added the resolved The issue was addressed / answered label Nov 8, 2022
@github-actions
Copy link
Contributor

This issue has been automatically closed because it was answered and there was no follow-up discussion.

@github-actions github-actions bot removed the resolved The issue was addressed / answered label Nov 15, 2022
@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / textcat Feature: Text Classifier training Training and updating models
Projects
None yet
Development

No branches or pull requests

2 participants