OverflowError: Invalid Nan value when encoding double #11757

IllyShaieb · 2022-11-07T01:16:54Z

How to reproduce the behaviour

An extract of my python code:

    ...

    def train(self) -> None:
        """
        Train the model on the given dataset for the given number of epochs.
        """
        if self.config.rebuild_data:
            self._prepare_data()

        if self.config.rebuild_config:
            os.system(
                "poetry run python -m spacy init fill-config .\\data\\intents\\base_config.cfg .\\data\\intents\\config.cfg"
            )

        os.system(
            "poetry run python -m spacy train .\\data\\intents\\config.cfg --output .\\models\\intents"
        )

    def _make_spacy_docs(
        self, data: list[tuple[str, str]], labels_to_idx: dict[str, int]
    ) -> list:
        """
        Helper function to create take a list of texts and labels and
        create a list of spaCy docs.

        returns: A list of spaCy docs.
        """
        docs = []
        for doc, label in tqdm(self.nlp.pipe(data, as_tuples=True), total=len(data)):
            doc.cats[label] = labels_to_idx[label]
            docs.append(doc)

        return docs

    def _prepare_data(self) -> None:
        """
        Helper function to prepare and save the data for training and validation.
        """
        dataset = data.IntentClassifierDataset(
            Path(self.config.data_path), shuffle=True
        )
        train_data, test_data = dataset.split(self.config.train_percentage)

        labels_to_idx = {label: idx for idx, label in enumerate(dataset.intents)}

        train_docs = self._make_spacy_docs(train_data.values.tolist(), labels_to_idx)
        test_docs = self._make_spacy_docs(test_data.values.tolist(), labels_to_idx)

        train_bin = DocBin(docs=train_docs)
        test_bin = DocBin(docs=test_docs)

        train_bin.to_disk(self.config.train_data_save_path)
        test_bin.to_disk(self.config.valid_data_save_path)

    ...

After creating the data and running the spaCy train loop I get an overflow error:

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59/59 [00:00<00:00, 629.33it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 556.84it/s]
✔ Auto-filled config with all values
✔ Saved config
data\intents\config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
ℹ Saving to output directory: models\intents
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-11-07 00:54:24,127] [INFO] Set up nlp object from config
[2022-11-07 00:54:24,149] [INFO] Pipeline: ['textcat']
[2022-11-07 00:54:24,155] [INFO] Created vocabulary
[2022-11-07 00:54:24,157] [INFO] Finished initializing nlp object
[2022-11-07 00:54:24,204] [INFO] Initialized pipeline components: ['textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE
---  ------  ------------  ----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
OverflowError('Invalid Nan value when encoding double')
Traceback (most recent call last):
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 122, in train
    raise e
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 110, in train
    save_checkpoint(is_best_checkpoint)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2060, in to_disk
    util.to_disk(path, serializers, exclude)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\util.py", line 1339, in to_disk
    writer(path / key)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2051, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
OverflowError: Invalid Nan value when encoding double

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\<USER>\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\typer\main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\cli\train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 126, in train
    save_checkpoint(False)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\training\loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2060, in to_disk
    util.to_disk(path, serializers, exclude)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\util.py", line 1339, in to_disk
    writer(path / key)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\spacy\language.py", line 2051, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "C:\Users\<USER>\AppData\Local\pypoetry\Cache\virtualenvs\ace-_xZX6yZF-py3.10\lib\site-packages\srsly\_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
OverflowError: Invalid Nan value when encoding double

Your Environment

Operating System: Windows 10
Python Version Used: 3.10
spaCy Version Used: 3.4.2
Environment Information: Using poetry to run my python code

polm · 2022-11-07T04:37:51Z

This looks like a duplicate of #10217. It looks like the issue is that you are using the enumerated ID of your labels as the value of doc.cats[label], which is not the way that is intended to be used - it should be 1 for true labels, and 0 otherwise. This should work if you change your scores to be assigned that way.

Out of curiosity, how did you come up with the current code for assigning labels?

IllyShaieb · 2022-11-07T08:08:48Z

Brilliant, thank you. I was using the code in this Medium article building-a-text-classifier-with-spacy.

I was under the assumption that the label had to be unique. Maybe the documentation needs to be clearer?

polm · 2022-11-07T09:32:21Z

It's true that the label has to be unique - duplicate labels will be treated as the same thing - but the label value does not have to be unique, it just needs to be 0 or 1.

This is actually called out in the API docs for textcat, which were updated to be clearer as part of #9041. Is there any other place you were checking where we could make this more explicit?

I will look at adding a check for this during training.

IllyShaieb · 2022-11-07T10:23:53Z

Ah I did not read the documentation properly. I think it seemed a bit overwhelming (which I think is a little my fault) which is why I went to other tutorials.

I think the error message is also not very clear. I think there should be some sort of page on the spaCy section on TextCategorizer or Training Pipelines & Models with common issues and examples of how to fix them.

polm · 2022-11-07T10:30:52Z

Thanks for the suggestions, and the note that the documentation seems overwhelming.

The error message here is definitely unhelpful, and I've written a PR (#11763) to help with that part.

We do have an FAQ label in Discussions, and a top-level FAQ, but I'll see if there's some way we can make these more accessible from the main docs.

IllyShaieb · 2022-11-07T10:33:49Z

Thank you very much 😃

github-actions · 2022-11-15T00:06:16Z

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions · 2022-12-16T00:02:10Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

polm added training Training and updating models feat / textcat Feature: Text Classifier resolved The issue was addressed / answered labels Nov 7, 2022

github-actions bot removed the resolved The issue was addressed / answered label Nov 7, 2022

polm mentioned this issue Nov 7, 2022

Check textcat values for validity #11763

Merged

3 tasks

polm added the resolved The issue was addressed / answered label Nov 8, 2022

github-actions bot closed this as completed Nov 15, 2022

github-actions bot removed the resolved The issue was addressed / answered label Nov 15, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: Invalid Nan value when encoding double #11757

OverflowError: Invalid Nan value when encoding double #11757

IllyShaieb commented Nov 7, 2022

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

github-actions bot commented Nov 15, 2022

github-actions bot commented Dec 16, 2022

OverflowError: Invalid Nan value when encoding double #11757

OverflowError: Invalid Nan value when encoding double #11757

Comments

IllyShaieb commented Nov 7, 2022

How to reproduce the behaviour

Your Environment

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

polm commented Nov 7, 2022

IllyShaieb commented Nov 7, 2022

github-actions bot commented Nov 15, 2022

github-actions bot commented Dec 16, 2022