Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using existing training.json throws error #99

Open
mzagaja opened this issue Apr 29, 2020 · 3 comments
Open

Using existing training.json throws error #99

mzagaja opened this issue Apr 29, 2020 · 3 comments

Comments

@mzagaja
Copy link

mzagaja commented Apr 29, 2020

When trying to use an existing training.json file on a dataset instead of getting output I have errors thrown:

csvdedupe --config_file=processors/csvdedupe-config.json --training_file=training.json --settings_file=processors/learned_settings data/finished/arts-and-cultural-assets-massachusetts-clustered.csv > test2.csv
INFO:root:imported 2673 rows
INFO:root:using fields: ['Name', 'Municipality']
INFO:root:taking a sample of 1500 possible pairs
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sortedAcronym, Municipality), SimplePredicate: (wholeFieldPredicate, Name))
INFO:root:reading labeled examples from training.json
INFO:dedupe.api:reading training from file
Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 168, in __call__
    doc_id = self.index._doc_to_id[doc]
AttributeError: 'NoneType' object has no attribute '_doc_to_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 650, in readTraining
    self.markPairs(training_pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 730, in markPairs
    self.active_learner.mark(examples, y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 359, in mark
    learner.fit_transform(self.pairs, self.y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 195, in fit_transform
    recall=1.0)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 26, in learn
    dupe_cover = Cover(self.blocker.predicates, matches)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 379, in __init__
    self._cover(predicates, pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 387, in _cover
    in enumerate(pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 389, in <setcomp>
    set(predicate(record_2, target=True)))}
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 170, in __call__
    raise AttributeError("Attempting to block with an index "
AttributeError: Attempting to block with an index predicate without indexing records

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/bin/csvdedupe", line 8, in <module>
    sys.exit(launch_new_instance())
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 180, in launch_new_instance
    d.main()
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 110, in main
    self.dedupe_training(deduper)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvhelpers.py", line 257, in dedupe_training
    deduper.readTraining(tf)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 653, in readTraining
    raise UserWarning('Training data has records not known '
UserWarning: Training data has records not known to the active learner. Read training in before initializing the active learner with the sample method, or use the prepare_training method.

Allegedly resolved in dedupeio/dedupe#761 on the dedupe side, but still manifesting here.

@ghost
Copy link

ghost commented Sep 21, 2020

csvdedupe requires dedupe>=1.6,<2, which turns out to be 1.10.0.
This was released on 9th Jan 2020.
dedupeio/dedupe#761 was merged on 10 Aug 2019, so in theory we should already be using it.

Perhaps this is a separate issue?

@chrismp
Copy link

chrismp commented Oct 11, 2021

Hello, I also recently ran csvdedupe for the first time. After I finished, a training.json file was created. When I tried running csvdedupe again, I got the same error as @mzagaja. I have dedupe v1.10.0 installed.

@regel
Copy link

regel commented Jan 21, 2023

Replacing readTraining function in dedupe/api.py with the following code fixes the issue. I will try submit a patch to the maintainers.

    def readTraining(self, training_file):
        '''
        Read training from previously built training data file object

        Arguments:

        training_file -- file object containing the training data
        '''
        logger.info('reading training from file')
        self.training_pairs = json.load(training_file,
                                        cls=serializer.dedupe_decoder)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants