Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Existing training.json throws an error #815

Closed
mzagaja opened this issue Apr 29, 2020 · 0 comments
Closed

Using Existing training.json throws an error #815

mzagaja opened this issue Apr 29, 2020 · 0 comments

Comments

@mzagaja
Copy link

mzagaja commented Apr 29, 2020

When trying to use an existing training.json file on a dataset instead of getting output I have errors thrown:

csvdedupe --config_file=processors/csvdedupe-config.json --training_file=training.json --settings_file=processors/learned_settings data/finished/arts-and-cultural-assets-massachusetts-clustered.csv > test2.csv
INFO:root:imported 2673 rows
INFO:root:using fields: ['Name', 'Municipality']
INFO:root:taking a sample of 1500 possible pairs
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (sortedAcronym, Municipality), SimplePredicate: (wholeFieldPredicate, Name))
INFO:root:reading labeled examples from training.json
INFO:dedupe.api:reading training from file
Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 168, in __call__
    doc_id = self.index._doc_to_id[doc]
AttributeError: 'NoneType' object has no attribute '_doc_to_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 650, in readTraining
    self.markPairs(training_pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 730, in markPairs
    self.active_learner.mark(examples, y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 359, in mark
    learner.fit_transform(self.pairs, self.y)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/labeler.py", line 195, in fit_transform
    recall=1.0)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 26, in learn
    dupe_cover = Cover(self.blocker.predicates, matches)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 379, in __init__
    self._cover(predicates, pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 387, in _cover
    in enumerate(pairs)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/training.py", line 389, in <setcomp>
    set(predicate(record_2, target=True)))}
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/predicates.py", line 170, in __call__
    raise AttributeError("Attempting to block with an index "
AttributeError: Attempting to block with an index predicate without indexing records

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/bin/csvdedupe", line 8, in <module>
    sys.exit(launch_new_instance())
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 180, in launch_new_instance
    d.main()
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvdedupe.py", line 110, in main
    self.dedupe_training(deduper)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/csvdedupe/csvhelpers.py", line 257, in dedupe_training
    deduper.readTraining(tf)
  File "/Users/mzagaja/.virtualenvs/dedupe-examples/lib/python3.7/site-packages/dedupe/api.py", line 653, in readTraining
    raise UserWarning('Training data has records not known '
UserWarning: Training data has records not known to the active learner. Read training in before initializing the active learner with the sample method, or use the prepare_training method.

Allegedly resolved in #761 on the dedupe side, but still manifesting here.

@mzagaja mzagaja closed this as completed Apr 29, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant