Active Learning sampling quality #983

tonca · 2022-03-18T10:07:38Z

After using the dedupe library for a while in the context of video content reconciliation, we encountered some situations where the Active Learning sampling is very poor. This makes it difficult to built a good training set for the classifier and by consequence the reconciliation results are poor as well.

For instance, we made some tests and tried to reconcile 2 well known public data provider (iMDB and TMDB), which contain reciprocal references to be used as ground truth and good metadata, plus we could build the dataset knowing that all entries could be reconciled in a many-to-one approach (set 1 is contained in set 2 -> 100% recall is theoretically possible).

We tried to reconcile episodes and used few fields in the process (episode title, series title, season number, episode number, series year). The Active Learning sampling were quite balanced between positive and negative examples, therefore it has been quite effortless to collect 10 samples of positive and negative pairs. The final results were quite satisfying as well: recall 78%, precision 98%. Moreover, by scrolling through the results, we noticed that the model learned to ignore the episode title field, which were not consistent between datasets.

Afterwards we decided to perform a second test by removing the episode title field, but keeping everything else as in the previous test (same dataset, same configurations). This time the Active Learning sample were quite poor: almost all pairs were wrong (it took more than 200 pairs to obtain 8 positive). The final reconciliation in this case were also poor: recall 15% and precision 91%.

I would ask then if it is possible to mitigate this kind of issues:

Is it important to balance the active learning pairs? in the second test we fed 200 negative vs 8 positive pairs. Can this be the cause of a low recall?
How is it the model explainable? How would you suggest to investigate bad reconciliation results in general?
Do you have any idea of which are the possible causes of bad sampling in this specific test case?

Thank you for your great work,

Antonio

The text was updated successfully, but these errors were encountered:

fgregg · 2022-03-18T13:12:04Z

can you try the better sampling branch and let me know if that is giving you better results?

tonca · 2022-03-21T17:41:00Z

Hi again,
I tried to run it but I am getting an error. I am not sure how to fix it.

'RecordLinkBlockLearner' object has no attribute 'candidates'
  File "/home/ubuntu/dedupe/dedupe/labeler.py", line 218, in candidate_scores
    labels = self.predict(self.candidates)
  File "/home/ubuntu/dedupe/dedupe/labeler.py", line 374, in pop
    probabilities = learner.candidate_scores()
  File "/home/ubuntu/dedupe/dedupe/api.py", line 1143, in uncertain_pairs
    return [self.active_learner.pop()]
  File "/home/ubuntu/dedupe/dedupe/convenience.py", line 44, in console_label
    uncertain_pairs = deduper.uncertain_pairs()
  File "/home/ubuntu/kf-reconciliation-on-premises/bin/dedupe_reconciliation.py", line 64, in reconcile
    dedupe.console_label(linker)
  File "/home/ubuntu/kf-reconciliation-on-premises/bin/dedupe_bootstrapping.py", line 58, in main
    reconciled_df = dedupe_reconciliation.reconcile(
  File "/home/ubuntu/kf-reconciliation-on-premises/bin/dedupe_bootstrapping.py", line 80, in <module>
    main()

fgregg · 2022-03-21T17:50:09Z

Looks like you are doing record linkage and not deduping, i haven't updated the code for that code path yet.

tonca · 2022-03-21T18:27:58Z

Yes, exactly. I'll wait then, thank you.

fgregg · 2022-03-29T04:08:39Z

@tonca, the better_sampling branch has been updated for record link #982

fgregg · 2022-04-12T00:42:59Z

closing for now due to lack of feedback.

fgregg closed this as completed Apr 12, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active Learning sampling quality #983

Active Learning sampling quality #983

tonca commented Mar 18, 2022

fgregg commented Mar 18, 2022

tonca commented Mar 21, 2022

fgregg commented Mar 21, 2022

tonca commented Mar 21, 2022

fgregg commented Mar 29, 2022

fgregg commented Apr 12, 2022

Active Learning sampling quality #983

Active Learning sampling quality #983

Comments

tonca commented Mar 18, 2022

fgregg commented Mar 18, 2022

tonca commented Mar 21, 2022

fgregg commented Mar 21, 2022

tonca commented Mar 21, 2022

fgregg commented Mar 29, 2022

fgregg commented Apr 12, 2022