Update an existing model, rather than learning a new one from scratch each time? #672

DustinReagan · 2018-05-31T21:04:04Z

Thanks for this project, it's a been very useful to me! However, I have a small question/issue:

Say I have a trained model and periodically get new training data that I'd like to use to update my model.

From what I can tell, it's impossible to load an existing settings file (which I believe contains the previously learned predicates?), add some new marked pairs, then train the model. Instead, it seems I have to:

Re-load & resample my data.
Load up my existing training file.
Call 'markPairs' with my new training data.
Re-write my training file.
Call 'train'.
Re-write my settings file.

It would be nice if I could skip step 1., since this seems to take the longest and in theory simply loading my existing settings file should get me to that point.

what I'm doing now (pseudocode):

data_d = readData()
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.sample(data_d)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

What I'd like to be able to do:

deduper = dedupe.Dedupe(fieldDefinitions)
deduper.readSettings(settingsFile)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

Am I missing something?

Edit: thinking about it a bit more, I guess the model would need to have samples loaded up to re-train on the new data anyhow...so there's no way to skip the data load/sample step?

Thanks,
Dustin

The text was updated successfully, but these errors were encountered:

jpatel531 · 2018-08-01T14:08:05Z

+1

coommark · 2018-08-24T17:20:08Z

+1

jpipas · 2018-08-30T12:00:09Z

This is my first time using dedupe - trying to dedupe 1.5 million records (mostly names/addresses).

However, is this not what StaticDedupe and the other Static* classes are for? Reloading a saved settings file? I've trained a model, saved the settings and training files off, and then load the settings file via the StaticDedupe class the next time around. Although, the settings file does appear to be architecture specific - as it appears you can't transfer settings files between machines.

adriennefranke · 2018-10-15T16:37:35Z

The StaticDedupe and the other Static* classes do not have the attribute 'train'. So if you want to do any additional training, active labeling, etc you would have to reload and resample the data into the Deduper/RecordLinker/Gazetteer and write those settings and labeled examples off and then you could reload that using the Static* class later. See #679. This is a little annoying though. Does anyone know a way to just add in labeled examples to the model without having to reload/resample all the data?

Yashad10 · 2019-02-15T21:19:02Z

+1

suyashdb · 2019-09-30T23:40:22Z

@adriennefranke @DustinReagan Were you guys able to just add labeled examples to the model without having to reload all the data?

If not how did you go about the problem where more and more data keep coming in batches and then needs to be combined to maste deduped data.

ieriii · 2020-04-20T19:39:19Z

Hi,
I've just created pull request to pandas-dedupe library, a wrapper of dedupe.
The changes allow users to update an existing model, rather than training it from scratch each time. See pandas-dedupe/pull/26

Note: the pull request has now been merged to master.
Updating model is default in pandas_dedupe

ieriii mentioned this issue Apr 20, 2020

Update existing models Lyonk71/pandas-dedupe#26

Merged

fgregg closed this as completed Jan 19, 2022

github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update an existing model, rather than learning a new one from scratch each time? #672

Update an existing model, rather than learning a new one from scratch each time? #672

DustinReagan commented May 31, 2018 •

edited

Loading

jpatel531 commented Aug 1, 2018

coommark commented Aug 24, 2018

jpipas commented Aug 30, 2018

adriennefranke commented Oct 15, 2018

Yashad10 commented Feb 15, 2019

suyashdb commented Sep 30, 2019

ieriii commented Apr 20, 2020 •

edited

Loading

Update an existing model, rather than learning a new one from scratch each time? #672

Update an existing model, rather than learning a new one from scratch each time? #672

Comments

DustinReagan commented May 31, 2018 • edited Loading

jpatel531 commented Aug 1, 2018

coommark commented Aug 24, 2018

jpipas commented Aug 30, 2018

adriennefranke commented Oct 15, 2018

Yashad10 commented Feb 15, 2019

suyashdb commented Sep 30, 2019

ieriii commented Apr 20, 2020 • edited Loading

DustinReagan commented May 31, 2018 •

edited

Loading

ieriii commented Apr 20, 2020 •

edited

Loading