Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update an existing model, rather than learning a new one from scratch each time? #672

Closed
DustinReagan opened this issue May 31, 2018 · 7 comments

Comments

@DustinReagan
Copy link

DustinReagan commented May 31, 2018

Thanks for this project, it's a been very useful to me! However, I have a small question/issue:

Say I have a trained model and periodically get new training data that I'd like to use to update my model.

From what I can tell, it's impossible to load an existing settings file (which I believe contains the previously learned predicates?), add some new marked pairs, then train the model. Instead, it seems I have to:

  1. Re-load & resample my data.
  2. Load up my existing training file.
  3. Call 'markPairs' with my new training data.
  4. Re-write my training file.
  5. Call 'train'.
  6. Re-write my settings file.

It would be nice if I could skip step 1., since this seems to take the longest and in theory simply loading my existing settings file should get me to that point.

what I'm doing now (pseudocode):

data_d = readData()
deduper = dedupe.Dedupe(fieldDefinitions)
deduper.sample(data_d)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

What I'd like to be able to do:

deduper = dedupe.Dedupe(fieldDefinitions)
deduper.readSettings(settingsFile)
deduper.readTraining(trainingFile)
deduper.markPairs(newTrainingPairs)
deduper.writeTraining(trainingFile)
deduper.train()
deduper.writeSettings(settingsFile)

Am I missing something?

Edit: thinking about it a bit more, I guess the model would need to have samples loaded up to re-train on the new data anyhow...so there's no way to skip the data load/sample step?

Thanks,
Dustin

@jpatel531
Copy link

+1

1 similar comment
@coommark
Copy link

+1

@jpipas
Copy link

jpipas commented Aug 30, 2018

This is my first time using dedupe - trying to dedupe 1.5 million records (mostly names/addresses).

However, is this not what StaticDedupe and the other Static* classes are for? Reloading a saved settings file? I've trained a model, saved the settings and training files off, and then load the settings file via the StaticDedupe class the next time around. Although, the settings file does appear to be architecture specific - as it appears you can't transfer settings files between machines.

@adriennefranke
Copy link

The StaticDedupe and the other Static* classes do not have the attribute 'train'. So if you want to do any additional training, active labeling, etc you would have to reload and resample the data into the Deduper/RecordLinker/Gazetteer and write those settings and labeled examples off and then you could reload that using the Static* class later. See #679. This is a little annoying though. Does anyone know a way to just add in labeled examples to the model without having to reload/resample all the data?

@Yashad10
Copy link

+1

@suyashdb
Copy link

@adriennefranke @DustinReagan Were you guys able to just add labeled examples to the model without having to reload all the data?

If not how did you go about the problem where more and more data keep coming in batches and then needs to be combined to maste deduped data.

@ieriii
Copy link

ieriii commented Apr 20, 2020

Hi,
I've just created pull request to pandas-dedupe library, a wrapper of dedupe.
The changes allow users to update an existing model, rather than training it from scratch each time. See pandas-dedupe/pull/26

Note: the pull request has now been merged to master.
Updating model is default in pandas_dedupe

@fgregg fgregg closed this as completed Jan 19, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants