Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally detect similar features during PK-less reimport. #336

Merged
merged 1 commit into from
Dec 18, 2020

Conversation

olsen232
Copy link
Collaborator

@olsen232 olsen232 commented Dec 16, 2020

Description

Detects features which have changed slightly during a re-import from a data source without a primary key, and reimporting them with the same primary key as last time so they show as edits as opposed to inserts.
Features are currently considered similar if they differ by only one field. This could be changed / configurable in the future if better strategies are discovered.

Usage:
sno import SOURCE --replace-existing --similarity-detection-limit=X

where SOURCE is a datasource without primary keys,
and X is a number large enough that it should generally be larger than the number of edits (or inserts + deletes), but not too large so as to take forever. (Setting it to a thousand in the case that there are a thousand edits will result in a million feature-to-feature comparisons).

In the case where there are too many inserts + deletes to check, none of them will be checked, and they will all remain inserts and deletes - the new features will be assigned new primary keys and treated as separate from the old features.

--similarity-detection-limit is currently set to zero by default - zero means "don't do similarity detection". Our mirroring pipeline may want to set it to some higher number. A non-zero but low default might also benefit end-users, once we have end-users doing reimports of PK-less datasets (we don't yet).

Related links:

#212

Checklist:

  • Have you reviewed your own change?
  • Have you included test(s)?
  • Have you updated the changelog?

sno/init.py Outdated Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
sno/pk_generation.py Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
sno/pk_generation.py Outdated Show resolved Hide resolved
Copy link
Member

@craigds craigds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! 🎉

@olsen232 olsen232 merged commit dbfdd06 into master Dec 18, 2020
@olsen232 olsen232 deleted the similarity branch December 18, 2020 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants