Optionally detect similar features during PK-less reimport. #336
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Detects features which have changed slightly during a re-import from a data source without a primary key, and reimporting them with the same primary key as last time so they show as edits as opposed to inserts.
Features are currently considered similar if they differ by only one field. This could be changed / configurable in the future if better strategies are discovered.
Usage:
sno import SOURCE --replace-existing --similarity-detection-limit=X
where SOURCE is a datasource without primary keys,
and X is a number large enough that it should generally be larger than the number of edits (or inserts + deletes), but not too large so as to take forever. (Setting it to a thousand in the case that there are a thousand edits will result in a million feature-to-feature comparisons).
In the case where there are too many inserts + deletes to check, none of them will be checked, and they will all remain inserts and deletes - the new features will be assigned new primary keys and treated as separate from the old features.
--similarity-detection-limit
is currently set to zero by default - zero means "don't do similarity detection". Our mirroring pipeline may want to set it to some higher number. A non-zero but low default might also benefit end-users, once we have end-users doing reimports of PK-less datasets (we don't yet).Related links:
#212
Checklist: