Optionally detect similar features during PK-less reimport. #336

olsen232 · 2020-12-16T22:26:57Z

Description

Detects features which have changed slightly during a re-import from a data source without a primary key, and reimporting them with the same primary key as last time so they show as edits as opposed to inserts.
Features are currently considered similar if they differ by only one field. This could be changed / configurable in the future if better strategies are discovered.

Usage:
sno import SOURCE --replace-existing --similarity-detection-limit=X

where SOURCE is a datasource without primary keys,
and X is a number large enough that it should generally be larger than the number of edits (or inserts + deletes), but not too large so as to take forever. (Setting it to a thousand in the case that there are a thousand edits will result in a million feature-to-feature comparisons).

In the case where there are too many inserts + deletes to check, none of them will be checked, and they will all remain inserts and deletes - the new features will be assigned new primary keys and treated as separate from the old features.

--similarity-detection-limit is currently set to zero by default - zero means "don't do similarity detection". Our mirroring pipeline may want to set it to some higher number. A non-zero but low default might also benefit end-users, once we have end-users doing reimports of PK-less datasets (we don't yet).

Checklist:

Have you reviewed your own change?
Have you included test(s)?
Have you updated the changelog?

sno/init.py

sno/pk_generation.py

craigds

nice work! 🎉

olsen232 requested review from craigds and rcoup December 16, 2020 22:27

olsen232 force-pushed the similarity branch from 9674c10 to c4926d4 Compare December 16, 2020 22:28

craigds requested changes Dec 17, 2020

View reviewed changes

rcoup reviewed Dec 18, 2020

View reviewed changes

sno/pk_generation.py Outdated Show resolved Hide resolved

Optionally detect similar features during PK-less reimport.

f1e5bfc

olsen232 force-pushed the similarity branch from c4926d4 to f1e5bfc Compare December 18, 2020 01:13

olsen232 requested review from rcoup and craigds December 18, 2020 01:14

craigds approved these changes Dec 18, 2020

View reviewed changes

olsen232 merged commit dbfdd06 into master Dec 18, 2020

olsen232 deleted the similarity branch December 18, 2020 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally detect similar features during PK-less reimport. #336

Optionally detect similar features during PK-less reimport. #336

olsen232 commented Dec 16, 2020 •

edited

Loading

craigds left a comment

Optionally detect similar features during PK-less reimport. #336

Optionally detect similar features during PK-less reimport. #336

Conversation

olsen232 commented Dec 16, 2020 • edited Loading

Description

Related links:

Checklist:

craigds left a comment

Choose a reason for hiding this comment

olsen232 commented Dec 16, 2020 •

edited

Loading