Skip to content

Commit

Permalink
Merge pull request #119 from ludwiglierhammer/dups_again
Browse files Browse the repository at this point in the history
do some duplicate check adjustments
  • Loading branch information
ludwiglierhammer authored Oct 21, 2024
2 parents 4df6579 + 01ddb73 commit afd31ba
Show file tree
Hide file tree
Showing 5 changed files with 722 additions and 219 deletions.
7 changes: 7 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ New features and enhancements
* ``mdf_reader.read``: optionally, set both external schema and code table paths and external schema file (:issue:`47`, :pull:`111`)
* ``cdm_mapper``: Change both columns history and report_quality during duplicate_check (:pull:`112`)
* ``cdm_mapper``: optionally, set column names to be ignored while duplicate check (:pull:`115`)
* ``cdm_mapper``: optionally, set offset values for duplicate_check (:pull:`119`)
* ``cdm_mapper``: optionally, set column entries to be ignored while duplicate_check (:pull:`119`)
* ``cdm_mapper``: add both column names ``station_speed`` and ``station_course`` to default duplicate check list (:pull:`119`)
* ``cdm_mapper``: optionally, re-index data in ascending order according to the number of nulls in each row (:pull:`119`)

Breaking changes
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -74,6 +78,9 @@ Internal changes
* ``metmetpy``: use function ``overwrite_data`` in all platform type correction functions (:pull:`89`)
* rename ``data_model`` into ``imodel`` (:pull:`103`)
* implement assertion tests for module operations (:pull:`104`)
* ``cdm_mapper``: put settings for duplicate check in _duplicate_settings (:pull:`119`)
* ``cdm_mapper``: use pandas.apply function instead of for loops in duplicate_check (:pull:`119`)
* adding some more duplicate checks to testing suite (:pull:`119`)

Bug fixes
^^^^^^^^^
Expand Down
57 changes: 57 additions & 0 deletions cdm_reader_mapper/cdm_mapper/_duplicate_settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""Settings for duplicate check."""

from __future__ import annotations

from recordlinkage import Compare
from recordlinkage.compare import Numeric

_method_kwargs = {
"left_on": "report_timestamp",
"window": 5,
"block_on": ["primary_station_id"],
}

_compare_kwargs = {
"primary_station_id": {"method": "exact"},
"longitude": {
"method": "numeric",
"kwargs": {"method": "step", "offset": 0.11},
},
"latitude": {
"method": "numeric",
"kwargs": {"method": "step", "offset": 0.11},
},
"report_timestamp": {
"method": "date2",
"kwargs": {"method": "gauss", "offset": 60.0},
},
"station_speed": {
"method": "numeric",
"kwargs": {"method": "step", "offset": 0.09},
},
"station_course": {
"method": "numeric",
"kwargs": {"method": "step", "offset": 0.9},
},
}

_histories = {
"duplicate_status": "Added duplicate information - flag",
"duplicates": "Added duplicate information - duplicates",
}


class Date2(Numeric):
"""Copy of ``rl.compare.Numeric`` class."""

pass


def date2(self, *args, **kwargs):
"""New method for ``rl.Compare`` object using ``Date2`` object."""
compare = Date2(*args, **kwargs)
self.add(compare)
return self


Compare.date2 = date2
Loading

0 comments on commit afd31ba

Please sign in to comment.