Regression v3 matcher #176

jbothma · 2024-09-17T15:14:35Z

add regression-v3 crawler copied from regression-v1
split train/test data based on pair group key to fully place connected entities in either train or test
parallelise feature generation for training
Add SimpleImputer to fill NaN with mean for the feature
single name_similarity feature takes max of name_match, name_token_overlap and name_levenshtein
- this helps with name features otherwise getting negative coefficients
address_match is NaN when values aren't available.
- This makes the coefficient positive
name_match component of name_similarity scaled to 0..1 favouring names with longer longest matching token, and more matching tokens.
Symmetric form of name_fingerprint_levenshtein is used for non-Person pairs for alignment of tokens
dob_similarity replaces dob_matches, dob_year_matches, dob_year_disjoint with a single feature scoring
- high for high precision match,
- lower for edits and year match,
- and negatively for precise date mismatch beyond edit distance of 2 on precise dates.
country_mismatch scores positively when countries overlap, negatively when countries are disjoint, NaN otherwise.
position_country_mismatch scores negatively when Position:country is disjoint
security_isin_mismatch scores negatively when Security:isin is disjoint

Before feature changes with just the chronological pairs support, name_levenshtein has a negative coefficient. Changes to make name_levelshtein positive resulted in name_token_overlap's coefficient becoming negative. So name_match, name_token_overlap and name_levenshtein have been combined into a single feature, taking the max.

TODO

clear feature docstrings
type annotations
See if name_token_overlap scaling is too aggressive and can be taught to chill

…data

either one goes negative, or they all hover around 0

jbothma · 2024-09-19T14:47:59Z

Comparing regression_v1 and regression_v3

Common subdirectories: nomenklatura/matching/regression_v1/__pycache__ and nomenklatura/matching/regression_v3/__pycache__
diff -u nomenklatura/matching/regression_v1/misc.py nomenklatura/matching/regression_v3/misc.py
--- nomenklatura/matching/regression_v1/misc.py	2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/misc.py	2024-09-15 09:15:13
@@ -1,8 +1,9 @@
 from followthemoney.proxy import E
 from followthemoney.types import registry
+import numpy as np
 
 from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
-from nomenklatura.matching.compare.util import has_overlap, extract_numbers
+from nomenklatura.matching.compare.util import has_overlap, extract_numbers, is_disjoint
 from nomenklatura.matching.util import props_pair, type_pair
 from nomenklatura.matching.util import max_in_sets, has_schema
 from nomenklatura.util import normalize_name
@@ -18,6 +19,8 @@
 def address_match(query: E, result: E) -> float:
     """Text similarity between addresses."""
     lv, rv = type_pair(query, result, registry.address)
+    if not (lv and rv):
+        return np.nan
     lvn = [normalize_name(v) for v in lv]
     rvn = [normalize_name(v) for v in rv]
     return max_in_sets(lvn, rvn, compare_levenshtein)
@@ -61,3 +64,19 @@
         return 0.0
     lv, rv = type_pair(query, result, registry.identifier)
     return 1.0 if has_overlap(lv, rv) else 0.0
+
+
+def position_country_mismatch(query: E, result: E) -> float:
+    """Whether positions have the same country or not"""
+    if not has_schema(query, result, "Position"):
+        return 0.0
+    lv, rv = type_pair(query, result, registry.country)
+    return 1.0 if is_disjoint(lv, rv) else 0
+
+
+def security_isin_mismatch(query: E, result: E) -> float:
+    """Both entities are linked to different ISIN codes."""
+    if not has_schema(query, result, "Security"):
+        return 0.0
+    qv, rv = props_pair(query, result, ["isin"])
+    return 1.0 if is_disjoint(qv, rv) else 0.0
diff -u nomenklatura/matching/regression_v1/model.py nomenklatura/matching/regression_v3/model.py
--- nomenklatura/matching/regression_v1/model.py	2024-02-13 13:25:35
+++ nomenklatura/matching/regression_v3/model.py	2024-09-15 09:29:11
@@ -5,48 +5,48 @@
 from sklearn.pipeline import Pipeline  # type: ignore
 from followthemoney.proxy import E
 
-from nomenklatura.matching.regression_v1.names import first_name_match
-from nomenklatura.matching.regression_v1.names import family_name_match
-from nomenklatura.matching.regression_v1.names import name_levenshtein, name_match
-from nomenklatura.matching.regression_v1.names import name_token_overlap, name_numbers
-from nomenklatura.matching.regression_v1.misc import phone_match, email_match
-from nomenklatura.matching.regression_v1.misc import address_match, address_numbers
-from nomenklatura.matching.regression_v1.misc import identifier_match, birth_place
-from nomenklatura.matching.regression_v1.misc import org_identifier_match
-from nomenklatura.matching.compare.countries import country_mismatch
+
+from nomenklatura.matching.regression_v3.names import first_name_match, name_similarity
+from nomenklatura.matching.regression_v3.names import family_name_match
+from nomenklatura.matching.regression_v3.names import name_levenshtein, name_match
+from nomenklatura.matching.regression_v3.names import name_token_overlap, name_numbers
+from nomenklatura.matching.regression_v3.misc import phone_match, email_match, position_country_mismatch
+from nomenklatura.matching.regression_v3.misc import address_match, address_numbers
+from nomenklatura.matching.regression_v3.misc import identifier_match, birth_place
+from nomenklatura.matching.regression_v3.misc import org_identifier_match
+from nomenklatura.matching.regression_v3.misc import security_isin_mismatch
 from nomenklatura.matching.compare.gender import gender_mismatch
 from nomenklatura.matching.compare.dates import dob_matches, dob_year_matches
-from nomenklatura.matching.compare.dates import dob_year_disjoint
+from nomenklatura.matching.compare.dates import dob_year_disjoint, dob_similarity
+from nomenklatura.matching.compare.countries import country_match
 from nomenklatura.matching.types import FeatureDocs, FeatureDoc, MatchingResult
 from nomenklatura.matching.types import CompareFunction, Encoded, ScoringAlgorithm
 from nomenklatura.matching.util import make_github_url
 from nomenklatura.util import DATA_PATH
 
 
-class RegressionV1(ScoringAlgorithm):
+class RegressionV3(ScoringAlgorithm):
     """A simple matching algorithm based on a regression model."""
 
-    NAME = "regression-v1"
+    NAME = "regression-v3"
     MODEL_PATH = DATA_PATH.joinpath(f"{NAME}.pkl")
     FEATURES: List[CompareFunction] = [
-        name_match,
-        name_token_overlap,
         name_numbers,
-        name_levenshtein,
+        name_similarity,
         phone_match,
         email_match,
         identifier_match,
-        dob_matches,
-        dob_year_matches,
-        dob_year_disjoint,
+        dob_similarity,
         first_name_match,
         family_name_match,
         birth_place,
         gender_mismatch,
-        country_mismatch,
+        country_match,
+        position_country_mismatch,
         org_identifier_match,
         address_match,
         address_numbers,
+        security_isin_mismatch,
     ]
 
     @classmethod
diff -u nomenklatura/matching/regression_v1/names.py nomenklatura/matching/regression_v3/names.py
--- nomenklatura/matching/regression_v1/names.py	2024-09-09 11:14:58
+++ nomenklatura/matching/regression_v3/names.py	2024-09-18 22:39:00
@@ -1,14 +1,24 @@
+from statistics import mean
 from typing import Iterable, Set
 from followthemoney.proxy import E
 from followthemoney.types import registry
+import numpy as np
 
-from nomenklatura.matching.regression_v1.util import tokenize_pair, compare_levenshtein
+from nomenklatura.matching.regression_v3.util import tokenize_pair, compare_levenshtein
 from nomenklatura.matching.compare.util import is_disjoint, has_overlap, extract_numbers
-from nomenklatura.matching.util import props_pair, type_pair
+from nomenklatura.matching.compare.names import aligned_levenshtein, name_fingerprint_levenshtein, symmetric_aligned_levenshtein
+from nomenklatura.matching.util import has_schema, props_pair, type_pair
 from nomenklatura.matching.util import max_in_sets
 from nomenklatura.util import fingerprint_name
 
 
+MATCH_BASE_SCORE = 0.7
+MAX_BONUS_LENGTH = 100
+LENGTH_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_LENGTH
+MAX_BONUS_QTY = 10
+QTY_BONUS_FACTOR = (1 - MATCH_BASE_SCORE) / MAX_BONUS_QTY
+
+
 def normalize_names(raws: Iterable[str]) -> Set[str]:
     names = set()
     for raw in raws:
@@ -21,43 +31,77 @@
 def name_levenshtein(left: E, right: E) -> float:
     """Consider the edit distance (as a fraction of name length) between the two most
     similar names linked to both entities."""
-    lv, rv = type_pair(left, right, registry.name)
-    lvn, rvn = normalize_names(lv), normalize_names(rv)
-    return max_in_sets(lvn, rvn, compare_levenshtein)
+    if has_schema(left, right, "Person"):
+        lv, rv = type_pair(left, right, registry.name)
+        lvn, rvn = normalize_names(lv), normalize_names(rv)
+        return max_in_sets(lvn, rvn, compare_levenshtein)
+    else:
+        return name_fingerprint_levenshtein(left, right, symmetric_aligned_levenshtein)
 
 
 def first_name_match(left: E, right: E) -> float:
     """Matching first/given name between the two entities."""
     lv, rv = tokenize_pair(props_pair(left, right, ["firstName"]))
+    if not (lv and rv):
+        return np.nan
     return 1.0 if has_overlap(lv, rv) else 0.0
 
 
 def family_name_match(left: E, right: E) -> float:
     """Matching family name between the two entities."""
     lv, rv = tokenize_pair(props_pair(left, right, ["lastName"]))
+    if not (lv and rv):
+        return np.nan
     return 1.0 if has_overlap(lv, rv) else 0.0
 
 
 def name_match(left: E, right: E) -> float:
-    """Check for exact name matches between the two entities."""
+    """
+    Check for exact name matches between the two entities.
+
+    Having any completely matching name initially scores 0.8.
+    A length bonus is added based on the length of the longest common name up to 100 chars.
+    A quantity bonus is added based on the number of common names up to 10.
+
+    The maximum score is 1.0.
+    No matches scores 0.0.
+    """
     lv, rv = type_pair(left, right, registry.name)
     lvn, rvn = normalize_names(lv), normalize_names(rv)
-    common = [len(n) for n in lvn.intersection(rvn)]
-    max_common = max(common, default=0)
-    if max_common == 0:
+    common = sorted(lvn.intersection(rvn), key=lambda n: len(n), reverse=True)
+    if not common:
         return 0.0
-    return float(max_common)
+    score = MATCH_BASE_SCORE
+    longest_common = common[0]
+    length_bonus = min(len(longest_common), MAX_BONUS_LENGTH) * LENGTH_BONUS_FACTOR
+    quantity_bonus = min(len(common), MAX_BONUS_QTY) * QTY_BONUS_FACTOR
+    return score + (length_bonus + quantity_bonus) / 2
 
 
 def name_token_overlap(left: E, right: E) -> float:
     """Evaluate the proportion of identical words in each name."""
-    lv, rv = tokenize_pair(type_pair(left, right, registry.name))
-    common = lv.intersection(rv)
-    tokens = min(len(lv), len(rv))
-    return float(len(common)) / float(max(2.0, tokens))
+    lvt, rvt = tokenize_pair(type_pair(left, right, registry.name))
+    common = lvt.intersection(rvt)
+    tokens = min(len(lvt), len(rvt))
+    if tokens == 0:
+        return 0.0
+    return float(len(common)) / tokens
 
 
 def name_numbers(left: E, right: E) -> float:
     """Find if names contain numbers, score if the numbers are different."""
     lv, rv = type_pair(left, right, registry.name)
     return 1.0 if is_disjoint(extract_numbers(lv), extract_numbers(rv)) else 0.0
+
+
+def name_similarity(left: E, right: E) -> float:
+    """Compute the similarity between the names of two entities, picking the max from
+    a full string match, token overlap-based score, and levenshtein distance-based
+    score."""
+    return max(
+        [
+            name_match(left, right),
+            0.5 * name_token_overlap(left, right),
+            name_levenshtein(left, right),
+        ]
+    )
diff -u nomenklatura/matching/regression_v1/train.py nomenklatura/matching/regression_v3/train.py
--- nomenklatura/matching/regression_v1/train.py	2024-09-06 12:44:09
+++ nomenklatura/matching/regression_v3/train.py	2024-09-13 17:28:35
@@ -1,19 +1,20 @@
 import logging
 import numpy as np
 import multiprocessing
-from typing import Iterable, List, Tuple
+from typing import List, Tuple
 from pprint import pprint
 from numpy.typing import NDArray
 from sklearn.pipeline import make_pipeline  # type: ignore
 from sklearn.preprocessing import StandardScaler  # type: ignore
-from sklearn.model_selection import train_test_split  # type: ignore
+from sklearn.model_selection import GroupShuffleSplit  # type: ignore
 from sklearn.linear_model import LogisticRegression  # type: ignore
+from sklearn.impute import SimpleImputer  # type: ignore
 from sklearn import metrics  # type: ignore
-from concurrent.futures import ThreadPoolExecutor
+from concurrent.futures import ProcessPoolExecutor
 
 from nomenklatura.judgement import Judgement
 from nomenklatura.matching.pairs import read_pairs, JudgedPair
-from nomenklatura.matching.regression_v1.model import RegressionV1
+from nomenklatura.matching.regression_v3.model import RegressionV3
 from nomenklatura.util import PathLike
 
 log = logging.getLogger(__name__)
@@ -22,20 +23,20 @@
 def pair_convert(pair: JudgedPair) -> Tuple[List[float], int]:
     """Encode a pair of training data into features and target."""
     judgement = 1 if pair.judgement == Judgement.POSITIVE else 0
-    features = RegressionV1.encode_pair(pair.left, pair.right)
+    features = RegressionV3.encode_pair(pair.left, pair.right)
     return features, judgement
 
 
 def pairs_to_arrays(
-    pairs: Iterable[JudgedPair],
+    pairs: List[JudgedPair],
 ) -> Tuple[NDArray[np.float32], NDArray[np.float32]]:
     """Parallelize feature computation for training data"""
     xrows = []
     yrows = []
     threads = multiprocessing.cpu_count()
     log.info("Compute threads: %d", threads)
-    with ThreadPoolExecutor(max_workers=threads) as excecutor:
-        results = excecutor.map(pair_convert, pairs)
+    with ProcessPoolExecutor(max_workers=threads) as executor:
+        results = executor.map(pair_convert, pairs, chunksize=1000)
         for idx, (x, y) in enumerate(results):
             if idx > 0 and idx % 10000 == 0:
                 log.info("Computing features: %s....", idx)
@@ -45,42 +46,49 @@
     return np.array(xrows), np.array(yrows)
 
 
-def train_matcher(pairs_file: PathLike) -> None:
+def train_matcher(pairs_file: PathLike, splits: int = 1) -> None:
     pairs = []
     for pair in read_pairs(pairs_file):
-        # HACK: support more eventually:
-        # if not pair.left.schema.is_a("LegalEntity"):
-        #     continue
         if pair.judgement == Judgement.UNSURE:
             pair.judgement = Judgement.NEGATIVE
-        # randomize_entity(pair.left)
-        # randomize_entity(pair.right)
         pairs.append(pair)
-    # random.shuffle(pairs)
-    # pairs = pairs[:30000]
     positive = len([p for p in pairs if p.judgement == Judgement.POSITIVE])
     negative = len([p for p in pairs if p.judgement == Judgement.NEGATIVE])
     log.info("Total pairs loaded: %d (%d pos/%d neg)", len(pairs), positive, negative)
+
     X, y = pairs_to_arrays(pairs)
-    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
-    # logreg = LogisticRegression(class_weight={0: 95, 1: 1})
-    # logreg = LogisticRegression(penalty="l1", solver="liblinear")
-    logreg = LogisticRegression(penalty="l2")
-    log.info("Training model...")
-    pipe = make_pipeline(StandardScaler(), logreg)
-    pipe.fit(X_train, y_train)
-    coef = logreg.coef_[0]
-    coefficients = {n.__name__: c for n, c in zip(RegressionV1.FEATURES, coef)}
-    RegressionV1.save(pipe, coefficients)
-    print("Coefficients:")
-    pprint(coefficients)
-    y_pred = pipe.predict(X_test)
-    cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
-    print("Confusion matrix:\n", cnf_matrix)
-    print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
-    print("Precision:", metrics.precision_score(y_test, y_pred))
-    print("Recall:", metrics.recall_score(y_test, y_pred))
+    groups = [p.group for p in pairs]
+    gss = GroupShuffleSplit(n_splits=splits, test_size=0.33)
+    for split, (train_indices, test_indices) in enumerate(
+        gss.split(X, y, groups=groups), 1
+    ):
+        X_train = [X[i] for i in train_indices]
+        X_test = [X[i] for i in test_indices]
+        y_train = [y[i] for i in train_indices]
+        y_test = [y[i] for i in test_indices]
 
-    y_pred_proba = pipe.predict_proba(X_test)[::, 1]
-    auc = metrics.roc_auc_score(y_test, y_pred_proba)
-    print("Area under curve:", auc)
+        print()
+        log.info("Training model...(split %d)" % split)
+        logreg = LogisticRegression(penalty="l2")
+        pipe = make_pipeline(
+            SimpleImputer(strategy="mean"),
+            StandardScaler(),
+            logreg,
+        )
+        pipe.fit(X_train, y_train)
+        coef = logreg.coef_[0]
+        coefficients = {n.__name__: c for n, c in zip(RegressionV3.FEATURES, coef)}
+        RegressionV3.save(pipe, coefficients)
+
+        print("Coefficients:")
+        pprint(coefficients)
+        y_pred = pipe.predict(X_test)
+        cnf_matrix = metrics.confusion_matrix(y_test, y_pred, normalize="all") * 100
+        print("Confusion matrix (% of all):\n", cnf_matrix)
+        print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
+        print("Precision:", metrics.precision_score(y_test, y_pred))
+        print("Recall:", metrics.recall_score(y_test, y_pred))
+
+        y_pred_proba = pipe.predict_proba(X_test)[::, 1]
+        auc = metrics.roc_auc_score(y_test, y_pred_proba)
+        print("Area under curve:", auc)

jbothma added 16 commits September 6, 2024 10:51

Add regression v3 crawler based on v1 with disjoint cluster training …

7981631

…data

Add SimpleImputer to treat NaN as blank and fill with mean

9e5b6ae

Make name_levenshtein use alignment for non-persons

9b3ffcf

Scale name_token_overlap inversely with number of names

7ecd69f

Combine three different name features because...

ad82f9e

either one goes negative, or they all hover around 0

Replace multiple date features with one that combines the ideas

1c96cfd

Fix clauses

b126b17

Make address match positive

087e284

Retrain

3f80039

Merge branch 'main' into reg-v3-base-fix-odd-coefficients

cf37310

Retrain

26ad188

Update test scores for recent tweaks

0d9f777

Discourage position matches for disjoint countries

4eac909

add isin mismatch feature chill country and first/lastname

22f2077

Use symmetric aligned levenshtein to avoid double work

4544687

Update expected scores

5d59c8d

jbothma changed the title ~~Reg v3 base fix odd coefficients~~ Regression v3 matcher Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression v3 matcher #176

Regression v3 matcher #176

jbothma commented Sep 17, 2024 •

edited

Loading

jbothma commented Sep 19, 2024 •

edited

Loading

Regression v3 matcher #176

Are you sure you want to change the base?

Regression v3 matcher #176

Conversation

jbothma commented Sep 17, 2024 • edited Loading

TODO

jbothma commented Sep 19, 2024 • edited Loading

jbothma commented Sep 17, 2024 •

edited

Loading

jbothma commented Sep 19, 2024 •

edited

Loading