Remove clustering pairwise output format #2264

ADBond · 2024-07-17T09:39:23Z

This removes the options from cluster_pairwise_predictions_at_threshold() to allow for the output format to be in pairwise 'edge' format instead of as a list of nodes with cluster labels.

The rationale for this is that the feature does not seem to be used particularly, and its removal simplifies the code and the API. All the pairwise output did is essentially joining the clusters table to the edges table, which can easily be done manually if desired (or we could re-provide a method for doing so if there is a burning desire from users).

RobinL

Thanks!

samirnoman · 2024-07-30T17:48:56Z

Can you please provide an alternative method to produce the same pairwise output of cluster_pairwise_predictions_at_threshold.
In my case, this output is the main result of my pipeline because it simplifies reviewing and approving the clustering. I usually provide this in excel file for my coworkers as the final output of the deduplication process.

It would be very convenient to have this functionality (even with a different method).

Right now, I have managed to port my code to the new version (v4).
I'm only stuck at this final step so far.

I tried to study the old code in order to re-produce it in my code. But, the old code was using intermediate table (representative) which is not available after calling the method.

Thank you for your great work

ADBond · 2024-07-31T12:01:16Z

Hi @samirnoman - we can certainly have a look at providing a way to have this functionality in the new version.

In the meantime, you should be able to get similar results using code along these lines:

...
df_e = linker.inference.predict(threshold_match_weight=-5)
df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_e, 0.95
)

sql = f"""
SELECT
    e.*,  -- or whatever subset of columns you wish to retain
    l.cluster_id AS cluster_id_l,
    r.cluster_id AS cluster_id_r
FROM
    {df_e.physical_name} e
LEFT JOIN
    {df_c.physical_name} l
ON e.unique_id_l = l.unique_id
LEFT JOIN
    {df_c.physical_name} r
ON e.unique_id_r = r.unique_id
where cluster_id_l = cluster_id_r  -- if you only want intra-cluster links, otherwise drop this line - pairwise_filter=True
order by
    cluster_id_l, cluster_id_r
"""

# this is the SplinkDataFrame with output as with pairwise_mode=True
df_pairwise = db_api.sql_to_splink_dataframe_checking_cache(
    sql=sql, output_tablename_templated="pairwise_clusters"
)

You may need to slightly tweak the SQL if e.g. you use a custom unique_id_column_name or anything

samirnoman · 2024-07-31T13:16:09Z

thank you so much for this code.. It will certainly help me go forward and finish upgrading my scripts... I will try it today and let you know if I face any problems..

I also appreciate your consideration to include this functionality in the future

best regards

samirnoman · 2024-07-31T14:50:11Z

I tried the code you sent earlier and it worked well almost without modification.
I turned it into a function in order to make it generic and easier to call from my script. I also tried to make it handle composite uid_cols properly

`

def make_pairwise_clusters(linker: Linker,
    df_predict: SplinkDataFrame,
    df_clusters: SplinkDataFrame,
    pairwise_filter=True,
):
    db_api = linker._db_api
    uid_cols = linker._settings_obj.column_info_settings.unique_id_input_columns
    if uid_cols:
        uid_concat_l = " and ".join([ f"l.{c.name} = n.{c.name_l}" for c in uid_cols ])
        uid_concat_r = " and ".join([ f"r.{c.name} = n.{c.name_r}" for c in uid_cols ])
    else:
        uid_concat_l = "l.unique_id = n.unique_id_l"
        uid_concat_r = "r.unique_id = n.unique_id_r"
    filter_cond = "where cluster_id_l = cluster_id_r" if pairwise_filter else ""
    sql = f"""
        select
            n.*,
            l.cluster_id as cluster_id_l,
            r.cluster_id as cluster_id_r,
        from {df_predict.physical_name} as n
        left join
        {df_clusters.physical_name} as l
            on {uid_concat_l}
        left join
        {df_clusters.physical_name} as r
            on {uid_concat_r}
        {filter_cond}
        order by
            cluster_id_l, cluster_id_r
    """
    # this is the SplinkDataFrame with output as with pairwise_mode=True
    df_pairwise = db_api.sql_to_splink_dataframe_checking_cache(
        sql=sql, output_tablename_templated="pairwise_clusters"
    )
    return df_pairwise

`

remove pairwise output format option from clustering method

5157aa9

RobinL approved these changes Jul 17, 2024

View reviewed changes

ADBond merged commit 1e1bcfb into splink4_dev Jul 17, 2024
25 checks passed

ADBond deleted the remove-clustering-pairwise-format branch July 17, 2024 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove clustering pairwise output format #2264

Remove clustering pairwise output format #2264

ADBond commented Jul 17, 2024

RobinL left a comment

samirnoman commented Jul 30, 2024 •

edited

Loading

ADBond commented Jul 31, 2024

samirnoman commented Jul 31, 2024

samirnoman commented Jul 31, 2024 •

edited

Loading

Remove clustering pairwise output format #2264

Remove clustering pairwise output format #2264

Conversation

ADBond commented Jul 17, 2024

RobinL left a comment

Choose a reason for hiding this comment

samirnoman commented Jul 30, 2024 • edited Loading

ADBond commented Jul 31, 2024

samirnoman commented Jul 31, 2024

samirnoman commented Jul 31, 2024 • edited Loading

samirnoman commented Jul 30, 2024 •

edited

Loading

samirnoman commented Jul 31, 2024 •

edited

Loading