Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove clustering pairwise output format #2264

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

ADBond
Copy link
Contributor

@ADBond ADBond commented Jul 17, 2024

This removes the options from cluster_pairwise_predictions_at_threshold() to allow for the output format to be in pairwise 'edge' format instead of as a list of nodes with cluster labels.

The rationale for this is that the feature does not seem to be used particularly, and its removal simplifies the code and the API. All the pairwise output did is essentially joining the clusters table to the edges table, which can easily be done manually if desired (or we could re-provide a method for doing so if there is a burning desire from users).

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ADBond ADBond merged commit 1e1bcfb into splink4_dev Jul 17, 2024
25 checks passed
@ADBond ADBond deleted the remove-clustering-pairwise-format branch July 17, 2024 09:52
@samirnoman
Copy link

samirnoman commented Jul 30, 2024

Can you please provide an alternative method to produce the same pairwise output of cluster_pairwise_predictions_at_threshold.
In my case, this output is the main result of my pipeline because it simplifies reviewing and approving the clustering. I usually provide this in excel file for my coworkers as the final output of the deduplication process.

It would be very convenient to have this functionality (even with a different method).

Right now, I have managed to port my code to the new version (v4).
I'm only stuck at this final step so far.

I tried to study the old code in order to re-produce it in my code. But, the old code was using intermediate table (representative) which is not available after calling the method.

Thank you for your great work

@ADBond
Copy link
Contributor Author

ADBond commented Jul 31, 2024

Hi @samirnoman - we can certainly have a look at providing a way to have this functionality in the new version.

In the meantime, you should be able to get similar results using code along these lines:

...
df_e = linker.inference.predict(threshold_match_weight=-5)
df_c = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_e, 0.95
)

sql = f"""
SELECT
    e.*,  -- or whatever subset of columns you wish to retain
    l.cluster_id AS cluster_id_l,
    r.cluster_id AS cluster_id_r
FROM
    {df_e.physical_name} e
LEFT JOIN
    {df_c.physical_name} l
ON e.unique_id_l = l.unique_id
LEFT JOIN
    {df_c.physical_name} r
ON e.unique_id_r = r.unique_id
where cluster_id_l = cluster_id_r  -- if you only want intra-cluster links, otherwise drop this line - pairwise_filter=True
order by
    cluster_id_l, cluster_id_r
"""

# this is the SplinkDataFrame with output as with pairwise_mode=True
df_pairwise = db_api.sql_to_splink_dataframe_checking_cache(
    sql=sql, output_tablename_templated="pairwise_clusters"
)

You may need to slightly tweak the SQL if e.g. you use a custom unique_id_column_name or anything

@samirnoman
Copy link

thank you so much for this code.. It will certainly help me go forward and finish upgrading my scripts... I will try it today and let you know if I face any problems..

I also appreciate your consideration to include this functionality in the future

best regards

@samirnoman
Copy link

samirnoman commented Jul 31, 2024

I tried the code you sent earlier and it worked well almost without modification.
I turned it into a function in order to make it generic and easier to call from my script. I also tried to make it handle composite uid_cols properly

`

def make_pairwise_clusters(linker: Linker,
    df_predict: SplinkDataFrame,
    df_clusters: SplinkDataFrame,
    pairwise_filter=True,
):
    db_api = linker._db_api
    uid_cols = linker._settings_obj.column_info_settings.unique_id_input_columns
    if uid_cols:
        uid_concat_l = " and ".join([ f"l.{c.name} = n.{c.name_l}" for c in uid_cols ])
        uid_concat_r = " and ".join([ f"r.{c.name} = n.{c.name_r}" for c in uid_cols ])
    else:
        uid_concat_l = "l.unique_id = n.unique_id_l"
        uid_concat_r = "r.unique_id = n.unique_id_r"
    filter_cond = "where cluster_id_l = cluster_id_r" if pairwise_filter else ""
    sql = f"""
        select
            n.*,
            l.cluster_id as cluster_id_l,
            r.cluster_id as cluster_id_r,
        from {df_predict.physical_name} as n
        left join
        {df_clusters.physical_name} as l
            on {uid_concat_l}
        left join
        {df_clusters.physical_name} as r
            on {uid_concat_r}
        {filter_cond}
        order by
            cluster_id_l, cluster_id_r
    """
    # this is the SplinkDataFrame with output as with pairwise_mode=True
    df_pairwise = db_api.sql_to_splink_dataframe_checking_cache(
        sql=sql, output_tablename_templated="pairwise_clusters"
    )
    return df_pairwise

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants