Clustering allows match weight args not just match probability #2454

RobinL · 2024-10-07T07:36:30Z

This PR allows all clustering functions to take a match weight threshold argument instead of a match probability threshold.

You can now provide either (but not both), which makes the behaviour consistent with e.g. inference.predict()

RobinL · 2024-11-25T08:44:09Z

Possible chart showing how num clusters varies with mw

import altair as alt
import pandas as pd

cc = cluster_pairwise_predictions_at_multiple_thresholds(
    nodes,
    edges,
    node_id_column_name="my_id",
    db_api=db_api,
    # match_probability_thresholds=thresholds,
    match_weight_thresholds=thresholds_weights,
    output_cluster_summary_stats=True,
)


dc_df = cc.as_duckdbpyrelation().df()

# Define the options for the x-axis
x_axis_options = ['threshold_match_probability', 'threshold_match_weight']

# Create a selection parameter with radio buttons
x_axis_param = alt.param(
    name='x_field',
    bind=alt.binding_radio(options=x_axis_options, name='X-axis: '),
    value='threshold_match_probability'
)

# Base chart with dynamic x-axis based on the parameter
base_chart = (
    alt.Chart(dc_df)
    .transform_fold(
        fold=['threshold_match_probability', 'threshold_match_weight'],
        as_=['variable', 'x_value']
    )
    .transform_filter(
        alt.datum.variable == x_axis_param
    )
    .add_params(x_axis_param)
    .encode(
        x=alt.X('x_value:Q', title='X-axis')
    )
    .properties(width=400, height=150)
)

# Define the subcharts
num_clusters = (
    base_chart.mark_line()
    .encode(
        y=alt.Y("num_clusters:Q", title="Number of Clusters")
    )
    .properties(title="Number of Clusters vs X-axis")
)

max_cluster_size = (
    base_chart.mark_line()
    .encode(
        y=alt.Y("max_cluster_size:Q", title="Max Cluster Size")
    )
    .properties(title="Maximum Cluster Size vs X-axis")
)

avg_cluster_size = (
    base_chart.mark_line()
    .encode(
        y=alt.Y("avg_cluster_size:Q", title="Average Cluster Size")
    )
    .properties(title="Average Cluster Size vs X-axis")
)

# Combine the charts
combined_chart = alt.vconcat(
    num_clusters, max_cluster_size, avg_cluster_size
).resolve_scale(y="independent")

…o reflect mw

The reason for this is that checkpointing can cause the schema to be lost if there are zero rows. Whereas parquet preserves the schema

RobinL and others added 6 commits October 6, 2024 08:09

factor out logic

b78757d

allow match weight to be passed to clustering

fa6d0ad

cluster multiple threshold allow mw

d2929b5

allow match weight thresholds

29fc665

cast float

e274a76

threshold match weight can now be provided

3bf528f

RobinL added 2 commits November 25, 2024 09:12

fix bug in cluster at multiple

136391b

column names in cluster_pairwise_predictions_at_multiple_thresholds t…

76bd4f5

…o reflect mw

RobinL changed the title ~~(WIP) Clustering allows mw~~ (WIP) Clustering allows match weight args not just match probability Nov 25, 2024

RobinL added 3 commits November 25, 2024 09:53

fix annotations

14295a7

add tests

50f6000

add test that mw and cluster give same result

6e55df2

RobinL changed the title ~~(WIP) Clustering allows match weight args not just match probability~~ Clustering allows match weight args not just match probability Nov 25, 2024

RobinL and others added 7 commits November 25, 2024 13:46

Merge branch 'master' into clustering_allows_mw

348ad24

add additional persisted df

6996a2f

Use parquet rather than checkpoint to break lineage

71e8ba7

The reason for this is that checkpointing can cause the schema to be lost if there are zero rows. Whereas parquet preserves the schema

unique name for unstable reps

7db4cf2

try running tests with checkpoint

427e060

better function name

14d9f9b

fix ordering - turn arg to match prob earlier

3ceddde

RobinL merged commit 2aa78da into master Nov 26, 2024
25 checks passed

RobinL deleted the clustering_allows_mw branch November 26, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering allows match weight args not just match probability #2454

Clustering allows match weight args not just match probability #2454

RobinL commented Oct 7, 2024 •

edited

Loading

RobinL commented Nov 25, 2024 •

edited

Loading

Clustering allows match weight args not just match probability #2454

Clustering allows match weight args not just match probability #2454

Conversation

RobinL commented Oct 7, 2024 • edited Loading

RobinL commented Nov 25, 2024 • edited Loading

RobinL commented Oct 7, 2024 •

edited

Loading

RobinL commented Nov 25, 2024 •

edited

Loading