Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support duckdbpyrelation as input type #2375

Merged
merged 6 commits into from
Sep 3, 2024
Merged

Support duckdbpyrelation as input type #2375

merged 6 commits into from
Sep 3, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Sep 3, 2024

Currently we don't support passing a DuckDBPyRelation (the object that most closely resembles a dataframe in the duckdb python API) into the linker.

That is, you can't do:

in_df = con.read_parquet("mydf.parquet")
linker = Linker(in_df, settings, db_api)

This is bad because many users will go through pandas e.g. df = pd.read_csv() and pass df to Linker), which in inefficient and risks data typing issues

Users currently can do this, but it's not documented:

linker = Linker("mydf.parquet", settings, db_api)

which under the hood uses duckdb natively to load in the parquet.

This is also not completely desirable because it doesn't give the user any control over how mydf.parquet is read by duckdb. This is particularly important for csv reads (e.g. passing mydf.csv) where the user may which to specify types

This PR adds support for passing in a DuckDBPyRelation.

Note that, since the existing code was treating a DuckDBPyRelation the same as a pandas DataFrame, and both support being passed to con.register, the existing code actually worked already.

The only thing that was broken was that there was a seprate/independent bug where we weren't ensuring the input of tables was a list.

I have also added support for retrieving a DuckDBPyRelation from SplinkDataFrame outputs from the duckdb linker with as_duckdbpyrelation

Finally, I have allowed the duckdb connection to be specified as :default:
It's used when the user doesn't bother to specify a connection, so for example when you run:

duckdb.read_parquet()

as opposed to

con.read_parquet()

the default connection is used.

This solves the problem that if you used duckdb.read_parquet() to load in a dataframe, and then passed it to a linker, you'd get an error:

InvalidInputException: Invalid Input Error: The relation you are attempting to register was not made from this connection

The solution is either:

con = duckdb.connect()
db_api = DuckdbAPI(connection=con)
df = con.read_parquet("myfile.parquet")
linker = Linker(df,settings,db_api)

or

df = duckdb.read_parquet("myfile.parquet")
db_api = DuckDBAPI(":default:")
linker = Linker(df,settings,db_api)

@RobinL
Copy link
Member Author

RobinL commented Sep 3, 2024

runnable example
import duckdb

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on

in_df = duckdb.read_parquet("del.parquet")
db_api = DuckDBAPI(connection=":default:")

# Create settings
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)


linker = Linker(in_df, settings, db_api)

pairwise_predictions = linker.inference.predict(threshold_match_weight=-5)
pairwise_predictions.as_duckdbpyrelation()

@RobinL RobinL requested a review from ADBond September 3, 2024 15:09
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! All feels like it's a bit neater as well.

@RobinL RobinL merged commit 8e41c4e into master Sep 3, 2024
25 checks passed
@RobinL RobinL deleted the duckdb_py_relation branch September 3, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants