Support duckdbpyrelation as input type #2375
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently we don't support passing a
DuckDBPyRelation
(the object that most closely resembles a dataframe in the duckdb python API) into the linker.That is, you can't do:
This is bad because many users will go through pandas e.g.
df = pd.read_csv()
and pass df toLinker
), which in inefficient and risks data typing issuesUsers currently can do this, but it's not documented:
which under the hood uses duckdb natively to load in the parquet.
This is also not completely desirable because it doesn't give the user any control over how
mydf.parquet
is read by duckdb. This is particularly important for csv reads (e.g. passingmydf.csv
) where the user may which to specify typesThis PR adds support for passing in a
DuckDBPyRelation
.Note that, since the existing code was treating a
DuckDBPyRelation
the same as a pandas DataFrame, and both support being passed tocon.register
, the existing code actually worked already.The only thing that was broken was that there was a seprate/independent bug where we weren't ensuring the input of tables was a list.
I have also added support for retrieving a
DuckDBPyRelation
from SplinkDataFrame outputs from the duckdb linker withas_duckdbpyrelation
Finally, I have allowed the duckdb connection to be specified as
:default:
It's used when the user doesn't bother to specify a connection, so for example when you run:
as opposed to
the default connection is used.
This solves the problem that if you used
duckdb.read_parquet()
to load in a dataframe, and then passed it to a linker, you'd get an error:The solution is either:
or