-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DataChain.column(...)
and fixing functions and types
#226
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing change.
Comments are inline.
tests/unit/lib/test_datachain.py
Outdated
|
||
ds1 = ds.mutate(new=ds.column("id") - 1) | ||
assert ds1.signals_schema.values["new"] is int | ||
ds1.save("ds1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any value in saving this. For saving - you might need another unit-test. Let's make these test independent.
PS: I don't think you need a separate test for saving int
type :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored tests (split them to multiple ones without save) and added one separate with save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice implementation, thank you! One comment is inline.
Next steps in order to fully close #148 (not in this PR): We need to think if we can do a better job in case of not typed columns and do this conversion under the hood like:
ds.mutate(col=(C("num") - 2) / C("count"))
-->
ds.mutate(col=(ds.column("num") - 2) / ds.column("count"))
tests/unit/lib/test_datachain.py
Outdated
|
||
|
||
def test_mutate_with_expression_without_type(catalog): | ||
with pytest.raises(CompileError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be wrapped with DC exception class DataChainColumError(DataChainParamsError): ...
. In mutate() or signals_schema.mutate() methods, I guess.
Sqlalchemy is implementation detail and should not be exposed to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Yes, this should be done in followup. We need to traverse expression. Problematic part could be to traverse function arguments as I'm not sure they offer hat in API but I guess it should be possible |
…terative/datachain into ilongin/148-datachain-column-with-type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
This PR tries to fix general issue of
.mutate()
receiving expression or column without type, which ends up with error when .save()
is called.Couple of things are added / fixed:
.column(name: str)
method inDataChain
which returnsColumn
instance with defined type by some column name from current schema of thatDataChain
instance. SQLAlchemy expressions that are build with columns with types have inferred type and those with columns without type ends up without inferred type (itsNullType
as result) so.mutate()
should use this method instead rawColumn(...)
when adding expressions for new added columns.avg
function with typefloat
. Before, we were just re-importing this one from sqlalchemy and it didn't have specific type when used in expressions which would make whole expression without type. Now it will always havefloat
type. Other functions that are re-imported directly from SQLAlchemy seems to have defined type in expressions and are working ok in.mutate()
for examplesql_to_python()
function. Before, this was only checking for a subset of all possible SQL types and was failing if type was for exampleINTEGER
which is DB specific and different from genericIntteger
..mutate()
with complex expressions