Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DataChain.column(...) and fixing functions and types #226

Merged
merged 20 commits into from
Aug 6, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Aug 2, 2024

This PR tries to fix general issue of .mutate() receiving expression or column without type, which ends up with error when .save() is called.

Couple of things are added / fixed:

  • New .column(name: str) method in DataChain which returns Column instance with defined type by some column name from current schema of that DataChain instance. SQLAlchemy expressions that are build with columns with types have inferred type and those with columns without type ends up without inferred type (its NullType as result) so .mutate() should use this method instead raw Column(...) when adding expressions for new added columns.
  • Redefining avg function with type float. Before, we were just re-importing this one from sqlalchemy and it didn't have specific type when used in expressions which would make whole expression without type. Now it will always have float type. Other functions that are re-imported directly from SQLAlchemy seems to have defined type in expressions and are working ok in .mutate() for example
  • Fixing sql_to_python() function. Before, this was only checking for a subset of all possible SQL types and was failing if type was for example INTEGER which is DB specific and different from generic Intteger.
  • Adding tests for .mutate() with complex expressions

@ilongin ilongin marked this pull request as draft August 2, 2024 09:45
Copy link

cloudflare-workers-and-pages bot commented Aug 2, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: e927da9
Status: ✅  Deploy successful!
Preview URL: https://53605818.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-148-datachain-column.datachain-documentation.pages.dev

View logs

@ilongin ilongin linked an issue Aug 2, 2024 that may be closed by this pull request
@ilongin ilongin marked this pull request as ready for review August 2, 2024 12:34
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing change.

Comments are inline.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
tests/unit/lib/test_datachain.py Outdated Show resolved Hide resolved
tests/unit/lib/test_datachain.py Outdated Show resolved Hide resolved
tests/unit/lib/test_datachain.py Show resolved Hide resolved

ds1 = ds.mutate(new=ds.column("id") - 1)
assert ds1.signals_schema.values["new"] is int
ds1.save("ds1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in saving this. For saving - you might need another unit-test. Let's make these test independent.

PS: I don't think you need a separate test for saving int type :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored tests (split them to multiple ones without save) and added one separate with save

@ilongin ilongin requested a review from dmpetrov August 5, 2024 15:37
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice implementation, thank you! One comment is inline.

Next steps in order to fully close #148 (not in this PR): We need to think if we can do a better job in case of not typed columns and do this conversion under the hood like:

ds.mutate(col=(C("num") - 2) / C("count"))
--> 
ds.mutate(col=(ds.column("num") - 2) / ds.column("count"))



def test_mutate_with_expression_without_type(catalog):
with pytest.raises(CompileError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be wrapped with DC exception class DataChainColumError(DataChainParamsError): .... In mutate() or signals_schema.mutate() methods, I guess.

Sqlalchemy is implementation detail and should not be exposed to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ilongin
Copy link
Contributor Author

ilongin commented Aug 5, 2024

Really nice implementation, thank you! One comment is inline.

Next steps in order to fully close #148 (not in this PR): We need to think if we can do a better job in case of not typed columns and do this conversion under the hood like:

ds.mutate(col=(C("num") - 2) / C("count"))
--> 
ds.mutate(col=(ds.column("num") - 2) / ds.column("count"))

Yes, this should be done in followup. We need to traverse expression. Problematic part could be to traverse function arguments as I'm not sure they offer hat in API but I guess it should be possible

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@ilongin ilongin merged commit bc7608b into main Aug 6, 2024
13 of 18 checks passed
@ilongin ilongin deleted the ilongin/148-datachain-column-with-type branch August 6, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column type inference fails after using mutate to subtract two columns
3 participants