-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove storage from dataset query and refactor related codebase #367
Conversation
Deploying datachain-documentation with
|
Latest commit: |
7dfeea9
|
Status: | ✅ Deploy successful! |
Preview URL: | https://dd4aad5a.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-340-remove-storage-f.datachain-documentation.pages.dev |
2e125ae
to
8537347
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of changes in a single PR, I wish it could be broken up.
Some comments:
- Why don't we just remove
Catalog.apply_udf()
? We don't actually use it. index_tar()
only exists to support some tests, couldn't we get rid of it?- Removing
Changed
is a good idea.
I know, but the problem was that listing from
Removed
I would leave it for now and remove in separate PR as I would like to add some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good to me, it will be great to make sure if all Studio tests will pass before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rather uncomfortable with the general approach to DatasetQuery
tests: we end up with a bunch of hacks to support a random mix of old- and new-style code, and I feel that they don't really test much that we actually care about any more.
I think we should rather try to port them to the new-style, using DataChain
and DataModel
(or discard them if we already have a `DataChain equivalent), and then use the hacks if there are any left.
tests/unit/test_data_storage.py
Outdated
@@ -31,6 +31,7 @@ | |||
|
|||
@pytest.mark.parametrize("tree", [COMPLEX_TREE], indirect=True) | |||
def test_dir_expansion(cloud_test_catalog, version_aware, cloud_type): | |||
pytest.skip("Skipping as dir expansion must be re-implemented in application level") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is quite useful. Without it, I fear that the dir expansion code willl bitrot very quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, returned it. Needed to add simple util function to "old" schema with rooted file signals as current implementation of dir expansion only works with those and legacy CLI code that's still using them.
|
And I think that porting or removing the tests should be done first, i.e. that this should wait for #438. |
Just when I thought I'm done with resolving conflicts :D ... Ok, I will wait for that .. I guess it's better then Matt doing rebasing and ending up removing bunch of code anyway. |
Removes
IndexingStep
fromDatasetQuery
as this part of the codebase will not be reachable any more.Since we use listing / indexing from
DatasetQuery
a lot in tests and related code, this PR needed to refactor those as well. In those places we now useDataChain.from_storage(...)
Changes in more details:
DataChain.from_storage()
instead ofDatasetQuery listing
in functions:Catalog.create_dataset_from_source()
andCatalog.apply_udf()
signal_name
toCatalog.apply_udf()
for naming of the new signal that will be created with UDFadd_storage_dependency
etc.DatasetQuery.changed()
-> this one is not used inDataChain
at all so it's "dead" code. It also uses old listing columns so it won't work anyway. We can easily return this back if there will be a needindex_tar
from builtins to work with new feature file schema. Currently hardcoded to usefile
prefix.DatasetQuery.subtract()
andDatasetQuery._subtract()
-> no need to have twolisted_bucket
fromconftest
to use new listingtest_dataset_query.py
to use new listing withDataChain.from_storage()
and to use new file based schema instead deprecated columns