Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove storage from dataset query and refactor related codebase #367

Merged
merged 62 commits into from
Sep 17, 2024

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Aug 28, 2024

Removes IndexingStep from DatasetQuery as this part of the codebase will not be reachable any more.
Since we use listing / indexing from DatasetQuery a lot in tests and related code, this PR needed to refactor those as well. In those places we now use DataChain.from_storage(...)

Changes in more details:

  1. Using DataChain.from_storage() instead of DatasetQuery listing in functions: Catalog.create_dataset_from_source() and Catalog.apply_udf()
  2. Added new argument signal_name to Catalog.apply_udf() for naming of the new signal that will be created with UDF
  3. Refactor dataset dependency insert functions - removed add_storage_dependency etc.
  4. Removed DatasetQuery.changed() -> this one is not used in DataChain at all so it's "dead" code. It also uses old listing columns so it won't work anyway. We can easily return this back if there will be a need
  5. Refactoring index_tar from builtins to work with new feature file schema. Currently hardcoded to use file prefix.
  6. Unified DatasetQuery.subtract() and DatasetQuery._subtract() -> no need to have two
  7. Stopped using old listings in tests - refactor listed_bucket from conftest to use new listing
  8. Refactoring test_dataset_query.py to use new listing with DataChain.from_storage() and to use new file based schema instead deprecated columns
  9. Refactor other tests to use new listing and file based schema
  10. Skipping dir expansion algorithm test for now as it will be moved to application level and tests will need refactoring

@ilongin ilongin changed the base branch from main to ilongin/329-refactor-storages August 28, 2024 01:27
@ilongin ilongin marked this pull request as draft August 28, 2024 01:27
@ilongin ilongin linked an issue Aug 28, 2024 that may be closed by this pull request
Copy link

cloudflare-workers-and-pages bot commented Aug 28, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 7dfeea9
Status: ✅  Deploy successful!
Preview URL: https://dd4aad5a.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-340-remove-storage-f.datachain-documentation.pages.dev

View logs

@ilongin ilongin force-pushed the ilongin/329-refactor-storages branch from 2e125ae to 8537347 Compare September 2, 2024 12:52
Base automatically changed from ilongin/329-refactor-storages to main September 5, 2024 09:00
Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of changes in a single PR, I wish it could be broken up.

Some comments:

  • Why don't we just remove Catalog.apply_udf()? We don't actually use it.
  • index_tar() only exists to support some tests, couldn't we get rid of it?
  • Removing Changed is a good idea.

@ilongin
Copy link
Contributor Author

ilongin commented Sep 12, 2024

That's a lot of changes in a single PR, I wish it could be broken up.

I know, but the problem was that listing from DatasetQuery was used in a lot of places (mostly tests) so needed to change it all together

*Why don't we just remove Catalog.apply_udf()? We don't actually use it.

Removed

index_tar() only exists to support some tests, couldn't we get rid of it?

I would leave it for now and remove in separate PR as I would like to add some DataChain test with tar (without this low level index_tar thing) as well, so I don't want to extend this already too big PR.

@ilongin ilongin requested a review from rlamy September 12, 2024 01:54
@ilongin ilongin requested a review from dreadatour September 12, 2024 12:05
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me, it will be great to make sure if all Studio tests will pass before merge.

Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rather uncomfortable with the general approach to DatasetQuery tests: we end up with a bunch of hacks to support a random mix of old- and new-style code, and I feel that they don't really test much that we actually care about any more.
I think we should rather try to port them to the new-style, using DataChain and DataModel (or discard them if we already have a `DataChain equivalent), and then use the hacks if there are any left.

@@ -31,6 +31,7 @@

@pytest.mark.parametrize("tree", [COMPLEX_TREE], indirect=True)
def test_dir_expansion(cloud_test_catalog, version_aware, cloud_type):
pytest.skip("Skipping as dir expansion must be re-implemented in application level")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is quite useful. Without it, I fear that the dir expansion code willl bitrot very quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, returned it. Needed to add simple util function to "old" schema with rooted file signals as current implementation of dir expansion only works with those and legacy CLI code that's still using them.

@ilongin
Copy link
Contributor Author

ilongin commented Sep 12, 2024

I'm rather uncomfortable with the general approach to DatasetQuery tests: we end up with a bunch of hacks to support a random mix of old- and new-style code, and I feel that they don't really test much that we actually care about any more. I think we should rather try to port them to the new-style, using DataChain and DataModel (or discard them if we already have a `DataChain equivalent), and then use the hacks if there are any left.

DatasetQuery should ideally be agnostic about our schema, but it's more handy if we use our current file based one.
Note that the goal of this PR was to remove dead indexing / listing code from DatasetQuery that has only been used in tests so tests needed some refactoring.
I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

@ilongin ilongin requested a review from rlamy September 12, 2024 23:55
@rlamy
Copy link
Contributor

rlamy commented Sep 13, 2024

I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

And I think that porting or removing the tests should be done first, i.e. that this should wait for #438.

@ilongin
Copy link
Contributor Author

ilongin commented Sep 13, 2024

I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

And I think that porting or removing the tests should be done first, i.e. that this should wait for #438.

Just when I thought I'm done with resolving conflicts :D ... Ok, I will wait for that .. I guess it's better then Matt doing rebasing and ending up removing bunch of code anyway.

@ilongin ilongin merged commit 0abafcd into main Sep 17, 2024
37 of 38 checks passed
@ilongin ilongin deleted the ilongin/340-remove-storage-from-dataset-query branch September 17, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove storages from DatasetQuery
3 participants