Remove storage from dataset query and refactor related codebase #367

ilongin · 2024-08-28T01:27:01Z

Removes IndexingStep from DatasetQuery as this part of the codebase will not be reachable any more.
Since we use listing / indexing from DatasetQuery a lot in tests and related code, this PR needed to refactor those as well. In those places we now use DataChain.from_storage(...)

Changes in more details:

Using DataChain.from_storage() instead of DatasetQuery listing in functions: Catalog.create_dataset_from_source() and Catalog.apply_udf()
Added new argument signal_name to Catalog.apply_udf() for naming of the new signal that will be created with UDF
Refactor dataset dependency insert functions - removed add_storage_dependency etc.
Removed DatasetQuery.changed() -> this one is not used in DataChain at all so it's "dead" code. It also uses old listing columns so it won't work anyway. We can easily return this back if there will be a need
Refactoring index_tar from builtins to work with new feature file schema. Currently hardcoded to use file prefix.
Unified DatasetQuery.subtract() and DatasetQuery._subtract() -> no need to have two
Stopped using old listings in tests - refactor listed_bucket from conftest to use new listing
Refactoring test_dataset_query.py to use new listing with DataChain.from_storage() and to use new file based schema instead deprecated columns
Refactor other tests to use new listing and file based schema
Skipping dir expansion algorithm test for now as it will be moved to application level and tests will need refactoring

cloudflare-workers-and-pages · 2024-08-28T01:28:13Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`7dfeea9`
Status:	✅ Deploy successful!
Preview URL:	https://dd4aad5a.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-340-remove-storage-f.datachain-documentation.pages.dev

View logs

rlamy

That's a lot of changes in a single PR, I wish it could be broken up.

Some comments:

Why don't we just remove Catalog.apply_udf()? We don't actually use it.
index_tar() only exists to support some tests, couldn't we get rid of it?
Removing Changed is a good idea.

ilongin · 2024-09-12T01:53:41Z

That's a lot of changes in a single PR, I wish it could be broken up.

I know, but the problem was that listing from DatasetQuery was used in a lot of places (mostly tests) so needed to change it all together

*Why don't we just remove Catalog.apply_udf()? We don't actually use it.

Removed

index_tar() only exists to support some tests, couldn't we get rid of it?

I would leave it for now and remove in separate PR as I would like to add some DataChain test with tar (without this low level index_tar thing) as well, so I don't want to extend this already too big PR.

dreadatour

Look good to me, it will be great to make sure if all Studio tests will pass before merge.

rlamy

I'm rather uncomfortable with the general approach to DatasetQuery tests: we end up with a bunch of hacks to support a random mix of old- and new-style code, and I feel that they don't really test much that we actually care about any more.
I think we should rather try to port them to the new-style, using DataChain and DataModel (or discard them if we already have a `DataChain equivalent), and then use the hacks if there are any left.

src/datachain/catalog/catalog.py

src/datachain/cli.py

tests/func/test_catalog.py

tests/func/test_query.py

rlamy · 2024-09-12T14:41:59Z

tests/unit/test_data_storage.py

@@ -31,6 +31,7 @@

 @pytest.mark.parametrize("tree", [COMPLEX_TREE], indirect=True)
 def test_dir_expansion(cloud_test_catalog, version_aware, cloud_type):
+    pytest.skip("Skipping as dir expansion must be re-implemented in application level")


This test is quite useful. Without it, I fear that the dir expansion code willl bitrot very quickly.

Ok, returned it. Needed to add simple util function to "old" schema with rooted file signals as current implementation of dir expansion only works with those and legacy CLI code that's still using them.

ilongin · 2024-09-12T23:55:13Z

I'm rather uncomfortable with the general approach to DatasetQuery tests: we end up with a bunch of hacks to support a random mix of old- and new-style code, and I feel that they don't really test much that we actually care about any more. I think we should rather try to port them to the new-style, using DataChain and DataModel (or discard them if we already have a `DataChain equivalent), and then use the hacks if there are any left.

DatasetQuery should ideally be agnostic about our schema, but it's more handy if we use our current file based one.
Note that the goal of this PR was to remove dead indexing / listing code from DatasetQuery that has only been used in tests so tests needed some refactoring.
I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

rlamy · 2024-09-13T11:59:49Z

I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

And I think that porting or removing the tests should be done first, i.e. that this should wait for #438.

ilongin · 2024-09-13T13:20:10Z

I agree that we should think about removing them or porting to DataChain but that can be done in separate PR.

And I think that porting or removing the tests should be done first, i.e. that this should wait for #438.

Just when I thought I'm done with resolving conflicts :D ... Ok, I will wait for that .. I guess it's better then Matt doing rebasing and ending up removing bunch of code anyway.

ilongin changed the base branch from main to ilongin/329-refactor-storages August 28, 2024 01:27

ilongin marked this pull request as draft August 28, 2024 01:27

ilongin linked an issue Aug 28, 2024 that may be closed by this pull request

Remove storages from DatasetQuery #340

Closed

ilongin force-pushed the ilongin/329-refactor-storages branch from 2e125ae to 8537347 Compare September 2, 2024 12:52

Base automatically changed from ilongin/329-refactor-storages to main September 5, 2024 09:00

ilongin added 24 commits September 5, 2024 12:11

first version of from_storage without deprecated listing

c7af79f

first version of from_storage without deprecated listing

4411ecf

fixing tests and removing prints, refactoring

5b05dfa

refactoring listing static methods

519150e

fixing non recursive queries

74f8726

using ctc in test session

fb30121

fixing json

5e049e0

fixing windows tests

c9f4bf8

returning to all tests

782badd

added session on cloud test catalog and refactoring tests

6e7b4db

refactoring and fixing tests

355ff79

fixing apply_udf and its tests

df0cac7

refactoring tests and related codebase

d6a2e9c

first version of from_storage without deprecated listing

caf8c45

first version of from_storage without deprecated listing

2d788c4

fixing tests and removing prints, refactoring

f3a8a12

refactoring listing static methods

62bb15f

fixing non recursive queries

218e088

using ctc in test session

afe0609

fixing json

f9a033e

removed not needed catalog storage methods and their related codebase

985f918

fixing windows tests

c04da97

returning to all tests

ca4fd38

fixing dataset dependencies

1613abd

ilongin added 2 commits September 11, 2024 14:14

fixing test

9161d1d

resolving conflicts

1048844

rlamy reviewed Sep 11, 2024

View reviewed changes

ilongin added 5 commits September 11, 2024 16:53

fixing tests

da5c7c5

fixing test

d8aeee6

Merge branch 'main' into ilongin/340-remove-storage-from-dataset-query

f670c10

removing apply udf

e5e9acc

remove parallel

01135ba

Merge branch 'main' into ilongin/340-remove-storage-from-dataset-query

63867ba

ilongin requested a review from rlamy September 12, 2024 01:54

removed -vvv from tests

5f02cdc

ilongin requested a review from dreadatour September 12, 2024 12:05

merging with main

1319724

dreadatour approved these changes Sep 12, 2024

View reviewed changes

rlamy reviewed Sep 12, 2024

View reviewed changes

ilongin added 3 commits September 12, 2024 16:52

removing not used functions and related tests

09a54c7

returned skipped test, returned output assert, fix print in CLI

df72597

returned dir expansion test

9baf069

ilongin requested a review from rlamy September 12, 2024 23:55

ilongin added 2 commits September 13, 2024 01:56

merging with main

9942335

merging with main

1722105

rlamy mentioned this pull request Sep 13, 2024

Convert index_tar to a new-style generator #439

Closed

ilongin added 2 commits September 17, 2024 12:16

merging with main

6e4efab

removing not needed method and added one test

7dfeea9

ilongin merged commit 0abafcd into main Sep 17, 2024
37 of 38 checks passed

ilongin deleted the ilongin/340-remove-storage-from-dataset-query branch September 17, 2024 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove storage from dataset query and refactor related codebase #367

Remove storage from dataset query and refactor related codebase #367

ilongin commented Aug 28, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 28, 2024 •

edited

Loading

rlamy left a comment

ilongin commented Sep 12, 2024

dreadatour left a comment

rlamy left a comment

rlamy Sep 12, 2024

ilongin Sep 12, 2024

ilongin commented Sep 12, 2024

rlamy commented Sep 13, 2024

ilongin commented Sep 13, 2024

Remove storage from dataset query and refactor related codebase #367

Remove storage from dataset query and refactor related codebase #367

Conversation

ilongin commented Aug 28, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Aug 28, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

rlamy left a comment

Choose a reason for hiding this comment

ilongin commented Sep 12, 2024

dreadatour left a comment

Choose a reason for hiding this comment

rlamy left a comment

Choose a reason for hiding this comment

rlamy Sep 12, 2024

Choose a reason for hiding this comment

ilongin Sep 12, 2024

Choose a reason for hiding this comment

ilongin commented Sep 12, 2024

rlamy commented Sep 13, 2024

ilongin commented Sep 13, 2024

ilongin commented Aug 28, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 28, 2024 •

edited

Loading