Implement sys feature, and rename id/random columns #28

skshetry · 2024-07-12T15:23:05Z

Check individual commits.

dmpetrov · 2024-07-12T16:23:28Z

It feels like the logic should be opposite - the fields should be excluded by default everywhere and user can enable these in some specific API calls like from_storage(sys=True) or from_dataset(sys=True)

skshetry · 2024-07-12T16:43:25Z

We have a lot of internals that depend on it. Eg: all of the indexing queries, udf, join require id columns. We are also doing some ordering based on id in different places.

We use random column for chunking, etc.

Another thing is, we try to preserve data as much as possible:
Eg:

DatasetQuery(name="test").save()

This preserves all of the columns including id and random, we don't generate anything.

Similarly, DatasetQuery.generate() adds a column, but it preserves id and random.

skshetry · 2024-07-12T16:47:00Z

Ideally, we need a notion of default columns, in which select() returns all columns except id and random unless specifically asked.

DatasetQuery.select(C.id, C.random, *query.columns).results()
# contains `id` and `random` columns

DatasetQuery.results()
# excludes `id` and `random` column

But this is very hard to differentiate at the query level, as we might get columns from other queries, or tables.

dmpetrov · 2024-07-13T05:10:52Z

I'm talking about user facing API.

Internally we have to always save the columns.
For users - only when these sys columns were requested directly.

Practically, it means, the request should be likedc.map(res=lambda sys, file: ..., ) or dc.map(res=func(), params=["sys"])

If we decide to do that in the lower level calls (which is optional until we can make it in map()): @udf(params=("sys__id") ...).

dmpetrov

Looks good! thank you.

src/datachain/lib/dc.py

dmpetrov · 2024-07-16T08:31:47Z

src/datachain/lib/dc.py

+        with super().select(*db_signals).as_iterable() as rows:
+            yield from rows
+
+    def results(


Shouldn't it be collect_flatten()?

What should we do for existing results and to_records?

dmpetrov · 2024-07-16T08:35:51Z

src/datachain/lib/dc.py

+        self, row_factory: Optional[Callable] = None, **kwargs
+    ) -> list[tuple[Any, ...]]:
+        rows = self.iterate_flatten()
+        if row_factory:


Do we need this? It looks smart but the usage is not clear. If it's needed - we should probably introduce this in other methods.

It's just trying to imitate what we do in DatasetQuery.results(). It also makes it easier to implement to_records()?

dmpetrov · 2024-07-16T08:37:36Z

src/datachain/query/dataset.py

@@ -1515,30 +1506,35 @@ def offset(self, offset: int) -> "Self":
        query.steps.append(SQLOffset(offset))
        return query

+    def as_scalar(self) -> Any:


tests/unit/lib/test_datachain.py

pyproject.toml

src/datachain/data_storage/warehouse.py

dmpetrov mentioned this pull request Jul 13, 2024

To pandas - hierarchical multi header #22

Merged

skshetry marked this pull request as ready for review July 15, 2024 11:24

skshetry changed the title ~~exclude sys columns on results()~~ rename sys columns; exclude them on results() etc. APIs Jul 15, 2024

skshetry requested a review from a team July 15, 2024 11:27

rlamy approved these changes Jul 15, 2024

View reviewed changes

dmpetrov approved these changes Jul 16, 2024

View reviewed changes

skshetry changed the title ~~rename sys columns; exclude them on results() etc. APIs~~ Implement sys feature, and rename id/random columns Jul 16, 2024

skshetry commented Jul 16, 2024

View reviewed changes

tests/unit/lib/test_datachain.py Show resolved Hide resolved

rlamy reviewed Jul 16, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

rlamy reviewed Jul 16, 2024

View reviewed changes

src/datachain/data_storage/warehouse.py Outdated Show resolved Hide resolved

rlamy mentioned this pull request Jul 16, 2024

Remove legacy file signals in DataChain.from_storage() #32

Closed

skshetry added 3 commits July 17, 2024 09:31

rename id and random columns to sys__id and sys__random

5933f92

add support for Sys feature schema

4e09d51

implement settings(include_sys=True|False) API and add tests

1289885

skshetry merged commit 5cf20d3 into iterative:main Jul 17, 2024
11 of 12 checks passed

skshetry deleted the exclude_sys_cols branch July 17, 2024 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sys feature, and rename id/random columns #28

Implement sys feature, and rename id/random columns #28

skshetry commented Jul 12, 2024 •

edited

Loading

dmpetrov commented Jul 12, 2024

skshetry commented Jul 12, 2024

skshetry commented Jul 12, 2024 •

edited

Loading

dmpetrov commented Jul 13, 2024

dmpetrov left a comment

dmpetrov Jul 16, 2024

skshetry Jul 16, 2024

dmpetrov Jul 16, 2024

skshetry Jul 16, 2024

dmpetrov Jul 16, 2024

Implement sys feature, and rename id/random columns #28

Implement sys feature, and rename id/random columns #28

Conversation

skshetry commented Jul 12, 2024 • edited Loading

dmpetrov commented Jul 12, 2024

skshetry commented Jul 12, 2024

skshetry commented Jul 12, 2024 • edited Loading

dmpetrov commented Jul 13, 2024

dmpetrov left a comment

Choose a reason for hiding this comment

dmpetrov Jul 16, 2024

Choose a reason for hiding this comment

skshetry Jul 16, 2024

Choose a reason for hiding this comment

dmpetrov Jul 16, 2024

Choose a reason for hiding this comment

skshetry Jul 16, 2024

Choose a reason for hiding this comment

dmpetrov Jul 16, 2024

Choose a reason for hiding this comment

skshetry commented Jul 12, 2024 •

edited

Loading

skshetry commented Jul 12, 2024 •

edited

Loading