always include sys signals #81

skshetry · 2024-07-18T10:58:39Z

This PR

renames DatasetQuery.results() and DatasetQuery.to_records() to db_results()/to_db_records().
Always adds signals during select(...), and removes them on as_iterable.

TODO

cleanup

src/datachain/lib/dc.py

rlamy

LGTM so far. I guess it makes sense to handle this only in DataChain but then we should try to remove all special handling for sys columns from DatasetQuery.

I don't like that _select() and select() are so similar - it would probably be better to add an undocumented flag, but feel free to clean-up however you like.

dmpetrov

An question is inline. Please answer.

In general, that's very interesting idea but I need time to think a bit more about this. What i'm not sure about right now:

id & rand are technical signals that are required for map and shuffle. I'm not sure we need to expose this to app level. it feels like it belongs to DB level.
this approach requires downloading ID & RAND all the time. In some cases it triples amount of data to transfer. In DB level it can be easily optimized.
An open question: can we make RAND optional? So, it's created only when user asks. It will require separate datasets and shuffled/randomized dataset.

src/datachain/lib/dc.py

skshetry · 2024-07-19T01:26:35Z

id & rand are technical signals that are required for map and shuffle. I'm not sure we need to expose this to app level. it feels like it belongs to DB level.

It might be hard to do in db level, due to subqueries. We won't always know if id and rand should be added to select() or not.

An open question: can we make RAND optional? So, it's created only when user asks. It will require separate datasets and shuffled/randomized dataset.

chunking for UDFs also require rand.

dmpetrov · 2024-07-19T04:51:43Z

We won't always know if id and rand should be added to select() or not.

We do know! It should be added in 100% of the cases 🙂

discussed privately

skshetry commented Jul 18, 2024

View reviewed changes

src/datachain/lib/dc.py Outdated Show resolved Hide resolved

skshetry commented Jul 18, 2024

View reviewed changes

src/datachain/lib/dc.py Outdated Show resolved Hide resolved

skshetry marked this pull request as ready for review July 18, 2024 14:13

skshetry mentioned this pull request Jul 18, 2024

Remove legacy signals in from_storage() #72

Merged

skshetry requested a review from rlamy July 18, 2024 14:15

rlamy approved these changes Jul 18, 2024

View reviewed changes

dmpetrov previously requested changes Jul 18, 2024

View reviewed changes

src/datachain/lib/dc.py Outdated Show resolved Hide resolved

src/datachain/lib/dc.py Outdated Show resolved Hide resolved

skshetry added 2 commits July 19, 2024 14:44

rename DatasetQuery.results/to_records to db_results/to_db_records

8a7a50d

always select sys signals by default

4258be1

skshetry mentioned this pull request Jul 19, 2024

Added input params to distinct() #96

Merged

skshetry requested a review from dmpetrov July 19, 2024 10:14

cleanup

c5de917

skshetry merged commit dcc08c4 into iterative:main Jul 19, 2024
11 checks passed

skshetry deleted the fix-sys branch July 19, 2024 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

always include sys signals #81

always include sys signals #81

skshetry commented Jul 18, 2024 •

edited

Loading

rlamy left a comment

dmpetrov left a comment

skshetry commented Jul 19, 2024

dmpetrov commented Jul 19, 2024

always include sys signals #81

always include sys signals #81

Conversation

skshetry commented Jul 18, 2024 • edited Loading

TODO

rlamy left a comment

Choose a reason for hiding this comment

dmpetrov left a comment

Choose a reason for hiding this comment

skshetry commented Jul 19, 2024

dmpetrov commented Jul 19, 2024

skshetry commented Jul 18, 2024 •

edited

Loading