Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sys feature, and rename id/random columns #28

Merged
merged 3 commits into from
Jul 17, 2024
Merged

Implement sys feature, and rename id/random columns #28

merged 3 commits into from
Jul 17, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jul 12, 2024

Check individual commits.

@dmpetrov
Copy link
Member

It feels like the logic should be opposite - the fields should be excluded by default everywhere and user can enable these in some specific API calls like from_storage(sys=True) or from_dataset(sys=True)

@skshetry
Copy link
Member Author

We have a lot of internals that depend on it. Eg: all of the indexing queries, udf, join require id columns. We are also doing some ordering based on id in different places.

We use random column for chunking, etc.

Another thing is, we try to preserve data as much as possible:
Eg:

DatasetQuery(name="test").save()

This preserves all of the columns including id and random, we don't generate anything.

Similarly, DatasetQuery.generate() adds a column, but it preserves id and random.

@skshetry
Copy link
Member Author

skshetry commented Jul 12, 2024

Ideally, we need a notion of default columns, in which select() returns all columns except id and random unless specifically asked.

DatasetQuery.select(C.id, C.random, *query.columns).results()
# contains `id` and `random` columns
DatasetQuery.results()
# excludes `id` and `random` column

But this is very hard to differentiate at the query level, as we might get columns from other queries, or tables.

@dmpetrov
Copy link
Member

I'm talking about user facing API.

  1. Internally we have to always save the columns.
  2. For users - only when these sys columns were requested directly.

Practically, it means, the request should be likedc.map(res=lambda sys, file: ..., ) or dc.map(res=func(), params=["sys"])

If we decide to do that in the lower level calls (which is optional until we can make it in map()): @udf(params=("sys__id") ...).

@skshetry skshetry marked this pull request as ready for review July 15, 2024 11:24
@skshetry skshetry changed the title exclude sys columns on results() rename sys columns; exclude them on results() etc. APIs Jul 15, 2024
@skshetry skshetry requested a review from a team July 15, 2024 11:27
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! thank you.

with super().select(*db_signals).as_iterable() as rows:
yield from rows

def results(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be collect_flatten()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do for existing results and to_records?

self, row_factory: Optional[Callable] = None, **kwargs
) -> list[tuple[Any, ...]]:
rows = self.iterate_flatten()
if row_factory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? It looks smart but the usage is not clear. If it's needed - we should probably introduce this in other methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just trying to imitate what we do in DatasetQuery.results(). It also makes it easier to implement to_records()?

@@ -1515,30 +1506,35 @@ def offset(self, offset: int) -> "Self":
query.steps.append(SQLOffset(offset))
return query

def as_scalar(self) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@skshetry skshetry changed the title rename sys columns; exclude them on results() etc. APIs Implement sys feature, and rename id/random columns Jul 16, 2024
@skshetry skshetry merged commit 5cf20d3 into iterative:main Jul 17, 2024
11 of 12 checks passed
@skshetry skshetry deleted the exclude_sys_cols branch July 17, 2024 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants