Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: move indexing to an application layer #39

Closed
7 of 8 tasks
Tracked by #33
skshetry opened this issue Jun 18, 2024 · 6 comments
Closed
7 of 8 tasks
Tracked by #33

Epic: move indexing to an application layer #39

skshetry opened this issue Jun 18, 2024 · 6 comments
Assignees

Comments

@skshetry
Copy link
Member

skshetry commented Jun 18, 2024

Description

i.e make it based on a feature schema and if possible, with udfs.

@skshetry skshetry added the bug Something isn't working label Jun 18, 2024
@skshetry skshetry self-assigned this Jun 18, 2024
@skshetry skshetry removed the bug Something isn't working label Jun 18, 2024
@skshetry skshetry changed the title Move indexing to application layer Move indexing to an application layer Jun 18, 2024
@skshetry skshetry removed their assignment Jul 8, 2024
@dmpetrov dmpetrov transferred this issue from another repository Jul 13, 2024
@ilongin
Copy link
Contributor

ilongin commented Jul 22, 2024

We need to think how to deal with additional tables that are created during indexing, like buckets or partials. So this is not just normal UDF that has an output of some rows in a dataset table, but needs to insert into buckets and partials tables.
It's easy for us to implement this, but if we want users to implement their own indexing maybe we need to provide framework to do so implicitly (user should not care about those tables explicitly) ... WDYT?

@shcheklein
Copy link
Member

I think we should start getting rid of partials. They are too complicated for the value they provide. Same with buckets / sources - I would reconsider also drop them.

Each path that we pass to from_storage can be creating a versioned dataset. We can decide to reuse those (as a way to cache things) with some expiration date, etc.

What are the major things we are loosing by getting rid of bucket, sources, partials?

@ilongin
Copy link
Contributor

ilongin commented Jul 22, 2024

Partials are needed to be able to index part of a bucket and to avoid re-indexing subdirectories. I have a feeling though that this can all be done even without that partials table, just on the fly but this needs to be investigated.

@dmpetrov
Copy link
Member

I think we should start getting rid of partials. T

and

that this can all be done even without that partials table, just on the fly but this needs to be investigated.

Both are good ideas! Let's try to simplify this as much as we can.

We need to think how to deal with additional tables that are created during indexing, like buckets or partials. So this is not just normal UDF that has an output of some rows in a dataset table

Right. We need to find a way to fit the buckets (as well as partials i if needed) into "just normal UDF" and normal datasets. I hope these datasets won't be visible to users (by default).

@shcheklein shcheklein changed the title Move indexing to an application layer Epic: move indexing to an application layer Jul 31, 2024
@shcheklein
Copy link
Member

Prioritizing this. It's an epic. Need to add first steps.

@ilongin
Copy link
Contributor

ilongin commented Jul 31, 2024

I can take over this one and make a plan / subtasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants