Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: handle nulls when creating indices with cuda #2910

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

westonpace
Copy link
Contributor

No description provided.

@github-actions github-actions bot added enhancement New feature or request python labels Sep 18, 2024
if len(columns) == 1 and filter.lower() == f"{columns[0]} is not null":
table = pc.drop_null(table)
elif filter is not None:
raise Exception(f"Can't yet run filter <{filter}> in-memory")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe a bit frail at the moment.

The problem is that I can't apply the filter with lance because I am using a limit/offset and they won't be pushed down if a filter is present.

Potential solutions:

  • Add a scan_range to lance that interprets limit/offset as a pre-filter limit/offset.
  • Add filter support to take (and just pass a contiguous range to take)
  • Use duckdb or something else to apply the filter here

@westonpace westonpace merged commit 3f935d5 into lancedb:main Sep 19, 2024
11 checks passed
logging.info("Done sampling: centroids shape: %s", init_centroids.shape)

ds = TorchDataset(
dataset,
batch_size=20480,
columns=[column],
samples=sample_size,
filter=filt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using filters here seems to cause a significant performance hit now (on the order of several minutes vs several seconds for kmeans clustering).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe worth a fast path if we know there are no nulls then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants