Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor README and Vicinity class to support any serializable item type #56

Merged
merged 9 commits into from
Jan 24, 2025

Conversation

davidberenstein1957
Copy link
Contributor

@davidberenstein1957 davidberenstein1957 commented Jan 20, 2025

I was playing around and realised the items don't necessarily need to be strings.

Update

  • Updated README.md to clarify that items can be strings or other serialisable objects.
  • Modified the Vicinity class to accept a broader range of item types by changing type hints from str to Any in several methods.
  • Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type.

Question
Also, why do we use tokens as argument names for items in the insert and delete methods?

Example
Something like the following works now. Slower during insert and deletion but I would say the flexibility is worth it.

import numpy as np

from vicinity import Backend, Metric, Vicinity

# Create some dummy data
items = [
    {"name": "triforce", "id": 0},
    {"name": "master sword", "id": 1},
    {"name": "hylian shield", "id": 2},
    {"name": "boomerang", "id": 3},
    {"name": "hookshot", "id": 4},
]


vectors = np.random.rand(len(items), 128)

# Initialize the Vicinity instance (using the basic backend and cosine metric)
vicinity = Vicinity.from_vectors_and_items(
    vectors=vectors, items=items, backend_type=Backend.BASIC, metric=Metric.COSINE
)

# Create a query vector
query_vector = np.random.rand(128)

# Query for nearest neighbors with a top-k search
results = vicinity.query(query_vector, k=3)

# Query for nearest neighbors with a threshold search
results = vicinity.query_threshold(query_vector, threshold=0.9)

vicinity.delete([items[0], items[1], items[2], items[3], items[4]])
vicinity.insert(items, vectors)
# Query with a list of query vectors
query_vectors = np.random.rand(5, 128)
results = vicinity.query(query_vectors, k=3)
print(results)

vicinity.save("my_vector_store", overwrite=True)
vicinity = Vicinity.load("my_vector_store")

results = vicinity.query(query_vectors, k=3)
print(results)

- Updated README.md to clarify that items can be strings or other serializable objects.
- Modified the Vicinity class to accept a broader range of item types by changing type hints from `str` to `Any` in several methods.
- Enhanced the insert and delete methods to handle non-string tokens appropriately, ensuring that items can be checked and managed regardless of their type.
@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
tests/conftest.py 100.00% <100.00%> (ø)
tests/test_vicinity.py 100.00% <100.00%> (ø)
vicinity/vicinity.py 98.62% <100.00%> (+0.94%) ⬆️

- Simplified the logic for checking and appending tokens in the insert method, ensuring that duplicate tokens are properly managed.
- Updated the `items` fixture to return a mix of dictionaries and strings based on index parity.
- Modified `test_vicinity_insert_duplicate` to use the updated `items` fixture for inserting items.
- Adjusted `test_vicinity_delete_and_query` to reference items by their indices instead of hardcoded values.
- Enhanced the Vicinity class to streamline token management, ensuring proper handling of duplicates and improving error messaging for token deletions.
Copy link
Member

@stephantul stephantul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks! One improvement I see is that we can explicitly throw a ValueError on trying to save if the items are not JSON serializable. e.g., by catching a JsonEncodeError and giving a more informative warning.

Another improvement is the typing, but I'll do that in a follow-up because I actually don't know what the best way is 💀

tests/test_vicinity.py Outdated Show resolved Hide resolved
tests/test_vicinity.py Outdated Show resolved Hide resolved
tests/test_vicinity.py Outdated Show resolved Hide resolved
vicinity/vicinity.py Outdated Show resolved Hide resolved
vicinity/vicinity.py Show resolved Hide resolved
davidberenstein1957 and others added 3 commits January 20, 2025 19:41
Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
…ling

- Replaced the nested loop for checking duplicates with a single extend operation for tokens.
- Improved efficiency by directly appending tokens to the items list, ensuring proper management of duplicates.
…ling

- Replaced the nested loop for token matching with a more efficient list comprehension.
- Enhanced error messaging to specify which tokens were not found in the vector space.
- Added a try-except block around the JSON serialization process to catch JSONEncodeError.
- Introduced a new pytest fixture `non_serializable_items` that generates a list of non-serializable objects for testing.
- Added a test case `test_vicinity_save_and_load_non_serializable_items` to verify that saving a Vicinity instance with non-serializable items raises a JSONEncodeError.
- Updated the Vicinity class documentation to specify that JSONEncodeError may be raised if items are not serializable.
@stephantul stephantul merged commit 2a8272c into MinishLab:main Jan 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants