Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .ratio to embedbase sdk #95

Open
benjaminshafii opened this issue May 11, 2023 · 3 comments
Open

Add .ratio to embedbase sdk #95

benjaminshafii opened this issue May 11, 2023 · 3 comments

Comments

@benjaminshafii
Copy link
Member

It would be so helpful in my use-case

results = await client.dataset(recipe_id, farm_id, user_id, locaton_id).search(question, max_token=3000, ratio=[.7, .1, .1, .1)

max_token of 2100, 300, 300, 300 are applied to each dataset.

Originally posted by @ccomkhj in #71 (comment)

@louis030195
Copy link
Contributor

Sure! From a technical point of view just require a little tweak on SDK plus a new endpoint in embedbase instance that take a list of dataset instead of query dataset

@louis030195
Copy link
Contributor

@hotkartoffel lets say you have

"Basil is a green plant that need daily water..." in green_plants dataset
and "Basil is a green plant that need daily water..." in general_plants dataset

upon running

results = await client.dataset("green_plants", "general_plants").search("How should I take care of my basil when the leaves turn yellow?")

I assume you expect to receive distinct results (no duplicates?)

louis030195 added a commit that referenced this issue May 15, 2023
@louis030195 louis030195 mentioned this issue May 15, 2023
10 tasks
@ccomkhj
Copy link

ccomkhj commented May 16, 2023

Q1. How embedding vector knows if it's duplicates or not? (exactly same embedding vector? or Simply similarity score?)

Q2. If a dataset of "green_plants" has multiple duplicates, how is it treated in the searching algorithm?
i.e. "basil is green plants" in page 32 from PDF.
"basil is plants which are mostly green" in page 111 from PDF.
... (This kind of generic sentence can repeat +10 times)

results = await client.dataset("green_palnts").search("What is the color of basil and its taste?")
-> too much color info can suppress taste info?

Distinct results are always good but I wonder how you decide it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants