Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractor pipeline 2.0 #417

Closed
davidmezzetti opened this issue Feb 1, 2023 · 2 comments
Closed

Extractor pipeline 2.0 #417

davidmezzetti opened this issue Feb 1, 2023 · 2 comments
Assignees
Milestone

Comments

@davidmezzetti
Copy link
Member

This is major update on the path to Generative Semantic Search

The extractor pipeline was one of the first components in txtai, going all the way back to 1.0. Since then, much has changed both with txtai and externally. This pipeline has a lot of potential but it needs a couple updates.

Make the following upgrades to the Extractor pipeline.

  • Ability to run embeddings searches. Given that content is supported, text can be retrieved from the embeddings instance.
  • In addition to extractive qa, support text generation models, sequence to sequence models and custom pipelines
  • Better detection of when a tokenizer should be used (word vector models only)

These changes will enable a prompt-driven approach to question-answering with LLMs. This includes Hugging Face models and external services like OpenAI/Cohere. Services can be called directly or with another library like langchain. Custom pipelines only require a __call__ interface.

@4entertainment
Copy link

hi @davidmezzetti ! Thank you for your all works. I want to use "Cohere/Cohere-embed-english-v3.0" embedding model in the following code:

`%%capture

from txtai import Embeddings

Works with a list, dataset or generator

data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]

Create an embeddings

embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

Create an index for the list of text

embeddings.index(data)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

Run an embeddings search for each query

for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):

Extract uid of first result

search result format: (uid, score)

uid = embeddings.search(query, 1)[0][0]

Print text

print("%-20s %s" % (query, data[uid]))`

I have my Cohere Key. How can i do that?

@davidmezzetti
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants