Extractor pipeline 2.0 #417

davidmezzetti · 2023-02-01T16:11:32Z

This is major update on the path to Generative Semantic Search

The extractor pipeline was one of the first components in txtai, going all the way back to 1.0. Since then, much has changed both with txtai and externally. This pipeline has a lot of potential but it needs a couple updates.

Make the following upgrades to the Extractor pipeline.

Ability to run embeddings searches. Given that content is supported, text can be retrieved from the embeddings instance.
In addition to extractive qa, support text generation models, sequence to sequence models and custom pipelines
Better detection of when a tokenizer should be used (word vector models only)

These changes will enable a prompt-driven approach to question-answering with LLMs. This includes Hugging Face models and external services like OpenAI/Cohere. Services can be called directly or with another library like langchain. Custom pipelines only require a __call__ interface.

The text was updated successfully, but these errors were encountered:

4entertainment · 2024-05-30T06:59:04Z

hi @davidmezzetti ! Thank you for your all works. I want to use "Cohere/Cohere-embed-english-v3.0" embedding model in the following code:

`%%capture

from txtai import Embeddings

Works with a list, dataset or generator

data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]

Create an embeddings

embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

Create an index for the list of text

embeddings.index(data)

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

Run an embeddings search for each query

for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):

Extract uid of first result

search result format: (uid, score)

uid = embeddings.search(query, 1)[0][0]

Print text

print("%-20s %s" % (query, data[uid]))`

I have my Cohere Key. How can i do that?

davidmezzetti · 2024-05-31T12:46:17Z

This recent notebook should help: https://github.com/neuml/txtai/blob/master/examples/62_RAG_with_llama_cpp_and_external_API_services.ipynb

davidmezzetti added this to the v5.3.0 milestone Feb 1, 2023

davidmezzetti self-assigned this Feb 1, 2023

davidmezzetti closed this as completed in 6bbdeb0 Feb 1, 2023

davidmezzetti mentioned this issue Feb 5, 2023

Make texts parameter optional for extractor pipeline in applications #420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractor pipeline 2.0 #417

Extractor pipeline 2.0 #417

davidmezzetti commented Feb 1, 2023

4entertainment commented May 30, 2024

davidmezzetti commented May 31, 2024

Extractor pipeline 2.0 #417

Extractor pipeline 2.0 #417

Comments

davidmezzetti commented Feb 1, 2023

4entertainment commented May 30, 2024

Works with a list, dataset or generator

Create an embeddings

Create an index for the list of text

Run an embeddings search for each query

Extract uid of first result

search result format: (uid, score)

Print text

davidmezzetti commented May 31, 2024