Adding Metadata To Embeddings #75

david-saeger · 2024-09-06T19:19:51Z

Id like to add s3 metadata to my embeddings during the embedding creation process and realized that I wasnt sure the best place to do that. I wasnt sure if forking the project and adding to the file processing would be ideal or if there was something I could do by defining a ragLambdaLayer as descibed here

LISA/example_config.yaml

Line 16 in 2c3b03b

# ragLayerPath: /path/to/rag_layer.zip

In truth I think I am just a little uncertain what these lambda layers do or how to use them. Do they replace the current rag api or add to it?

petermuller · 2024-09-09T20:20:39Z

Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly.

As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the develop branch 🎉 )

If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py

Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151

and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107

We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 )

So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know!

david-saeger · 2024-09-09T20:41:18Z

This makes sense and is helpful. I imagine that folks won't want all s3 metadata translated to embeddings, do you think it would make sense to check for a prefix a la if s3 object metadata is prefixed with _lisa_ (or something) then it is translated to vector metadata. Figure its worth asking before heading down the wrong path.

…

On Mon, Sep 9, 2024, 4:21 PM Peter Muller ***@***.***> wrote: Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly. As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the develop branch 🎉 ) If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151 and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107 We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 ) So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know! — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQBXN23WXX4DLGB3OCAGTWLZVX7K3AVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGAYTENBRGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

petermuller · 2024-09-09T23:25:11Z

I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146

So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of:

Upload file to S3 with Object metadata attached
Use LISA ingestion to consume / embed files
Per file, check if there's S3 metadata (optionally: and check if the metadata is prefixed with a LISA-known prefix)
Add metadata to metadata dictionary that is processed along with the Document object
Metadata is now returned with the document text for requested vectors

Is this the workflow you're thinking of?

david-saeger · 2024-09-10T13:12:50Z

Yeah that was my first thought. Not sure if tying the vector metadata to S3 metadata is out of line with the goals of the project for some reason but unless your averse I can put it in a PR.

…

On Mon, Sep 9, 2024, 7:25 PM Peter Muller ***@***.***> wrote: I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document <https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html> level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146 So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of: 1. Upload file to S3 with Object metadata attached 2. Use LISA ingestion to consume / embed files 3. Per file, check if there's S3 metadata (optionally: and check if the metadata is prefixed with a LISA-known prefix) 4. Add metadata to metadata dictionary that is processed along with the Document object 5. Metadata is now returned with the document text for requested vectors Is this the workflow you're thinking of? — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQBXN24EK3OWTYLKZJ6S6UDZVYU6ZAVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGMZDEMRTHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

david-saeger · 2024-09-12T13:42:55Z

had to step away for the past couple days but kind of circling back to where I was originally when looking through this and trying to figure out a path to get the rag functionality I need. I understand that the layer zip files are included in the config optionally for network isolation. Could they not also serve to replace the RAG functionality if that is my end goal?

Something I am thinking through is what I need out of RAG is pretty boutique and I am doubtful it will be useful to other LISA users (likely would include custom embedding generation logic that is specific to the shape of specific documents) so figure any contribution I make here would end up looking like: place custom functionality somewhere (likely in the form of a lambda) and use it to replace some part or all of the rag API and I am questioning if this already exists in plain site or if I am missing something.

petermuller · 2024-09-13T04:26:22Z

No worries at all!

I've been thinking on this one for a little bit too and I think the main issue in our way is that our implementation of the RAG feature is fairly limited from the UI. Direct invocation via curl command or similar isn't really documented, but as I'm staring at it, I can see that it is possible to upload a custom list of keys to the rag store, so long as the exist in the LISA-provided document bucket (which is also something that we could edit to be user-provider). And with that, we could then provide additional metadata as part of the ingest_documents request. Several routes to go from here, but possible ones:

any metadata given in a call is added to all docs that are processed (keys: [docname1, docname2], metadata: something_applied_to_all)
metadata is applied to objects whose key match in a key:val relationship (keys: [docname1, docname2], metadata: {docname1: metadata1, docname2: metadata2})

And to answer your question, yes the rag layer could be used that way, but then it's a lot harder for us to support that way or improve on our existing things. I would say even based on all of this, we would still welcome a pull request with your ideas in it, and we can work to find the best path forward on it. If the goal for now is to just make a utility outside of the Chat UI to ingest documents with metadata, I think that backwards compatible changes to the repository API would be fine (as long as it doesn't break the current functionality then I'm good 👍 )

Some points of interest for that:

LISA/lambda/repository/lambda_functions.py

Line 137 in 0e824eb

docs = process_record(s3_keys=body["keys"], chunk_size=chunk_size, chunk_overlap=chunk_overlap)
- this line takes a list of S3 keys (associated with the LISA-generated rag documents bucket), and will chunk them and add a hardcoded metadata of the s3 location for the text
LISA/lambda/utilities/file_processing.py

Line 146 in 0e824eb

docs = [Document(page_content=extracted_text, metadata=_get_metadata(s3_uri=s3_uri))]
- this line is what associates the metadata to the langchain Document, and the metadata is just a hardcoded dict right here:
  
  LISA/lambda/utilities/file_processing.py
  
  Lines 36 to 37 in 0e824eb
  
  def _get_metadata(s3_uri: str) -> dict:
  
  return {"source": s3_uri}
I think what we could do is accept a "metadata" parameter in the APIGW API and then pass that into the file_processing file where we append it to the metadata dictionary that's already there

just some ideas and totally not prescriptive by any means!

david-saeger · 2024-09-13T16:49:20Z

Great ideas peter! I think these are great ways to get to the goal I expressed of adding metadata to vector embeddings. I think I may have convoluted the thread here with a second and related goal I have which I am having a harder time thinking through in terms of how to add in a way that could be useful to the broader LISA community that motivated this comment #75 (comment)

Ill leave it hear in case you have thoughts but recognize it should be in another ticket and think I have the information I was seeking about metadata creation.

Basically I would like to be able to use boutique embedding creation logic so that I could parse a document and include some a prior knowledge about its shape in the embedding creation process so that I can for instance inject a title and subheading for each chunk generated from a section in a policy document.

Looking through the codebase I believe that would require replacing the routine here

LISA/lambda/utilities/file_processing.py

Line 59 in 0e824eb

    
           def _generate_chunks(docs: List[Document], chunk_size: Optional[int], chunk_overlap: Optional[int]) -> List[Document]:

on a one off bases for a particular document which is hard for me to think through how to implement in a way that is useful to anybody else. As I am writing this It strikes me that the answer may be just in just generate boutique embeddings locally and send them direct to pgvector. Do you see any issues with that?

bedanley · 2025-01-21T18:21:36Z

As of v3.5.0, we now have a Document meta table that stores additional information about a document outside the vector store. This is primarily used for managing ingested documents, but this might be a good place to store your additional data fields. Unfortunately, it won't address chunk-level metadata as stated above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Metadata To Embeddings #75

Adding Metadata To Embeddings #75

david-saeger commented Sep 6, 2024

petermuller commented Sep 9, 2024

david-saeger commented Sep 9, 2024 via email

petermuller commented Sep 9, 2024

david-saeger commented Sep 10, 2024 via email

david-saeger commented Sep 12, 2024 •

edited

Loading

petermuller commented Sep 13, 2024

david-saeger commented Sep 13, 2024

bedanley commented Jan 21, 2025

Adding Metadata To Embeddings #75

Adding Metadata To Embeddings #75

Comments

david-saeger commented Sep 6, 2024

petermuller commented Sep 9, 2024

david-saeger commented Sep 9, 2024 via email

petermuller commented Sep 9, 2024

david-saeger commented Sep 10, 2024 via email

david-saeger commented Sep 12, 2024 • edited Loading

petermuller commented Sep 13, 2024

david-saeger commented Sep 13, 2024

bedanley commented Jan 21, 2025

david-saeger commented Sep 12, 2024 •

edited

Loading