Embedding Seems Unsuccessful for Chroma + OpenAI #7963

kong75 · 2023-07-19T21:33:31Z

System Info

openai==0.27.7
langchain==0.0.237
chromadb==0.4.2

Platform: Windows 11
Python Version: 3.10

Who can help?

No response

Information

The official example notebooks/scripts
My own modified scripts

Related Components

Reproduction

Within this file, I was expecting db_collection to have embeddings when it was printed. However, the output is like this:

db_collection {'ids': ['1234_5678_1'], 'embeddings': None, 'metadatas': [{'source': 'Test0720.txt'}], 'documents': ['Nuclear power in the United States is provided by 99 commercial reactors with a net capacity of 100,350 megawatts (MW), 65 pressurized water reactors and 34 boiling water reactors.\n\nIn 2016 they produced a total of 805.3 terawatt-hours of electricity, which accounted for 19.7% of the nation's total electric energy generation.\n\nIn 2016, nuclear energy comprised nearly 60 percent of U.S. emission-free generation.']}

The value for "embeddings" is empty.

Here is the code:

import os
from flask import Blueprint, request, jsonify
from werkzeug.utils import secure_filename
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader


chroma_bp = Blueprint('chroma_bp', __name__, url_prefix='/v1/resource')

openai_key = os.getenv('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = openai_key


@chroma_bp.route('/save_to_chroma', methods=['POST'])
def api_handler():
    file = request.files['file']
    user_id = request.form.get('user_id')
    file_id = request.form.get('file_id')

    try:
        response = create_chroma_db_from_file(file, file_id, user_id)
        return jsonify({'response': 'Chroma DB created successfully'}), 200
    except Exception as e:
        print(f"Exception: {e}")  # Debug print statement
        return jsonify({'error': str(e)}), 500


def create_chroma_db_from_file(file, file_id, user_id):
    filename = secure_filename(file.filename)
    file.save(filename)

    # load the document and split it into chunks
    loader = TextLoader(filename)
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    print(f"Number of documents: {len(docs)}")
    print(f"Documents:", docs)

    # # create the open-source embedding function
    embeddings = OpenAIEmbeddings(openai_api_key=openai_key)

    # load it into Chroma

    ids = [f"{file_id}_{user_id}_{i}" for i in range(1, len(docs) + 1)]
    db = Chroma.from_documents(
        documents=docs, embedding=embeddings, ids=ids, persist_directory="../chromadb")
    print(f"db", db)
    print(f"db_collection", db._collection.get(ids=[ids[0]]))
    db.persist()

    # query it
    query = "Nuclear power in the United States is provided by 99 commercial reactors with a net capacity of 100,350 megawatts (MW), 65 pressurized water reactors and 34 boiling water reactors. In 2016 they produced a total of 805.3 terawatt-hours of electricity, which accounted for 19.7% of the nation's total electric energy generation. In 2016, nuclear energy comprised nearly 60 percent of U.S. emission-free generation."
    search_result = db.similarity_search(query)

    # print results
    print(search_result[0].page_content)

    os.remove(filename)

    return True

Expected behavior

The embedding is done successfully and could be shown in logs. Thank you!

The text was updated successfully, but these errors were encountered:

kong75 · 2023-07-21T03:31:32Z

Just found out that this should be an expected behavior. And I have a follow-up question: if there is no vector stored in the DB, how is the similarity calculated (I guess the query is neither embedded when doing the similarity search)? Are we using some default embedding provided by Chroma?

dosubot · 2023-07-21T04:48:06Z

Answer generated by a 🤖

Answer

I understand that you're puzzled by the 'None' value for the 'embeddings' field when you print the Chroma database collection. You're also curious about how the similarity search works if the embeddings aren't directly stored in the database.

Let me clarify this for you. The 'None' value you're seeing is actually expected behavior. The Chroma database doesn't store the embeddings directly. Instead, it keeps a compressed representation of these embeddings. When you print the collection, it shows 'None' for the embeddings because the actual embeddings aren't directly accessible.

As for the similarity search, it operates by comparing these compressed representations. When you execute a similarity search, Chroma decompresses the stored representations to compute the similarity scores. So, even though you don't see the embeddings when you print the collection, rest assured they are there in a compressed form and are utilized for similarity searches.

This is just one possible explanation based on the information you've provided and my understanding of the Chroma database. If you find this explanation accurate and feel it could be beneficial to others, I encourage you to make a pull request to update the documentation. This way, other users who encounter the same confusion can quickly find the answer.

Sources

docs/extras/modules/data_connection/vectorstores/integrations/chroma.ipynb

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

rjarun8 · 2023-07-21T16:08:26Z

@kong75 is it possible to provide dummy/masked data which has been used here or to reproduce the issue?

kong75 · 2023-07-22T20:44:14Z

@rjarun8 Thank you for replying! I think the response above from dosu has answered my question. I just did not know that embeddings are not available when printing by default.

kong75 · 2023-07-22T20:45:06Z

Embeddings are not available when printed by default.

braun-viathan · 2023-08-22T09:48:08Z

Answer generated by a 🤖

Answer

I understand that you're puzzled by the 'None' value for the 'embeddings' field when you print the Chroma database collection. You're also curious about how the similarity search works if the embeddings aren't directly stored in the database.

Let me clarify this for you. The 'None' value you're seeing is actually expected behavior. The Chroma database doesn't store the embeddings directly. Instead, it keeps a compressed representation of these embeddings. When you print the collection, it shows 'None' for the embeddings because the actual embeddings aren't directly accessible.

As for the similarity search, it operates by comparing these compressed representations. When you execute a similarity search, Chroma decompresses the stored representations to compute the similarity scores. So, even though you don't see the embeddings when you print the collection, rest assured they are there in a compressed form and are utilized for similarity searches.

This is just one possible explanation based on the information you've provided and my understanding of the Chroma database. If you find this explanation accurate and feel it could be beneficial to others, I encourage you to make a pull request to update the documentation. This way, other users who encounter the same confusion can quickly find the answer.

Sources

docs/extras/modules/data_connection/vectorstores/integrations/chroma.ipynb

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

See: https://docs.trychroma.com/troubleshooting Embeddings are not passed by default.

dosubot bot added Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 19, 2023

kong75 closed this as completed Jul 22, 2023

dosubot bot mentioned this issue Sep 9, 2023

Chrom from_documents not making embedding to remote chromadb server #10403

Closed

14 tasks

dosubot bot mentioned this issue Sep 23, 2023

cant store embeddings from llama_cpp_python #10978

Closed

14 tasks

dosubot bot mentioned this issue Nov 8, 2023

OpenAIEmbeddings #13082

Closed

14 tasks

dosubot bot mentioned this issue Nov 16, 2023

langchain.vectorstores.Chroma support for EmbeddingFunction.__call__ update of ChromaDB #13441

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

kong75 commented Jul 19, 2023 •

edited

Loading

kong75 commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

rjarun8 commented Jul 21, 2023

kong75 commented Jul 22, 2023

kong75 commented Jul 22, 2023

braun-viathan commented Aug 22, 2023

Answer

Sources

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

Comments

kong75 commented Jul 19, 2023 • edited Loading

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

kong75 commented Jul 21, 2023

dosubot bot commented Jul 21, 2023

Answer

Sources

rjarun8 commented Jul 21, 2023

kong75 commented Jul 22, 2023

kong75 commented Jul 22, 2023

braun-viathan commented Aug 22, 2023

Answer

Sources

kong75 commented Jul 19, 2023 •

edited

Loading