Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

Closed
3 of 14 tasks
kong75 opened this issue Jul 19, 2023 · 6 comments
Closed
3 of 14 tasks

Embedding Seems Unsuccessful for Chroma + OpenAI #7963

kong75 opened this issue Jul 19, 2023 · 6 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module

Comments

@kong75
Copy link

kong75 commented Jul 19, 2023

System Info

openai==0.27.7
langchain==0.0.237
chromadb==0.4.2

Platform: Windows 11
Python Version: 3.10

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

Within this file, I was expecting db_collection to have embeddings when it was printed. However, the output is like this:

db_collection {'ids': ['1234_5678_1'], 'embeddings': None, 'metadatas': [{'source': 'Test0720.txt'}], 'documents': ['Nuclear power in the United States is provided by 99 commercial reactors with a net capacity of 100,350 megawatts (MW), 65 pressurized water reactors and 34 boiling water reactors.\n\nIn 2016 they produced a total of 805.3 terawatt-hours of electricity, which accounted for 19.7% of the nation's total electric energy generation.\n\nIn 2016, nuclear energy comprised nearly 60 percent of U.S. emission-free generation.']}

The value for "embeddings" is empty.

Here is the code:

import os
from flask import Blueprint, request, jsonify
from werkzeug.utils import secure_filename
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader


chroma_bp = Blueprint('chroma_bp', __name__, url_prefix='/v1/resource')

openai_key = os.getenv('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = openai_key


@chroma_bp.route('/save_to_chroma', methods=['POST'])
def api_handler():
    file = request.files['file']
    user_id = request.form.get('user_id')
    file_id = request.form.get('file_id')

    try:
        response = create_chroma_db_from_file(file, file_id, user_id)
        return jsonify({'response': 'Chroma DB created successfully'}), 200
    except Exception as e:
        print(f"Exception: {e}")  # Debug print statement
        return jsonify({'error': str(e)}), 500


def create_chroma_db_from_file(file, file_id, user_id):
    filename = secure_filename(file.filename)
    file.save(filename)

    # load the document and split it into chunks
    loader = TextLoader(filename)
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)
    print(f"Number of documents: {len(docs)}")
    print(f"Documents:", docs)

    # # create the open-source embedding function
    embeddings = OpenAIEmbeddings(openai_api_key=openai_key)

    # load it into Chroma

    ids = [f"{file_id}_{user_id}_{i}" for i in range(1, len(docs) + 1)]
    db = Chroma.from_documents(
        documents=docs, embedding=embeddings, ids=ids, persist_directory="../chromadb")
    print(f"db", db)
    print(f"db_collection", db._collection.get(ids=[ids[0]]))
    db.persist()

    # query it
    query = "Nuclear power in the United States is provided by 99 commercial reactors with a net capacity of 100,350 megawatts (MW), 65 pressurized water reactors and 34 boiling water reactors. In 2016 they produced a total of 805.3 terawatt-hours of electricity, which accounted for 19.7% of the nation's total electric energy generation. In 2016, nuclear energy comprised nearly 60 percent of U.S. emission-free generation."
    search_result = db.similarity_search(query)

    # print results
    print(search_result[0].page_content)

    os.remove(filename)

    return True

Expected behavior

The embedding is done successfully and could be shown in logs. Thank you!

@dosubot dosubot bot added Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jul 19, 2023
@kong75
Copy link
Author

kong75 commented Jul 21, 2023

Just found out that this should be an expected behavior. And I have a follow-up question: if there is no vector stored in the DB, how is the similarity calculated (I guess the query is neither embedded when doing the similarity search)? Are we using some default embedding provided by Chroma?

@dosubot
Copy link

dosubot bot commented Jul 21, 2023

Answer generated by a 🤖

Answer

I understand that you're puzzled by the 'None' value for the 'embeddings' field when you print the Chroma database collection. You're also curious about how the similarity search works if the embeddings aren't directly stored in the database.

Let me clarify this for you. The 'None' value you're seeing is actually expected behavior. The Chroma database doesn't store the embeddings directly. Instead, it keeps a compressed representation of these embeddings. When you print the collection, it shows 'None' for the embeddings because the actual embeddings aren't directly accessible.

As for the similarity search, it operates by comparing these compressed representations. When you execute a similarity search, Chroma decompresses the stored representations to compute the similarity scores. So, even though you don't see the embeddings when you print the collection, rest assured they are there in a compressed form and are utilized for similarity searches.

This is just one possible explanation based on the information you've provided and my understanding of the Chroma database. If you find this explanation accurate and feel it could be beneficial to others, I encourage you to make a pull request to update the documentation. This way, other users who encounter the same confusion can quickly find the answer.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@rjarun8
Copy link

rjarun8 commented Jul 21, 2023

@kong75 is it possible to provide dummy/masked data which has been used here or to reproduce the issue?

@kong75
Copy link
Author

kong75 commented Jul 22, 2023

@rjarun8 Thank you for replying! I think the response above from dosu has answered my question. I just did not know that embeddings are not available when printing by default.

@kong75
Copy link
Author

kong75 commented Jul 22, 2023

Embeddings are not available when printed by default.

@kong75 kong75 closed this as completed Jul 22, 2023
@braun-viathan
Copy link
Contributor

Answer generated by a 🤖

Answer

I understand that you're puzzled by the 'None' value for the 'embeddings' field when you print the Chroma database collection. You're also curious about how the similarity search works if the embeddings aren't directly stored in the database.

Let me clarify this for you. The 'None' value you're seeing is actually expected behavior. The Chroma database doesn't store the embeddings directly. Instead, it keeps a compressed representation of these embeddings. When you print the collection, it shows 'None' for the embeddings because the actual embeddings aren't directly accessible.

As for the similarity search, it operates by comparing these compressed representations. When you execute a similarity search, Chroma decompresses the stored representations to compute the similarity scores. So, even though you don't see the embeddings when you print the collection, rest assured they are there in a compressed form and are utilized for similarity searches.

This is just one possible explanation based on the information you've provided and my understanding of the Chroma database. If you find this explanation accurate and feel it could be beneficial to others, I encourage you to make a pull request to update the documentation. This way, other users who encounter the same confusion can quickly find the answer.

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

See: https://docs.trychroma.com/troubleshooting Embeddings are not passed by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: embeddings Related to text embedding models module Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

2 participants