Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error message is displayed after the knowledge base file is uploaded #2772

Closed
3 tasks done
joginwen opened this issue Mar 9, 2024 · 10 comments
Closed
3 tasks done
Assignees
Labels
🐞 bug Something isn't working 🤔 cant-reproduce We can not reproduce it or the information is limited

Comments

@joginwen
Copy link

joginwen commented Mar 9, 2024

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.5.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create a knowledge base
  2. Upload the md file and set all policies to default
  3. Wait for a moment. The Error state of the text Embedding processing is displayed

Error message screenshot:
image

docker-worker-1 container error logs

[2024-03-09 16:20:56,842: ERROR/MainProcess] consume document failed
Traceback (most recent call last):
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 31, in invoke
    return self._invoke(model, credentials, texts, user)
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 71, in _invoke
    embeddings_batch, embedding_used_tokens = self._embedding_invoke(
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in _embedding_invoke
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in <listcomp>
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/usr/local/lib/python3.10/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/usr/local/lib/python3.10/base64.py", line 45, in _bytes_from_decode_data
    raise TypeError("argument should be a bytes-like object or ASCII "
TypeError: argument should be a bytes-like object or ASCII string, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/api/core/indexing_runner.py", line 70, in run
    self._load(
  File "/app/api/core/indexing_runner.py", line 667, in _load
    index_processor.load(dataset, chunk_documents)
  File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 59, in load
    vector.create(documents)
  File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 120, in create
    embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
  File "/app/api/core/embedding/cached_embedding.py", line 51, in embed_documents
    raise ex
  File "/app/api/core/embedding/cached_embedding.py", line 35, in embed_documents
    embedding_result = self._model_instance.invoke_text_embedding(
  File "/app/api/core/model_manager.py", line 95, in invoke_text_embedding
    return self.model_type_instance.invoke(
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 33, in invoke
    raise self._transform_invoke_error(e)
core.model_runtime.errors.invoke.InvokeError: [openai] Error: argument should be a bytes-like object or ASCII string, not 'list'

✔️ Expected Behavior

The knowledge base can be used normally

❌ Actual Behavior

The uploaded document cannot be processed properly

@joginwen joginwen added the 🐞 bug Something isn't working label Mar 9, 2024
Copy link

dosubot bot commented Mar 9, 2024

Hey there @joginwen! Great to see you around again. Hope you're doing well!

Thank you for providing detailed information about the issue you're experiencing. Based on the error logs you've shared, it seems like the error is occurring during the text embedding process. Specifically, the error message TypeError: argument should be a bytes-like object or ASCII string, not 'list' suggests that the function base64.b64decode() is receiving a list as an argument, whereas it expects a bytes-like object or ASCII string.

This error might be due to the format or content of the uploaded .md file. It's possible that the text embedding function is encountering a list where it expects a string, causing the TypeError.

As a first step, I would recommend checking the content of the .md file you're trying to upload. Make sure that it doesn't contain any unusual formatting or data types that might be interpreted as a list during processing.

If the issue persists, you might want to try the solutions suggested in similar issues:

  1. Archiving and unarchiving your documents as suggested in this issue.
  2. Changing the volumes path as suggested in this issue.

Please try these steps and let us know if the issue still persists. We're here to help!

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@JohnJyong
Copy link
Contributor

PR: #2781 , it has fixed, thanks for your feedback @joginwen

@joginwen
Copy link
Author

PR: #2781 , it has fixed, thanks for your feedback @joginwen

After updating the code, I tested it again, and the same error occurred

[2024-03-11 12:31:59,277: ERROR/MainProcess] consume document failed
Traceback (most recent call last):
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 31, in invoke
    return self._invoke(model, credentials, texts, user)
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 71, in _invoke
    embeddings_batch, embedding_used_tokens = self._embedding_invoke(
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in _embedding_invoke
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in <listcomp>
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/usr/local/lib/python3.10/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/usr/local/lib/python3.10/base64.py", line 45, in _bytes_from_decode_data
    raise TypeError("argument should be a bytes-like object or ASCII "
TypeError: argument should be a bytes-like object or ASCII string, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/api/core/indexing_runner.py", line 71, in run
    self._load(
  File "/app/api/core/indexing_runner.py", line 674, in _load
    index_processor.load(dataset, chunk_documents)
  File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 59, in load
    vector.create(documents)
  File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 120, in create
    embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
  File "/app/api/core/embedding/cached_embedding.py", line 51, in embed_documents
    raise ex
  File "/app/api/core/embedding/cached_embedding.py", line 35, in embed_documents
    embedding_result = self._model_instance.invoke_text_embedding(
  File "/app/api/core/model_manager.py", line 95, in invoke_text_embedding
    return self.model_type_instance.invoke(
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 33, in invoke
    raise self._transform_invoke_error(e)
core.model_runtime.errors.invoke.InvokeError: [openai] Error: argument should be a bytes-like object or ASCII string, not 'list'
[2024-03-11 12:31:59,281: INFO/MainProcess] Processed dataset: b41d31bf-1279-46a9-8ad4-b37706b1f2a2 latency: 2.85326619999978

@JohnJyong
Copy link
Contributor

Have you restarted the celery service? @joginwen

@joginwen
Copy link
Author

joginwen commented Mar 12, 2024

Have you restarted the celery service? @joginwen

My deployment is based on the docker compose container. After I update the main branch code, I mount the source directory to the inner container /app/api directory, replacing the inner container code

docker-compose fragment

version: '3.1'
services:
  # API service
  api:
    volumes:
      # Mount the storage directory to the container, for storing user files.
      - ./volumes/app/storage:/app/api/storage
      - /Users/liuda/Documents/work/ai/dify/api:/app/api
   worker:
    volumes:
      # Mount the storage directory to the container, for storing user files.
      - ./volumes/app/storage:/app/api/storage
      - /Users/liuda/Documents/work/ai/dify/api:/app/api

docker-compose down Then docker-compose up -d completes the restart and after testing, errors still appear.

@AI-HR
Copy link

AI-HR commented Mar 14, 2024

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.5.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create a knowledge base
  2. Upload the md file and set all policies to default
  3. Wait for a moment. The Error state of the text Embedding processing is displayed

Error message screenshot: image

docker-worker-1 container error logs

[2024-03-09 16:20:56,842: ERROR/MainProcess] consume document failed
Traceback (most recent call last):
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 31, in invoke
    return self._invoke(model, credentials, texts, user)
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 71, in _invoke
    embeddings_batch, embedding_used_tokens = self._embedding_invoke(
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in _embedding_invoke
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in <listcomp>
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/usr/local/lib/python3.10/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/usr/local/lib/python3.10/base64.py", line 45, in _bytes_from_decode_data
    raise TypeError("argument should be a bytes-like object or ASCII "
TypeError: argument should be a bytes-like object or ASCII string, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/api/core/indexing_runner.py", line 70, in run
    self._load(
  File "/app/api/core/indexing_runner.py", line 667, in _load
    index_processor.load(dataset, chunk_documents)
  File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 59, in load
    vector.create(documents)
  File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 120, in create
    embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
  File "/app/api/core/embedding/cached_embedding.py", line 51, in embed_documents
    raise ex
  File "/app/api/core/embedding/cached_embedding.py", line 35, in embed_documents
    embedding_result = self._model_instance.invoke_text_embedding(
  File "/app/api/core/model_manager.py", line 95, in invoke_text_embedding
    return self.model_type_instance.invoke(
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 33, in invoke
    raise self._transform_invoke_error(e)
core.model_runtime.errors.invoke.InvokeError: [openai] Error: argument should be a bytes-like object or ASCII string, not 'list'

✔️ Expected Behavior

The knowledge base can be used normally

❌ Actual Behavior

The uploaded document cannot be processed properly

May I ask if you have resolved it?

@joginwen
Copy link
Author

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.5.8

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create a knowledge base
  2. Upload the md file and set all policies to default
  3. Wait for a moment. The Error state of the text Embedding processing is displayed

Error message screenshot: image
docker-worker-1 container error logs

[2024-03-09 16:20:56,842: ERROR/MainProcess] consume document failed
Traceback (most recent call last):
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 31, in invoke
    return self._invoke(model, credentials, texts, user)
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 71, in _invoke
    embeddings_batch, embedding_used_tokens = self._embedding_invoke(
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in _embedding_invoke
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/app/api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py", line 184, in <listcomp>
    return ([list(np.frombuffer(base64.b64decode(data.embedding), dtype="float32")) for data in response.data],
  File "/usr/local/lib/python3.10/base64.py", line 80, in b64decode
    s = _bytes_from_decode_data(s)
  File "/usr/local/lib/python3.10/base64.py", line 45, in _bytes_from_decode_data
    raise TypeError("argument should be a bytes-like object or ASCII "
TypeError: argument should be a bytes-like object or ASCII string, not 'list'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/api/core/indexing_runner.py", line 70, in run
    self._load(
  File "/app/api/core/indexing_runner.py", line 667, in _load
    index_processor.load(dataset, chunk_documents)
  File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 59, in load
    vector.create(documents)
  File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 120, in create
    embeddings = self._embeddings.embed_documents([document.page_content for document in texts])
  File "/app/api/core/embedding/cached_embedding.py", line 51, in embed_documents
    raise ex
  File "/app/api/core/embedding/cached_embedding.py", line 35, in embed_documents
    embedding_result = self._model_instance.invoke_text_embedding(
  File "/app/api/core/model_manager.py", line 95, in invoke_text_embedding
    return self.model_type_instance.invoke(
  File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 33, in invoke
    raise self._transform_invoke_error(e)
core.model_runtime.errors.invoke.InvokeError: [openai] Error: argument should be a bytes-like object or ASCII string, not 'list'

✔️ Expected Behavior

The knowledge base can be used normally

❌ Actual Behavior

The uploaded document cannot be processed properly

May I ask if you have resolved it?

No, I upgraded to version 0.5.9 and still got the same error.

image

@AI-HR
Copy link

AI-HR commented Mar 18, 2024

i have the same error and dont know how to fix it

@zjbshk
Copy link

zjbshk commented Mar 18, 2024

I have encountered the same issue as well. I reproduced it using the following code:
image

I discovered that the issue was due to my proxy not sending the "encoding_format": "base64" parameter to OpenAI's /v1/embeddings endpoint. Below is the response I received after including the encoding_format parameter, which can be correctly parsed.

import base64
import numpy as np

req = {
    "data": [
        {
            "embedding": "",
            "index": 0,
            "object": "embedding"
        }
    ],
    "model": "text-embedding-ada-002",
    "object": "list",
    "usage": {
        "prompt_tokens": 421,
        "total_tokens": 421
    }
}
embedding_data = req['data'][0]['embedding']
# Decode the base64 string and convert it to a NumPy array of float32 type
decoded_array = np.frombuffer(base64.b64decode(embedding_data), dtype=np.float32)

print(decoded_array)

data = req['data'][0]
print(np.frombuffer(base64.b64decode(data['embedding']), dtype="float32"))

You can modify the line extra_model_kwargs['encoding_format'] = 'base64' in the file api/core/model_runtime/model_providers/openai/text_embedding/text_embedding.py by commenting it out.

image
image

This is the result of my testing, I am not sure if you are encountering the same issue.

@andylzming
Copy link

Upgrade to 0.5.9, the following error is reported when the knowledge base upload document

[2024-03-21 15:22:38,510: INFO/MainProcess] Task tasks.document_indexing_task.document_indexing_task[f2551a43-5607-4269-a61b-e1360c4c92f8] received
[2024-03-21 15:22:38,517: INFO/MainProcess] Start process document: 950e9171-108d-481a-9eee-da396574a68f
[2024-03-21 15:22:39,054: DEBUG/MainProcess] Created new connection using: 3075a88a36494dc3afd4685ad08b42fe
[2024-03-21 15:22:39,656: ERROR/MainProcess] RPC error: [insert_rows], <DataNotMatchException: (code=1, message=Attempt to insert an unexpected field to collection without enabling dynamic field)>, <Time:{'RPC start': '2024-03-21 15:22:39.653886', 'RPC error': '2024-03-21 15:22:39.656706'}>
[2024-03-21 15:22:39,657: ERROR/MainProcess] Failed to insert batch starting at entity: 0/11
[2024-03-21 15:22:39,657: ERROR/MainProcess] Failed to insert batch starting at entity: 0/11
[2024-03-21 15:22:39,657: ERROR/MainProcess] consume document failed
Traceback (most recent call last):
  File "/app/api/core/indexing_runner.py", line 70, in run
    self._load(
  File "/app/api/core/indexing_runner.py", line 667, in _load
    index_processor.load(dataset, chunk_documents)
  File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 59, in load
    vector.create(documents)
  File "/app/api/core/rag/datasource/vdb/vector_factory.py", line 121, in create
    self._vector_processor.create(
  File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 75, in create
    self.add_texts(texts, embeddings)
  File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 101, in add_texts
    raise e
  File "/app/api/core/rag/datasource/vdb/milvus/milvus_vector.py", line 95, in add_texts
    ids = self._client.insert(collection_name=self._collection_name, data=batch_insert_list)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 206, in insert
    raise ex from ex
  File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 198, in insert
    res = conn.insert_rows(collection_name, insert_batch, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 127, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 123, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 162, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 102, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 68, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 501, in insert_rows
    request = self._prepare_row_insert_request(
  File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 482, in _prepare_row_insert_request
    return Prepare.row_insert_param(
  File "/usr/local/lib/python3.10/site-packages/pymilvus/client/prepare.py", line 422, in row_insert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/client/prepare.py", line 370, in _parse_row_request
    raise DataNotMatchException(message=ExceptionsMessage.InsertUnexpectedField)
pymilvus.exceptions.DataNotMatchException: <DataNotMatchException: (code=1, message=Attempt to insert an unexpected field to collection without enabling dynamic field)>
[2024-03-21 15:22:39,663: INFO/MainProcess] Processed dataset: bd66a1d2-d871-42c4-8fe7-4275be32a591 latency: 1.1507785804569721

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working 🤔 cant-reproduce We can not reproduce it or the information is limited
Projects
None yet
Development

No branches or pull requests

8 participants