Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix load bug #795

Merged
merged 7 commits into from
Jan 12, 2024
Merged

fix: Fix load bug #795

merged 7 commits into from
Jan 12, 2024

Conversation

jimlloyd
Copy link
Contributor

@jimlloyd jimlloyd commented Jan 8, 2024

Please describe the purpose of this pull request.
This is a partial fix for #755 for local embeddings. We might want to hold this until we can extend this fix to other embeddings endpoint types, but given that users are experiencing pain over this perhaps it is worth making it available now and fixing the others later.

The change checks length of the token sequence and whenever the sequence exceeds the model's limit it truncates the text to exactly the model limit.

How to test
It may be worthwhile to first test without this change so that you can locate instances that exhibit the #755 bug.
Try memgpt load directory --recursive --name <some-name> --input-dir <your-dir>
It's unclear to me which types of files are most problematic. I found it easiest to produce the problem with directories containing a lot of C++ code. But users have reported the problem with .pdf and I think .epub.

Once you have a directory that reproduces the bug, check out this branch and try again. However, you may also
need the new --extensions <extension-list> option. The default extensions currently configured in this branch is ".txt.,.md,.pdf". If you reproduced the bug with one of these file types then you don't need the option.

I run the test with --extensions .md,.hpp,.cpp

Have you tested this PR?
Yes. With the fix, the output should be similar to the output produced in the past when no files failed to load.

Related issues or PRs
#755

Is your PR over 500 lines of code?
No

Additional context
Add any other context or screenshots about the PR here.

@jimlloyd
Copy link
Contributor Author

jimlloyd commented Jan 8, 2024

Those test errors are not meaningful to me. I have not tried to run those tests locally but can do so this evening. LMK if you understand them and have any suggestions as to what I need to do to fix. Or are these known failure still not resolved from the big breaking changes?

@jimlloyd
Copy link
Contributor Author

@cpacker do you want to to resolve merge conflicts and repush?

memgpt/embeddings.py Show resolved Hide resolved
@jimlloyd
Copy link
Contributor Author

@cpacker I found something that fixes the failing CI test. But when I run memgpt load directory ... --extensions .md,.hpp,.cpp memgpt gives a usage error:

Usage: memgpt load directory [OPTIONS]
Try 'memgpt load directory --help' for help.
╭─ Error ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --extensions                                                                                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

And when I run memgpt load directory --help the help message is using both the extensions option and the user_id option that are declared for the load_directory function.

Do you understand what might be causing this? I am totally new to typer so I don't know its quirks.

@cpacker
Copy link
Collaborator

cpacker commented Jan 12, 2024

@cpacker I found something that fixes the failing CI test. But when I run memgpt load directory ... --extensions .md,.hpp,.cpp memgpt gives a usage error:

Usage: memgpt load directory [OPTIONS]
Try 'memgpt load directory --help' for help.
╭─ Error ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --extensions                                                                                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

And when I run memgpt load directory --help the help message is using both the extensions option and the user_id option that are declared for the load_directory function.

Do you understand what might be causing this? I am totally new to typer so I don't know its quirks.

I'm not getting the --extensions error:

(pymemgpt-py3.11) MemGPT-jimlloyd % memgpt load directory --name memgpt_docs --input-dir docs  --extensions .md,.hpp,.cpp
Parsing nodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 152/152 [00:00<00:00, 2758.03it/s]
Generating embeddings:   0%|                                                                                                                                      | 0/192 [00:00<?, ?it/s]

Maybe it's an environment issue (env is pointing at a stale version of the file)?

Copy link
Collaborator

@cpacker cpacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cpacker cpacker merged commit c8dd115 into letta-ai:main Jan 12, 2024
3 checks passed
@flobotde
Copy link

flobotde commented Feb 4, 2024

Seems to me, that i am still facing this with version 0.3, where this fix should be in (checked my memgpt source code in both files) as far as i can understand.
Maybe it is just a problem on my local environment, or has nothing to do with the descriped problem above? If not, i can raise another issue, if you want.

UPDATE: This error only occoured for me with your embedding-Server embeddings.memgpt.ai
Using the Embedding Server from Open AI ( text-embedding-ada-002 modell) i had no problems!

Description of the problem, that i am facing (hopefully anyone can reproduce that)
#1. Made a new setup of memgpt (removed .memgpt folder) with memgpt V0.3
#2. Run memgpt configure quickstart (using your modell and embedding memgpt-API server, many thanks and coffee for providing that, by the way :) )
My config file is also in the attached files.
#3. changed some presets (which i think shouldn't touch my problem, but who knows?)
#3.1 defaulted preset = memgpt_docs
#3.2 defaulted persona = memgpt_doc
#3.3 added persona: memgpt add persona --name deepracer_doc -f deepracer_doc.txt
#3.4 added human: me as a human, simple a edited copy of the cs_phd.txt
#4. Now i wanted to add a directory with some files containing this PDF-File (among others): https://docs.aws.amazon.com/pdfs/deepracer/latest/developerguide/awsracerdg.pdf
All files other files in there where created embeddings for with no probem (e.g. F8-BP-2023-Jakl-Vincent-thesis.txt)

memgpt load directory --name deepracer --input-dir=deepracer --recursive

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 440/440 [00:00<00:00, 689.90it/s]
Generating embeddings:  30%|████████████████████████████████████████████▉                                                                                                          | 400/1344 [04:13<09:20,  1.69it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/bin/memgpt", line 8, in <module>
    sys.exit(app())
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 328, in __call__
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/homebrew/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/cli/cli_load.py", line 227, in load_directory
    store_docs(str(name), docs, user_id)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/cli/cli_load.py", line 117, in store_docs
    index = VectorStoreIndex.from_documents(docs, service_context=service_context, show_progress=True)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/base.py", line 112, in from_documents
    return cls(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 53, in __init__
    super().__init__(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/base.py", line 75, in __init__
    index_struct = self.build_index_from_nodes(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 274, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **insert_kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 246, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 199, in _add_nodes_to_index
    nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 107, in _get_node_with_embedding
    id_to_embed_map = embed_nodes(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/utils.py", line 137, in embed_nodes
    new_embeddings = embed_model.get_text_embedding_batch(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/core/embeddings/base.py", line 256, in get_text_embedding_batch
    embeddings = self._get_text_embeddings(cur_batch)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 168, in _get_text_embeddings
    embeddings = [self._get_text_embedding(text) for text in texts]
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 168, in <listcomp>
    embeddings = [self._get_text_embedding(text) for text in texts]
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 2519', 'code': 413, 'type': 'Validation'}
Generating embeddings:  30%|█████████████████████████████████████████████▉                                                                                                         | 409/1344 [04:17<09:48,  1.59it/s]

##5. Another try was converting the PDF to a textfile (with pdf2text.py)

memgpt load directory --name deepracer-txt --input-files=files/deepracer-txt/awsracerdg.txt 

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Warning: text is too long (95618 tokens), truncating to 8191 tokens.
Source deepracer-txt for user 00000000-0000-0000-0000-5eab23008ef0 already exists.
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 24.94it/s]
Generating embeddings:  12%|████████████████████████████                                                                                                                                                                                                                     | 10/86 [00:05<00:39,  1.93it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:
...

  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 1461', 'code': 413, 'type': 'Validation'}
Generating embeddings:  22%|█████████████████████████████████████████████████████▏                                                                                                                                                                                           | 19/86 [00:07<00:26,  2.49it/s]

#6. Even splitting the textfile into smaler ones with 20 pages per file didn't help. Even a 10 pages file didn't work for me, which is very weird to me! (you can find all those smaller files in my attachement)

memgpt load directory --name deepracer-txt --input-files=files/deepracer-txt/awsracerdg_page_001-010.txt

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Source deepracer-txt for user 00000000-0000-0000-0000-5eab23008ef0 already exists.
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.12it/s]
Generating embeddings:  14%|█████████████████████████████████                                                                                                                                                                                                                | 10/73 [00:04<00:30,  2.03it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:
...

  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 1451', 'code': 413, 'type': 'Validation'}
Generating embeddings:  26%|██████████████████████████████████████████████████████████████▋                                                                                                                                                                                  | 19/73 [00:06<00:19,  2.84it/s]

bugreport-795.zip

norton120 pushed a commit to norton120/MemGPT that referenced this pull request Feb 15, 2024
Co-authored-by: Charles Packer <packercharles@gmail.com>
@quantumalchemy
Copy link

this same error showing its ugly head again in 0.3.4
-- memgpt load directory --name filename --input-files=files/filename.txt

@cpacker
Copy link
Collaborator

cpacker commented Mar 5, 2024

thanks for the report @quantumalchemy - @sarahwooders and I are looking into it

mattzh72 pushed a commit that referenced this pull request Oct 9, 2024
Co-authored-by: Charles Packer <packercharles@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants