fix: Fix load bug #795

jimlloyd · 2024-01-08T05:15:52Z

Please describe the purpose of this pull request.
This is a partial fix for #755 for local embeddings. We might want to hold this until we can extend this fix to other embeddings endpoint types, but given that users are experiencing pain over this perhaps it is worth making it available now and fixing the others later.

The change checks length of the token sequence and whenever the sequence exceeds the model's limit it truncates the text to exactly the model limit.

How to test
It may be worthwhile to first test without this change so that you can locate instances that exhibit the #755 bug.
Try memgpt load directory --recursive --name <some-name> --input-dir <your-dir>
It's unclear to me which types of files are most problematic. I found it easiest to produce the problem with directories containing a lot of C++ code. But users have reported the problem with .pdf and I think .epub.

Once you have a directory that reproduces the bug, check out this branch and try again. However, you may also
need the new --extensions <extension-list> option. The default extensions currently configured in this branch is ".txt.,.md,.pdf". If you reproduced the bug with one of these file types then you don't need the option.

I run the test with --extensions .md,.hpp,.cpp

Have you tested this PR?
Yes. With the fix, the output should be similar to the output produced in the past when no files failed to load.

Related issues or PRs
#755

Is your PR over 500 lines of code?
No

Additional context
Add any other context or screenshots about the PR here.

jimlloyd · 2024-01-08T16:19:12Z

Those test errors are not meaningful to me. I have not tried to run those tests locally but can do so this evening. LMK if you understand them and have any suggestions as to what I need to do to fix. Or are these known failure still not resolved from the big breaking changes?

jimlloyd · 2024-01-10T19:43:28Z

@cpacker do you want to to resolve merge conflicts and repush?

memgpt/embeddings.py

jimlloyd · 2024-01-11T21:30:30Z

@cpacker I found something that fixes the failing CI test. But when I run memgpt load directory ... --extensions .md,.hpp,.cpp memgpt gives a usage error:

Usage: memgpt load directory [OPTIONS]
Try 'memgpt load directory --help' for help.
╭─ Error ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --extensions                                                                                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

And when I run memgpt load directory --help the help message is using both the extensions option and the user_id option that are declared for the load_directory function.

Do you understand what might be causing this? I am totally new to typer so I don't know its quirks.

cpacker · 2024-01-12T06:53:43Z

@cpacker I found something that fixes the failing CI test. But when I run memgpt load directory ... --extensions .md,.hpp,.cpp memgpt gives a usage error:

Usage: memgpt load directory [OPTIONS]
Try 'memgpt load directory --help' for help.
╭─ Error ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ No such option: --extensions                                                                                                                                                                                                                                                       │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

And when I run memgpt load directory --help the help message is using both the extensions option and the user_id option that are declared for the load_directory function.

Do you understand what might be causing this? I am totally new to typer so I don't know its quirks.

I'm not getting the --extensions error:

(pymemgpt-py3.11) MemGPT-jimlloyd % memgpt load directory --name memgpt_docs --input-dir docs  --extensions .md,.hpp,.cpp
Parsing nodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 152/152 [00:00<00:00, 2758.03it/s]
Generating embeddings:   0%|                                                                                                                                      | 0/192 [00:00<?, ?it/s]

Maybe it's an environment issue (env is pointing at a stale version of the file)?

cpacker

LGTM

flobotde · 2024-02-04T12:03:04Z

Seems to me, that i am still facing this with version 0.3, where this fix should be in (checked my memgpt source code in both files) as far as i can understand.
Maybe it is just a problem on my local environment, or has nothing to do with the descriped problem above? If not, i can raise another issue, if you want.

UPDATE: This error only occoured for me with your embedding-Server embeddings.memgpt.ai
Using the Embedding Server from Open AI ( text-embedding-ada-002 modell) i had no problems!

Description of the problem, that i am facing (hopefully anyone can reproduce that)
#1. Made a new setup of memgpt (removed .memgpt folder) with memgpt V0.3
#2. Run memgpt configure quickstart (using your modell and embedding memgpt-API server, many thanks and coffee for providing that, by the way :) )
My config file is also in the attached files.
#3. changed some presets (which i think shouldn't touch my problem, but who knows?)
#3.1 defaulted preset = memgpt_docs
#3.2 defaulted persona = memgpt_doc
#3.3 added persona: memgpt add persona --name deepracer_doc -f deepracer_doc.txt
#3.4 added human: me as a human, simple a edited copy of the cs_phd.txt
#4. Now i wanted to add a directory with some files containing this PDF-File (among others): https://docs.aws.amazon.com/pdfs/deepracer/latest/developerguide/awsracerdg.pdf
All files other files in there where created embeddings for with no probem (e.g. F8-BP-2023-Jakl-Vincent-thesis.txt)

memgpt load directory --name deepracer --input-dir=deepracer --recursive

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 440/440 [00:00<00:00, 689.90it/s]
Generating embeddings:  30%|████████████████████████████████████████████▉                                                                                                          | 400/1344 [04:13<09:20,  1.69it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/bin/memgpt", line 8, in <module>
    sys.exit(app())
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 328, in __call__
    raise e
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/homebrew/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/homebrew/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/cli/cli_load.py", line 227, in load_directory
    store_docs(str(name), docs, user_id)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/cli/cli_load.py", line 117, in store_docs
    index = VectorStoreIndex.from_documents(docs, service_context=service_context, show_progress=True)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/base.py", line 112, in from_documents
    return cls(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 53, in __init__
    super().__init__(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/base.py", line 75, in __init__
    index_struct = self.build_index_from_nodes(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 274, in build_index_from_nodes
    return self._build_index_from_nodes(nodes, **insert_kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 246, in _build_index_from_nodes
    self._add_nodes_to_index(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 199, in _add_nodes_to_index
    nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 107, in _get_node_with_embedding
    id_to_embed_map = embed_nodes(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/indices/utils.py", line 137, in embed_nodes
    new_embeddings = embed_model.get_text_embedding_batch(
  File "/opt/homebrew/lib/python3.10/site-packages/llama_index/core/embeddings/base.py", line 256, in get_text_embedding_batch
    embeddings = self._get_text_embeddings(cur_batch)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 168, in _get_text_embeddings
    embeddings = [self._get_text_embedding(text) for text in texts]
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 168, in <listcomp>
    embeddings = [self._get_text_embedding(text) for text in texts]
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 2519', 'code': 413, 'type': 'Validation'}
Generating embeddings:  30%|█████████████████████████████████████████████▉                                                                                                         | 409/1344 [04:17<09:48,  1.59it/s]

##5. Another try was converting the PDF to a textfile (with pdf2text.py)

memgpt load directory --name deepracer-txt --input-files=files/deepracer-txt/awsracerdg.txt 

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Warning: text is too long (95618 tokens), truncating to 8191 tokens.
Source deepracer-txt for user 00000000-0000-0000-0000-5eab23008ef0 already exists.
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 24.94it/s]
Generating embeddings:  12%|████████████████████████████                                                                                                                                                                                                                     | 10/86 [00:05<00:39,  1.93it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:
...

  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 1461', 'code': 413, 'type': 'Validation'}
Generating embeddings:  22%|█████████████████████████████████████████████████████▏                                                                                                                                                                                           | 19/86 [00:07<00:26,  2.49it/s]

#6. Even splitting the textfile into smaler ones with 20 pages per file didn't help. Even a 10 pages file didn't work for me, which is very weird to me! (you can find all those smaller files in my attachement)

memgpt load directory --name deepracer-txt --input-files=files/deepracer-txt/awsracerdg_page_001-010.txt

Warning: couldn't find tokenizer for model BAAI/bge-large-en-v1.5, using default tokenizer cl100k_base
Source deepracer-txt for user 00000000-0000-0000-0000-5eab23008ef0 already exists.
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 19.12it/s]
Generating embeddings:  14%|█████████████████████████████████                                                                                                                                                                                                                | 10/73 [00:04<00:30,  2.03it/s]Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 114, in _call_api
    embedding = response_json["data"][0]["embedding"]
KeyError: 'data'

During handling of the above exception, another exception occurred:
...

  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 164, in _get_text_embedding
    embedding = self._call_api(text)
  File "/opt/homebrew/lib/python3.10/site-packages/memgpt/embeddings.py", line 116, in _call_api
    raise TypeError(f"Got back an unexpected payload from text embedding function, response=\n{response_json}")
TypeError: Got back an unexpected payload from text embedding function, response=
{'message': 'Input validation error: `inputs` must have less than 512 tokens. Given: 1451', 'code': 413, 'type': 'Validation'}
Generating embeddings:  26%|██████████████████████████████████████████████████████████████▋                                                                                                                                                                                  | 19/73 [00:06<00:19,  2.84it/s]

bugreport-795.zip

Co-authored-by: Charles Packer <packercharles@gmail.com>

quantumalchemy · 2024-03-04T14:09:28Z

this same error showing its ugly head again in 0.3.4
-- memgpt load directory --name filename --input-files=files/filename.txt

cpacker · 2024-03-05T00:06:50Z

thanks for the report @quantumalchemy - @sarahwooders and I are looking into it

Co-authored-by: Charles Packer <packercharles@gmail.com>

jimlloyd added 2 commits January 7, 2024 20:34

embeddings: truncate long text to embedding limit

3dbadf3

Add extensions option for load directory

834cb02

cpacker approved these changes Jan 8, 2024

View reviewed changes

Merge branch 'main' into fix-load-bug

1c5f6f2

cpacker requested changes Jan 11, 2024

View reviewed changes

memgpt/embeddings.py Show resolved Hide resolved

jimlloyd added 4 commits January 11, 2024 12:15

fallback for embeddings model

ba8ec62

fixup! fallback for embeddings model

66699b8

fixup! fixup! fallback for embeddings model

8c4def0

fixup! fixup! fixup! fallback for embeddings model

dc34fa0

cpacker approved these changes Jan 12, 2024

View reviewed changes

cpacker merged commit c8dd115 into letta-ai:main Jan 12, 2024
3 checks passed

norton120 pushed a commit to norton120/MemGPT that referenced this pull request Feb 15, 2024

fix: Fix load bug (letta-ai#795)

2fa62fc

Co-authored-by: Charles Packer <packercharles@gmail.com>

mattzh72 pushed a commit that referenced this pull request Oct 9, 2024

fix: Fix load bug (#795)

29a71ce

Co-authored-by: Charles Packer <packercharles@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix load bug #795

fix: Fix load bug #795

jimlloyd commented Jan 8, 2024 •

edited

Loading

jimlloyd commented Jan 8, 2024

jimlloyd commented Jan 10, 2024

jimlloyd commented Jan 11, 2024

cpacker commented Jan 12, 2024 •

edited

Loading

cpacker left a comment

flobotde commented Feb 4, 2024 •

edited

Loading

quantumalchemy commented Mar 4, 2024

cpacker commented Mar 5, 2024

fix: Fix load bug #795

fix: Fix load bug #795

Conversation

jimlloyd commented Jan 8, 2024 • edited Loading

jimlloyd commented Jan 8, 2024

jimlloyd commented Jan 10, 2024

jimlloyd commented Jan 11, 2024

cpacker commented Jan 12, 2024 • edited Loading

cpacker left a comment

Choose a reason for hiding this comment

flobotde commented Feb 4, 2024 • edited Loading

quantumalchemy commented Mar 4, 2024

cpacker commented Mar 5, 2024

jimlloyd commented Jan 8, 2024 •

edited

Loading

cpacker commented Jan 12, 2024 •

edited

Loading

flobotde commented Feb 4, 2024 •

edited

Loading