Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "NotImplementedError" with the atransform_documents() method in MarkdownifyTransformer (this issue includes a suggested fix) #27865

Open
5 tasks done
rparkr opened this issue Nov 3, 2024 · 0 comments · May be fixed by #27866
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@rparkr
Copy link

rparkr commented Nov 3, 2024

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Linked to discussion: #27760

Currently, the MarkdownifyTransformer raises a NotImplementedError if a user calls the atransform_documents() method:

async def atransform_documents(
self,
documents: Sequence[Document],
**kwargs: Any,
) -> Sequence[Document]:
raise NotImplementedError

Here's a minimum reproducible example:

from langchain_community.document_transformers import MarkdownifyTransformer
from langchain_core.documents import Document

docs = [
    Document(
        page_content=(
            """
            <h1>The first doc</h1>
            <p>Here's the first document.</p>
            """
        ),
        metadata={"source": "doc1"}
    ),
    Document(
        page_content=(
            """
            <h1>The second doc</h1>
            <p>Here's the second document.</p>
            """
        ),
        metadata={"source": "doc2"}
    )
]

md = MarkdownifyTransformer()
md_docs = await md.atransform_documents(docs)
print(md_docs)

Expected result:

[Document(metadata={'source': 'doc1'}, page_content="# The first doc\n\nHere's the first document."), Document(metadata={'source': 'doc2'}, page_content="# The second doc\n\nHere's the second document.")]

or, when formatted:

[
    Document(
        metadata={'source': 'doc1'},
        page_content="# The first doc\n\nHere's the first document."
    ),
    Document(
        metadata={'source': 'doc2'},
        page_content="# The second doc\n\nHere's the second document."
    )
]

Actual result:

NotImplementedError

Traceback (most recent call last):
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/marimo/_runtime/executor.py", line 131, in execute_cell_async
    await eval(cell.body, glbls)
  Cell marimo://langchain_markdownify_async.py#cell=cell-0
, line 26, in <module>
    md_docs = await md.atransform_documents(docs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/langchain_community/document_transformers/markdownify.py", line 83, in atransform_documents
    raise NotImplementedError
NotImplementedError

Recommended resolution

I implemented the atransform_documents() method here and would be happy to contribute a PR.

Error Message and Stack Trace (if applicable)

NotImplementedError

Traceback (most recent call last):
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/marimo/_runtime/executor.py", line 131, in execute_cell_async
    await eval(cell.body, glbls)
  Cell marimo://langchain_markdownify_async.py#cell=cell-0
, line 26, in <module>
    md_docs = await md.atransform_documents(docs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/langchain_community/document_transformers/markdownify.py", line 83, in atransform_documents
    raise NotImplementedError
NotImplementedError

Description

The MarkdownifyTransformer class in langchain_community.document_transformers has not yet implemented the atransform_documents method.

I would like to use that method within an application that uses asynchronous methods for document loading, embedding generation, and semantic search.

I created an example implementation here and would be happy to contribute a PR with the changes.

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Sun Oct 20 15:04:22 UTC 2024
Python Version: 3.11.10 (main, Sep 9 2024, 22:11:19) [Clang 18.1.8 ]

Package Information

langchain_core: 0.3.15
langchain: 0.3.7
langchain_community: 0.3.5
langsmith: 0.1.139
langchain_text_splitters: 0.3.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.10
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.11
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.6.1
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
tenacity: 9.0.0
typing-extensions: 4.12.2

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant