Fix "NotImplementedError" with the `atransform_documents()` method in `MarkdownifyTransformer` (this issue includes a suggested fix) #27865

rparkr · 2024-11-03T02:09:31Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Linked to discussion: #27760

Currently, the MarkdownifyTransformer raises a NotImplementedError if a user calls the atransform_documents() method:

langchain/libs/community/langchain_community/document_transformers/markdownify.py

Lines 78 to 83 in 3b956b3

    
           async def atransform_documents( 
        
               self, 
        
               documents: Sequence[Document], 
        
               **kwargs: Any, 
        
           ) -> Sequence[Document]: 
        
               raise NotImplementedError

Here's a minimum reproducible example:

from langchain_community.document_transformers import MarkdownifyTransformer
from langchain_core.documents import Document

docs = [
    Document(
        page_content=(
            """
            <h1>The first doc</h1>
            <p>Here's the first document.</p>
            """
        ),
        metadata={"source": "doc1"}
    ),
    Document(
        page_content=(
            """
            <h1>The second doc</h1>
            <p>Here's the second document.</p>
            """
        ),
        metadata={"source": "doc2"}
    )
]

md = MarkdownifyTransformer()
md_docs = await md.atransform_documents(docs)
print(md_docs)

Expected result:

[Document(metadata={'source': 'doc1'}, page_content="# The first doc\n\nHere's the first document."), Document(metadata={'source': 'doc2'}, page_content="# The second doc\n\nHere's the second document.")]

or, when formatted:

[
    Document(
        metadata={'source': 'doc1'},
        page_content="# The first doc\n\nHere's the first document."
    ),
    Document(
        metadata={'source': 'doc2'},
        page_content="# The second doc\n\nHere's the second document."
    )
]

Actual result:

NotImplementedError

Traceback (most recent call last):
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/marimo/_runtime/executor.py", line 131, in execute_cell_async
    await eval(cell.body, glbls)
  Cell marimo://langchain_markdownify_async.py#cell=cell-0
, line 26, in <module>
    md_docs = await md.atransform_documents(docs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/langchain_community/document_transformers/markdownify.py", line 83, in atransform_documents
    raise NotImplementedError
NotImplementedError

Recommended resolution

I implemented the atransform_documents() method here and would be happy to contribute a PR.

Error Message and Stack Trace (if applicable)

NotImplementedError

Traceback (most recent call last):
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/marimo/_runtime/executor.py", line 131, in execute_cell_async
    await eval(cell.body, glbls)
  Cell marimo://langchain_markdownify_async.py#cell=cell-0
, line 26, in <module>
    md_docs = await md.atransform_documents(docs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ryan/langchain_test/.venv/lib/python3.11/site-packages/langchain_community/document_transformers/markdownify.py", line 83, in atransform_documents
    raise NotImplementedError
NotImplementedError

Description

The MarkdownifyTransformer class in langchain_community.document_transformers has not yet implemented the atransform_documents method.

I would like to use that method within an application that uses asynchronous methods for document loading, embedding generation, and semantic search.

I created an example implementation here and would be happy to contribute a PR with the changes.

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Sun Oct 20 15:04:22 UTC 2024
Python Version: 3.11.10 (main, Sep 9 2024, 22:11:19) [Clang 18.1.8 ]

Package Information

langchain_core: 0.3.15
langchain: 0.3.7
langchain_community: 0.3.5
langsmith: 0.1.139
langchain_text_splitters: 0.3.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.10
async-timeout: Installed. No version info available.
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.11
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.6.1
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
tenacity: 9.0.0
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Nov 3, 2024

rparkr linked a pull request Nov 3, 2024 that will close this issue

community: implement the atransform_documents method for MarkdownifyTransformer #27866

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "NotImplementedError" with the `atransform_documents()` method in `MarkdownifyTransformer` (this issue includes a suggested fix) #27865

Fix "NotImplementedError" with the `atransform_documents()` method in `MarkdownifyTransformer` (this issue includes a suggested fix) #27865

rparkr commented Nov 3, 2024 •

edited

Loading

Fix "NotImplementedError" with the atransform_documents() method in MarkdownifyTransformer (this issue includes a suggested fix) #27865

Fix "NotImplementedError" with the atransform_documents() method in MarkdownifyTransformer (this issue includes a suggested fix) #27865

Comments

rparkr commented Nov 3, 2024 • edited Loading

Checked other resources

Example Code

Recommended resolution

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Fix "NotImplementedError" with the `atransform_documents()` method in `MarkdownifyTransformer` (this issue includes a suggested fix) #27865

Fix "NotImplementedError" with the `atransform_documents()` method in `MarkdownifyTransformer` (this issue includes a suggested fix) #27865

rparkr commented Nov 3, 2024 •

edited

Loading