-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLHeaderTextSplitter won't run (maxHead) #13149
Comments
🤖 Hello @AndyMik90, I'm an assistive bot here to help you with your issue while we wait for a human maintainer. I'm able to help address bugs, answer questions, and even guide you on how to become a contributor to the project. Rest assured, I'm here to help! I'm currently reviewing the problem you've encountered with HTMLHeaderTextSplitter in the LangChain repository. I'll be back shortly with a more comprehensive solution! Stay tuned! |
🤖 The error you're encountering, The One possible solution could be to split your HTML file into smaller chunks before processing it with the Another possible solution could be to increase the maximum size limit of the XSLT processor. However, this would involve modifying the source code of the libxslt library and recompiling it, which may not be feasible or desirable. Please note that these are just possible solutions and may not work in all cases. If you continue to encounter issues, please provide more details about the size and structure of your HTML file, and any other relevant information that could help in troubleshooting the issue. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Hi, @AndyMik90, I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you raised an issue regarding the HTMLHeaderTextSplitter not running due to a maxHead error when attempting to split a large HTML file. Dosubot provided a detailed response explaining the error and suggested potential solutions for resolving the issue, such as splitting the HTML file into smaller chunks or modifying the underlying libraries. They also requested more details for further troubleshooting. Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, kindly let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you! |
I'm having the same problem and I'm using the the latest version of the LangChain repository. |
Me too? have you found any solutions? |
This is indeed a problem from the It is high time we switch to another dependency. |
I've just hit this problem too. Anyone know of any workarounds? |
+1 Looking for a workaround |
My workaround is to subclass HTMLHeaderTextSplitter and overwrite split_text_from_file. In there I call xsltproc via subprocess.run just for the transformation using the same html_chunks_with_headers.xslt. |
I have overwritten the class. I have used bs4 instead of lxml which is so far working for me
|
…MLHeaderTextSplitter for Improved Large HTML File Processing (#27678) This pull request updates the `HTMLHeaderTextSplitter` by replacing the `split_text_from_file` method's implementation. The original method used `lxml` and XSLT for processing HTML files, which caused `lxml.etree.xsltapplyerror maxhead` when handling large HTML documents due to limitations in the XSLT processor. Fixes #13149 By switching to BeautifulSoup (`bs4`), we achieve: - **Improved Performance and Reliability:** BeautifulSoup efficiently processes large HTML files without the errors associated with `lxml` and XSLT. - **Simplified Dependencies:** Removes the dependency on `lxml` and external XSLT files, relying instead on the widely used `beautifulsoup4` library. - **Maintained Functionality:** The new method replicates the original behavior, ensuring compatibility with existing code and preserving the extraction of content and metadata. **Issue:** This change addresses issues related to processing large HTML files with the existing `HTMLHeaderTextSplitter` implementation. It resolves problems where users encounter lxml.etree.xsltapplyerror maxhead due to large HTML documents. **Dependencies:** - **BeautifulSoup (`beautifulsoup4`):** The `beautifulsoup4` library is now used for parsing HTML content. - Installation: `pip install beautifulsoup4` **Code Changes:** Updated the `split_text_from_file` method in `HTMLHeaderTextSplitter` as follows: ```python def split_text_from_file(self, file: Any) -> List[Document]: """Split HTML file using BeautifulSoup. Args: file: HTML file path or file-like object. Returns: List of Document objects with page_content and metadata. """ from bs4 import BeautifulSoup from langchain.docstore.document import Document import bs4 # Read the HTML content from the file or file-like object if isinstance(file, str): with open(file, 'r', encoding='utf-8') as f: html_content = f.read() else: # Assuming file is a file-like object html_content = file.read() # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(html_content, 'html.parser') # Extract the header tags and their corresponding metadata keys headers_to_split_on = [tag[0] for tag in self.headers_to_split_on] header_mapping = dict(self.headers_to_split_on) documents = [] # Find the body of the document body = soup.body if soup.body else soup # Find all header tags in the order they appear all_headers = body.find_all(headers_to_split_on) # If there's content before the first header, collect it first_header = all_headers[0] if all_headers else None if first_header: pre_header_content = '' for elem in first_header.find_all_previous(): if isinstance(elem, bs4.Tag): text = elem.get_text(separator=' ', strip=True) if text: pre_header_content = text + ' ' + pre_header_content if pre_header_content.strip(): documents.append(Document( page_content=pre_header_content.strip(), metadata={} # No metadata since there's no header )) else: # If no headers are found, return the whole content full_text = body.get_text(separator=' ', strip=True) if full_text.strip(): documents.append(Document( page_content=full_text.strip(), metadata={} )) return documents # Process each header and its associated content for header in all_headers: current_metadata = {} header_name = header.name header_text = header.get_text(separator=' ', strip=True) current_metadata[header_mapping[header_name]] = header_text # Collect all sibling elements until the next header of the same or higher level content_elements = [] for sibling in header.find_next_siblings(): if sibling.name in headers_to_split_on: # Stop at the next header break if isinstance(sibling, bs4.Tag): content_elements.append(sibling) # Get the text content of the collected elements current_content = '' for elem in content_elements: text = elem.get_text(separator=' ', strip=True) if text: current_content += text + ' ' # Create a Document if there is content if current_content.strip(): documents.append(Document( page_content=current_content.strip(), metadata=current_metadata.copy() )) else: # If there's no content, but we have metadata, still create a Document documents.append(Document( page_content='', metadata=current_metadata.copy() )) return documents ``` --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
System Info
Notebook with latest langchain
Who can help?
No response
Information
Related Components
Reproduction
Trying in notebook the HTMLHeaderTextSplitter
The text was updated successfully, but these errors were encountered: