-
Notifications
You must be signed in to change notification settings - Fork 15.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a MHTML document loader #6311
Added a MHTML document loader #6311
Conversation
@masylum is attempting to deploy a commit to the LangChain Team on Vercel. A member of the Team first needs to authorize it. |
"source": self.file_path, | ||
"title": title, | ||
} | ||
return [Document(page_content=text, metadata=metadata)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could there be more than one html part or will there always be at most one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. According to the spec:
HTML [RFC 1866] defines a powerful means of specifying multimedia
documents. These multimedia documents consist of a text/html root
resource (object) and other subsidiary resources (image, video clip,
applet, etc. objects) referenced by Uniform Resource Identifiers
(URIs) within the text/html root resource. When an HTML multimedia
document is retrieved by a browser, each of these component resources
is individually retrieved in real time from a location, and using a
protocol, specified by each URI.
Thanks for contributing! Please add an example notebook; you can see other example notebooks for loaders here as a guide: |
I fixed the Lint errors. Can you please add a short notebook w/ example usage? Then, we can deploy. |
753f15a
to
3797fe8
Compare
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case. This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file.
3797fe8
to
3bd124e
Compare
I added a notebook. Merging this now! |
MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case.
This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file.