Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a MHTML document loader #6311

Merged
merged 3 commits into from
Jun 25, 2023

Conversation

masylum
Copy link
Contributor

@masylum masylum commented Jun 16, 2023

MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case.

This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file.

@vercel
Copy link

vercel bot commented Jun 16, 2023

@masylum is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

"source": self.file_path,
"title": title,
}
return [Document(page_content=text, metadata=metadata)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could there be more than one html part or will there always be at most one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. According to the spec:

HTML [RFC 1866] defines a powerful means of specifying multimedia
documents. These multimedia documents consist of a text/html root
resource (object) and other subsidiary resources (image, video clip,
applet, etc. objects) referenced by Uniform Resource Identifiers
(URIs) within the text/html root resource. When an HTML multimedia
document is retrieved by a browser, each of these component resources
is individually retrieved in real time from a location, and using a
protocol, specified by each URI.

https://datatracker.ietf.org/doc/html/rfc2557

@rlancemartin
Copy link
Collaborator

Thanks for contributing! Please add an example notebook; you can see other example notebooks for loaders here as a guide:

https://github.com/hwchase17/langchain/tree/master/docs/extras/modules/data_connection/document_loaders/integrations

@rlancemartin
Copy link
Collaborator

rlancemartin commented Jun 20, 2023

I fixed the Lint errors.

Can you please add a short notebook w/ example usage? Then, we can deploy.

@vercel
Copy link

vercel bot commented Jun 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jun 25, 2023 8:05pm

@dev2049 dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 21, 2023
masylum and others added 3 commits June 25, 2023 12:51
MHTML is a very interesting format since it's used both for emails but
also for archived webpages. Some scraping projects want to store pages in
disk to process them later, mhtml is perfect for that use case.

This is heavily inspired from the beautifulsoup html loader, but
extracting the html part from the mhtml file.
@rlancemartin
Copy link
Collaborator

I added a notebook. Merging this now!

@rlancemartin rlancemartin merged commit 87802c8 into langchain-ai:master Jun 25, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants