Added a MHTML document loader #6311

masylum · 2023-06-16T20:41:53Z

MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case.

This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file.

vercel · 2023-06-16T20:41:55Z

@masylum is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

dev2049 · 2023-06-16T23:06:41Z

langchain/document_loaders/mhtml.py

+                        "source": self.file_path,
+                        "title": title,
+                    }
+                    return [Document(page_content=text, metadata=metadata)]


could there be more than one html part or will there always be at most one

I don't think so. According to the spec:

HTML [RFC 1866] defines a powerful means of specifying multimedia
documents. These multimedia documents consist of a text/html root
resource (object) and other subsidiary resources (image, video clip,
applet, etc. objects) referenced by Uniform Resource Identifiers
(URIs) within the text/html root resource. When an HTML multimedia
document is retrieved by a browser, each of these component resources
is individually retrieved in real time from a location, and using a
protocol, specified by each URI.

https://datatracker.ietf.org/doc/html/rfc2557

rlancemartin · 2023-06-19T06:20:42Z

Thanks for contributing! Please add an example notebook; you can see other example notebooks for loaders here as a guide:

https://github.com/hwchase17/langchain/tree/master/docs/extras/modules/data_connection/document_loaders/integrations

rlancemartin · 2023-06-20T16:43:39Z

I fixed the Lint errors.

Can you please add a short notebook w/ example usage? Then, we can deploy.

vercel · 2023-06-21T02:40:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)			Jun 25, 2023 8:05pm

MHTML is a very interesting format since it's used both for emails but also for archived webpages. Some scraping projects want to store pages in disk to process them later, mhtml is perfect for that use case. This is heavily inspired from the beautifulsoup html loader, but extracting the html part from the mhtml file.

rlancemartin · 2023-06-25T20:12:05Z

I added a notebook. Merging this now!

dev2049 reviewed Jun 16, 2023

View reviewed changes

hwchase17 assigned rlancemartin Jun 19, 2023

rlancemartin force-pushed the mhtml-document-loader branch from 753f15a to 3797fe8 Compare June 20, 2023 18:49

dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 21, 2023

masylum and others added 3 commits June 25, 2023 12:51

Format

0195b86

Add ntbk

3bd124e

rlancemartin force-pushed the mhtml-document-loader branch from 3797fe8 to 3bd124e Compare June 25, 2023 20:05

rlancemartin merged commit 87802c8 into langchain-ai:master Jun 25, 2023

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a MHTML document loader #6311

Added a MHTML document loader #6311

masylum commented Jun 16, 2023

vercel bot commented Jun 16, 2023

dev2049 Jun 16, 2023

masylum Jun 17, 2023

rlancemartin commented Jun 19, 2023

rlancemartin commented Jun 20, 2023 •

edited

Loading

vercel bot commented Jun 21, 2023 •

edited

Loading

rlancemartin commented Jun 25, 2023

Added a MHTML document loader #6311

Added a MHTML document loader #6311

Conversation

masylum commented Jun 16, 2023

vercel bot commented Jun 16, 2023

dev2049 Jun 16, 2023

Choose a reason for hiding this comment

masylum Jun 17, 2023

Choose a reason for hiding this comment

rlancemartin commented Jun 19, 2023

rlancemartin commented Jun 20, 2023 • edited Loading

vercel bot commented Jun 21, 2023 • edited Loading

rlancemartin commented Jun 25, 2023

rlancemartin commented Jun 20, 2023 •

edited

Loading

vercel bot commented Jun 21, 2023 •

edited

Loading