🚚 Loaders: URL loader to quickly load content from web pages #37

alchaplinsky · 2023-05-20T19:39:27Z

Scope

Basic URL loader for loading content from web pages.
Good as a starting point, but in general web loader should be more robust and handle different edge cases with content loading.

Changes

Add nokogiri as a dependency
Add URL loader
Add specs

andreibondarev · 2023-05-20T20:53:40Z

@alchaplinsky Could you please add the info about this Loader to the README?

rickychilcott · 2023-05-20T22:58:15Z

This is really cool @alchaplinsky!

I see no major code changes that are needed, but instead of "url" do you think we could make this Loaders::Webpage? The reason being that I intend to allow the framework to read a url, so that the framework can handle downloading on behalf of the user.

You can leave the remainder of your code the same (i.e. you downloading the file) and I can pull that functionality out of the loader when I implement the downloading functionality.

alchaplinsky · 2023-05-21T09:57:43Z

Hey @rickychilcott, thanks for your review on this one!
I'm not really attached to the naming of this particular class. However, I might've done a poor job of describing the intention behind this PR and the loader introduced. Since this is just the first step and it only supports parsing HTML pages for now.

The idea is that a URL can return various types of content (HTML, images, pdf, json, audio video, etc.). And my intention was to build a URL Loader that would automatically apply proper parser/decoder based on mime type and return response data. So that users of the library can just use it instead of deciding between a dozed on different loaders that fetch data by URL.

BTW, this might be something you're planning to work on, so let's align our efforts to avoid double work. WDYT?

andreibondarev · 2023-05-21T13:05:09Z

@alchaplinsky Would it make sense to name the class Loaders::HTML then? I don't have any strong feeling one way or the other. It seems like we've got 3 concepts emerging: loaders, parsers and maybe chunkers.

rickychilcott · 2023-05-21T14:56:51Z

Yeah, I agree @andreibondarev. Loaders::HTML might be the right path.

As lovely as "one loader to rule them all" sounds @alchaplinsky, that might only be possible by splitting it into their own unique loaders Loaders::Image, Loaders::JSON, Loaders::Audio etc. because they will need different types of processing.

As a quick follow up PR, I'd like to have the framework handle downloading of urls -- so you can add "paths" as file paths or urls and the framework will just handle the loader a Pathname which will be either a file on the server/home location or /tmp.

That ok for you @alchaplinsky?

andreibondarev · 2023-05-21T15:02:55Z

lib/loaders/url.rb

+module Loaders
+  class URL < Base
+    # We only look for headings and paragraphs
+    TEXT_CONTENT_TAGS = %w[h1 h2 h3 h4 h5 h6 p]


What about <span> tags?

There may be a ton more fringe cases where people use <article> (other HTML5 tags) to add content.

But again -- perhaps it's a future near-future concern not an immediate one.

I've been thinking about this. And for now I think the above is optimal.
span is too granular. I tried having them and I was just getting a lot of clutter (random words that were just wrapped into span on the page for whatever reason.
article (and similar) are superior to paragraph. So, if the page markup is semantic, then article should contain paragraphs and we don't need to query articles we just focus on the content. If the markup is not good and tags are used randomly, then we'll loose some of the info. But I think it is better for now to have less important information than all of the text from the page which is just random bits squashed together.

This all gets quite interesting when you really dig in. It hasn't been updated for a while, but https://github.com/cantino/ruby-readability might be the ticket to just work at an even higher level than nokogiri directly and is going to do a better job than just pulling inner html on a select set of tags.

@rickychilcott Nice fine, we should evaluate this gem! It looks like it's covering a ton of edge cases that would take a long time to develop on our own.

andreibondarev · 2023-05-21T15:06:29Z

@alchaplinsky @rickychilcott How about we rename this to Loaders::HTML for now and merge, and then I'd actually propose that we get together and discuss the next iteration. Let's use an Agile lens to look at it from 🤓

alchaplinsky · 2023-05-21T15:44:57Z

Agree on Loaders::HTML for this one as it is basically what it is doing so far. I think we'll anyway need to refactor and move things around.

To your point @rickychilcott that's exactly what I was thinking - having unique modules responsible for different types of content. However, I think we'll need to have a bit more sophisticated structure that just individual loaders.

The way I see it (just giving some context so that we can discuss it later): Goal is to have an ability to load different types of data from different sources to use it for training a model, putting it into vector database, etc.

So I see 3 layers here:

Ingesting layer: Data can be loaded from a local file (that's how current pdf, docx loaders are set up) or from a URL. It can also be streamed. So we need different types of "Ingest" modules.
Parsing layer: Based on file extension or a mime type we use different "parsers" to extract text from html, parse json, parse a PDF, etc.
Storage layer: It is a good idea to put data into some tmp file (would be definitely useful for streaming), however putting it into memory might be more common option. If you use langchainrb within your rails app hosted on heroku and you don't have access to file storage and you don't want to bother uploading it to S3 you would just load it to memory. So, this part should also be flexible to user's needs.

That's high level. There's of course a lot of other use cases and nuances. Let's discuss together as @andreibondarev suggested.

rickychilcott · 2023-05-21T17:29:09Z

Agree on Loaders::HTML for this one as it is basically what it is doing so far. I think we'll anyway need to refactor and move things around.

To your point @rickychilcott that's exactly what I was thinking - having unique modules responsible for different types of content. However, I think we'll need to have a bit more sophisticated structure that just individual loaders.

The way I see it (just giving some context so that we can discuss it later): Goal is to have an ability to load different types of data from different sources to use it for training a model, putting it into vector database, etc.

So I see 3 layers here:

Ingesting layer: Data can be loaded from a local file (that's how current pdf, docx loaders are set up) or from a URL. It can also be streamed. So we need different types of "Ingest" modules.

Parsing layer: Based on file extension or a mime type we use different "parsers" to extract text from html, parse json, parse a PDF, etc.

Storage layer: It is a good idea to put data into some tmp file (would be definitely useful for streaming), however putting it into memory might be more common option. If you use langchainrb within your rails app hosted on heroku and you don't have access to file storage and you don't want to bother uploading it to S3 you would just load it to memory. So, this part should also be flexible to user's needs.

That's high level. There's of course a lot of other use cases and nuances. Let's discuss together as @andreibondarev suggested.

Let's keep talking but how you laid it out, makes it clear to me. Should we consider renaming Loaders to Parsers?

andreibondarev · 2023-05-21T23:54:42Z

Great job @alchaplinsky!

alchaplinsky added 2 commits May 20, 2023 21:35

Add url loader to quickly load content from web pages

142c560

Stylistic changes

58cb70d

alchaplinsky force-pushed the main branch from 7780414 to 58cb70d Compare May 20, 2023 19:49

Mitigate security risk with URI.open

31d5190

Add URL loader to README.md

a79ff62

andreibondarev reviewed May 21, 2023

View reviewed changes

Rename URL loader to HTML

bf38c20

andreibondarev merged commit 1c2165c into patterns-ai-core:main May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚚 Loaders: URL loader to quickly load content from web pages #37

🚚 Loaders: URL loader to quickly load content from web pages #37

alchaplinsky commented May 20, 2023

andreibondarev commented May 20, 2023

rickychilcott commented May 20, 2023

alchaplinsky commented May 21, 2023

andreibondarev commented May 21, 2023 •

edited

Loading

rickychilcott commented May 21, 2023

andreibondarev May 21, 2023

andreibondarev May 21, 2023

andreibondarev May 21, 2023

alchaplinsky May 21, 2023 •

edited

Loading

rickychilcott May 21, 2023

andreibondarev May 21, 2023 •

edited

Loading

andreibondarev commented May 21, 2023

alchaplinsky commented May 21, 2023

rickychilcott commented May 21, 2023

andreibondarev commented May 21, 2023

🚚 Loaders: URL loader to quickly load content from web pages #37

🚚 Loaders: URL loader to quickly load content from web pages #37

Conversation

alchaplinsky commented May 20, 2023

Scope

Changes

andreibondarev commented May 20, 2023

rickychilcott commented May 20, 2023

alchaplinsky commented May 21, 2023

andreibondarev commented May 21, 2023 • edited Loading

rickychilcott commented May 21, 2023

andreibondarev May 21, 2023

Choose a reason for hiding this comment

andreibondarev May 21, 2023

Choose a reason for hiding this comment

andreibondarev May 21, 2023

Choose a reason for hiding this comment

alchaplinsky May 21, 2023 • edited Loading

Choose a reason for hiding this comment

rickychilcott May 21, 2023

Choose a reason for hiding this comment

andreibondarev May 21, 2023 • edited Loading

Choose a reason for hiding this comment

andreibondarev commented May 21, 2023

alchaplinsky commented May 21, 2023

rickychilcott commented May 21, 2023

andreibondarev commented May 21, 2023

andreibondarev commented May 21, 2023 •

edited

Loading

alchaplinsky May 21, 2023 •

edited

Loading

andreibondarev May 21, 2023 •

edited

Loading