Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Loaders: URL loader to quickly load content from web pages #37

Merged
merged 5 commits into from
May 21, 2023

Conversation

alchaplinsky
Copy link
Contributor

Scope

Basic URL loader for loading content from web pages.
Good as a starting point, but in general web loader should be more robust and handle different edge cases with content loading.

Changes

  • Add nokogiri as a dependency
  • Add URL loader
  • Add specs

@andreibondarev
Copy link
Collaborator

@alchaplinsky Could you please add the info about this Loader to the README?

@rickychilcott
Copy link
Contributor

This is really cool @alchaplinsky!

I see no major code changes that are needed, but instead of "url" do you think we could make this Loaders::Webpage? The reason being that I intend to allow the framework to read a url, so that the framework can handle downloading on behalf of the user.

You can leave the remainder of your code the same (i.e. you downloading the file) and I can pull that functionality out of the loader when I implement the downloading functionality.

@alchaplinsky
Copy link
Contributor Author

Hey @rickychilcott, thanks for your review on this one!
I'm not really attached to the naming of this particular class. However, I might've done a poor job of describing the intention behind this PR and the loader introduced. Since this is just the first step and it only supports parsing HTML pages for now.

The idea is that a URL can return various types of content (HTML, images, pdf, json, audio video, etc.). And my intention was to build a URL Loader that would automatically apply proper parser/decoder based on mime type and return response data. So that users of the library can just use it instead of deciding between a dozed on different loaders that fetch data by URL.

BTW, this might be something you're planning to work on, so let's align our efforts to avoid double work. WDYT?

@andreibondarev
Copy link
Collaborator

andreibondarev commented May 21, 2023

@alchaplinsky Would it make sense to name the class Loaders::HTML then? I don't have any strong feeling one way or the other. It seems like we've got 3 concepts emerging: loaders, parsers and maybe chunkers.

@rickychilcott
Copy link
Contributor

Yeah, I agree @andreibondarev. Loaders::HTML might be the right path.

As lovely as "one loader to rule them all" sounds @alchaplinsky, that might only be possible by splitting it into their own unique loaders Loaders::Image, Loaders::JSON, Loaders::Audio etc. because they will need different types of processing.

As a quick follow up PR, I'd like to have the framework handle downloading of urls -- so you can add "paths" as file paths or urls and the framework will just handle the loader a Pathname which will be either a file on the server/home location or /tmp.

That ok for you @alchaplinsky?

module Loaders
class URL < Base
# We only look for headings and paragraphs
TEXT_CONTENT_TAGS = %w[h1 h2 h3 h4 h5 h6 p]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about <span> tags?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a ton more fringe cases where people use <article> (other HTML5 tags) to add content.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again -- perhaps it's a future near-future concern not an immediate one.

Copy link
Contributor Author

@alchaplinsky alchaplinsky May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this. And for now I think the above is optimal.
span is too granular. I tried having them and I was just getting a lot of clutter (random words that were just wrapped into span on the page for whatever reason.
article (and similar) are superior to paragraph. So, if the page markup is semantic, then article should contain paragraphs and we don't need to query articles we just focus on the content. If the markup is not good and tags are used randomly, then we'll loose some of the info. But I think it is better for now to have less important information than all of the text from the page which is just random bits squashed together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all gets quite interesting when you really dig in. It hasn't been updated for a while, but https://github.com/cantino/ruby-readability might be the ticket to just work at an even higher level than nokogiri directly and is going to do a better job than just pulling inner html on a select set of tags.

Copy link
Collaborator

@andreibondarev andreibondarev May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickychilcott Nice fine, we should evaluate this gem! It looks like it's covering a ton of edge cases that would take a long time to develop on our own.

@andreibondarev
Copy link
Collaborator

@alchaplinsky @rickychilcott How about we rename this to Loaders::HTML for now and merge, and then I'd actually propose that we get together and discuss the next iteration. Let's use an Agile lens to look at it from 🤓

@alchaplinsky
Copy link
Contributor Author

Agree on Loaders::HTML for this one as it is basically what it is doing so far. I think we'll anyway need to refactor and move things around.

To your point @rickychilcott that's exactly what I was thinking - having unique modules responsible for different types of content. However, I think we'll need to have a bit more sophisticated structure that just individual loaders.

The way I see it (just giving some context so that we can discuss it later): Goal is to have an ability to load different types of data from different sources to use it for training a model, putting it into vector database, etc.

So I see 3 layers here:

  • Ingesting layer: Data can be loaded from a local file (that's how current pdf, docx loaders are set up) or from a URL. It can also be streamed. So we need different types of "Ingest" modules.

  • Parsing layer: Based on file extension or a mime type we use different "parsers" to extract text from html, parse json, parse a PDF, etc.

  • Storage layer: It is a good idea to put data into some tmp file (would be definitely useful for streaming), however putting it into memory might be more common option. If you use langchainrb within your rails app hosted on heroku and you don't have access to file storage and you don't want to bother uploading it to S3 you would just load it to memory. So, this part should also be flexible to user's needs.

That's high level. There's of course a lot of other use cases and nuances. Let's discuss together as @andreibondarev suggested.

@rickychilcott
Copy link
Contributor

Agree on Loaders::HTML for this one as it is basically what it is doing so far. I think we'll anyway need to refactor and move things around.

To your point @rickychilcott that's exactly what I was thinking - having unique modules responsible for different types of content. However, I think we'll need to have a bit more sophisticated structure that just individual loaders.

The way I see it (just giving some context so that we can discuss it later): Goal is to have an ability to load different types of data from different sources to use it for training a model, putting it into vector database, etc.

So I see 3 layers here:

  • Ingesting layer: Data can be loaded from a local file (that's how current pdf, docx loaders are set up) or from a URL. It can also be streamed. So we need different types of "Ingest" modules.
  • Parsing layer: Based on file extension or a mime type we use different "parsers" to extract text from html, parse json, parse a PDF, etc.
  • Storage layer: It is a good idea to put data into some tmp file (would be definitely useful for streaming), however putting it into memory might be more common option. If you use langchainrb within your rails app hosted on heroku and you don't have access to file storage and you don't want to bother uploading it to S3 you would just load it to memory. So, this part should also be flexible to user's needs.

That's high level. There's of course a lot of other use cases and nuances. Let's discuss together as @andreibondarev suggested.

Let's keep talking but how you laid it out, makes it clear to me. Should we consider renaming Loaders to Parsers?

@andreibondarev andreibondarev merged commit 1c2165c into patterns-ai-core:main May 21, 2023
@andreibondarev
Copy link
Collaborator

Great job @alchaplinsky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants