-
-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚚 Loaders: URL loader to quickly load content from web pages #37
Conversation
@alchaplinsky Could you please add the info about this Loader to the README? |
This is really cool @alchaplinsky! I see no major code changes that are needed, but instead of "url" do you think we could make this You can leave the remainder of your code the same (i.e. you downloading the file) and I can pull that functionality out of the loader when I implement the downloading functionality. |
Hey @rickychilcott, thanks for your review on this one! The idea is that a URL can return various types of content (HTML, images, pdf, json, audio video, etc.). And my intention was to build a URL Loader that would automatically apply proper parser/decoder based on mime type and return response data. So that users of the library can just use it instead of deciding between a dozed on different loaders that fetch data by URL. BTW, this might be something you're planning to work on, so let's align our efforts to avoid double work. WDYT? |
@alchaplinsky Would it make sense to name the class |
Yeah, I agree @andreibondarev. As lovely as "one loader to rule them all" sounds @alchaplinsky, that might only be possible by splitting it into their own unique loaders As a quick follow up PR, I'd like to have the framework handle downloading of urls -- so you can add "paths" as file paths or urls and the framework will just handle the loader a That ok for you @alchaplinsky? |
lib/loaders/url.rb
Outdated
module Loaders | ||
class URL < Base | ||
# We only look for headings and paragraphs | ||
TEXT_CONTENT_TAGS = %w[h1 h2 h3 h4 h5 h6 p] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about <span>
tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a ton more fringe cases where people use <article>
(other HTML5 tags) to add content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But again -- perhaps it's a future near-future concern not an immediate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this. And for now I think the above is optimal.
span
is too granular. I tried having them and I was just getting a lot of clutter (random words that were just wrapped into span on the page for whatever reason.
article
(and similar) are superior to paragraph. So, if the page markup is semantic, then article should contain paragraphs and we don't need to query articles
we just focus on the content. If the markup is not good and tags are used randomly, then we'll loose some of the info. But I think it is better for now to have less important information than all of the text from the page which is just random bits squashed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all gets quite interesting when you really dig in. It hasn't been updated for a while, but https://github.com/cantino/ruby-readability might be the ticket to just work at an even higher level than nokogiri directly and is going to do a better job than just pulling inner html on a select set of tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickychilcott Nice fine, we should evaluate this gem! It looks like it's covering a ton of edge cases that would take a long time to develop on our own.
@alchaplinsky @rickychilcott How about we rename this to |
Agree on To your point @rickychilcott that's exactly what I was thinking - having unique modules responsible for different types of content. However, I think we'll need to have a bit more sophisticated structure that just individual loaders. The way I see it (just giving some context so that we can discuss it later): Goal is to have an ability to load different types of data from different sources to use it for training a model, putting it into vector database, etc. So I see 3 layers here:
That's high level. There's of course a lot of other use cases and nuances. Let's discuss together as @andreibondarev suggested. |
Let's keep talking but how you laid it out, makes it clear to me. Should we consider renaming Loaders to Parsers? |
Great job @alchaplinsky! |
Scope
Basic URL loader for loading content from web pages.
Good as a starting point, but in general web loader should be more robust and handle different edge cases with content loading.
Changes