-
-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚚 Loaders: URL loader to quickly load content from web pages #37
Merged
Merged
Changes from 4 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
142c560
Add url loader to quickly load content from web pages
alchaplinsky 58cb70d
Stylistic changes
alchaplinsky 31d5190
Mitigate security risk with URI.open
alchaplinsky a79ff62
Add URL loader to README.md
alchaplinsky bf38c20
Rename URL loader to HTML
alchaplinsky File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# frozen_string_literal: true | ||
|
||
require "open-uri" | ||
|
||
module Loaders | ||
class URL < Base | ||
# We only look for headings and paragraphs | ||
TEXT_CONTENT_TAGS = %w[h1 h2 h3 h4 h5 h6 p] | ||
|
||
# | ||
# This Loader parses URL into a text. | ||
# If you'd like to use it directly you can do so like this: | ||
# Loaders::URL.new("https://nokogiri.org/").load | ||
# | ||
def initialize(url) | ||
depends_on "nokogiri" | ||
require "nokogiri" | ||
|
||
@url = url | ||
end | ||
|
||
# Check that url is a valid URL | ||
def loadable? | ||
!!(@url =~ URI::DEFAULT_PARSER.make_regexp) | ||
end | ||
|
||
def load | ||
return unless response.status.first == "200" | ||
|
||
doc = Nokogiri::HTML(response.read) | ||
doc.css(TEXT_CONTENT_TAGS.join(",")).map(&:inner_text).join("\n\n") | ||
end | ||
|
||
def response | ||
@response ||= URI.parse(@url).open | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# frozen_string_literal: true | ||
|
||
RSpec.describe Loaders::URL do | ||
let(:url) { "https://www.example.com" } | ||
let(:status) { ["200", "OK"] } | ||
let(:body) { "<html><body><h1>Lorem Ipsum</h1><p>Dolor sit amet.</p></body></html>" } | ||
let(:response) { double("response", status: status, read: body) } | ||
|
||
before do | ||
allow(URI).to receive(:parse).and_return(double(open: response)) | ||
end | ||
|
||
describe "#load" do | ||
subject { described_class.new(url).load } | ||
|
||
context "successful response" do | ||
it "loads url" do | ||
expect(subject).to eq("Lorem Ipsum\n\nDolor sit amet.") | ||
end | ||
end | ||
|
||
context "error response" do | ||
let(:status) { ["404", "Not Found"] } | ||
let(:body) { "<html><body><h1>Not Found</h1></body></html>" } | ||
|
||
it "loads url" do | ||
expect(subject).to eq(nil) | ||
end | ||
end | ||
end | ||
|
||
describe "#loadable?" do | ||
subject { described_class.new(url).loadable? } | ||
|
||
context "with valid url" do | ||
it { is_expected.to be_truthy } | ||
end | ||
|
||
context "with invalid url" do | ||
let(:url) { "invalid url" } | ||
|
||
it { is_expected.to be_falsey } | ||
end | ||
end | ||
end |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
<span>
tags?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a ton more fringe cases where people use
<article>
(other HTML5 tags) to add content.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But again -- perhaps it's a future near-future concern not an immediate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this. And for now I think the above is optimal.
span
is too granular. I tried having them and I was just getting a lot of clutter (random words that were just wrapped into span on the page for whatever reason.article
(and similar) are superior to paragraph. So, if the page markup is semantic, then article should contain paragraphs and we don't need to queryarticles
we just focus on the content. If the markup is not good and tags are used randomly, then we'll loose some of the info. But I think it is better for now to have less important information than all of the text from the page which is just random bits squashed together.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all gets quite interesting when you really dig in. It hasn't been updated for a while, but https://github.com/cantino/ruby-readability might be the ticket to just work at an even higher level than nokogiri directly and is going to do a better job than just pulling inner html on a select set of tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickychilcott Nice fine, we should evaluate this gem! It looks like it's covering a ton of edge cases that would take a long time to develop on our own.