Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚚 Loaders: URL loader to quickly load content from web pages #37

Merged
merged 5 commits into from
May 21, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@ DEPENDENCIES
hugging-face (~> 0.3.3)
langchainrb!
milvus (~> 0.9.0)
nokogiri (~> 1.13)
pdf-reader (~> 1.4)
pinecone (~> 0.1.6)
pry-byebug (~> 3.10.0)
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,7 @@ Need to read data from various sources? Load it up.
| docx | Loaders::Docx | `gem "docx", branch: "master", git: "https://github.com/ruby-docx/docx.git"` |
| pdf | Loaders::PDF | `gem "pdf-reader", "~> 1.4"` |
| text | Loaders::Text | |
| url | Loaders::URL | `gem "nokogiri", "~> 1.13"` |

## Examples
Additional examples available: [/examples](https://github.com/andreibondarev/langchainrb/tree/main/examples)
Expand Down
1 change: 1 addition & 0 deletions langchain.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Gem::Specification.new do |spec|
spec.add_development_dependency "google_search_results", "~> 2.0.0"
spec.add_development_dependency "hugging-face", "~> 0.3.3"
spec.add_development_dependency "milvus", "~> 0.9.0"
spec.add_development_dependency "nokogiri", "~> 1.13"
spec.add_development_dependency "pdf-reader", "~> 1.4"
spec.add_development_dependency "pinecone", "~> 0.1.6"
spec.add_development_dependency "qdrant-ruby", "~> 0.9.0"
Expand Down
1 change: 1 addition & 0 deletions lib/langchain.rb
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ module Loaders
autoload :Docx, "loaders/docx"
autoload :PDF, "loaders/pdf"
autoload :Text, "loaders/text"
autoload :URL, "loaders/url"
end

autoload :Loader, "loader"
Expand Down
38 changes: 38 additions & 0 deletions lib/loaders/url.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# frozen_string_literal: true

require "open-uri"

module Loaders
class URL < Base
# We only look for headings and paragraphs
TEXT_CONTENT_TAGS = %w[h1 h2 h3 h4 h5 h6 p]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about <span> tags?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a ton more fringe cases where people use <article> (other HTML5 tags) to add content.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But again -- perhaps it's a future near-future concern not an immediate one.

Copy link
Contributor Author

@alchaplinsky alchaplinsky May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this. And for now I think the above is optimal.
span is too granular. I tried having them and I was just getting a lot of clutter (random words that were just wrapped into span on the page for whatever reason.
article (and similar) are superior to paragraph. So, if the page markup is semantic, then article should contain paragraphs and we don't need to query articles we just focus on the content. If the markup is not good and tags are used randomly, then we'll loose some of the info. But I think it is better for now to have less important information than all of the text from the page which is just random bits squashed together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all gets quite interesting when you really dig in. It hasn't been updated for a while, but https://github.com/cantino/ruby-readability might be the ticket to just work at an even higher level than nokogiri directly and is going to do a better job than just pulling inner html on a select set of tags.

Copy link
Collaborator

@andreibondarev andreibondarev May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickychilcott Nice fine, we should evaluate this gem! It looks like it's covering a ton of edge cases that would take a long time to develop on our own.


#
# This Loader parses URL into a text.
# If you'd like to use it directly you can do so like this:
# Loaders::URL.new("https://nokogiri.org/").load
#
def initialize(url)
depends_on "nokogiri"
require "nokogiri"

@url = url
end

# Check that url is a valid URL
def loadable?
!!(@url =~ URI::DEFAULT_PARSER.make_regexp)
end

def load
return unless response.status.first == "200"

doc = Nokogiri::HTML(response.read)
doc.css(TEXT_CONTENT_TAGS.join(",")).map(&:inner_text).join("\n\n")
end

def response
@response ||= URI.parse(@url).open
end
end
end
45 changes: 45 additions & 0 deletions spec/loaders/url_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# frozen_string_literal: true

RSpec.describe Loaders::URL do
let(:url) { "https://www.example.com" }
let(:status) { ["200", "OK"] }
let(:body) { "<html><body><h1>Lorem Ipsum</h1><p>Dolor sit amet.</p></body></html>" }
let(:response) { double("response", status: status, read: body) }

before do
allow(URI).to receive(:parse).and_return(double(open: response))
end

describe "#load" do
subject { described_class.new(url).load }

context "successful response" do
it "loads url" do
expect(subject).to eq("Lorem Ipsum\n\nDolor sit amet.")
end
end

context "error response" do
let(:status) { ["404", "Not Found"] }
let(:body) { "<html><body><h1>Not Found</h1></body></html>" }

it "loads url" do
expect(subject).to eq(nil)
end
end
end

describe "#loadable?" do
subject { described_class.new(url).loadable? }

context "with valid url" do
it { is_expected.to be_truthy }
end

context "with invalid url" do
let(:url) { "invalid url" }

it { is_expected.to be_falsey }
end
end
end