Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API listing below, or browse the full documentation.
The following code scrapes stories from the Hacker News front page:
require "grubby"
class HackerNews < Grubby::PageScraper
scrapes(:items) do
page.search!(".athing").map{|element| Item.new(element) }
end
class Item < Grubby::Scraper
scrapes(:story_link){ source.at!("a.storylink") }
scrapes(:story_url){ expand_url(story_link["href"]) }
scrapes(:title){ story_link.text }
scrapes(:comments_link, optional: true) do
source.next_sibling.search!(".subtext a").find do |link|
link.text.match?(/comment|discuss/)
end
end
scrapes(:comments_url, if: :comments_link) do
expand_url(comments_link["href"])
end
scrapes(:comment_count, if: :comments_link) do
comments_link.text.to_i
end
def expand_url(url)
url.include?("://") ? url : source.document.uri.merge(url).to_s
end
end
end
# The following line will raise an exception if anything goes wrong
# during the scraping process. For example, if the structure of the
# HTML does not match expectations due to a site change, the script will
# terminate immediately with a helpful error message. This prevents bad
# data from propagating and causing hard-to-trace errors.
hn = HackerNews.scrape("https://news.ycombinator.com/news")
# Your processing logic goes here:
hn.items.take(10).each do |item|
puts "* #{item.title}"
puts " #{item.story_url}"
puts " #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
puts
end
Hacker News also offers a JSON API, which may be more robust for scraping purposes. grubby can scrape JSON just as well:
require "grubby"
class HackerNews < Grubby::JsonScraper
scrapes(:items) do
# API returns array of top 500 item IDs, so limit as necessary
json.take(10).map do |item_id|
Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
end
end
class Item < Grubby::JsonScraper
scrapes(:story_url){ json["url"] || hn_url }
scrapes(:title){ json["title"] }
scrapes(:comments_url, optional: true) do
hn_url if json["descendants"]
end
scrapes(:comment_count, optional: true) do
json["descendants"]&.to_i
end
def hn_url
"https://news.ycombinator.com/item?id=#{json["id"]}"
end
end
end
hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")
# Your processing logic goes here:
hn.items.each do |item|
puts "* #{item.title}"
puts " #{item.story_url}"
puts " #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
puts
end
- Grubby
- Scraper
- PageScraper
- JsonScraper
- Mechanize::File
- Mechanize::Page
- Mechanize::Page::Link
- URI
grubby loads several gems that extend Ruby objects with utility methods. Some of those methods are listed below. See each gem's documentation for a complete API listing.
- Active Support (docs)
- casual_support (docs)
- gorge (docs)
- mini_sanity (docs)
- pleasant_path (docs)
- ryoba (docs)
Install the grubby
gem.
Run rake test
to run the tests.