View on GitHub

Wombat

Web crawler/scraper with an elegant DSL which extracts structured data from web pages

Download this project as a .zip file Download this project as a tar.gz file

Wombat

Gem Version CI Build Status Dependency Status Code Climate Coverage Status

Web scraper with an elegant DSL that parses structured data from web pages.

Usage:

gem install wombat

Obs: Requires ruby 1.9.3 (activesupport requires Ruby version >= 1.9.3)

Scraping a page:

The simplest way to use Wombat is by calling Wombat.crawl and passing it a block:

require 'wombat'

Wombat.crawl do
  base_url "http://www.github.com"
  path "/"

  headline xpath: "//h1"
  subheading css: "p.subheading"

  what_is({ css: ".teaser h3" }, :list)

  links do
    explore xpath: '//*[@id="wrapper"]/div[1]/div/ul/li[1]/a' do |e|
      e.gsub(/Explore/, "Love")
    end

    search css: '.search'
    features css: '.features'
    blog css: '.blog'
  end
end
The code above is gonna return the following hash:
{
  "headline"=>"Build software better, together.",
  "subheading"=> "Powerful collaboration, review, and code management for open source and private development projects.",
  "what_is"=> [
    "Great collaboration starts with communication.",
    "Manage and contribute from all your devices.",
    "The world’s largest open source community."
  ],
  "links"=> {
    "explore"=>"Love GitHub",
    "search"=>"Search",
    "features"=>"Features",
    "blog"=>"Blog"
  }
}

This is just a sneak peek of what Wombat can do. For the complete documentation, please check the links below:

Wiki

API Documentation

Changelog

Contributing to Wombat

Contributors

Copyright

Copyright (c) 2012 Felipe Lima. See LICENSE.txt for further details.