Skip to content

Jsoup Extension

Dmitriy Zayceff edited this page Apr 24, 2015 · 13 revisions

Jsoup is a library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. It based on the jsoup java library jsoup.org

Usage

Getting and parsing text of url:

use php\jsoup\Jsoup;

$doc = Jsoup::connect("http://en.wikipedia.org/")->get();
$newsHeadlines = $doc->select("#mp-itn b a");

foreach ($newsHeadlines as $element) {
   echo "- {$element->text()}\n";
}

Use the static Jsoup::parseText(string $html) method, or Jsoup::parseText(string $html, string $baseUri) if the page came from the web.

$html = "<html><head><title>First parse</title></head>
         <body><p>Parsed HTML into a doc.</p></body></html>";
$doc = Jsoup::parseText($html);

echo $doc->title();

Load document form an url with parameters:

$doc = Jsoup::connect("http://example.com")
  ->data("query", "Java")
  ->userAgent("Mozilla")
  ->cookie("auth", "token")
  ->timeout(3000) // 3 sec
  ->post();

Load a Document from a File:

$doc = Jsoup::parse('path/to/file.html', 'UTF-8', 'http://mysite.com/');

Use DOM methods to navigate a document:

$input = "/tmp/input.html";
$doc = Jsoup::parse($input, "UTF-8", "http://example.com/");

$content = $doc->select("#content");
$links   = $content->select("a");

foreach ($links as $link) {
  $linkHref = $link->attr("href");
  $linkText = $link->text();
}

Use selector-syntax to find elements:

$input = "/tmp/input.html";
$doc = Jsoup::parse($input, "UTF-8", "http://example.com/");

$links = $doc->select("a[href]"); // a with href
$pngs = $doc->select("img[src$=.png]"); // img with src ending .png

$masthead = $doc->select("div.masthead")->first();  // div with class=masthead

$resultLinks = $doc->select("h3.r > a"); // direct a after h3
Clone this wiki locally