Page Meta is a PHP library than can retrieve detailed info on any URL from the internet! It uses data from HTML meta tags and OpenGraph with fallback to detailed HTML scraping.
- Works for any valid URL on the internet!
- Follows page redirects
- Uses all scraping methods available: HTML tags, OpenGraph, Schema data
- Display Info Cards for links in a article
- Rich preview for links in messaging apps
- Extract info from a user-submitted URL
Add layered/page-meta
as a dependency in your project's composer.json
file:
$ composer require layered/page-meta
Create a UrlPreview
instance, then call loadUrl($url)
method with your URL as first argument. Preview data is retrieved with get($section)
or getAll()
methods:
require 'vendor/autoload.php';
$preview = new Layered\PageMeta\UrlPreview([
'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; YourApp/1.0; +https://example.com)'
]);
$preview->loadUrl('https://www.instagram.com/p/BbRyo_Kjqt1/');
$allPageData = $preview->getAll(); // contains all scraped data
$siteInfo = $preview->get('site'); // get general info about the website
The library downloads the HTML source of the url you provided, then uses specialized scrapers to extract pieces of information.
Core scrapers can be seen in src/scrapers/
, and they extract general info for a page: title, author, description, page type, main image, etc.
If you would like to extract a new field, see Extending the library section.
User Agent or extra headers can make a big difference when downloading HTML from a website. There are some websites that forbid scraping and hide the content when they detect a tool like this one. Make sure to read their dev docs & TOS.
The default User Agent is blocked on sites like Twitter, Instagram, Facebook and others. A workaround is to use this one (thanks for the tip PVGrad):
'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
Returned data will be an Array
with following format:
{
"site": {
"secure": true,
"url": "https:\/\/www.instagram.com",
"icon": "https:\/\/www.instagram.com\/static\/images\/ico\/favicon-192.png\/b407fa101800.png",
"language": "en",
"responsive": true,
"name": "Instagram"
},
"page": {
"type": "photo",
"url": "https:\/\/www.instagram.com\/p\/BbRyo_Kjqt1\/",
"title": "GitHub on Instagram",
"description": "There\u2019s still time to join the #GitHubGameOff and build a game inspired by throwbacks. Get started\u2026",
"image": {
"url": "https:\/\/scontent-mad1-1.cdninstagram.com\/vp\/73b1790d77548031327e64ee83196706\/5B4AD567\/t51.2885-15\/e35\/23421974_1768724519826754_3855913942043852800_n.jpg"
}
},
"author": {
"name": "GitHub",
"handle": "@github",
"url": "https:\/\/www.instagram.com\/github\/"
},
"app_links": {
"ios": {
"url": "nflx:\/\/www.netflix.com\/title\/80014749",
"app_store_id": "363590051",
"app_name": "Netflix",
"store_url": "https:\/\/itunes.apple.com\/us\/app\/Netflix\/id363590051"
},
"android": {
"url": "nflx:\/\/www.netflix.com\/title\/80014749",
"package": "com.netflix.mediaclient",
"app_name": "Netflix",
"store_url": "https:\/\/play.google.com\/store\/apps\/details?id=com.netflix.mediaclient"
}
}
}
See UrlPreview::getAll()
for info on each returned field.
UrlPreview
class provides the following public methods:
Start the UrlPreview instance. Pass extra headers to send when requesting the page URL
Load and start the scrape process for any valid URL
Get all data scraped from page
Return: Array
with scraped data in following format:
site
- info about the websiteurl
- main site URLname
- site name, ex: 'Instagram' or 'Medium'secure
- Boolean true|false depending on http connectionresponsive
- Boolean true|false.True
if site hasviewport
meta tag present. Basic check for responsivenessicon
- site iconlanguage
- ISO 639-1 language code, ex:en
,es
page
- info about the page at current URLtype
- page type, ex:website
,article
,profile
,video
, etcurl
- canonical URL for the pagetitle
- page titledescription
- page descriptionimage
-Array
containing image info, if present:url
- image URLwidth
- image widthheight
- image width
video
-Array
containing video info, if found on page:url
- video URLwidth
- video widthheight
- video width
author
- info about the content author, ex:name
- Author's name on a blog, person's name on social network siteshandle
- Social media site usernameurl
- Author URL for more articles or Profile URL on social network sites
app_links
-Array
containing apps linked to page, like:ios
- iOS appurl
- link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'app_store_id
- Apple AppStore app IDapp_name
- name of the appstore_url
- link to installable app
android
- Android appurl
- link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'package
- Android PlayStore app IDapp_name
- name of the appstore_url
- link to installable app
Get data in one scraped section site
, page
, profile
or app_links
Return: Array
with section scraped data. See UrlPreview::getAll()
for data format
Attach an event on UrlPreview
for data processing or scrape process. Arguments:
$eventName
- on which event to listen. Available:page.scrape
- fired when the scraping process startsdata.filter
- fired when data is requested bygetData()
orgetAll()
methods
$listener
- a callable reference, which will get the$event
parameter with available data$priority
- order on which the callable should be executed
If there's need to more scraped data for a URL, more functionality can be attached to PageMeta library. Example for returing the 'Terms and Conditions' link from pages:
use Symfony\Component\EventDispatcher\Event;
$previewer = new \Layered\PageMeta\UrlPreview;
$previewer->addListener('page.scrape', function(Event $event) {
$currentScrapedData = $event->getData(); // check data from other scrapers
$crawler = $event->getCrawler(); // instance of DomCrawler Symfony Component
$termsLink = '';
$crawler->filter('a[href*=terms]')->each(function($node) use(&$termsLink) {
$termsLink = $node->attr('href');
});
// forwards the scraped data
$event->addData('site', [
'termsLink' => $termsLink
]);
});
$previewer->loadUrl('http://github.com');
Please report any issues here on GitHub.