Skip to content

A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.

Notifications You must be signed in to change notification settings

elijahcroft/web_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PHP Web Crawler

This is a basic web crawler built in PHP that navigates through web pages starting from a given URL. It extracts metadata such as titles, descriptions, and keywords from the pages it visits.

Features

  • Crawls web pages recursively starting from a given URL.
  • Extracts the following metadata:
    • Page Title
    • Meta Description
    • Meta Keywords
  • Resolves relative links to absolute URLs.
  • Handles edge cases like JavaScript links and duplicate URLs.
  • Outputs metadata in JSON format.
  • Tracks all visited URLs.

Prerequisites

  • PHP installed on your system (version 7.0 or higher).
  • A local or live server to host and test the crawler.

How to Use

  1. Clone the repository or download the script.
  2. Place the script in your server's directory (e.g., htdocs for XAMPP or similar).
  3. Update the $start variable with the URL you want to crawl. For example:
    $start = "http://example.com";

About

A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages