Skip to content

Latest commit

 

History

History
57 lines (39 loc) · 1.68 KB

README.md

File metadata and controls

57 lines (39 loc) · 1.68 KB

spidy

A simple web crawler written in go.

Description

This tool tries to crawl through the links on any given page and list the urls it finds.

Dependencies and Building

It uses the goquery to parse the HTML page and fetch the links from them. There is no other external dependencies for the application.

You can run make dep to install the dependencies.

Note: If you want to run make lint you need to have golint installed.

Building

To build a binary for Mac (Darwin) you can simply run make build. But if you want to build the binary for Linux run the below command on a linux instance.

go build -i -o spidy

Running

Simply run the application with url you want o crawl as the first arg.

./spidy https://xkcd.com

It will print all the links which belong to the same site. So in the example above, all the links of https://xkcd.com domain will be listed.

Considerations and Limitations

As of now, it doesn't limit the concurrency. But it is a good idea to limit the cocurrency to the number of cpu cores available the machine.

Also, while testing the application I quickly found that crawling through the links recursively can take hours or not even end. So it might be a great idea to limit the layers/depth of crawling.

Planned Enhancements

  1. Sanitise the user input and reject incorrect starting URL.
  2. Limit the concurrency of the process.
  3. Set the depth until which the the crawling should be done.
  4. Modify the result struct so that a sitemap of some kind can be printed.
  5. Add more unit tests.
  6. Setup a CI/CD system which runs some validations before merging.