A simple web crawler written in go.
This tool tries to crawl through the links on any given page and list the urls it finds.
It uses the goquery to parse the HTML page and fetch the links from them. There is no other external dependencies for the application.
You can run make dep
to install the dependencies.
Note: If you want to run make lint
you need to have
golint installed.
To build a binary for Mac (Darwin) you can simply run make build
. But
if you want to build the binary for Linux run the below command on a linux
instance.
go build -i -o spidy
Simply run the application with url you want o crawl as the first arg.
./spidy https://xkcd.com
It will print all the links which belong to the same site. So in the example above, all the links of https://xkcd.com domain will be listed.
As of now, it doesn't limit the concurrency. But it is a good idea to limit the cocurrency to the number of cpu cores available the machine.
Also, while testing the application I quickly found that crawling through the links recursively can take hours or not even end. So it might be a great idea to limit the layers/depth of crawling.
- Sanitise the user input and reject incorrect starting URL.
- Limit the concurrency of the process.
- Set the depth until which the the crawling should be done.
- Modify the result struct so that a sitemap of some kind can be printed.
- Add more unit tests.
- Setup a CI/CD system which runs some validations before merging.