web-link-scraper

Simple REST API to locate hyperlinks in a single supplied URL.

The API is defined and documented in Swagger - see https://app.swaggerhub.com/apis/seanprince/Web-Link-Scraper/v1

The Jsoup HTML parser (see https://jsoup.org/) was used to speed the implementation. Jsoup was chosen as it is widely used (1007 usages) and actively being developed recently (last release Nov 17).

Assumptions about the requirement:

Use of the verb 'locate' in the task specification implies information about where the link resides in the HTML document should also be returned by the API. I've chosen to return the CSS selector for the link to meet this requirement.

To build and run tests:

In IntelliJ:
- Open the project at <src checkout root>/web-link-scraper
- Open the Maven Projects windows by clicking View | Tool Windows | Maven Projects
- Double-click web-link-scraper > Lifecycle > install
Or, from a command prompt:
- Go to folder <src checkout root>/web-link-scraper
- Run command mvn install

To run the REST API:

Download and install Tomcat 8 (see https://tomcat.apache.org/tomcat-8.0-doc/setup.html)
In IntelliJ:
- Edit the provided wls-rest-service Run Configuration by selecting Run | Edit Configurations...
- On the Server tab, click Configure next to Application server
- Ensure that the Tomcat home folder is set to the folder where you've installed Tomcat
- Start Tomcat by clicking Run | Run wls-rest-service. The endpoint will be http://localhost:8086/wls-rest/v1/links

If this was a 'real' development task, I would additionally:

Write automated acceptance testing of the REST API, eg using POSTMan scripts
Enhance the scraper to be kind to the sites it was scraping by:
- Caching previously scraped sites within a configurable period
- Add a configurable time-out to cope with long document download times

Possible enhancements (that go beyond the scope of the requirement):

Add 'deep' link scraping to a configurable level
Paging and limiting results
Sorting results

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
core		core
rest		rest
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml
wls-logback-configuration.xml		wls-logback-configuration.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-link-scraper

About

Releases

Packages

Languages

License

seanprince/web-link-scraper

Folders and files

Latest commit

History

Repository files navigation

web-link-scraper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages