company-scraper-server
is a web server app built with Node.js and hapi framework, whose purpose is to scrap company information from different sources such as linkedin.com and societe.com.
Here are some high level design schemas to describe 3 different use cases:
Disclaimer: there is actually no cache involved in this version of company-scraper-server
. Schemas show how it would ideally work. It could be implemented with Redis or Memcached for example, but it seems a little bit overkill for now.
company-scraper-server
serves a REST API implemented with hapi. It has 2 routes:
Returns a list of company pages (URLs) that match query
, from linkedin.com and societe.com.
- Parameters
{
query: 'company_name';
}
Returns company information from company page(s) URL(s).
- Parameters
{
linkedin: 'https://www.linkedin.com/company/company-name',
societe: 'https://www.societe.com/societe/company-name'
}
Only 1 URL is required (linkedin OR societe OR both)
Scraping services use puppeteer to extract data from company pages.
Company information data is persisted in a MongoDB collection, named companies
. Thus, when a company whose data was already scraped is researched, its data from DB is returned. A scheduled job triggers a cleaning of old companies (it should run everyday at midnight).
- Node v10+
- yarn
- A Linkedin account: we need to be logged in to scrap company data from linkedin.com
- A MongoDB database: you can create one on MongoDB Atlas if necessary (see https://docs.mongodb.com/manual/tutorial/atlas-free-tier-setup/)
- A
.env
file at project's root for environment variables
- Copy
.env.example
file and rename it by.env
- Add Linkedin credentials
- Add MongoDB connection string URI
$ yarn install # install dependencies
$ yarn serve # start dev with server with nodemon
$ yarn lint # lints files
$ yarn lint:fix # lints and fixes files
$ yarn test # run unit tests with Jest