company-scraper-server

Context

company-scraper-server is a web server app built with Node.js and hapi framework, whose purpose is to scrap company information from different sources such as linkedin.com and societe.com.

Use cases

Here are some high level design schemas to describe 3 different use cases:

Cache hit

Cache miss / DB hit

Cache miss / DB miss

Disclaimer: there is actually no cache involved in this version of company-scraper-server. Schemas show how it would ideally work. It could be implemented with Redis or Memcached for example, but it seems a little bit overkill for now.

Low level design

API

company-scraper-server serves a REST API implemented with hapi. It has 2 routes:

`POST /api/companies/query`

Returns a list of company pages (URLs) that match query, from linkedin.com and societe.com.

Parameters

{
  query: 'company_name';
}

`POST /api/company/urls`

Returns company information from company page(s) URL(s).

Parameters

{
  linkedin: 'https://www.linkedin.com/company/company-name',
  societe: 'https://www.societe.com/societe/company-name'
}

Only 1 URL is required (linkedin OR societe OR both)

Scraping

Scraping services use puppeteer to extract data from company pages.

Persistence

Company information data is persisted in a MongoDB collection, named companies. Thus, when a company whose data was already scraped is researched, its data from DB is returned. A scheduled job triggers a cleaning of old companies (it should run everyday at midnight).

Setup

⚠️ Requirements ⚠️

Node v10+
yarn
A Linkedin account: we need to be logged in to scrap company data from linkedin.com
A MongoDB database: you can create one on MongoDB Atlas if necessary (see https://docs.mongodb.com/manual/tutorial/atlas-free-tier-setup/)
A .env file at project's root for environment variables

`.env` file properties

Copy .env.example file and rename it by .env
Add Linkedin credentials
Add MongoDB connection string URI

Usage

$ yarn install    # install dependencies
$ yarn serve      # start dev with server with nodemon
$ yarn lint       # lints files
$ yarn lint:fix   # lints and fixes files
$ yarn test       # run unit tests with Jest

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
src		src
.editorconfig		.editorconfig
.env.example		.env.example
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmrc		.npmrc
.nvmrc		.nvmrc
.prettierrc.js		.prettierrc.js
.travis.yml		.travis.yml
LICENSE		LICENSE
jest.config.js		jest.config.js
nodemon.json		nodemon.json
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

company-scraper-server

Context

Use cases

Cache hit

Cache miss / DB hit

Cache miss / DB miss

Low level design

API

`POST /api/companies/query`

`POST /api/company/urls`

Scraping

Persistence

Setup

⚠️ Requirements ⚠️

`.env` file properties

Usage

About

Releases

Packages

Contributors 3

Languages

License

nicolaspayot/company-scraper-server

Folders and files

Latest commit

History

Repository files navigation

company-scraper-server

Context

Use cases

Cache hit

Cache miss / DB hit

Cache miss / DB miss

Low level design

API

POST /api/companies/query

POST /api/company/urls

Scraping

Persistence

Setup

⚠️ Requirements ⚠️

.env file properties

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`POST /api/companies/query`

`POST /api/company/urls`

`.env` file properties

Packages