requirements

Crawl paper information from

science direct
mdpi and store into mongodb

Check item.py out for the details of scraped structure.

requirements

docker
python 3.7

run steps

docker pull mongo:4.2.3
docker run -p 27017:27017 -d --name mongo mongo:4.2.3
run either below methods to install dependencies

pip install -r requirement.txt
pipenv install --python 3.7 then pipenv shell

run crawler, scrapy crawl sciencedirect --loglevel=INFO, where sciencedirect is a spider

run scrapy list to display the list of crawlers

visual data

I recommend Robo 3T to query data, otherwise mongo-shell can do

connect to server
db.getCollection('items').find({'abstract': {$ne: null}})

notes

visit https://docs.scrapy.org/en/latest/topics/jobs.html for pause crawl works

e.g scrapy crawl sciencedirect -s JOBDIR=crawl_jobs/sciencedirect

should expose file inside docker container to the machine

determine a folder first outside, for example: /Users/bryan/papers_crawler/mongodb
run docker run -p 27017:27017 -d -v /Users/bryan/workplace/papers_crawler/mongodb:/data/db/ --name mongo mongo:4.2.3

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
papers_crawler		papers_crawler
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

requirements

run steps

visual data

notes

About

Releases

Packages

Languages

quy-ng/papers_crawler

Folders and files

Latest commit

History

Repository files navigation

requirements

run steps

visual data

notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages