Skip to content

Latest commit

 

History

History
103 lines (76 loc) · 3.25 KB

README.md

File metadata and controls

103 lines (76 loc) · 3.25 KB

F1000 header

F1000Scraper

F1000Research is an open access publishing platform. It provides an API to extract XML or PDF of articles published in F1000Research. F1000Scraper is a python wrapper for scraping these articles as XML, and parsing the XML.

Usage

Collecting data using start and end date of the articles

Currently, the only functionality we provide within this wrapper is that of collecting data using the date option in the API. After downloading the files, you can simply run the program scrape.py from the api directory as follows:

python3 scrape.py <date_from> <date_to> <output_directory_path> <output_format> <keyword in the title (optional)>

whereas

  • data_from can be any date of the form "dd-mm-yyyy" or just "*" and defines the starting date.
  • data_to can be any date of the form "dd-mm-yyyy" or just "*" and defines the end date.
  • output_directory defines the path where the data files should be saved.
  • output_format needs to be either xml or pdf
  • keyword is an optional argument and will only download articles within the given date range where the provided keyword occurs in the title.

Example 1

python3 scrape.py 01-01-2019 01-01-2020 data/ xml

The above commmand will download articles in the XML format from 1st January 2019 to 1st January 2020, and save them to the data folder in the current directory.

Example 2

python3 scrape.py 01-01-2019 * data/ pdf

The above commmand will download articles in the PDF format from 1st January 2019 to today's date, and save them to the data folder in the current directory.

Disclaimer

This is a work in progress.

Contributors

  • Shahan Ali Memon
  • Bedoor AlShebli