Skip to content
This repository has been archived by the owner on Jun 3, 2024. It is now read-only.

Latest commit

 

History

History
31 lines (24 loc) · 1.25 KB

README.md

File metadata and controls

31 lines (24 loc) · 1.25 KB

PDFBoT API

PDFBoT is a tool for accurately extracting body text of the article in PDF format.

Publications

Please cite the following paper if you are using our tool. Thanks!

Environment

MacOS with python3.6+ installed

Dependencies

  • pdf2htmlEX
  • bs4
  • flask

Usage

Command to run PDFBoT API on local computer

python3 main.py

Once the builtin server is launched, head over to http://127.0.0.1:5000/. Follow the instrucation to upload PDF file to extract text.

Run the source code to extract the text

step 1 convert the PDF to HTML by the function "pdftohtml_test", in main.py L72.

  • The input is the url of the PDF document.
  • The output is the path of the corresponding HTML file.

step 2 extract body text from HTML file by the function "getTextFrom2HTML", in extractTextFrom2colHTML.py.

  • The input is the path of the HTML file.
  • The output is the a list of string, each string in the list represents one paragraph of the article in document.