Re-implementation of PDFNLT/postprocess in Python.
This module is designed for Python3 (3.5 or later). You can install this module and dependent libraries just by one shot:
$ python3 setup.py install
Execute the postprocess
package with Python3 interpreter. You can specify one
or more XHTML files to process:
$ python3 -m postprocess XHTML ...
The outputs go to ./out
directory as default. To output the files to the
other places, use -o
option:
$ python3 -m postprocess -o DIR XHTML ...
Further options can be found with:
$ python3 -m postprocess -h
This project includes the following modules. Please see the docstrings in each moudle file for the details.
- Textualizer (
textualizer.py
) - FigureTagger (
figuretagger.py
) - MathTagger (
mathtagger.py
) - CiteDetector (
citedetector.py
)
The following files will be generated in the output DIR:
<input>.cite.tsv
: List of citations in TSV form<input>.math.tsv
: List of mathematical expressions in TSV form<input>.sent.tsv
: List of sentences in TSV form<input>.txt
: Plain text extracted from the input<input>.word.tsv
: List of words in TSV form<input>.xhtml
: The postprocessed XHTML with following attributes
data-sent-id
: sentence iddata-from
: start position of a worddata-to
: end position of a worddata-cite-id
: paragraph id corresponds to the reference paper in reference sectiondata-cite-end
: word id where the citation tokens enddata-cite-type
: citation type automatically determined by [Nanba, H., et al. 2000]
Citation types represent the reason for citation:- Type B: Citations that show other researchers' theories or methods for the theoretical basis
- Type C: Citations to point out the problems or gaps in related works
- Type O: Citations other than types B and C
data-cite-type-cue
: the citation types are determined based on this cue phrasesdata-math
: indicates in-line formulas. The range of a formula starts with "B-Math" and while "I-Math" continuesdata-figref
: indicator of figure references. Starts with aB-FIG
and continues during theI-FIG
sdata-figref-id
: indicates the figure which it refers to. Corresponds todata-fig
tagdata-figref-end
: the word spanid
of which the figure reference ends
Some modules in the postprocess package also provide independent CLI.
You can execute "tag" and "learn" operation. Try following for the details:
$ python3 -m postprocess.mathtagger -h
- Nanba, H., Kando, N., Okumura, M.: Classification of research papers using citation links and citation types: Towards automatic review article generation. (2000)