Generating/checking matching pdf file paths for visual linking outside of parser #134

j-rausch · 2018-09-02T00:08:05Z

Is your feature request related to a problem? Please describe.
At the moment we use HtmlDocPreprocessor to separately generate pre-processed documents that are fed into the parser.

If we want to extract visual features, we currently need corresponding pdf files for each input document. Fetching the pdf file path currently happens inside parser, which is initialized with a pdf_path argument. This couples the parser with input data generation. Furthermore, we can only test whether a matching pdf file exists, when the ParserUDF.apply() method is called, because we have no knowledge about the html input files before.

Describe the solution you'd like
Have a (separate) generator that handles generation and checking of the matching pdf file paths, which are fed into the parser.apply() method, e.g. parser.apply((doc,text), pdf_path, **kwargs).

Describe alternatives you've considered
Extend HtmlDocPreprocessor to return tuples of three values (doc,text,pdf_path), if a visual_linking_pdf_path is provided.

Additional context
One thing to consider is that there are also other ways of visual linking that would not require PDF files in the future.

The text was updated successfully, but these errors were encountered:

lukehsiao added enhancement New feature or request help wanted Extra attention is required labels Sep 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating/checking matching pdf file paths for visual linking outside of parser #134

Generating/checking matching pdf file paths for visual linking outside of parser #134

j-rausch commented Sep 2, 2018

Generating/checking matching pdf file paths for visual linking outside of parser #134

Generating/checking matching pdf file paths for visual linking outside of parser #134

Comments

j-rausch commented Sep 2, 2018