You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
At the moment we use HtmlDocPreprocessor to separately generate pre-processed documents that are fed into the parser.
If we want to extract visual features, we currently need corresponding pdf files for each input document. Fetching the pdf file path currently happens inside parser, which is initialized with a pdf_path argument. This couples the parser with input data generation. Furthermore, we can only test whether a matching pdf file exists, when the ParserUDF.apply() method is called, because we have no knowledge about the html input files before.
Describe the solution you'd like
Have a (separate) generator that handles generation and checking of the matching pdf file paths, which are fed into the parser.apply() method, e.g. parser.apply((doc,text), pdf_path, **kwargs).
Describe alternatives you've considered
Extend HtmlDocPreprocessor to return tuples of three values (doc,text,pdf_path), if a visual_linking_pdf_path is provided.
Additional context
One thing to consider is that there are also other ways of visual linking that would not require PDF files in the future.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
At the moment we use
HtmlDocPreprocessor
to separately generate pre-processed documents that are fed into the parser.If we want to extract visual features, we currently need corresponding pdf files for each input document. Fetching the pdf file path currently happens inside parser, which is initialized with a
pdf_path
argument. This couples the parser with input data generation. Furthermore, we can only test whether a matching pdf file exists, when the ParserUDF.apply() method is called, because we have no knowledge about the html input files before.Describe the solution you'd like
Have a (separate) generator that handles generation and checking of the matching pdf file paths, which are fed into the
parser.apply()
method, e.g.parser.apply((doc,text), pdf_path, **kwargs)
.Describe alternatives you've considered
Extend HtmlDocPreprocessor to return tuples of three values (doc,text,pdf_path), if a
visual_linking_pdf_path
is provided.Additional context
One thing to consider is that there are also other ways of visual linking that would not require PDF files in the future.
The text was updated successfully, but these errors were encountered: