You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@igorbrigadir This is the output of all the PDFs concatenated then exported to HTML using Acrobat Pro CC. The index page has an extra search box pointed at GitHub, but that's about it.
Thanks! That might be a good pipeline - might be easier to clean up the HTML result rather than the extracted text from PDF - selecting elements (footers, headers, bullets) by style.
@igorbrigadir good point. Can you let us know how you get on. I've taken a quick look at the @cr3ative version but not done an in depth comparison vs simple text extraction.
PDF to text creates jumbled text sometimes e.g.
Original source:
Markdown formatted preview:
Correct the issue when formatting in MD, or attempt re-extraction and hope for a better outcome?
The text was updated successfully, but these errors were encountered: