Glitchy source text - correct, or re-extract? #4

tlongers · 2016-07-08T13:52:10Z

PDF to text creates jumbled text sometimes e.g.

Original source:

Markdown formatted preview:

Correct the issue when formatting in MD, or attempt re-extraction and hope for a better outcome?

igorbrigadir · 2016-07-08T15:49:50Z

@cr3ative has a good cleaned up version https://github.com/cr3ative/chilcot-html in case that's useful to compare (no sure how it was extracted though)

cr3ative · 2016-07-08T15:57:27Z

@igorbrigadir This is the output of all the PDFs concatenated then exported to HTML using Acrobat Pro CC. The index page has an extra search box pointed at GitHub, but that's about it.

igorbrigadir · 2016-07-09T14:30:59Z

Thanks! That might be a good pipeline - might be easier to clean up the HTML result rather than the extracted text from PDF - selecting elements (footers, headers, bullets) by style.

rufuspollock · 2016-07-10T08:10:28Z

@igorbrigadir good point. Can you let us know how you get on. I've taken a quick look at the @cr3ative version but not done an in depth comparison vs simple text extraction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glitchy source text - correct, or re-extract? #4

Glitchy source text - correct, or re-extract? #4

tlongers commented Jul 8, 2016

igorbrigadir commented Jul 8, 2016

cr3ative commented Jul 8, 2016

igorbrigadir commented Jul 9, 2016

rufuspollock commented Jul 10, 2016

Glitchy source text - correct, or re-extract? #4

Glitchy source text - correct, or re-extract? #4

Comments

tlongers commented Jul 8, 2016

igorbrigadir commented Jul 8, 2016

cr3ative commented Jul 8, 2016

igorbrigadir commented Jul 9, 2016

rufuspollock commented Jul 10, 2016