Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glitchy source text - correct, or re-extract? #4

Open
tlongers opened this issue Jul 8, 2016 · 4 comments
Open

Glitchy source text - correct, or re-extract? #4

tlongers opened this issue Jul 8, 2016 · 4 comments

Comments

@tlongers
Copy link

tlongers commented Jul 8, 2016

PDF to text creates jumbled text sometimes e.g.

Original source:
the-report-of-the-iraq-inquiry_introduction_pdf__page_2_of_19_

Markdown formatted preview:

introduction_md__80__

Correct the issue when formatting in MD, or attempt re-extraction and hope for a better outcome?

@igorbrigadir
Copy link

@cr3ative has a good cleaned up version https://github.com/cr3ative/chilcot-html in case that's useful to compare (no sure how it was extracted though)

@cr3ative
Copy link

cr3ative commented Jul 8, 2016

@igorbrigadir This is the output of all the PDFs concatenated then exported to HTML using Acrobat Pro CC. The index page has an extra search box pointed at GitHub, but that's about it.

@igorbrigadir
Copy link

Thanks! That might be a good pipeline - might be easier to clean up the HTML result rather than the extracted text from PDF - selecting elements (footers, headers, bullets) by style.

@rufuspollock
Copy link

@igorbrigadir good point. Can you let us know how you get on. I've taken a quick look at the @cr3ative version but not done an in depth comparison vs simple text extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants