Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract all Text Objects #83

Closed
Banguiskode opened this issue Dec 22, 2019 · 2 comments
Closed

Extract all Text Objects #83

Banguiskode opened this issue Dec 22, 2019 · 2 comments

Comments

@Banguiskode
Copy link

Hi !
First of all, thank you for this great tool.
Let me ask you a question:
I would like to be able to extract a table (or a list) containing the text objects with their properties, is that possible?
Thanks

@sambitdash
Copy link
Owner

sambitdash commented Dec 22, 2019

@Banguiskode thank you for your interest in the library. Your expectations are captured as enhancements #2, #7, #11 and #17.

PDF as a specification does not have any simple mechanism of specifying tabular structures as tables unless you post process the text positions extracted from the PDF files. While the API does not provided a very explicit API for the same, pdPageEvaluate can be extended to extract the text data and their positions. As part of tagged specification PDF supports specifying the tabular structure representations but a very small portion of the PDF files available in the market actually implement those specifications to a great extent. If you will like to contribute to any parts of PDFIO by implementing any of the features, we will be happy to accept PRs.

Since, the intent of the issue is already captured as part of other issues, I will close the issue with this comment.

@Banguiskode
Copy link
Author

Thank you very much for your answer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants