Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract metadata from PDF #50

Closed
pronguen opened this issue Jul 10, 2019 · 7 comments
Closed

Extract metadata from PDF #50

pronguen opened this issue Jul 10, 2019 · 7 comments

Comments

@pronguen
Copy link
Contributor

No description provided.

@pronguen pronguen added the new label Jul 10, 2019
@sebdeleze sebdeleze added this to the Sprint 3 milestone Jul 15, 2019
@sebdeleze
Copy link
Contributor

@pronguen You can find in folder documents/documentation/projets/SONAR/dev/test_grobid_extraction/ the outputs in JSON and XML for the files you provided.

I can do better formatting for the JSON files, but I have to know which data are relevant to do that. Could we discuss about that together ?

@sebdeleze
Copy link
Contributor

A test page for PDF metadata extraction is now available on DEV website :

https://sonardev.test.rero.ch/pdf-extractor/test

You must be logged to access this page (admin@sonar.ch - 123456)

Metadata extraction takes between 20 and 30 seconds.

@pronguen
Copy link
Contributor Author

I completed the evaluation file (documents/documentation/projets/SONAR/dev/test_grobid_extraction/evaluation.ods). Main results:

  1. some PDF don't generate any output (bug)
  2. encoding problems with some characters
  3. the extraction doesn't work on thesis or dissertations. More generally, it works fine almost only on postprints (articles), which is also the most common document type of SONAR

For point 3, we can enable the extraction in the interface only for articles (we need to ask first what document type the user is going to add).

In general, a absence of information in the extraction output is less problematic than incorrect information.

@sebdeleze
Copy link
Contributor

Thanks for your feedback ! Referring to your evaluation file, do I need to do some corrections or additional checks ? Or we will do that during real implementation?

@pronguen
Copy link
Contributor Author

For point 1, you could

  • have a look and try to understand why it didn't work for some PDF
  • or at least make that the whole upload process works even if the metadata extraction does not.

For point 2: we should correct it, but it is not a priority I guess

@sebdeleze
Copy link
Contributor

Errors in parsing documents are not caused by Grobid, but by the JSON formatting I was using.
As the metadata structure can be significantly different from one file to another and therefore cause errors, I decided to remove this formatting and keep exactly the Grobid output. The result is more verbose and a little less readable, but there are no more errors and we will be sure that any problems come from Grobid and not from this formatting.

Result is available on DEV website.

Are you OK with that ?

@pronguen
Copy link
Contributor Author

Thank you! Yes, ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants