Extract metadata from PDF #50

pronguen · 2019-07-10T06:55:28Z

No description provided.

sebdeleze · 2019-07-17T14:32:54Z

@pronguen You can find in folder documents/documentation/projets/SONAR/dev/test_grobid_extraction/ the outputs in JSON and XML for the files you provided.

I can do better formatting for the JSON files, but I have to know which data are relevant to do that. Could we discuss about that together ?

sebdeleze · 2019-07-24T07:16:32Z

A test page for PDF metadata extraction is now available on DEV website :

https://sonardev.test.rero.ch/pdf-extractor/test

You must be logged to access this page (admin@sonar.ch - 123456)

Metadata extraction takes between 20 and 30 seconds.

pronguen · 2019-07-24T11:52:08Z

I completed the evaluation file (documents/documentation/projets/SONAR/dev/test_grobid_extraction/evaluation.ods). Main results:

some PDF don't generate any output (bug)
encoding problems with some characters
the extraction doesn't work on thesis or dissertations. More generally, it works fine almost only on postprints (articles), which is also the most common document type of SONAR

For point 3, we can enable the extraction in the interface only for articles (we need to ask first what document type the user is going to add).

In general, a absence of information in the extraction output is less problematic than incorrect information.

sebdeleze · 2019-07-24T12:01:53Z

Thanks for your feedback ! Referring to your evaluation file, do I need to do some corrections or additional checks ? Or we will do that during real implementation?

pronguen · 2019-07-24T12:37:09Z

For point 1, you could

have a look and try to understand why it didn't work for some PDF
or at least make that the whole upload process works even if the metadata extraction does not.

For point 2: we should correct it, but it is not a priority I guess

sebdeleze · 2019-07-25T06:59:14Z

Errors in parsing documents are not caused by Grobid, but by the JSON formatting I was using.
As the metadata structure can be significantly different from one file to another and therefore cause errors, I decided to remove this formatting and keep exactly the Grobid output. The result is more verbose and a little less readable, but there are no more errors and we will be sure that any problems come from Grobid and not from this formatting.

Result is available on DEV website.

Are you OK with that ?

pronguen · 2019-07-25T08:27:04Z

Thank you! Yes, ok!

pronguen added the new label Jul 10, 2019

sebdeleze added this to the Sprint 3 milestone Jul 15, 2019

sebdeleze closed this as completed Oct 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract metadata from PDF #50

Extract metadata from PDF #50

pronguen commented Jul 10, 2019

sebdeleze commented Jul 17, 2019

sebdeleze commented Jul 24, 2019

pronguen commented Jul 24, 2019

sebdeleze commented Jul 24, 2019

pronguen commented Jul 24, 2019

sebdeleze commented Jul 25, 2019

pronguen commented Jul 25, 2019

Extract metadata from PDF #50

Extract metadata from PDF #50

Comments

pronguen commented Jul 10, 2019

sebdeleze commented Jul 17, 2019

sebdeleze commented Jul 24, 2019

pronguen commented Jul 24, 2019

sebdeleze commented Jul 24, 2019

pronguen commented Jul 24, 2019

sebdeleze commented Jul 25, 2019

pronguen commented Jul 25, 2019