-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract metadata from PDF #50
Comments
@pronguen You can find in folder documents/documentation/projets/SONAR/dev/test_grobid_extraction/ the outputs in JSON and XML for the files you provided. I can do better formatting for the JSON files, but I have to know which data are relevant to do that. Could we discuss about that together ? |
A test page for PDF metadata extraction is now available on DEV website : https://sonardev.test.rero.ch/pdf-extractor/test You must be logged to access this page (admin@sonar.ch - 123456) Metadata extraction takes between 20 and 30 seconds. |
I completed the evaluation file (documents/documentation/projets/SONAR/dev/test_grobid_extraction/evaluation.ods). Main results:
For point 3, we can enable the extraction in the interface only for articles (we need to ask first what document type the user is going to add). In general, a absence of information in the extraction output is less problematic than incorrect information. |
Thanks for your feedback ! Referring to your evaluation file, do I need to do some corrections or additional checks ? Or we will do that during real implementation? |
For point 1, you could
For point 2: we should correct it, but it is not a priority I guess |
Errors in parsing documents are not caused by Grobid, but by the JSON formatting I was using. Result is available on DEV website. Are you OK with that ? |
Thank you! Yes, ok! |
No description provided.
The text was updated successfully, but these errors were encountered: