-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problematic Medline Document: 23700993 #10
Comments
Department of Medical Informatics and Biostatistics, Iuliu Haieganu University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj, Romania. |
The issue is as follows: We VTD-XML to import Medline XML into the database. Since VTD cannot deal with Unicode Supplementary characters properly, the resulting XML contains invalid characters (control sequences or whatever). What happens is that the supplementary characters - which start at codepoint U+10000 and thus need more than 16bit - are represented via surrogate pairs. Each surrogate has 16bit. VTD only uses the lower 16bit. |
We have currently an internal VTD-XML version which I put together following the instructions of the VTD-XML author. After doing this, the Unicode jUnit test put up to prove the wrong behavior worked fine. I put together a new version of the julie-medline-manager using this version of VTD-XML and started importing of Medline XML from scratch. All pipelines should be updated to this version: For julie-medline-manager:
If the julie-xml-tools are directly used, then:
We then have to check manually for the documents in question whether everything is fine, then. |
@khituras What is the status here? |
The newest verion of VTD has the fix included. This issue is fixed. |
The text was updated successfully, but these errors were encountered: