-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset (metadata) Title with '&' (ampersand) in the text causes 'Publish' to fail. #3845
Comments
@mdmADA thanks for the detailed bug report! |
It looks like Thanks for the line numbers, @mdmADA . |
Still DV version 4.6.1. The characters "&" and ";" in the Metadata Description cause publish to fail so they should probably be avoided in the metadata completely. Having "&" in the Description field gave the usual 'Datacite unavailable" error in the UI and the following error in server.log (and different to the one from the original issue): Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: cvc-complex-type.2.4.a: Invalid content was found starting with element 'p'. One of '{"http://datacite.org/schema/kernel-3":br}' is expected. Interestingly enough, when I validated the xml (as written to the doidataciteregistercache table) against the external schemas, it passed. I guessed the "element p" in the server.log error message referred to the "p" in "& amp;" (had to put the space so the amp; part would show up). Perhaps it is actually the ";" causing the issue, however, since we had to change all of the ";" in the Description text to "," to get it to publish. I am not sure if I should create a new issue or edit the title of this one to reflect that it is not just "&" and not just the title where these issues occur.... let me know and I will do as advised. Thanks! |
DV 4.6.1. Running into same problem with < br >, < p >, etc markup in the Metadata description causing publish to fail. It seems to boil down to valid HTML required throughout the UI is being sent, incorrectly, as valid XML to the Datacite DOI URL. In the DOIDataciteRegisterService createIdentifier() method, it calls: The DatasetVersion getDescription() method calls the MarkupChecker.sanitizeBasicHTML() method. This makes sure that the description text is valid html and converts < br >< /br > to < br >< br >. < br >< br > is valid HTML but not XML as XML requires the closing tag. I believe that & is not allowed in XML either so needs to be escaped before sending the XML to Datacite. I am sure that whoever is assigned the bug fix can figure this out but maybe my own investigations can assist... |
Hi Phil. I believe jsoup is behaving as it should (no bugs) in that it properly sanitizes text input to valid html. The issue is that this html is being sent to Datacite as part of the XML in the postMetadata() method for DOI minting.
=> The description is saved to the datasetfieldvalue table with the select value from datasetfieldvalue where value like '%
paper for conference plus materials
The |
@mdmADA yes, this is helping. Thanks! 😄 |
@mdmADA I started working on this issue and pushed some scratch work to pdurbin@4ed4c00 if you're interested. You're definitely right about how Dataverse sometimes sends XML that's not well formed to DataCite. I'll keep updating this issue with my progress. |
I gave @rbhatta99 a brain dump this morning and just pushed a branch called @mdmADA neither one of us are are able to reproduce the "& in the title" bug. I'm not sure why. We definitely can excecise the " Along the way, I discovered that while using DataCite rather than EZID, I can't publish a dataset created via SWORD because @matthew-a-dunlap @rbhatta99 and I observed that descriptions of datasets aren't even being shown at https://search.datacite.org/works/10.7910/dvn/eiwf4p so part of me wonders if an easy fix would be to always send an empty string. Does DataCite do anything with descriptions? Maybe @jggautier or @pameyer would know. I'll keep hacking away but I wanted to give an update and get that branch pushed. |
as of commit dd55c08 on develop (v 4.7.1), a dataset still gets published with an & in the title. |
Which version of the DataCite XML schema are we sending to DataCite's API? From a quick check (edit test file and re-run validator), v3.1 doesn't support html tags in the description. |
We're using 3.1. Both 3.1 and the newest version, and 4.0, pretty strongly recommend including the description (although I also noticed that dataset descriptions aren't displayed on search.datacite). Edit: Some dataset descriptions aren't displayed on search.datacite, but this one is: https://search.datacite.org/works/10.6084/m9.figshare.4223907, and the Datacite.xml you can export from that page includes the dataset description, whereas the Datacite.xml export on this page, https://search.datacite.org/works/10.7910/dvn/eiwf4p, doesn't. If we plan on using DataCite's service to add datacite metadata in schema.org/JSON-LD format to dataset pages (#3793), than finding a way to send DataCite the description would help even more. |
Maybe there is a difference between 4.6.1 and 4.7.1 (I am not sure when we will move to that version) with the & in the title? I am using this fake dataset to test: https://dataverse-dev.ada.edu.au/dataset.xhtml?persistentId=doi:10.5072/82/AH5ZIW When I add & to the title (and change nothing else in the metadata), this is the xml attempted to send to Datacite:
This is the error in server.log: [2017-08-14T14:19:56.645+1000] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DataCiteRESTfullClient] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 150268 2017-08-14T14:19:56.646+1000] [glassfish 4.1] [WARNING] [AS-EJB-00056] [javax.enterprise.ejb.container] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 1502684396 . Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference. If I make the simple change of `&` to 'and', it publishes. Not sure what the difference is... |
I just created pull request #4075 and am moving this issue to code review. @jggautier seemed ok with sending plain text to DataCite so we're stripping out HTML tags (he started a thread at https://groups.google.com/d/msg/datacite-metadata/Di5TSstfafU/Zki5n44CAgAJ ). (Thanks to @rbhatta99 we do have code ready to go to in 64e1ee4 escape the HTML tags instead if that's what's desired.) |
Hi @mdmADA and @philippconzett, does stripping all html from the description metadata sent to DataCite work for ADA? |
In 2b466e7 and 47feb73 I addressed code review from @scolapasta . Moving to QA. |
In (delayed) reply to @jggautier about stripping out all HTML from the description to send to DataCite, I think that is perfectly acceptable for ADA... Thanks! |
@kcondon here's the template I mentioned we use when sending XML to DataCite:
(From If you run Like I said, the fix is to strip out HTML from the description before it's inserted into the template above. While we were in there, we also now strip out HTML from the description when inserting it into the "meta" tags that Zotero and other tools consume (#1393). The "meta" code/template looks like this:
Hope this helps. |
A dataset title containing the character '&' gives the UI error "This dataset may not be published because the DataCite Service is currently inaccessible. Please try again. Does the issue continue to persist? Please contact Dataverse Support for assistance."
However, the logfile shows the following xml error:
Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:183)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:89)
A colleague had tried to publish a dataset with '&' in the title and got this error.
I took a guess and replaced the & with 'and' and Publish succeeded. So the '&' is definitely the issue.
Adding a '&' to the subtitle metadata field did not cause an error so I am not sure if the issue applies only to the title or to other fields as well.
A more informative error message for the user would be nice to have (this applies to all DOI-related errors since the user only sees the error message listed above which gives no indication of the actual underlying problem).
Or escaping problematic characters in forming the xml for the DOI minting...
-M.
The text was updated successfully, but these errors were encountered: