Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset (metadata) Title with '&' (ampersand) in the text causes 'Publish' to fail. #3845

Closed
mdmADA opened this issue May 22, 2017 · 18 comments

Comments

@mdmADA
Copy link
Contributor

mdmADA commented May 22, 2017

A dataset title containing the character '&' gives the UI error "This dataset may not be published because the DataCite Service is currently inaccessible. Please try again. Does the issue continue to persist? Please contact Dataverse Support for assistance."

However, the logfile shows the following xml error:

Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:183)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:89)

A colleague had tried to publish a dataset with '&' in the title and got this error.
I took a guess and replaced the & with 'and' and Publish succeeded. So the '&' is definitely the issue.

Adding a '&' to the subtitle metadata field did not cause an error so I am not sure if the issue applies only to the title or to other fields as well.

A more informative error message for the user would be nice to have (this applies to all DOI-related errors since the user only sees the error message listed above which gives no indication of the actual underlying problem).

Or escaping problematic characters in forming the xml for the DOI minting...

-M.

@pdurbin
Copy link
Member

pdurbin commented May 22, 2017

@mdmADA thanks for the detailed bug report!

@pdurbin
Copy link
Member

pdurbin commented May 22, 2017

It looks like retString = client.postMetadata(xmlMetadata) at https://github.com/IQSS/dataverse/blob/v4.6.1/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L89 is getting a RuntimeException because a "201" response code is not coming back from DataCite here: https://github.com/IQSS/dataverse/blob/v4.6.1/src/main/java/edu/harvard/iq/dataverse/DataCiteRESTfullClient.java#L183

Thanks for the line numbers, @mdmADA .

@pdurbin pdurbin added the User Role: Curator Curates and reviews datasets, manages permissions label Jul 4, 2017
@mdmADA
Copy link
Contributor Author

mdmADA commented Jul 26, 2017

Still DV version 4.6.1.

The characters "&" and ";" in the Metadata Description cause publish to fail so they should probably be avoided in the metadata completely.

Having "&" in the Description field gave the usual 'Datacite unavailable" error in the UI and the following error in server.log (and different to the one from the original issue):

Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: cvc-complex-type.2.4.a: Invalid content was found starting with element 'p'. One of '{"http://datacite.org/schema/kernel-3":br}' is expected.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:183)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:89)

Interestingly enough, when I validated the xml (as written to the doidataciteregistercache table) against the external schemas, it passed.

I guessed the "element p" in the server.log error message referred to the "p" in "& amp;" (had to put the space so the amp; part would show up). Perhaps it is actually the ";" causing the issue, however, since we had to change all of the ";" in the Description text to "," to get it to publish.

I am not sure if I should create a new issue or edit the title of this one to reflect that it is not just "&" and not just the title where these issues occur.... let me know and I will do as advised.

Thanks!

@mdmADA
Copy link
Contributor Author

mdmADA commented Aug 3, 2017

DV 4.6.1.

Running into same problem with < br >, < p >, etc markup in the Metadata description causing publish to fail.

It seems to boil down to valid HTML required throughout the UI is being sent, incorrectly, as valid XML to the Datacite DOI URL.

In the DOIDataciteRegisterService createIdentifier() method, it calls:
metadataTemplate.setDescription(dataset.getLatestVersion().getDescription());

The DatasetVersion getDescription() method calls the MarkupChecker.sanitizeBasicHTML() method.

This makes sure that the description text is valid html and converts < br >< /br > to < br >< br >.

< br >< br > is valid HTML but not XML as XML requires the closing tag.
The publishing doi process requires valid xml and so throws an Exception due to the 'sanitized' HTML being invalid XML.

I believe that & is not allowed in XML either so needs to be escaped before sending the XML to Datacite.

I am sure that whoever is assigned the bug fix can figure this out but maybe my own investigations can assist...

pdurbin added a commit that referenced this issue Aug 3, 2017
@pdurbin
Copy link
Member

pdurbin commented Aug 3, 2017

@mdmADA can you please look at b1ae906 and let me know if those test match your expectations? I'm trying to understand if there's a bug in the library we're using (jsoup).

@mdmADA
Copy link
Contributor Author

mdmADA commented Aug 8, 2017

Hi Phil. I believe jsoup is behaving as it should (no bugs) in that it properly sanitizes text input to valid html.

The issue is that this html is being sent to Datacite as part of the XML in the postMetadata() method for DOI minting.


The html being sent as part of the XML renders that XML invalid so Datacite is throwing a "400" status with "bad xml".



Example: Enter <br></br> into the Description field and hit 'Save Change'.

=> The description is saved to the datasetfieldvalue table with the <br></br> intact:

select value from datasetfieldvalue where value like '%<br>%';

                 value                      

paper for conference plus materials <br></br>\r+
testing line breaks



Now hit 'Publish'.
=> In the DOIDataCiteRegisterService createIdentifier() method, the dataset.getLatestVersion().getDescription() calls the MarkupChecker.sanitizeBasicHTML() on this description text which correctly converts the <br></br> to <br><br> as it should for valid html. This is then embedded as part of the xml sent by the postMetadata() method (see the description element):

<?xml version="1.0" encoding="UTF-8"?> <resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd" xmlns="http://datacite.org/schema/kernel-3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <identifier identifierType="DOI">10.5072/82/AH5ZIW</identifier> <creators><creator><creatorName>Test Author 1</creatorName><affiliation>ANU</affiliation></creator><creator><creatorName>Test Author 2</creato rName><affiliation>ANU</affiliation></creator></creators> <titles> <title>Output dataset-testing and self-ingest for Replication Materials</title> </titles> <publisher>DEV ADA Dataverse</publisher> <publicationYear>2017</publicationYear> <resourceType resourceTypeGeneral="Dataset"/> <descriptions> <description descriptionType="Abstract">paper for conference plus materials <br> <br> testing line breaks</description> </descriptions> <contributors><contributor contributorType="ContactPerson"><contributorName>Contributor 1</contributorName><affiliation>ANU</affiliation></ contributor></contributors> </resource>

The <br><br> however, is not valid XML and Datacite is throwing a 400 'bad xml' error because of it.


Finally, I tested if using &lt;br&gt; allows publishing. While it does, the <br> tags show up in the UI as part of the text which is not what we want either.


I assume this is an issue when trying to include ANY of the html tags (<em>,<b>,etc) in the decription Metadata field...


I hope I am describing this clearly!!

@pdurbin
Copy link
Member

pdurbin commented Aug 8, 2017

@mdmADA yes, this is helping. Thanks! 😄

@pdurbin
Copy link
Member

pdurbin commented Aug 8, 2017

@mdmADA I started working on this issue and pushed some scratch work to pdurbin@4ed4c00 if you're interested. You're definitely right about how Dataverse sometimes sends XML that's not well formed to DataCite. I'll keep updating this issue with my progress.

@jggautier jggautier changed the title Dataset (metadata) Title with '&' in the text causes 'Publish' to fail. Dataset (metadata) Title with '&' (ampersand) in the text causes 'Publish' to fail. Aug 10, 2017
@djbrooke djbrooke added this to the 4.8 - Large Data Upload Integration milestone Aug 10, 2017
pdurbin added a commit that referenced this issue Aug 11, 2017
@pdurbin
Copy link
Member

pdurbin commented Aug 11, 2017

I gave @rbhatta99 a brain dump this morning and just pushed a branch called 3845-datacite-xml as a common starting point.

@mdmADA neither one of us are are able to reproduce the "& in the title" bug. I'm not sure why.

We definitely can excecise the "<br></br> in the description" bug.

Along the way, I discovered that while using DataCite rather than EZID, I can't publish a dataset created via SWORD because contributorName wasn't being sent. I pushed a fix in e93d6b3 to my branch and this relates strongly to #3802 and #3839.

@matthew-a-dunlap @rbhatta99 and I observed that descriptions of datasets aren't even being shown at https://search.datacite.org/works/10.7910/dvn/eiwf4p so part of me wonders if an easy fix would be to always send an empty string. Does DataCite do anything with descriptions? Maybe @jggautier or @pameyer would know.

I'll keep hacking away but I wanted to give an update and get that branch pushed.

@rbhatta99 rbhatta99 self-assigned this Aug 11, 2017
@rbhatta99
Copy link
Contributor

as of commit dd55c08 on develop (v 4.7.1), a dataset still gets published with an & in the title.
Although the addition of HTML tags in the description still causes it to fail.

@pameyer
Copy link
Contributor

pameyer commented Aug 11, 2017

Which version of the DataCite XML schema are we sending to DataCite's API? From a quick check (edit test file and re-run validator), v3.1 doesn't support html tags in the description.

@jggautier
Copy link
Contributor

jggautier commented Aug 11, 2017

We're using 3.1. Both 3.1 and the newest version, and 4.0, pretty strongly recommend including the description (although I also noticed that dataset descriptions aren't displayed on search.datacite).

Edit: Some dataset descriptions aren't displayed on search.datacite, but this one is: https://search.datacite.org/works/10.6084/m9.figshare.4223907, and the Datacite.xml you can export from that page includes the dataset description, whereas the Datacite.xml export on this page, https://search.datacite.org/works/10.7910/dvn/eiwf4p, doesn't. If we plan on using DataCite's service to add datacite metadata in schema.org/JSON-LD format to dataset pages (#3793), than finding a way to send DataCite the description would help even more.

@mdmADA
Copy link
Contributor Author

mdmADA commented Aug 14, 2017

Maybe there is a difference between 4.6.1 and 4.7.1 (I am not sure when we will move to that version) with the & in the title?

I am using this fake dataset to test: https://dataverse-dev.ada.edu.au/dataset.xhtml?persistentId=doi:10.5072/82/AH5ZIW

When I add & to the title (and change nothing else in the metadata), this is the xml attempted to send to Datacite:

<?xml version="1.0" encoding="UTF-8"?> <resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd" xmlns="http://datacite.org/schema/kernel-3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <identifier identifierType="DOI">10.5072/82/AH5ZIW</identifier> <creators><creator><creatorName>Test Author 1</creatorName><affiliation>ANU</affiliation></creator><creator><creatorName>Test Author 2</creatorName><affiliation>ANU</affili ation></creator></creators> <titles> <title>Output dataset testing & self ingest for Replication Materials</title> </titles> <publisher>DEV ADA Dataverse</publisher> <publicationYear>2017</publicationYear> <resourceType resourceTypeGeneral="Dataset"/> <descriptions> <description descriptionType="Abstract">paper for conference plus materials testing line breaks</description> </descriptions> <contributors><contributor contributorType="ContactPerson"><contributorName>Contributor 1</contributorName><affiliation>ANU</affiliation></contributor></contributors> </resource>


This is the error in server.log:

[2017-08-14T14:19:56.645+1000] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DataCiteRESTfullClient] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 150268
4396645] [levelValue: 1000] [[
Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.]]

2017-08-14T14:19:56.646+1000] [glassfish 4.1] [WARNING] [AS-EJB-00056] [javax.enterprise.ejb.container] [tid: _ThreadID=50 _ThreadName=jk-connector(1)] [timeMillis: 1502684396
646] [levelValue: 900] [[
A system exception occurred during an invocation on EJB DOIDataCiteRegisterService, method: public java.lang.String edu.harvard.iq.dataverse.DOIDataCiteRegisterService.create
Identifier(java.lang.String,java.util.HashMap,edu.harvard.iq.dataverse.Dataset) throws java.io.IOException]]

.
.
.

Caused by: java.lang.RuntimeException: Response code: 400, [xml] xml error: The entity name must immediately follow the '&' in the entity reference.
at edu.harvard.iq.dataverse.DataCiteRESTfullClient.postMetadata(DataCiteRESTfullClient.java:186)
at edu.harvard.iq.dataverse.DOIDataCiteRegisterService.createIdentifier(DOIDataCiteRegisterService.java:92)


If I make the simple change of `&` to 'and', it publishes. Not sure what the difference is...

pdurbin added a commit that referenced this issue Aug 14, 2017
@pdurbin pdurbin mentioned this issue Aug 14, 2017
5 tasks
@pdurbin
Copy link
Member

pdurbin commented Aug 14, 2017

I just created pull request #4075 and am moving this issue to code review. @jggautier seemed ok with sending plain text to DataCite so we're stripping out HTML tags (he started a thread at https://groups.google.com/d/msg/datacite-metadata/Di5TSstfafU/Zki5n44CAgAJ ). (Thanks to @rbhatta99 we do have code ready to go to in 64e1ee4 escape the HTML tags instead if that's what's desired.)

@pdurbin pdurbin removed the User Role: Curator Curates and reviews datasets, manages permissions label Aug 14, 2017
@jggautier
Copy link
Contributor

Hi @mdmADA and @philippconzett, does stripping all html from the description metadata sent to DataCite work for ADA?

@pdurbin
Copy link
Member

pdurbin commented Aug 15, 2017

In 2b466e7 and 47feb73 I addressed code review from @scolapasta . Moving to QA.

@mdmADA
Copy link
Contributor Author

mdmADA commented Aug 16, 2017

In (delayed) reply to @jggautier about stripping out all HTML from the description to send to DataCite, I think that is perfectly acceptable for ADA... Thanks!

@kcondon kcondon self-assigned this Aug 17, 2017
@pdurbin
Copy link
Member

pdurbin commented Aug 17, 2017

@kcondon here's the template I mentioned we use when sending XML to DataCite:

<?xml version="1.0" encoding="UTF-8"?>
<resource xsi:schemaLocation="http://datacite.org/schema/kernel-3 http://schema.datacite.org/meta/kernel-3/metadata.xsd"
          xmlns="http://datacite.org/schema/kernel-3"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <identifier identifierType="DOI">${identifier}</identifier>
    <creators>${creators}</creators>
    <titles>
        <title>${title}</title>
    </titles>
    <publisher>${publisher}</publisher>
    <publicationYear>${publisherYear}</publicationYear>
    <resourceType resourceTypeGeneral="Dataset"/>
    <descriptions>
        <description descriptionType="Abstract">${description}</description>
    </descriptions>
    <contributors>{$contributors}</contributors>
</resource>

(From src/main/resources/edu/harvard/iq/dataverse/datacite_metadata_template.xml in the code.)

If you run asadmin set-log-levels edu.harvard.iq.dataverse.DOIDataCiteRegisterService=FINE you can see the XML in server.log right after it's constructed.

Like I said, the fix is to strip out HTML from the description before it's inserted into the template above. While we were in there, we also now strip out HTML from the description when inserting it into the "meta" tags that Zotero and other tools consume (#1393). The "meta" code/template looks like this:

<meta name="DC.identifier" content="#{DatasetPage.persistentId}"/>
<meta name="DC.type" content="Dataset"/>
<meta name="DC.title" content="#{DatasetPage.title}"/>
<meta name="DC.date" content="#{DatasetPage.publicationDate}"/>
<meta name="DC.publisher" content="#{DatasetPage.publisher}" />
<meta name="DC.description" content="#{DatasetPage.description}" />

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants