-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish Dataset - Fails when metadata contains HTML entities w/special characters such as #3328
Comments
By looking more precisely to my datas, some utf-8 "No-Break Space", aka \xa0, aka & nbsp, were present. So I purged them, and the error disappeared. I had a second error after that, an argument, optionnal for Dataverse, but mandatory for Datacite it seems: Here, the contributorName is not part of the contributor field of a Dataset on Dataverse, but it concerns the dataset contact Name, that is optionnal for a dataset, but create an error if you use Datacite, because it's mandatory but not asked anywhere (except in the error message). After adding a Name to my datasetContact field, I could finally properly Publish my imported Dataset :) So to resume, there is 1 major problem, and 1 optionnal problem:
|
See also details at http://irclog.iq.harvard.edu/dataverse/2016-09-02#i_40373 |
We've had another report of the "optional" item from this issue regarding |
@djbrooke |
I believe there's another case of this problem reported here today: https://groups.google.com/d/msg/dataverse-community/v6jgyewGqlI/peTMCoW-DAAJ |
We talked about this in standup. The expectation is that code changes that were already made had fixed this, so we should retest. |
So, to test this I need to:
|
Yeah, HTML content such as |
Is there a known list or pattern to unsupported tags or would just trying a known case be enough? |
I see, is this issue essentially a duplicate of #3845? |
OK, still fails: This works: This fails: |
[2017-09-14T17:19:50.302-0400] [glassfish 4.1] [SEVERE] [] [edu.harvard.iq.dataverse.DataCiteRE [2017-09-14T17:19:50.304-0400] [glassfish 4.1] [WARNING] [AS-EJB-00056] [javax.enterprise.ejb.c [2017-09-14T17:19:50.304-0400] [glassfish 4.1] [WARNING] [] [javax.enterprise.ejb.container] [t javax.ejb.TransactionRolledbackLocalException: Exception thrown from bean [2017-09-14T17:19:50.307-0400] [glassfish 4.1] [WARNING] [] [edu.harvard.iq.dataverse.DOIDataCiteServiceBean] [tid: _ThreadID=56 _ThreadName=jk-connector(4)] [timeMillis: 1505423990307] [levelValue: 900] [[ |
Bummer. Thanks for checking @kcondon The reason this got on my radar is that @amberleahey linked to it from the "Dev Efforts by the Dataverse Community" spreadsheet at https://docs.google.com/spreadsheets/d/1pl9U0_CtWQ3oz6ZllvSHeyB0EG1M_vZEC_aZ7hREnhE/edit?usp=sharing and indicated the @kevinworthington put a fix in the production installation for Scholars Portal. She called it "DataCite metadata validation (strip out HTML encoding)". For #3845 we stripped out HTML tags but we didn't look at entities like |
So if we didn't look at entities, why did we bother testing it? |
It's good to know where the bugs are. All Dataverse installations using EZID will need to move to DataCite: http://www.cdlib.org/cdlinfo/2017/08/28/ezid-service-update-august-2017/ |
I know but if we did not specifically fix entities why would we expect them to work? This does not require an answer, just something to think about going forward. |
This issue wasn't on my radar when I worked on #3845. In the next retrospective I'll mention that we need a better way to group together related issues so we can fix and QA them in a single pull request. |
I guess my point isn't that you didn't fix it or that it wasn't organized in such a way, it was that we knew it wasn't fixed but we tested it rather than just acknowledging that it was something we have still to do. That does not make sense to me. That's all. We can continue off line in stand up. |
Unfortunately @philippconzett is suffering from this bug in production as originally reported at https://groups.google.com/d/msg/dataverse-community/kVktk1nzBG8/4wsIsZlBBgAJ He's running 4.7.1 but the bug is in the "develop" branch as well. In the stack trace below that Philipp sent via https://help.hmdc.harvard.edu/Ticket/Display.html?id=254989 the Here's the code in question that should validate the XML before it's sent to DataCite:
Here's the stack trace showing line the
|
FWIW, DataCite supports only a few HTML tags in the DOI description field, but that includes |
@mfenner thanks! @4tikhonov you'll need to add the |
@pdurbin, it makes no sense to add |
@4tikhonov ok, what would be a more sustainable solution? Markdown support? |
@pdurbin, probably it would be better to replace |
The subset that we picked ( Something that we "unofficially" support is Latex, which for example the folks at CERN make heavy use of and that we support via Mathjax (but that is tricky). |
@4tikhonov one challenge with that is for those that have used |
Hey @qqmyers - you mentioned you may have a fix for this issue that you deployed on QDR. |
@djbrooke, @jmjamison - not sure if anything is needed. I just repeated putting odd chars in the description as mheppler did in Aug 2019 (above) and it publishes OK: 'https://demo.dataverse.org/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2F0LD1LQ&version=1.1 Does that mean this issue is fixed? Or is there a better test case that demonstrates the problem? |
Is demo configured with a fake DOI provider? That was what I missed if you are referencing my test on my local dev environment above. Just putting these here...
|
I'm assuming the 10.70122 is a real DOI of some sort (on the test DataCite server?), but I don't know. |
@qqmyers I believe that the rest of the DOI "10.70122/FK2/0LD1LQ" with the "FK2" shoulder is the "fake" provider part. |
The FakePIDProvider just returns "fakeIdentifier" (see dataverse/src/main/java/edu/harvard/iq/dataverse/pidproviders/FakePidProviderServiceBean.java Lines 46 to 47 in 54a47cf
|
Not only are they not resolvable (DataCite 404 when clicked), but the Developer Guide implies the registration is disabled, which I believe is why these special characters only fail on production installations with "real" DOI providers.
|
Can someone verify what demo.dataverse.org is running? While it's true that the FAKE provider would generate 'realistic looking' DOIs and not be a good test, running with the DataCite provider would be a good test and looks similar. (demo.dataverse.org is using the authority 10.70122 which looks more like a DataCite test authority than the obsolete 10.5072 authority I've seen used for the FAKE provider). Also - this is working at QDR on a dev server using the DataCite provider and their test system so either demo is also using DataCite, this is a valid test, the basic problem is fixed, or demo is on FAKE, not a valid test, and QDR must have some code change that needs to be merged since the problem is fixed there. |
@qqmyers Yes, it is a test domain: https://mds.test.datacite.org |
@kcondon - thanks. So - I think that means this issue is solved in 5.3. @jmjamison - can you try your specific case on https://demo.dataverse.org and confirm? Are there any other test cases in the notes above that need to be tested? (I tried & in the description and mheppler's string with more variants, but there could be some other test that is not yet handled.) Also - adding a note from the QDR dev server - I can see the & at DataCite in the Fabrica interface, so it is definitely being transmitted correctly (and not just getting cleaned out of the metadata before sending, etc.) |
I can't explain why things are working on demo and not elsewhere (see https://groups.google.com/d/msgid/dataverse-community/cf4032af-e12b-4dfc-ba6d-7804b35e4f40n%40googlegroups.com). That said, I do see that the datacite.xml produced on demo is not correctly escaped, i.e. in my browser I see an error when trying to view https://demo.dataverse.org/api/datasets/export?exporter=Datacite&persistentId=doi%3A10.70122/FK2/0LD1LQ : QDR has a change that does the escaping which may also resolve the problem with publishing (at least at QDR, my browser shows the formatted xml when I look at a dataset with '&' > < chars in the description and can publish and retrieve the XML from the test DataCite server (via Fabrica) to see that the description is there with escaping, eg. PR to follow. |
More investigation: It looks like DataCite is silently removing unescaped chars rather than failing during publish. @jmjamison - guessing that you use EZID (which may not be doing that and just failing instead)? For DataCite, the PR above, which escapes the description, results in DataCite (i.e. as seen in Fabrica) having the escaped chars and the Dataverse DataCite.xml export being valid xml. For EZID, we'll need a similar change in AbstractGlobalIdServiceBean.getMetadataFromDvObject() - shouldn't hurt anything regardless but would be to confirm that the publication failure is with that service first. Another fun note for DataCite: If you put an & in the title (now or after we add escaping to the description), DataCite decides to remove the unescaped and escaped versions of the characters from the whole xml it is sent, so even the properly escaped & chars in a description disappear if DataCite sees an unescaped & in the title. So - we may want to do more work to make sure all fields, even ones that don't normally get special chars in them to avoid odd effects if/when someone does enter them in some other field. |
Confirmed with @jmjamison that they use EZID, so the mysteries are solved. The PR doesn't yet have the same fix in the EZID code, but should be straight-forward to add. Not a huge priority as removing the offending chars is a work-around. |
I added datasets in a Dataverse using Datacite for DOI's. Those datasets were added with python scripts, using the dataverse python api to first create a "simple" (only the required elements) dataset, then update its metadatas with a json (made by extracting datas from xlsx documents).
The problem here seems to come when I try to publish the dataset. The publish fails and there is the error message : "Error – This dataset may not be published because the DataCite Service is currently inaccessible. Please try again. Does the issue continue to persist? Please contact Dataverse Support for assistance. "
Logs are here :dataverse_event_published_error_log.txt
We can see a message : [xml] xml error: The entity "nbsp" was referenced, but not declared.]
So I tried to use the datacite api to try with "custom" xml files, and there was something: if I had a " " element, I receive the same error. When I remove it, everything is fine. I tried by adding an entity element ( ]>), (it's a &# and 160 between the "") and the nbsp was replaced and everything worked like a charm. Except that it's by using the api. The problem here is that dataverse doesn't handle nbsp thing (coming from my imports maybe).
Here is a json example asked for doing test on a test server using datacite :
json.txt
The text was updated successfully, but these errors were encountered: