-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7127 Move <sumDscr> and <dataAccs> child elements to the right sequence #7135
7127 Move <sumDscr> and <dataAccs> child elements to the right sequence #7135
Conversation
@pdurbin @jggautier So in this PR I changed the child elements of
to
because this is DDI valid. But since we lose the logical grouping of the geographic locations, I decided to contact the DDI Alliance to hear their opinion. Wendy Thomas, the chair of the technical committee, suggested the following XML:
Apparently the geogCover element is intended to be unstructured so you can add a complete set of information. I suggest to update the PR to format the XML as suggested, and also parse the DDI XML accordingly when importing. How do you guys feel about this? |
That change makes sense to me and appears to be valid against the Codebook schema. (Here's the XML I exported from 7127-fix-ddi-export-element-sequence and adjusted: test_ddi.txt (had to change the .xml extension to .txt so the file would upload in this comment)).
Would that include importing using harvesting and importing using Dataverse's API endpoint for importing DDI XML? Should this be moved to the Community Dev column for now? |
For sure!
I think @pdurbin can answer this. |
I'm not especially knowledgeable about the changes being proposed. Here's my suggestion:
(I don't think there are any UI/UX changes so there's probably no need to bring in @TaniaSchlatter ) Like #7108 these are geospatial changes but as I mentioned there I'm not sure when the new GDCC geopatial working group will have its first meeting so it probably doesn't make sense to wait. |
I can talk more about the changes during next Tuesday's tech hour. I agree, this wouldn't involve changes in the Dataverse UI. @JingMa87's latest proposal could affect UX by improving discoverability of some datasets, especially when dataset metadata from one Dataverse repository is imported into another Dataverse. The loss of the use of the |
Issue at DDI Alliance: https://ddi-alliance.atlassian.net/browse/DDICODE-71 |
I'd just like to note that we had a great conversation about this issue during tech hours on Tuesday and @jggautier did a fantastic job of talking about and demo'ing various nuances that would have taken a lot of back and forth in writing. I don't think I'm the only one who expressed some nervousness about attempting to put semi-structured text into a freeform field like "geogCover". As mentioned above, that field could look like this:
Personally, I'm wondering if we can wait for DDI to give us more granularity with a future version of their spec. I'm not sure what the next steps are for this pull request. I believe we'd like to give @JingMa87 some direction. 😄 |
@pdurbin @jggautier Alternatively, we could go from something like this:
to something like this:
This way we would keep the data granular and also be DDI valid. Wendy mentioned that the nation element doesn't necessarily have to be used for our nation data. |
In the meeting last month that @pdurbin mentioned, @landreev said he'd leave a comment when he can about one or more other ways to add the metadata to the "geogCover" field in a semi-structured way. That was held up by work on Dataverse 5 and he's "out of the office" this week. In the meantime, @JingMa87 the most recent solution you described made me wonder if I really understand the goal of this PR. I've been assuming that you've been working on improving the DDI Codebook xml that Dataverse produces in order to help with a migration effort you're working on. Is that right? If so, are there DDI xml files you could share that you need to help migrate into Dataverse? I'm thinking that some examples could help us weigh the pros and cons of different solutions. The same thing with examples from other non-Dataverse based repos that are using DDI Codebook to store and distribute geospatial metadata, especially Nesstar-based repos. Several admins of Nesstar-based repos have expressed interest in moving to Dataverse, using the DDI Codebook metadata that software produces. In the Dataverse community forum thread at https://groups.google.com/g/dataverse-community/c/kben6-8jRxE I describe the most recent interest, and I could follow up with them. ADA also migrated from Nesstar, they were in the DDI Codebook mapping weeds for some time and might be able to share helpful info about how they migrated their geospatial metadata. |
@jggautier Although at our organization we're busy migrating, I'm not actually making this fix for that migration effort. The sole goal here is to have the DDI be valid according to the codebook. |
If Dataverse imports DDI Codebook metadata that contains...
how will it know that there are two locations? Right now, Dataverse is using DDI-C's
on import it combines every value of
Since the DDI-C schema technically doesn't allow us to rely on the placement of the
it's exported in DDI-C as:
This would mean that there would be a loss of granularity if someone tries to export DDI-C metadata from one Dataverse repository into another:
But I think that's acceptable considering how little structured granularity the DDI-C schema allows in this case, and Dataverse repositories have the option to use Dataverse's own JSON metadata export to exchange metadata among themselves.
Dataverse could map them to its Geographic Coverage "Nation" field:
I can't imagine that any other type of repository that exports DDI-C metadata is using both the And if a newer version of a Dataverse-based repository with these proposed import/export rules wants to import (such as harvest) from an older Dataverse repository that's using the existing import/export rules, those Dataverse repositories can avoid information loss by using Dataverse's own JSON metadata export. Do you think this proposal makes sense @JingMa87? |
As Julian pointed out correctly, the DDI import code in Dataverse, as currently implemented, relies on the order of the "nation" and "geogCover" tags for properly placing the field values into separate compound metadata fields.
the order of the fields is the only thing that tells Dataverse to group "San Jose" with "United States". BTW, I am assuming that the attributes We may however need to reconsider that goal now - i.e. whether this is still a hard requirement, to be able to perform the export-and-import round trip, and end up with a perfectly identical dataset (at least with the metadata fields that can be exported to DDI). It's important to keep in mind that when this project was started, it was social sciences-centered; and the DDI was considered to be the primary format for exchanging metadata between instances. Neither is true anymore. So it would be a valid argument, that it may be irrelevant, how Dataverse handles DDI produced by another Dataverse... Simply because nobody should be doing that? Because our proprietary JSON - that's actually designed to encode all the possible metadata fields unambiguously - is the format for exchanging between Dataverses. I'm not sure we are there yet. We may still have some cases where we don't have a JSON round trip guaranteed to work properly. And where using the DDI would actually produce better results. Also, we may still be recommending DDI as a migration solution in some situations. (and there are still installations that are only using the citation and social sciences metadata exclusively). But my main point is that we should address all of this and maybe make some decisions for going forward, before we proceed to eliminate the remaining idiosyncrasies in the export. I'll be adding more info/thoughts... |
(I know I repeated a few things in the last comment that must have been obvious to everybody else who was already part of the discussion. I wanted to emphasize the part about the order being treated as meaningful. For the sake of other developers who may be joining in). From a quick look at the DDI import code, it obviously relies on the order in which the "nation" and "geogCover" are intermixed; and there is no attempt to As I said, we may want to reconsider how important it is to ensure a 1:1 export-import round trip via DDI. We all agree that we don't want to have that at the expense of producing XML that's illegal under the DDI schema. But I could accept that maybe it's not even necessary, for example, to be able to export a Dataverse compound field "geographicCoverage" with the child fields "country: USA" and "otherGeographicCoverage: Silicon Valley", and to have these 2 values guaranteed to end up in the same compound field, after an export and reimport... However, I feel like it is actually more important to ensure that we make sensible decisions when we try to import "arbitrary" DDI, that's not produced by another Dataverse. And making that hard-coded assumption about the order is almost certainly going to result in an unintended grouping of the imported values in that situation. (Also worth noting, that, if I'm reading the code correctly, even with these export and import hacks currently in place, a 1:1 round trip is still not guaranteed! For example, a compound field |
So, for the purposes of wrapping up and closing this issue, I like the solution described above, of inserting structured text into a
On the import side the code will need to check if the text found inside this tag matches the structure above; and if so, take the values from it and insert them into a single
will produce TWO compound field values in the dataverse metadata on the importing end: Another alternative to the above solution I can think of:
I believe this would require fewer modifications on the import end. In fact, the current import code will import the fragments above as 2 separate Thoughts? |
For the purposes of going forward with the DDI export and import in general - this is clearly outside the scope of this issue/PR, but something we need to address; I hope we can talk about this during the tech hour tomorrow. We should probably check the rest of the code for dataverse-specific hacks like the one above. Some of them may not be producing outright illegal XML, but may still not be ideal to keep, since they may be resulting in unintended behavior when importing non-Dataverse produced DDI. We should revisit the very notion of relying on special formatting in order to achieve a 1:1 export-import round trip. Simply because we should not view this format as intended for exchanging between dataverses. Producing valid, quality DDI and being able to import the format as produced by other institutions should be a higher priority. |
Hey @JingMa87 - do you have any thoughts on @landreev's feedback above? This may be more than you anticipated working on and it's fine if this additional work is not something you're able to take on. Let me know if you'd be interested in continuing the work here and, if so, let us know if you have any questions. Thanks for all the PRs so far! |
@djbrooke Thanks for the amazing thoughts on this topic! I also think that it makes sense to put structured text into a
Unfortunately I don't have time to make these changes anytime soon, since I've taken on a role in the migration at DANS. Hopefully someone else has time for this! |
OK, great, but let's try to put it on the schedule soon... while we remember how we agreed to implement it, etc. etc. :) |
What this PR does / why we need it: Improves the DDI export
Which issue(s) this PR closes: 7127
Closes #7127
Special notes for your reviewer: Looking good
Suggestions on how to test this: Make a DDI export for a dataset with geospatial data, then run the XML through a validator. You can validate here https://www.freeformatter.com/xml-validator-xsd.html with this XSD https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd. You do need to fix the import elements first. These are the correct ones:
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No
Is there a release notes update needed for this change?: No
Additional documentation: