Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7127 Move <sumDscr> and <dataAccs> child elements to the right sequence #7135

Closed
wants to merge 1 commit into from
Closed

7127 Move <sumDscr> and <dataAccs> child elements to the right sequence #7135

wants to merge 1 commit into from

Conversation

JingMa87
Copy link
Contributor

@JingMa87 JingMa87 commented Jul 28, 2020

What this PR does / why we need it: Improves the DDI export

Which issue(s) this PR closes: 7127

Closes #7127

Special notes for your reviewer: Looking good

Suggestions on how to test this: Make a DDI export for a dataset with geospatial data, then run the XML through a validator. You can validate here https://www.freeformatter.com/xml-validator-xsd.html with this XSD https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd. You do need to fix the import elements first. These are the correct ones:

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/xml.xsd"/>
<xs:import namespace="http://www.w3.org/1999/xhtml" schemaLocation="https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/ddi-xhtml11.xsd"/>
<xs:import namespace="http://purl.org/dc/terms/" schemaLocation="https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/dcterms.xsd"/>

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: No

Additional documentation:

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 19.637% when pulling f3e21f3 on JingMa87:7127-fix-ddi-export-element-sequence into baa521d on IQSS:develop.

@JingMa87
Copy link
Contributor Author

JingMa87 commented Jul 29, 2020

@pdurbin @jggautier So in this PR I changed the child elements of <sumDscr> from

<sumDscr>
    <nation>Netherlands</nation>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <nation>United States</nation>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

to

<sumDscr>
    <nation>Netherlands</nation>
    <nation>United States</nation>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

because this is DDI valid. But since we lose the logical grouping of the geographic locations, I decided to contact the DDI Alliance to hear their opinion. Wendy Thomas, the chair of the technical committee, suggested the following XML:

<geogCover>
    Country/Nation: United States
    State/Province: California
    City: San Jose
    Other: Silicon Valley
</geogCover>
<geogCover>  
    Country/Nation:  Netherlands 
    State/Province:  South Holland 
    City:  The Hague 
    Other:  Segbroek 
</geogCover> 

Apparently the geogCover element is intended to be unstructured so you can add a complete set of information. I suggest to update the PR to format the XML as suggested, and also parse the DDI XML accordingly when importing. How do you guys feel about this?

@jggautier
Copy link
Contributor

jggautier commented Jul 30, 2020

That change makes sense to me and appears to be valid against the Codebook schema. (Here's the XML I exported from 7127-fix-ddi-export-element-sequence and adjusted: test_ddi.txt (had to change the .xml extension to .txt so the file would upload in this comment)).

and also parse the DDI XML accordingly when importing

Would that include importing using harvesting and importing using Dataverse's API endpoint for importing DDI XML?

Should this be moved to the Community Dev column for now?

@JingMa87
Copy link
Contributor Author

Would that include importing using harvesting and importing using Dataverse's API endpoint for importing DDI XML?

For sure!

Should this be moved to the Community Dev column for now?

I think @pdurbin can answer this.

@pdurbin
Copy link
Member

pdurbin commented Jul 30, 2020

I'm not especially knowledgeable about the changes being proposed. Here's my suggestion:

  • For now, keep it in Code Review (that is, the ball is in IQSS's court)
  • If @jggautier is willing, present what's going on the developers at tech hours (Tuesdays at 3pm) run by @scolapasta
  • Afterward, @jggautier or I can report back to @JingMa87 here with "go ahead" or a different direction.

(I don't think there are any UI/UX changes so there's probably no need to bring in @TaniaSchlatter )

Like #7108 these are geospatial changes but as I mentioned there I'm not sure when the new GDCC geopatial working group will have its first meeting so it probably doesn't make sense to wait.

@jggautier
Copy link
Contributor

jggautier commented Jul 30, 2020

I can talk more about the changes during next Tuesday's tech hour.

I agree, this wouldn't involve changes in the Dataverse UI. @JingMa87's latest proposal could affect UX by improving discoverability of some datasets, especially when dataset metadata from one Dataverse repository is imported into another Dataverse. The loss of the use of the <nation> element might reduce discoverability of datasets from non-Dataverse repositories whose DDI Codebook metadata is imported into a Dataverse repository.

@JingMa87
Copy link
Contributor Author

Issue at DDI Alliance: https://ddi-alliance.atlassian.net/browse/DDICODE-71

@pdurbin
Copy link
Member

pdurbin commented Aug 7, 2020

I'd just like to note that we had a great conversation about this issue during tech hours on Tuesday and @jggautier did a fantastic job of talking about and demo'ing various nuances that would have taken a lot of back and forth in writing.

I don't think I'm the only one who expressed some nervousness about attempting to put semi-structured text into a freeform field like "geogCover". As mentioned above, that field could look like this:

<geogCover>
    Country/Nation: United States
    State/Province: California
    City: San Jose
    Other: Silicon Valley
</geogCover>

Personally, I'm wondering if we can wait for DDI to give us more granularity with a future version of their spec.

I'm not sure what the next steps are for this pull request. I believe we'd like to give @JingMa87 some direction. 😄

@JingMa87
Copy link
Contributor Author

JingMa87 commented Aug 7, 2020

@pdurbin @jggautier Alternatively, we could go from something like this:

<sumDscr>
    <nation>Netherlands</nation>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <nation>United States</nation>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

to something like this:

<sumDscr>
    <geogCover>Netherlands</nation>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <geogCover>United States</nation>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

This way we would keep the data granular and also be DDI valid. Wendy mentioned that the nation element doesn't necessarily have to be used for our nation data.

@jggautier
Copy link
Contributor

I'm not sure what the next steps are for this pull request. I believe we'd like to give @JingMa87 some direction. 😄

In the meeting last month that @pdurbin mentioned, @landreev said he'd leave a comment when he can about one or more other ways to add the metadata to the "geogCover" field in a semi-structured way. That was held up by work on Dataverse 5 and he's "out of the office" this week.

In the meantime, @JingMa87 the most recent solution you described made me wonder if I really understand the goal of this PR. I've been assuming that you've been working on improving the DDI Codebook xml that Dataverse produces in order to help with a migration effort you're working on. Is that right? If so, are there DDI xml files you could share that you need to help migrate into Dataverse? I'm thinking that some examples could help us weigh the pros and cons of different solutions.

The same thing with examples from other non-Dataverse based repos that are using DDI Codebook to store and distribute geospatial metadata, especially Nesstar-based repos. Several admins of Nesstar-based repos have expressed interest in moving to Dataverse, using the DDI Codebook metadata that software produces. In the Dataverse community forum thread at https://groups.google.com/g/dataverse-community/c/kben6-8jRxE I describe the most recent interest, and I could follow up with them. ADA also migrated from Nesstar, they were in the DDI Codebook mapping weeds for some time and might be able to share helpful info about how they migrated their geospatial metadata.

@JingMa87
Copy link
Contributor Author

@jggautier Although at our organization we're busy migrating, I'm not actually making this fix for that migration effort. The sole goal here is to have the DDI be valid according to the codebook.

@jggautier
Copy link
Contributor

If Dataverse imports DDI Codebook metadata that contains...

<sumDscr>
    <geogCover>Netherlands</nation>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <geogCover>United States</nation>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

how will it know that there are two locations?

Right now, Dataverse is using DDI-C's <nation> element to figure out where locations begin and end. But if Dataverse sees no <nation> elements...

<sumDscr>
    <geogCover>The Hague</geogCover>
    <geogCover>South Holland</geogCover>
    <geogCover>Segbroek</geogCover>
    <geogCover>San Jose</geogCover>
    <geogCover>California</geogCover>
    <geogCover>Silicon Valley</geogCover>
</sumDscr>

on import it combines every value of <geogCover> into one Dataverse Geographic Coverage "Other" field:

Other: The Hague, South Holland, Segbroek, San Jose, California, Silicon Valley

Since the DDI-C schema technically doesn't allow us to rely on the placement of the <nation> element to figure out where one geographic location ends and another begins, and if we want to avoid putting semi-structured text in the freeform field <geogCover>, I think another set of options would be:

  1. Concatenate all values of a location into one <geogCover> element. So when someone enters dataset metadata like...

Country/Nation: United States
State/Province: California
City: San Jose
Other: Silicon Valley

Country/Nation: Netherlands
State/Province: South Holland
City: The Hague
Other: Segbroek

it's exported in DDI-C as:

<sumDscr>
    <geogCover>San Jose, California, United States, Silicon Valley<geogCover>
    <geogCover>The Hague, South Holland, Netherlands, Segbroek</geogCover>
<sumDscr>

This would mean that there would be a loss of granularity if someone tries to export DDI-C metadata from one Dataverse repository into another:

Other: San Jose, California, United States, Silicon Valley

Other: The Hague, South Holland, Netherlands, Segbroek

But I think that's acceptable considering how little structured granularity the DDI-C schema allows in this case, and Dataverse repositories have the option to use Dataverse's own JSON metadata export to exchange metadata among themselves.

  1. If Dataverse tries to import DDI that contains <nation> elements...
<sumDscr>
    <nation>Netherlands</nation>
    <nation>Netherlands</nation>
</sumDscr>

Dataverse could map them to its Geographic Coverage "Nation" field:

Country/Nation: United States

Country/Nation: Netherlands

I can't imagine that any other type of repository that exports DDI-C metadata is using both the <nation> and <geogCover> elements to describe a single location (unless they're publishing invalid DDI-C, too) so I think we wouldn't need to account for that situation.

And if a newer version of a Dataverse-based repository with these proposed import/export rules wants to import (such as harvest) from an older Dataverse repository that's using the existing import/export rules, those Dataverse repositories can avoid information loss by using Dataverse's own JSON metadata export.

Do you think this proposal makes sense @JingMa87?

@landreev
Copy link
Contributor

As Julian pointed out correctly, the DDI import code in Dataverse, as currently implemented, relies on the order of the "nation" and "geogCover" tags for properly placing the field values into separate compound metadata fields.
i.e. in this example:

<sumDscr>
    <nation>Netherlands</nation>
    <geogCover>The Hague</geogCover>
    <nation>United States</nation>
    <geogCover>San Jose</geogCover>
</sumDscr>

the order of the fields is the only thing that tells Dataverse to group "San Jose" with "United States".
Yes, this is wrong and illegal. Both in terms of breaking the DDI schema; and in the very notion of relying on the order of XML fields for meaningful purposes. But that was the reason and rational for why the export was implemented like that in the first place. Being able to export and re-import, between Dataverses, without losing any data, was the goal. And that means that we need to address both sides, the export and import, at the same time - rather than just fixing the order of the fields in export.

BTW, I am assuming that the attributes source="DVN" that we used to have throughout the DDI - that were also illegal, and that we appear to have dropped recently - were intended for the same purpose: So that the import code could tell that this DDI was produced by another Dataverse, and apply some special, Dataverse-proprietary rules/assumptions about the meaning of certain fields. Again, not necessarily a good solution, but that was the original goal.

We may however need to reconsider that goal now - i.e. whether this is still a hard requirement, to be able to perform the export-and-import round trip, and end up with a perfectly identical dataset (at least with the metadata fields that can be exported to DDI). It's important to keep in mind that when this project was started, it was social sciences-centered; and the DDI was considered to be the primary format for exchanging metadata between instances. Neither is true anymore. So it would be a valid argument, that it may be irrelevant, how Dataverse handles DDI produced by another Dataverse... Simply because nobody should be doing that? Because our proprietary JSON - that's actually designed to encode all the possible metadata fields unambiguously - is the format for exchanging between Dataverses.

I'm not sure we are there yet. We may still have some cases where we don't have a JSON round trip guaranteed to work properly. And where using the DDI would actually produce better results. Also, we may still be recommending DDI as a migration solution in some situations. (and there are still installations that are only using the citation and social sciences metadata exclusively).

But my main point is that we should address all of this and maybe make some decisions for going forward, before we proceed to eliminate the remaining idiosyncrasies in the export.

I'll be adding more info/thoughts...

@landreev
Copy link
Contributor

(I know I repeated a few things in the last comment that must have been obvious to everybody else who was already part of the discussion. I wanted to emphasize the part about the order being treated as meaningful. For the sake of other developers who may be joining in).

From a quick look at the DDI import code, it obviously relies on the order in which the "nation" and "geogCover" are intermixed; and there is no attempt to
only apply that logic if the DDI is confirmed to have been produced by another Dataverse. This is clearly wrong, and needs to be changed.

As I said, we may want to reconsider how important it is to ensure a 1:1 export-import round trip via DDI. We all agree that we don't want to have that at the expense of producing XML that's illegal under the DDI schema. But I could accept that maybe it's not even necessary, for example, to be able to export a Dataverse compound field "geographicCoverage" with the child fields "country: USA" and "otherGeographicCoverage: Silicon Valley", and to have these 2 values guaranteed to end up in the same compound field, after an export and reimport... However, I feel like it is actually more important to ensure that we make sensible decisions when we try to import "arbitrary" DDI, that's not produced by another Dataverse. And making that hard-coded assumption about the order is almost certainly going to result in an unintended grouping of the imported values in that situation.

(Also worth noting, that, if I'm reading the code correctly, even with these export and import hacks currently in place, a 1:1 round trip is still not guaranteed! For example, a compound field
"geographicCoverage" ( "country: USA", "state: California")
will become
"geographicCoverage" ( "country: USA", "otherGeographicCoverage: California")
in the importing Dataverse...)

@landreev
Copy link
Contributor

So, for the purposes of wrapping up and closing this issue, I like the solution described above, of inserting structured text into a geogCover field:

Country/Nation: United States
State/Province: California
City: San Jose
Other: Silicon Valley

On the import side the code will need to check if the text found inside this tag matches the structure above; and if so, take the values from it and insert them into a single geographicCoverage compound metatada field. If the text inside the field is not structured, I believe we should simply treat every <geogCover> or <nation> tag inside a <sumDscr> section as independent, and not make any attempts to group them together. i.e.

<sumDscr>
   <nation>United States</nation>
   <geogCover>California</geogCover>
</sumDscr>

will produce TWO compound field values in the dataverse metadata on the importing end: geographicCoverage: (country: United States) and geographicCoverage: (otherGeographicCoverage: California).
I honestly feel like this is the best we can do. (i.e., we should not assume any structure or order where none is guaranteed!)

Another alternative to the above solution I can think of:
Since <sumDscr> is itself a repeatable section, instead of packaging structured text inside a <geogCover> field, we can simply export every geographicCoverage compound field value as its own complete <sumDscr> section:

<sumDscr>
   <nation>United States</nation>
   <geogCover>California</geogCover>
</sumDscr>
<sumDscr>
   <nation>United States</nation>
   <geogCover>Silicon Valley</geogCover>
</sumDscr>

I believe this would require fewer modifications on the import end. In fact, the current import code will import the fragments above as 2 separate geographicCoverage compound field values without any modifications. (although we still want to modify that code regardless, because of the assumptions built into it that we should not be making about DDIs produced elsewhere, see the comment above).

Thoughts?

@landreev
Copy link
Contributor

For the purposes of going forward with the DDI export and import in general - this is clearly outside the scope of this issue/PR, but something we need to address; I hope we can talk about this during the tech hour tomorrow. We should probably check the rest of the code for dataverse-specific hacks like the one above. Some of them may not be producing outright illegal XML, but may still not be ideal to keep, since they may be resulting in unintended behavior when importing non-Dataverse produced DDI. We should revisit the very notion of relying on special formatting in order to achieve a 1:1 export-import round trip. Simply because we should not view this format as intended for exchanging between dataverses. Producing valid, quality DDI and being able to import the format as produced by other institutions should be a higher priority.

@djbrooke
Copy link
Contributor

Hey @JingMa87 - do you have any thoughts on @landreev's feedback above? This may be more than you anticipated working on and it's fine if this additional work is not something you're able to take on. Let me know if you'd be interested in continuing the work here and, if so, let us know if you have any questions. Thanks for all the PRs so far!

@JingMa87
Copy link
Contributor Author

JingMa87 commented Sep 28, 2020

@djbrooke Thanks for the amazing thoughts on this topic! I also think that it makes sense to put structured text into a geogCover field:

Country/Nation: United States
State/Province: California
City: San Jose
Other: Silicon Valley

Unfortunately I don't have time to make these changes anytime soon, since I've taken on a role in the migration at DANS. Hopefully someone else has time for this!

@djbrooke
Copy link
Contributor

Hey @JingMa87 - thanks for letting us know. All the praise goes to @landreev :) We'll close this PR out, as we'll want to revisit this in the way outlined above. Thanks again!

@djbrooke djbrooke closed this Sep 28, 2020
@landreev
Copy link
Contributor

@djbrooke

... We'll close this PR out, as we'll want to revisit this in the way outlined above. Thanks again!

OK, great, but let's try to put it on the schedule soon... while we remember how we agreed to implement it, etc. etc. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some elements in the DDI XML response have an incorrect sequence
6 participants