Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOI references not matching RFC 2629 DTD #228

Open
ronaldtse opened this issue Jul 6, 2022 · 17 comments
Open

DOI references not matching RFC 2629 DTD #228

ronaldtse opened this issue Jul 6, 2022 · 17 comments
Assignees
Labels
bug Something isn't working Crossref Related to Crossref integration for DOI retrieval retrieval Generally related to API endpoints for searching and retrieving citation data

Comments

@ronaldtse
Copy link
Collaborator

From @kesara ietf-tools/xml2rfc#804 (comment)

Tests are failing because reference.DOI.10.1145/2975159 doesn't have date element under front element.
This violates rfc2629.dtd.

Originally posted by @kesara in ietf-tools/xml2rfc#804 (comment)

@ronaldtse ronaldtse added the bug Something isn't working label Jul 6, 2022
@ronaldtse
Copy link
Collaborator Author

@stefanomunarini can we add tests to validate bibitems (selection of tests across all datasets) against the BibXML schema?

@kesara
Copy link
Member

kesara commented Jul 6, 2022

bibxml-service reference: https://bib.ietf.org/public/rfc/bibxml-doi/reference.DOI.10.1145/2975159.xml
current tools.ietf.org reference: http://xml2rfc.tools.ietf.org/public/rfc/bibxml-doi/reference.DOI.10.1145/2975159.xml

New bibxml-service output's title is incomplete.
Also it lacks <seriesInfo name="Communications of the ACM" value="Vol. 59, pp. 88-97"/>.

@ronaldtse
Copy link
Collaborator Author

ronaldtse commented Jul 6, 2022

Incomplete title

Original: "Jupiter rising: a decade of clos topologies and centralized control in Google's datacenter network"
New: "Jupiter rising"

Right, this needs to be fixed (@strogonoff ). I think it may be fixed by #215 (@stefanomunarini ).

<seriesInfo name="Communications of the ACM" value="Vol. 59, pp. 88-97"/>.

@kesara while this could be useful, in <seriesInfo>, the "name" attribute value is explicitly invalid according to RFC 7991:

2.47.3.  "name" Attribute (Mandatory)

   The name of the series.  The currently known values are "RFC",
   "Internet-Draft", and "DOI".  The RFC Series Editor may change this
   list in the future.

@strogonoff
Copy link
Collaborator

strogonoff commented Jul 6, 2022

  • Yes, we don’t adapt any dates from Crossref to Relaton format yet. It looks like Feat/crossref integration expansion #215 needs to expand on that ASAP.

  • <title>: it looks like IETF xml2rfc tools concatenated title and subtitle using a colon. Relaton-py could do that if that’s reliable. Currently, relaton-py’s bibxml serializer doesn’t do any such title adaptation and ends up using the first available title when serializing to BibXML. We can either change BibXML serialization in relaton-py, or change the way we format the main title when parsing Crossref data in bibxml-service.

  • “Communications of the ACM” is apparently taken from container-title, we could use that when creating a bibliographic item from Crossref data if that’s always how it should be parsed.

    Edit: It appears that two pending PRs by @stefanomunarini to bibxml-service and relaton-py make it so that container-title is used to define bibliographic item locality, and locality is used by the relaton.serializers.bibxml to generate <seriesInfo>. Which looks like what we want! I think it’s on me that has not been merged yet…

Can anyone point to preexisting IETF’s xml2rfc tools Crossref API handler (i.e., what code runs under /public/rfc/bibxml-doi/)? https://github.com/ietf-tools/xml2rfc-bibxml doesn’t seem to have it🤔

@rjsparks
Copy link
Member

rjsparks commented Jul 6, 2022

What you're looking for is in the RFP, in the section for bibxml7.

@ronaldtse
Copy link
Collaborator Author

I'm a little perplexed: our doi2ietf already implements dates but why is not serialised into BibXML?

Yes we need to adopt the dates from the Crossref API and map them to the Relaton model.

Relaton supports these date/time types:

Crossref metadata includes the following date/times:

  • indexed: ignore
  • created: Relaton created
  • deposited: Relaton created
  • issued: Relaton issued
  • published: Relaton published
  • published-online: ignore
  • <title>: it looks like IETF xml2rfc tools concatenated title and subtitle using a colon. Relaton-py could do that if that’s reliable. Currently, relaton-py’s bibxml serializer doesn’t do any such title adaptation and ends up using the first available title when serializing to BibXML. We can either change BibXML serialization in relaton-py, or change the way we format the main title when parsing Crossref data in bibxml-service.

We should concatenate the Crossref title and subtitle at the doi2ietf level.

  • “Communications of the ACM” is apparently taken from container-title, we could use that when creating a bibliographic item from Crossref data if that’s always how it should be parsed.

As I pointed out in https://github.com/ietf-ribose/bibxml-service/issues/228#issuecomment-1175771552 , we really want explicit permission from @rjsparks that this is correct usage of <seriesInfo>. Thanks.

@strogonoff
Copy link
Collaborator

strogonoff commented Jul 7, 2022

@ronaldtse

our doi2ietf already implements dates but why is not serialised into BibXML?

We are not using doi2ietf for at least these two reasons:

  1. doi2ietf-py used obsolete, unsupported Crossref API to retrieve data.
  2. doi2ietf-py’s purpose was to transform Crossref API data directly into BibXML, not to Relaton. We don’t need that; we need to transform Crossref to Relaton, and use relaton-py’s bibxml serializer after that.

With that in mind, it was faster to bypass doi2ietf-py and implement this directly in bibxml-service and relaton-py.

Yes we need to adopt the dates from the Crossref API and map them to the Relaton model.

Yes, @stefanomunarini’s PRs should take care of all that. It’s aimed to port the requisite functionality from doi2ietf-py into both bibxml-service Crossref DOI parser and relaton-py serializer. I’ll merge them once we confirm that new <seriesInfo name> values are acceptable, because it contains that as well.

@strogonoff strogonoff pinned this issue Jul 7, 2022
@strogonoff strogonoff added retrieval Generally related to API endpoints for searching and retrieving citation data Crossref Related to Crossref integration for DOI retrieval labels Jul 7, 2022
@rjsparks
Copy link
Member

rjsparks commented Jul 7, 2022

@ronaldtse It is expected that seriesinfo will have more than the 3 possible names listed in 7991. We will make sure that gets clarified in 7991bis. A better thing to read at the moment is the seriesInfo entry at https://authors.ietf.org/en/rfcxml-vocabulary

@ajeanmahoney
Copy link
Collaborator

Note that the RPC uses seriesInfo for documents that are part of a series and have a unique value. Examples of document series include RFC, IEEE Std, ITU Recommendation, DOI, 3GPP TR, 3GPP TS, ISO/IEC, and FIPS PUB.
The RPC uses refcontent to capture journal or conference proceedings information: journal or conference title, volumes, pages, conference location, etc. For example,

<refcontent>Communications of the ACM, Vol. 59, pp. 88-97</refcontent>

@ronaldtse
Copy link
Collaborator Author

Thanks @rjsparks @ajeanmahoney .

Valid values for <seriesInfo> "name"

Is the seriesInfo value from a controlled vocabulary or free form text? If the former, it would be great to have the specifications.

https://authors.ietf.org/en/rfcxml-vocabulary seems to describe the "name" attribute as the name of the standardization organization outside of IETF ("other names such as "ISO", "W3C" for exist for other standardisation organisations")
Screenshot 2022-07-08 at 1 03 40 PM

Is "name" supposed to take the "series name" or the "organization name"?

From the illustrative list provided it looks like it is the "series name" (which makes sense given the element name), not the "organization name".

Some question regarding the example list:

  1. I understand the separation of "3GPP TS" and "3GPP TR". Are "ISO/IEC", "ISO/IEC TR" and "ISO/IEC TS" also separated (and there are other deliverable types as well)?
  2. IEEE offers other deliverable types that are not standards, such as "Recommended Practices" and "Guidelines". Should they be considered series?
  3. "ITU Recommendations" are published as "ITU-T Recommendations" and "ITU-R Recommendations". They also have a dozen deliverables types. Are they supported?
  4. NIST and W3C are also supported by bibxml-service.

Proper structuring of a DOI entry in BibXML

The item in question has source metadata provided through this Crossref link:

Notice that "Communications of the ACM" exist in container-title.

As specified by @ajeanmahoney , this information is to be in <refcontent>, not <seriesInfo>, and should look like this:

<refcontent>Communications of the ACM, Vol. 59, pp. 88-97</refcontent>

This formatted reference string can only be built from the raw Crossref metadata, by also including these elements:

"page":"88-97",
"volume":"59"

I would like to confirm with @ajeanmahoney that:

  1. Every BibXML item generated from DOI will be using refcontent, not seriesInfo.
  2. We will programmatically construct the formatted reference string in refcontent using Crossref metadata. This is about citation rendering.

Thanks!

@ajeanmahoney
Copy link
Collaborator

seriesInfo name and value attributes take freeform text. The name attribute holds the name of the series. The RPC uses the following seriesInfo names:

  • 3GPP TR
  • 3GPP TS
  • BCP
  • DOI
  • FIPS PUB
  • IBSN
  • IEEE Std (note the lack of a period at the end)
  • ISO/IEC
  • ISSN
  • ITU Recommendation
  • RFC

These are what we have identified so far. We will be discussing this list this week.

@ronaldtse
Copy link
Collaborator Author

Thanks @ajeanmahoney , since there's going to be a discussion if you don't mind let us provide some additional input 😉

Basis:

  • RFC, BCP are IETF-related series and there is no issue
  • ISBN, ISSN and DOI are internationally standardized identifiers (ISO standards), no problems with those

Questions:

  1. I believe seriesInfo name should support all series that BibXML service supports today (as part of the ietf-tools suite), including those published by the following organizations:
    • 3GPP
    • IEEE
    • NIST
    • W3C
    • IANA
  2. Consider whether seriesInfo name (for organizations external to IAB/IETF) represents the name of the SDO, or a document type of the SDO. Developers and users would certainly prefer a consistent application. Amongst values supported today:
    • document types: 3GPP TS, 3GPP TR, ITU Recommendation, IEEE Std
    • organization name: ISO/IEC
    • series name: FIPS PUB (published by the Department of Commerce as executed by NIST)

Thanks!

@strogonoff
Copy link
Collaborator

strogonoff commented Jul 28, 2022

Tests are failing because reference.DOI.10.1145/2975159 doesn't have date element under front element. This violates rfc2629.dtd.

Originally posted by @kesara in ietf-tools/xml2rfc#804 (comment)

Can I clarify where is <date> required? It’s not in this spec.

@ronaldtse While this particular issue may have been resolved, since we can rely on DOI to provide at least one date, we cannot be so sure with some other sources.

For example, we have recently found that some 3GPP documents are lacking dates, and this may be the case with other sources.

  • We may be facing the same choice as with missing authors: some authoritative sources simply seem to not provide the data, so we could either violate the spec or we could provide a placeholder date that is not real.
  • Regardless of that choice, it seems warranted to go through all sources and take note of all items that are missing any dates. Perhaps in some cases, like with 3GPP, we could reuse some auxiliary data available to us.

@ajeanmahoney
Copy link
Collaborator

There are some cases where a date is never provided in a bib entry (IANA registry entries, for instance).

Sometimes, an author points to a landing page for a spec (a 3GPP or IEEE entry may fall into this category). Those kind of entries don't have dates. I haven't looked to see if the bibxml-service datastore contains landing-page references.

@rjsparks
Copy link
Member

refererences without dates are syntactically legal and appropriate in cases like Jean calls out above. But when the document does have a publication date (as the original DOI the ticket was opened with), the date must be provided, well formed, in the reference.

@rjsparks
Copy link
Member

I think I've pointed this out in other places, but rfc2629.dtd is not v3 rfcxml - it is strict v2, and while we want to be v2 backwards compatible as much as we can be, there are many RFCs in the v2 era that were published with references that didn't contain dates. In short, date cannot be treated as a mandatory element here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Crossref Related to Crossref integration for DOI retrieval retrieval Generally related to API endpoints for searching and retrieving citation data
Projects
None yet
Development

No branches or pull requests

5 participants