Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate processing of IETF (group) drafts #1135

Merged
merged 2 commits into from
Nov 22, 2023
Merged

Automate processing of IETF (group) drafts #1135

merged 2 commits into from
Nov 22, 2023

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Nov 21, 2023

The code now recognizes IETF draft documents that have a datatracker.ietf.org URL:

  • It associates them with the IETF organization
  • It can compute a useful shortname (that code can in theory return a truncated shortname because there is no direct way to validate that the Internet Draft name contains a group ID).
  • It extracts the group's ID from the nightly URL (that code could further be improved to fetch the actual group name, right now the code only knows about the "HTTP" working group).
  • It associates IETF documents from the HTTP WG to the right repository.
  • It computes the better-looking nightly URL at www.ietf.org or at httpwg.org for HTTP WG documents.

This allows to simplify IETF data in specs.json a bit.

Note that the code still cannot process drafts that have been submitted by individuals automatically, even when these drafts at targeted at a group. Such drafts should be associated with the individuals that submitted them and not with any group. A couple of spec entries, which incorrectly referenced the Network WG or the HTTP WG, were fixed accordingly in specs.json.

This fixes #1122, but note that the code does not need to fetch the datatracker page for the time being.

The code now recognizes IETF draft documents that have a `datatracker.ietf.org`
URL:
- It associates them with the IETF organization
- It can compute a useful shortname (that code can in theory return a truncated
shortname because there is no direct way to validate that the Internet Draft
name contains a group ID).
- It extracts the group's ID from the nightly URL (that code could further be
improved to fetch the actual group name, right now the code only knows about
the "HTTP" working group).
- It associates IETF documents from the HTTP WG to the right repository.
- It computes the better-looking nightly URL at `www.ietf.org` or at
`httpwg.org` for HTTP WG documents.

This allows to simplify IETF data in `specs.json` a bit.

Note that the code still cannot process drafts that have been submitted by
individuals automatically, even when these drafts at targeted at a group.
Such drafts should be associated with the individuals that submitted them and
not with any group. A couple of spec entries, which incorrectly referenced the
Network WG or the HTTP WG, were fixed accordingly in `specs.json`.

This fixes #1122, but note that the code does not need to fetch the datatracker
page for the time being.
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but while researching this PR, I discovered the datatracker API that I think would make a lot of this more reliable:
https://datatracker.ietf.org/api/

See e.g. https://datatracker.ietf.org/api/v1/doc/document/?name=draft-ietf-httpbis-digest-headers&format=json https://datatracker.ietf.org/api/v1/group/group/1718/?format=json

@tidoust
Copy link
Member Author

tidoust commented Nov 22, 2023

Ah, I was looking into datatracker docs and was about to make the same point :)

Perhaps even simpler, there is a simplified JSON file that contains all the information we need and that would be more readily ingestable (if we use the API, we'll have to hop through a few endpoints to collect the info), e.g.:

https://datatracker.ietf.org/doc/draft-zern-webp/doc.json
https://datatracker.ietf.org/doc/rfc6797/doc.json

Also, I don't really understand how to get information about a document published as an RFC with the main API document endpoint, e.g. https://datatracker.ietf.org/api/v1/doc/document/?name=rfc6797&format=json does not work. It seems that one has to know the draft name instead, as in https://datatracker.ietf.org/api/v1/doc/document/?name=draft-ietf-websec-strict-transport-sec&format=json and the easiest way to map the rfc to the draft name would be to fetch the related doc.json.

(Edit: The datatracker API way to map the rfc to the draft name seems to be through the /doc/docalias endpoint: https://datatracker.ietf.org/api/v1/doc/docalias/rfc6797/)

The code now fetches all the information it needs for IETF drafts and RFCs from
the IETF datatracker using the Simplified Documents API:
https://datatracker.ietf.org/api/#simplified-documents

This makes it possible to retrieve the latest revision of a document to build
the nightly URL, and to fetch information about the group that standardizes the
document, if any.

IETF documents may be linked to a group, an area, or be part of what IETF calls
individual submissions. Areas and individual submissions still link to a "group"
page at IETF, so the code just takes that info from datatracker as-is. As a
result, individual submissions are no longer associated with the author who
submitted the document, but that does not seem needed in any case.

The code throws when an IETF document that it knows under a certain name got
published under a different name to alert us that the canonical URL needs to
change in browser-specs. Name changes typically happen when a document
transitions to a working group, or when it gets published as an RFC.
@tidoust tidoust merged commit db649a8 into main Nov 22, 2023
1 check passed
@tidoust tidoust deleted the ietf-logic branch November 22, 2023 13:59
tidoust added a commit that referenced this pull request Nov 23, 2023
Take 3 :)

PR #1135 actually had a couple of issues that made the code essentially useless
because it only ran on a handful of IETF specs:
- the code favored info from Specref over info from IETF
- the code only really applied to drafts due to a buggy RegExp

Fixing these problems yielded a new issue: the assumption that HTTP WG specs
are always available under `httpwg.org` turns out to be wrong. Also, there are
other specs that are not published by the HTTP WG but that still have an
`httpwg.org` version. The code now looks at the actual list of specs in the
underlying GitHub repository: https://github.com/httpwg/httpwg.github.io.

As a result, the nightly URL of all IETF specs that have an `httpwg.org`
version now targets that version, implementing the suggestion in #933 (see
that issue for the list of affected specs). A companion PR was sent to Specref
to implement a similar switch there:
tobie/specref#766

The code also looks at the obsolescence data in datatracker and sets the
`standing` and `obsoletedBy` properties accordingly. This fixes #327.
tidoust added a commit that referenced this pull request Nov 23, 2023
Take 3 :)

PR #1135 actually had a couple of issues that made the code essentially useless
because it only ran on a handful of IETF specs:
- the code favored info from Specref over info from IETF
- the code only really applied to drafts due to a buggy RegExp

Fixing these problems yielded a new issue: the assumption that HTTP WG specs
are always available under `httpwg.org` turns out to be wrong. Also, there are
other specs that are not published by the HTTP WG but that still have an
`httpwg.org` version. The code now looks at the actual list of specs in the
underlying GitHub repository: https://github.com/httpwg/httpwg.github.io.

As a result, the nightly URL of all IETF specs that have an `httpwg.org`
version now targets that version, implementing the suggestion in #937.
A companion PR was sent to Specref to implement a similar switch there:
tobie/specref#766

The code also looks at the obsolescence data in datatracker and sets the
`standing` and `obsoletedBy` properties accordingly. This fixes #327.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Learn to parse IETF datatracker pages
2 participants