Revamp GitHub tag versions, debugging individual projects, and more! #237

MattTheCuber · 2024-12-31T17:52:51Z

Closes #236

TODO

Decide if pre-releases, release candidates, post fix releases, dev releases, etc. count as releases.
Decide what to do with Cataclysm: Dark Days Ahead's releases, like 0.H, which are not PEP 440 compliant.
Update README.md.
Generate and manually validate projects.json.

Summary

This PR includes the following changes and features:

Created classes for handling data and grouping functionality in tools\gen_projects_json.py.
Rewrote how GitHub tags are parsed in tools\gen_projects_json.py using the PEP 440 compliant standard and custom regex on a per-project basis.
Added subcommands to tools\gen_projects_json.py for generate, info, and tags.
Refactored and simplified projects.yaml to not require nearly as much manual input.

Tag parsing

Tag parsing is much more generic now with the specific implementations moved to entries in the projects.yaml file. The base version simply trys to cast the version string as a PEP 440 compliant Version object (from packaging.version). If it fails, the tag is ignored for all aspects of the data generation (first, latest, count, etc.). Many (maybe like 1/3) repositories use non-compliant tag names. To solve this, each project can define custom regexs to apply to tags. For example, the Vala project uses tags that look like this: VALA_0_0_0. The updated entry adds a custom regex to convert this to a compliant version:

  - name: Vala
    gh_url: https://github.com/GNOME/vala
    tag_regex_subs:
      - search: ^VALA_(\d)+_(\d)+_(\d)+$
        replace: \1.\2.\3

Additionally, many projects use tag name prefixes. For example, the StreamEx project uses versions that look like streamex-0.8.3. To fix this, simply remove the prefix with this regex:

  - name: StreamEx
    gh_url: https://github.com/amaembo/streamex
    tag_regex_subs:
      - remove: ^streamex-

This system feels miles better (no offense intended). It also simplifies the code quite a bit and enables automatic parsing of many libraries and data that were previously not possible (React, FreeCAD, Haskell bytestring, OpenSSL, MAME, Window Maker, ReactOS, three.js, google-api-client, rand, distlib, etc.).

gen_projects_json.py CLI

gen_projects_json.py --help

> python .\tools\gen_projects_json.py --help
usage: gen_projects_json.py [-h] [-u USER] [-k TOKEN] [--disable-caching] {generate,info,tags} ...

Generate or update project.json using projects.yaml.

positional arguments:
  {generate,info,tags}  Available commands
    generate            Generate an updated projects.json file.
    info                Print automatically pulled info for a GitHub project for debugging.
    tags                Print all sorted tags for a GitHub project for debugging.

options:
  -h, --help            show this help message and exit
  -u, --user USER       GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN     A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.
  --disable-caching     Flag to disable caching. Falls back to the "ZV_DISABLE_CACHING" environment variable.

Generate

gen_projects_json.py generate --help

> python .\tools\gen_projects_json.py generate --help
usage: gen_projects_json.py generate [-h] [-u USER] [-k TOKEN] [--disable-caching]

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.
  --disable-caching  Flag to disable caching. Falls back to the "ZV_DISABLE_CACHING" environment variable.

This command did not change.

Info

gen_projects_json.py info --help

> python .\tools\gen_projects_json.py info --help    
usage: gen_projects_json.py info [-h] [-u USER] [-k TOKEN] name_or_link

positional arguments:
  name_or_link       The project.yaml exact entry name or GitHub link.

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.

The info command allows you to view what would be written to projects.json for the specified project. You can pass either a GitHub link or the exact name of a entry in projects.yaml. It will then print the output that would be written to projects.json for easier debugging.

Tags

gen_projects_json.py tags --help

> python .\tools\gen_projects_json.py tags --help
usage: gen_projects_json.py tags [-h] [-u USER] [-k TOKEN] name_or_link

positional arguments:
  name_or_link       The project.yaml exact entry name or GitHub link.

options:
  -h, --help         show this help message and exit
  -u, --user USER    GitHub Username for API authentication. Falls back to the "GH_USER" environment variable.
  -k, --token TOKEN  A path to a file containing a GitHub personal access token for API authentication. Falls back to the "GH_TOKEN" environment variable.

This command is super helpful with the new tagging system for building and testing regexs. When adding a new library simply pass the GitHub address to see if the tags are not compliant (requiring a regex). From there you will be able to see every parsed version, duplicate version (due to improper regex patterns), and failed version. Here is a demonstration output:

> python .\tools\gen_projects_json.py tags https://github.com/test/repo
Processing https://github.com/test/repo

Parsed tags:
v0.5.0 (parsed as 0.5.0)
v0.4.0-RC1 (parsed as 0.4.0rc1)
v0.4.0 (parsed as 0.4.0)
v0.3.0 (parsed as 0.3.0)
v0.2.0 (parsed as 0.2.0)
v0.1.0-beta.1 (parsed as 0.1.0b1)
v0.1.0 (parsed as 0.1.0)

Failed tags:
latest (tried latest)
test-ci-1 (tried test-ci-1)

Here is an example output for a more complicated example:

  - name: 3proxy
    gh_url: https://github.com/z3APA3A/3proxy
    tag_regex_subs:
      - remove: ^3proxy-

> python .\tools\gen_projects_json.py tags 3proxy
Processing 3proxy

Parsed tags:
0.9.4 (parsed as 0.9.4)
0.9.3 (parsed as 0.9.3)
0.9.2 (parsed as 0.9.2)
0.9.1 (parsed as 0.9.1)
0.9.0 (parsed as 0.9.0)
0.9.0-rc (parsed as 0.9.0rc0)
0.8.13 (parsed as 0.8.13)
0.8.12 (parsed as 0.8.12)
0.8.11 (parsed as 0.8.11)
0.8.10 (parsed as 0.8.10)
0.8.9 (parsed as 0.8.9)
0.8.8 (parsed as 0.8.8)
3proxy-0.8.7 (parsed as 0.8.7)
3proxy-0.8.6 (parsed as 0.8.6)
3proxy-0.8.5 (parsed as 0.8.5)
3proxy-0.8.4 (parsed as 0.8.4)
3proxy-0.8.3 (parsed as 0.8.3)
3proxy-0.8.2 (parsed as 0.8.2)
3proxy-0.7.1.4 (parsed as 0.7.1.4)
3proxy-0.8.1 (parsed as 0.8.1)
3proxy-0.8.0 (parsed as 0.8.0)
3proxy-0.8-pre (parsed as 0.8rc0)
3proxy-0.7.1.3 (parsed as 0.7.1.3)
3proxy-0.7.1.2 (parsed as 0.7.1.2)
v0.7.1.2 (parsed as 0.7.1.2)
v0.7.1.1 (parsed as 0.7.1.1)
v0.7.1 (parsed as 0.7.1)
v0.7 (parsed as 0.7)

Duplicate tags:
3proxy-0.8.8 (parsed as 0.8.8)

In this second example we can see a duplicate tag, which is fine in this case since there are actually two tags with the same version.

mahmoud · 2025-01-03T19:00:24Z

Just catching up on this now. Very cool! I think the optional regex transform per project makes a lot of sense. Definitely miles better, no offense taken.

To your TODO questions:

We're looking far beyond the Python ecosystem, and I'd expect PEP440 is probably too strict. The schema you have with match/replace/remove is fine, but instead of passing it to PEP440, we can say, if that string starts with 0, then the project is zerover. Versions that don't match the initial regex are ignored. We only log a failure if no releases match (the regex or URL is probably wrong).

This should definitely help with the huge increase in monorepos (architecturally good imo), but complicates tagging. Lots of server/0.1.0 / client/6.1.2-type situations.

In terms of release count, I'm fine merging/ignoring suffixed releases (dev/pre/post) with their equivalent numeric release.

MattTheCuber · 2025-01-03T22:59:38Z

Thanks for the input!

MattTheCuber · 2025-01-04T00:05:47Z

We're looking far beyond the Python ecosystem, and I'd expect PEP440 is probably too strict. The schema you have with match/replace/remove is fine, but instead of passing it to PEP440, we can say, if that string starts with 0, then the project is zerover. Versions that don't match the initial regex are ignored. We only log a failure if no releases match (the regex or URL is probably wrong).

I understand and agree. The trouble is that 99% of releases are PEP 440 compliant after regex parsing. The only one that is not is Cataclysm: Dark Days Ahead. However, there could be many more in the future, so it makes sense to not add restrictions. The part that makes this difficult is counting the number of releases. I'll see what I can do.

MattTheCuber · 2025-01-04T00:23:29Z

Hmm, this is proving challenging since I based most of the logic around the use of the Version object and custom regex to filter out undesired/duplicate tags like 0.8.0-beta1-candidate1, 0.10.2.0-KAFKA-5526, v1.4.4-changelog, v0.0.0-20230206210201-441728b4c075, sshuttle-0.60-macos-bin, v2.1.3plusPR822, clamav-0.98-dmgxar, tor-0.0.6incompat-merged, etc. All of these releases were duplicates of other releases and aren't directly parsable with PEP 440. Adding a more generic versioning checker that searched for tags beginning with X. would accept all of these and not be able to tell that they are duplicates...

mahmoud

Idea for the regex. especially given how many projects we're already tracking, the chances one of them adds a new suffix is quite high, and maintaining the list would get pretty involved.

mahmoud · 2025-01-04T00:58:07Z

tools/gen_projects_json.py

+from packaging.version import InvalidVersion, Version
+
+
+class RegexSubstituionDict(TypedDict):


RegexSubstituionDict -> RegexSubstitutionDict

mahmoud · 2025-01-04T01:02:51Z

tools/gen_projects_json.py

+        return True
+
+    def process_name(self, regex_subs: list[RegexSubstituionDict] | None = None):
+        for sub in regex_subs or []:


to handle the issue of suffixes, the first thing that comes to mind for me is to just treat the search regex as a prefix that must match, and we can append our own suffix portion to the configured regex. something to the effect of [^\d].*. And we just snip off all suffixes after the last matched part. Do you think that would work?

Can you elaborate on this?

I'll try! Regex is one of those things that can be easier to do than say. :)

So if I recall the way version stuff currently works is that I have an ignore (or skip?) step and a strip step. One matched tags we don't want, and the other cleaned up versions we did.

Ideally we could just have one kind of regex that matched and extracted. We can start with a really general default, something like: ^\D*(\d+(?:\D\d+)+)\D*

This already works for simpler cases like Julia that require a tag_regex_match to remove suffixes now.

Of course, this runs into issues with, e.g., hashicorp vault, which has tagging for subcomponents. We could make every entry with a static prefix revert to regex, but I'd suggest:

- name: HashiCorp Vault gh_url: https://github.com/hashicorp/vault emeritus: true tag_match: - prefix: v

So we build the regex for the project: v(\d+(?:\D\d+)+)\D*

For cases like stellarium which have two version formats, I'm thinking something like:

- name: Stellarium gh_url: https://github.com/Stellarium/stellarium tag_match: - prefix: v - regex: stellarium-(\d+-\d+-\d+)

And then for the second pattern, we stick the \D* on at the end. So the second regex would be ^stellarium-(\d+-\d+-\d+)\D*. We always tack on the \D* and pull the first group from re.match, which has the effect of dropping suffixes. And if the first character of any match is 0, that's 0ver. If it doesn't match, we try other regexes. For the purposes of release counting and assessing whether the project is currently 0ver, we only look at releases that match a regex.

I think this gives us a pretty robust mechanism. Ideally one where we won't be in regexland every other day because some project decided to get cute with their tags :) lmk what you think!

mahmoud · 2025-01-04T03:36:46Z

Also for Cataclysm in particular, I say we just kick it over to being manual. :P

MattTheCuber added 10 commits December 30, 2024 20:58

feat: add regex tag versioning per project WIP

f73f16c

docs: add docstring to parse_tags

c81f8bf

fix: finish implementing new versioning code

3fd02ca

feat: switch to classes and GraphQL

056dee2

fix: update projects.yaml

92a7928

fix: remove unneeding manual data

0eb5c80

feat: add star counts for gitlab repos

16f49f8

fix: resolve todos

3b6530a

chore: update requirements

7a7208e

Merge branch 'master' into revamp-tag-versions

8903a33

MattTheCuber mentioned this pull request Jan 3, 2025

Improving GitHub API usage in CI #240

Open

Merge branch 'master' into revamp-tag-versions

1233458

feat: dont include pre/post/dev releases

347a316

mahmoud reviewed Jan 4, 2025

View reviewed changes

MattTheCuber mentioned this pull request Jan 6, 2025

Fix project: n8n #242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp GitHub tag versions, debugging individual projects, and more! #237

Revamp GitHub tag versions, debugging individual projects, and more! #237

MattTheCuber commented Dec 31, 2024 •

edited

Loading

mahmoud commented Jan 3, 2025

MattTheCuber commented Jan 3, 2025

MattTheCuber commented Jan 4, 2025

MattTheCuber commented Jan 4, 2025 •

edited

Loading

mahmoud left a comment

mahmoud Jan 4, 2025

mahmoud Jan 4, 2025

MattTheCuber Jan 4, 2025

mahmoud Jan 4, 2025 •

edited

Loading

mahmoud commented Jan 4, 2025

		from packaging.version import InvalidVersion, Version


		class RegexSubstituionDict(TypedDict):

Revamp GitHub tag versions, debugging individual projects, and more! #237

Are you sure you want to change the base?

Revamp GitHub tag versions, debugging individual projects, and more! #237

Conversation

MattTheCuber commented Dec 31, 2024 • edited Loading

TODO

Summary

Tag parsing

gen_projects_json.py CLI

Generate

Info

Tags

mahmoud commented Jan 3, 2025

MattTheCuber commented Jan 3, 2025

MattTheCuber commented Jan 4, 2025

MattTheCuber commented Jan 4, 2025 • edited Loading

mahmoud left a comment

Choose a reason for hiding this comment

mahmoud Jan 4, 2025

Choose a reason for hiding this comment

mahmoud Jan 4, 2025

Choose a reason for hiding this comment

MattTheCuber Jan 4, 2025

Choose a reason for hiding this comment

mahmoud Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

mahmoud commented Jan 4, 2025

MattTheCuber commented Dec 31, 2024 •

edited

Loading

MattTheCuber commented Jan 4, 2025 •

edited

Loading

mahmoud Jan 4, 2025 •

edited

Loading