site scraping & name/version parsing #234

bkw777 · 2023-05-04T01:13:29Z

bkw777
May 4, 2023
Maintainer

Hashing out a plan to parse the kernel package name & version strings by picking them apart into their constituent tokens instead of the current mess with regexs and unsafe assumptions in explicit program logic.

The code does not have this yet and since it's a little complicated and takes several stages or layers to work backwards from the full strings down to the component parts in a robust way, I'm just writing this as a way to think out loud, document the idea, and get it more organized by getting at least the half-baked idea down on paper and then going over that again to improve it further as necessary.

One of the long-standing difficulties with this app has been parsing the kernel name & version strings in any kind of reliable way that actually works on all the different names & versions that have been made on the mainline-ppa site over the years. Until recently I thought there just was no rule that applied reliably, just loose rules that only happen to work most of the time for some stretches of time but with random exceptions scattered all over.

However now I think there is a way to pick them apart and wanted to document that.

First is just recognize that each source of kernels has their own different rules for how they construct names and version strings, so, first step is just stop trying to use one parser routine for all kernel packages found on a system and on the mainline-ppa site.

When dissecting a name or version string to figure out the kernel version and flavor and arch etc, use a different set of rules for each source of kernels, so, kernels from the stock ubuntu apt repos are different from the mainline-ppa are different from any other sources. Currently mainline still has almost the same logic as the original ukuu. For the rest of this document, assume we have split that up into seperate parsers for each possible source of kernel package names/versions, and we are only talking about the parser for the kernel.ubuntu.com source here.

Some of the following rules would collide with strings that come from the stock ubuntu apt repo, meaning some ubuntu kernels would match some of these patterns but not have these meanings. This is why I say to just don't include the ubuntu kernels. They still get handled and presented in the ui etc, just they would have their own parser that picks apart their name/version strings according to whatever rules are right for those packages.

This will also eventually allow us to support other sources of kernels besides kernel.ubuntu.com. But even before that, it would improve the support of kernel.ubuntu.com because it will allow us to support all the different flavors of kernels instead of just -generic which is all we support right now.

After that, the basic idea is instead of trying to regex on the whole string, break the string up into tokens by working backwards in stages or layers from how it was constructed. The filenames and/or package names & versions printed by dpkg are all built out of pieces, and then those pieces are combined again into bigger pieces, etc, and by the end it's no longer simple to split apart the whole thing by any sort of simple delimiter or regex. IE, you can't just count the dashes or dots etc.
But you CAN do that in sort of onion layers, where you split on a delimiter to get major sections, then pick apart those seperate sections according to different rules and patterns appropriate for each section individually.

Eventually at the end you have a bunch of seperated, labeled, meaningful pieces that can be used to sort, identify, and classify kernel versions and kernel packages more robustly than we are now.

Here are a selection of different-looking linux-image*.deb all from kernel.ubuntu.com

linux-image-unsigned-6.3.0-060300-generic_6.3.0-060300.202304232030_amd64.deb
linux-image-6.3.0-060300-generic-lpae_6.3.0-060300.202304232030_armhf.deb
linux-image-unsigned-6.3.0-060300rc4-generic-64k_6.3.0-060300rc4.202303262231_arm64.deb
linux-image-unsigned-5.10.80-051080-lowlatency_5.10.80-051080.202111181432_amd64.deb
linux-image-unsigned-5.10.80-051080-generic_5.10.80-051080.202111181432_amd64.deb
linux-image-4.5.0-040500rc5-generic_4.5.0-040500rc5.201602201730_ppc64el.deb
linux-image-4.1.0-040100rc5-generic_4.1.0-040100rc5.201505250235_i386.deb
linux-image-3.19.8-031908ckt23-generic_3.19.8-031908ckt23.201607121433_i386.deb
linux-image-3.19.8-031908ckt23-generic-lpae_3.19.8-031908ckt23.201607121433_armhf.deb
linux-image-3.19.8-031908ckt23-lowlatency_3.19.8-031908ckt23.201607121433_amd64.deb
linux-image-2.6.32-02063265-generic_2.6.32-02063265.201805131616_amd64.deb
linux-image-2.6.32-02063265-virtual_2.6.32-02063265.201805131616_amd64.deb
linux-image-2.6.32-02063265-preempt_2.6.32-02063265.201805131616_amd64.deb
linux-image-2.6.32-02063265-server_2.6.32-02063265.201805131616_amd64.deb
linux-image-2.6.32-02063264+drm3326-virtual_2.6.32-02063264+drm3326.201805132225_amd64.deb
linux-image-3.1.5-030105-generic_3.1.5-030105.201112091259_amd64.deb
linux-image-3.1.6-030106-generic-pae_3.1.6-030106.201112211719_i386.deb
linux-image-3.2.35-030235-omap_3.2.35-030235.201212061235_armel.deb
linux-image-3.2.35-030235-highbank_3.2.35-030235.201212061235_armhf.deb
linux-image-extra-3.2.35-030235-virtual_3.2.35-030235.201212061235_amd64.deb
linux-image-3.14.0-031400rc6-lowlatency_3.14.0-031400rc6.201403100035_amd64.deb
linux-image-4.5.4-040504-generic_4.5.4-040504.201606100356_s390x.deb

Sometimes there is yet more version info that isn't anywhere in the filename.
These are both the exact same version 4.6.0rc6-generic-amd64
The build timestamps are different, and so they don't fully collide, but they target different ubuntu versions, and that info is only in the parent directory name, and inside the .deb. This is a seperate issue from picking apart the name/version into tokens, and simple enough to deal with so I'm not talking about this right now.
Just wanted to show it to show that it's not being ignored or missed.
.../v4.6-rc6-wily/linux-image-4.6.0-040600rc6-generic_4.6.0-040600rc6.201605012031_amd64.deb
.../v4.6-rc6-yakkety/linux-image-4.6.0-040600rc6-generic_4.6.0-040600rc6.201606100520_amd64.deb

First just strip off the .deb

The first layer is to split on underscore, into NAME_VERSION_ARCH
linux-image-2.6.32-02063264+drm3326-virtual_2.6.32-02063264+drm3326.201805132225_amd64

NAME=linux-image-2.6.32-02063264+drm3326-virtual
VERSION=2.6.32-02063264+drm3326.201805132225
ARCH=amd64

Each of these then gets picked apart by it's own rules and patterns.

ARCH is done. ARCH is a final token not built out of other smaller tokens.

There is a lot of redundancy, and so the token naming will get confusing because there are "version" numbers/strings within both the NAME and VERSION top-level tokens. I guess let's prefix lower level token names with n or v to show a "version" that came from the NAME field is nVER vs a "version" that came from the VERSION field is vVER, etc.
I think for sorting and comparing version numbers later, it makes the most sense to use the ones from the VERSION field.

Next, work on the VERSION field. 1, because it's more consistent / less ambiguous than the NAME field, and 2, because we can use the unambiguous version numbers from the VERSION field to match & subtract from the NAME field to resolve some things that would otherwise be ambiguous in the NAME field.

VERSION
6.3.0-060300.202304232030
6.3.0-060300rc4.202303262231
3.19.8-031908ckt23.201607121433
2.6.32-02063264+drm3326.201805132225
2.6.32-02063265.201805131616

Split on "." into vVER.vDT
vVER=2.6.32-02063264+drm3326
vDT=201805132225

vDT
datetime from the VERSION field
could obviously be parsed into YYYY MM DD hh mm ss but there is no need to
vDT is essentially another final token like ARCH

vVER
version from the VERSION field
6.3.0-060300
6.3.0-060300rc4
3.19.8-031908ckt23
2.6.32-02063264+drm3326
2.6.32-02063265

Don't assume there is only one '-' even though there is in all these examples.
What's probably reliable is the first '-', so split on specifically the first '-',
into vvDISPLAY-vvMACHINE (struggling with names... we ain't done yet either)
vvDISPLAY = human-readable display version of version from the VERSION field
vvMACHINE = fixed-length machine readable version of version from the VERSION field
(insert monty python spam sketch, version version spam tomatos version eggs version version & version)

I think vVER is always entirely duplicated in the NAME field too, so although we will continue parsing and dissecting vVER into it smaller meaningful parts, we will probably also use vVER in whole later too when parsing NAME

vvDISPLAY
6.3.0
3.19.8
could be split obviously on "." into vvdMAJOR.vvdMINOR.vvdMICRO
but we may not bother because we may want to use the fixed-length versions from vvMACHINE for indexing and sorting anyway, and only use the collapsed version for display, in which case, just use the whole vvDISPLAY for that as it is.
For now, consider vvDISPLAY another final token even though it obviously does have sub-parts.
Maybe we split it for some reason later but if so it's trivial and not part of the main ambiguity problem we're working on right now.

vvMACHINE
060300
060300rc4
031908ckt23
02063264+drm3326
02063265

fixed-length first 6 bytes
2 bytes vvmMAJOR - final token
2 bytes vvmMINOR - final token
2 bytes vvmMICRO - final token

call the remainer vvmEXT

vvmEXT
(null)
rc4
ckt23
64+drm3326
65

vvmEXT is a final token, although there is a bit more logic to do on it, where the meaning of vvmEXT is different for RC kernels vs all others. But in either case (rc or not) it doesn't need to be dissected any further.

RC kernels do need special seperate handling. They may be toggeled included or excluded for consideration or display, and when included, they need to be sorted unnatuarally where 123rc4 sorts lower than 123, while 123foo sorts higher than 123.

So, regex for ^rc\d.*
If it matches, flag the kernel with is_unstable
otherwise no further changes or chopping
use is_unstable later for filtering and/or sorting the rc kernels differently from the rest

That's the VERSION field parsed completely.
vvDISPLAY + - + vvmMAJOR + vvmMINOR + vvmMICRO + vvmEXT

now the NAME field

NAME
linux-image-unsigned-6.3.0-060300-generic
linux-image-unsigned-6.3.0-060300rc4-generic-64k
linux-image-6.3.0-060300-generic-lpae
linux-image-3.19.8-031908ckt23-generic-lpae
linux-image-2.6.32-02063264+drm3326-virtual
linux-image-3.2.35-030235-omap
linux-image-extra-3.2.35-030235-virtual
linux-image-2.6.32-02063265-server
linux-image-3.1.6-030106-generic-pae

first strip off the leading 'linux-image-'

unsigned-6.3.0-060300-generic
unsigned-6.3.0-060300rc4-generic-64k
6.3.0-060300-generic-lpae
6.3.0-060300-generic
3.19.8-031908ckt23-generic-lpae
2.6.32-02063264+drm3326-virtual
2.6.32-02063265+drm3326-preempt
3.2.35-030235-omap
extra-3.2.35-030235-virtual
2.6.32-02063265-server
3.1.6-030106-generic-pae
4.6.0-040600rc6-lowlatency
3.2.55-030255-highbank

we have one ugly problem...
both 'linux-image' and 'linux-image-unsigned' are kernel image packages,
but 'linux-image-extra' is NOT.
Either '-extra' needs ugly monkey-patch special handling to exclude it (but still include it as an assosciated package like linux-headers),
or '-unsigned' needs an equivalent special handling to include it.
Either way it's a hard-coding that won't handle some unknown future new names.

Let's change the initial filter to regex linux-image(-unsigned)?-\d.*
Tolerable as long as it doesn't get much worse than that

unsigned-6.3.0-060300-generic
unsigned-6.3.0-060300rc4-generic-64k
6.3.0-060300-generic-lpae
6.3.0-060300-generic
3.19.8-031908ckt23-generic-lpae
2.6.32-02063264+drm3326-virtual
2.6.32-02063265+drm3326-preempt
3.2.35-030235-omap
2.6.32-02063265-server
3.1.6-030106-generic-pae
4.6.0-040600rc6-lowlatency
3.2.55-030255-highbank

if prefix 'unsigned-' nSIGNED='unsigned-' and strip leading 'unsigned-'
else nSIGNED=''

6.3.0-060300-generic
6.3.0-060300rc4-generic-64k
6.3.0-060300-generic-lpae
6.3.0-060300-generic
3.19.8-031908ckt23-generic-lpae
2.6.32-02063264+drm3326-virtual
2.6.32-02063265+drm3326-preempt
3.2.35-030235-omap
2.6.32-02063265-server
3.1.6-030106-generic-pae
4.6.0-040600rc6-lowlatency
3.2.55-030255-highbank

Remove previously derived vVER (and a trailing '-') from each.
We have vVER as a whole string, so we don't have to try to figure out the difference between '-'s within the version vs after the version in 'generic-pae' etc, just match the whole vVER verbatim and delete that without caring what's in it.
Call the entire remainder nFLAVOR
Don't worry about splitting 'generic-pae' into 'generic' and 'pae',
let's just call everything remaining as the flavor, and build a list of flavors based on whatever we find.
If we want to provide ui or settings toggles to hide/show some of the flavors, do it only dynamically by first scanning all kernels and displaying a list of whatever flavors were discovered, because they are almost unpredictable.

generic
generic-64k
generic-lpae
generic
generic-lpae
virtual
preempt
omap
server
generic-pae
lowlatency
highbank

The de-duped list of those, for any single ARCH, would be a shorter list.
This is essentially a worst-case that no one would ever see unless they set previous-majors to -1 to get the entire history of all versions.
generic
generic-64k
generic-lpae
virtual
preempt
omap
server
generic-pae
lowlatency
highbank

So NAME is done. The final tokens that make up NAME are:
linux-image + '-' + nSIGNED + vVER + '-' + nFLAVOR

That's a little messier but in the end we don't care about anything from the NAME field except getting a clean value for the flavor isolated reliably from all the other junk, which we got, so all we actually care about from this is nFLAVOR.

bkw777 · 2024-03-23T08:14:04Z

bkw777
Mar 23, 2024
Maintainer Author

the code has actually worked this way for some time now

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

site scraping & name/version parsing #234

{{title}}

Replies: 1 comment

{{title}}

Select a reply

site scraping & name/version parsing #234

bkw777 May 4, 2023 Maintainer

Replies: 1 comment

bkw777 Mar 23, 2024 Maintainer Author

bkw777
May 4, 2023
Maintainer

bkw777
Mar 23, 2024
Maintainer Author