Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: version parsing #240

Merged
merged 20 commits into from
Jul 4, 2023
Merged

Conversation

baszalmstra
Copy link
Collaborator

@baszalmstra baszalmstra commented Jun 30, 2023

This PR is a follow-up to #227. It refactors the parsing of versions to use nom instead of the code that was ported from conda.

Using nom allows removing some obscure regexes and should make it easier to integrate the same parsing logic in other places like the MatchSpec and VersionSpec parsing logic which are up next for a refactor.

The parser is about 3x faster in the common case and 2x faster in more complex cases.

With the new parser, I was also able to extract more information from the string which makes it possible to reproduce the exact string representation it was created from (with some obscure exceptions). Because of that, I was able to remove the "normalized" string representation from the struct further reducing the size from 128 to 112 bytes.

As you review this PR you will notice that some version string representations have changed. This is because the string representation is now constructed from the version itself. However, some cases are very hard to reproduce, for instance, the number 06 is represented as 6 in our code. While reviewing, make sure you check these changes and decide if they make sense or not.

This code again makes use of some bit twiddling to achieve all this but I wrapped all this behavior into structs to make it simple to reason about (see Segment and Flags).

Things that currently fail to parse but did previously parse correctly are:

  • 1__ parsed as 1._
  • 1-- parsed as 1._

I will look into those!

@@ -3106,7 +3106,7 @@ expression: "conda_lock.packages_for_platform(Platform::Linux64).collect::<Vec<_
manager: conda
platform: linux-64
dependencies:
certifi: ">=2020.06.20"
certifi: ">=2020.6.20"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened here? o.O

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I explained in:

As you review this PR you will notice that some version string representations have changed. This is because the string representation is now constructed from the version itself. However, some cases are very hard to reproduce, for instance, the number 06 is represented as 6 in our code. While reviewing, make sure you check these changes and decide if they make sense or not.

@wolfv
Copy link
Contributor

wolfv commented Jul 1, 2023

This is awesome, of course! I am wondering wether we need (or need not) be able to restore versions identically. There are some places where it might be "required", or confusing if it doesn't happen:

  • writing metadata in rattler-build
  • writing repodata.json files
  • validating signatures

To the last point: the current way of signing packages in conda works by signing the metadata of a package (including the dependencies). To sign the metadata, the entire record from the repodata.json is serialized into a specific JSON format, and then hashed, and the hash is signed (this is a rough description of the process).

I am not sure we want to implement the way conda package signing works (it's a pretty custom implementation, and sigstore.dev might be a better fit) but it is at least one place where restoring versions exactly is necessary.

We could add some sort of flag to the parser, so that if our representation doesn't match up, we store either a fancy diff or the original string. Or we could wrap the type with an ExactReprVersion type that implements this functionality.

Or - and this is controversial - we only store the version as string and reparse it every time we want to compare it / bump it / ... :)

@baszalmstra
Copy link
Collaborator Author

Yeah indeed. In that case it would make more sense to store a string alonside the Version seperately. Dont think it should be a concern of the Version struct indeed.

@baszalmstra baszalmstra marked this pull request as ready for review July 2, 2023 18:06
@baszalmstra
Copy link
Collaborator Author

@wolfv Should I also introduce a type that retains the original version string and use that in the RepoData?

@baszalmstra
Copy link
Collaborator Author

5234e3b adds VersionWithSource which is used in the places where Versions are parsed.

@wolfv wolfv merged commit dc6f190 into conda:main Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants