JLAP support (#197)

* adding initial module for JLAP support; still a WIP * making a few updates to the JLAP module * updating doc string * initial commit to add a blake2b has to the state file * removing accidentally committed jlap stuff * removing unnecessary dependency * updating imports * fixing formatting issues * updating this branch to instead switch the blake2 hash implementation to blake2b * fixing formatting issues * attempting to fix an issue with the windows test runner * saving progress so far * range request works better now * adding the variant checks for JLAP * adding a new JLAPManager object to hold a lot of the data and logic need for fetching, updating and patching; still incomplete * making a few updates to JLAPManager struct; adding blake2_hash field * caching and updating the jlap file is more-or-less working * I think this actually works! Stil need to write tests though 🙄; coming soon... * adding the start of some testing * finished add first test ✅ * Added more tests and example The tests work when just running the jlap module but fail when running the entire test suite. * Fixing test related stuff Because I used the `tokio:fs` crate in my JLAP code I was seeing unpredictable behavior when mixing in the `std::fs` crate with my tests. Switching the tests to use `tokio::fs` instead appears to have resolved this issue. * Adding hash verification; test refactor Here, I'm adding a hash verification as the last step in a successful jlap patch operation. The `patch_repo_data` function now returns this updated hash. I'm guessing it will be useful for updating the `*.state.json` file. This commit also refactors the tests a little bit to make it easier to reason about (trying to hide some of the setup code in their own functions). * couple of small changes * Updates to return the hash object instead of string This commit also makes changes to the rattler_digest code. I think it's better to store the hash type there rather than in the rattler_repodata::fetch::cache module that is private. I also updated the way that the hashes are generated to take advantage of the rattler_digest library too. * Re-working the way JLAP works This commit is a large refactor based on a conversation I had with a colleague. Instead of caching the JLAP file itself, we now just store information about the request in the .state.json file. This helps simplify the code quite a bit. * clippy issues; updating docs * more updates based on review comments * Refactoring to make code easier read plus more! This commit does some refactor to hopefully increase code readability. It introduces a new `JLAP` struct which holds information related to the JLAP response. * Adds working checksum validation 🙌 This commit finishes the validate_checksum method. It also updates the error messages to be a little less redundant (removing the `Error` suffix). * changes based on comments from review * Updates the serializers for hash values This commit updates the serializers used for hash values it relies on the `serde_as` macro now. I had to change some values in the `Cargo.toml` file to get this to work. * more tweaks and fixes * addressing more comments from review * more suggestions from review * updating comment * moving hex decoding to serde * we always need to return the latest iv value we get by running validate_checksum * use Blake2b in `validate_cached_state` * updating fetch_repo_data so it returns earlier when it successfully fetches JLAP data * Getting closer to a working JLAP What still should be done: - We should save the headers of the JLAP request to respect the cache timeouts - This doesn't play well with the progress bar yet * updating docs and adding a new error for when the parsing of the checksum fails * Lots of refactor from manual testing This commit inclues a lot of refactoring I did based on manual testing with real repodata (still needs to be included as unit tests 😬). The case I was not handling well were empty JLAP responses. This happens when there is no data to update. I was also saving the wrong values for the new initialization vector and had to address this too. * fixing documentation error * adding another test case to handle jlap responses with no new patches * adding another test case to make sure that the range not satisfiable error handling logic works as expected * refactoring tests to make them easier to read * Refactor test cases and adds ordering of repo data This commit contains two things: 1. I added a way for the serializer to order the repo data. This is necessary to make the blake2 hashes line up correctly. It does incur a performance penalty. 2. I refactored the tests to make use of `rstest` and test cases. I did this because there was a lot of redundant code in the test module --------- Co-authored-by: Wolf Vollprecht <w.vollprecht@gmail.com>
conda · Jun 12, 2023 · ce910d4 · ce910d4
1 parent b5f117b
commit ce910d4
Show file tree

Hide file tree

Showing 9 changed files with 1,180 additions and 26 deletions.
diff --git a/crates/rattler_digest/Cargo.toml b/crates/rattler_digest/Cargo.toml
@@ -17,6 +17,7 @@ hex = "0.4.3"
 serde = { version = "1.0.163", features = ["derive"], optional = true }
 sha2 = "0.10.6"
 md-5 = "0.10.5"
+blake2 = "0.10.6"
 serde_with = "3.0.0"
 
 [features]

diff --git a/crates/rattler_digest/src/lib.rs b/crates/rattler_digest/src/lib.rs
@@ -46,6 +46,8 @@ pub mod serde;
 
 pub use digest;
 
+use blake2::digest::consts::U32;
+use blake2::{Blake2b, Blake2bMac};
 use digest::{Digest, Output};
 use std::io::Read;
 use std::{fs::File, io::Write, path::Path};
@@ -59,6 +61,12 @@ pub type Sha256Hash = sha2::digest::Output<Sha256>;
 /// A type alias for the output of an MD5 hash.
 pub type Md5Hash = md5::digest::Output<Md5>;
 
+/// A type for a 32 bit length blake2b digest.
+pub type Blake2b256 = Blake2b<U32>;
+
+/// A type alias for the output of a blake2b256 hash.
+pub type Blake2bMac256 = Blake2bMac<U32>;
+
 /// Compute a hash of the file at the specified location.
 pub fn compute_file_digest<D: Digest + Default + Write>(
     path: impl AsRef<Path>,

diff --git a/crates/rattler_repodata_gateway/Cargo.toml b/crates/rattler_repodata_gateway/Cargo.toml
@@ -30,14 +30,16 @@ serde = { version = "1.0.163", features = ["derive"] }
 serde_json = { version = "1.0.96" }
 pin-project-lite = "0.2.9"
 md-5 = "0.10.5"
-rattler_digest = { version = "0.2.0", path = "../rattler_digest", features = ["tokio"] }
+rattler_digest = { version = "0.2.0", path = "../rattler_digest", features = ["tokio", "serde"] }
 rattler_conda_types = { version = "0.2.0", path = "../rattler_conda_types", optional = true }
 fxhash = { version = "0.2.1", optional = true }
 memmap2 = { version = "0.6.2", optional = true }
 ouroboros = { version = "0.15.6", optional = true }
-serde_with = { version = "3.0.0", optional = true }
+serde_with = "3.0.0"
 superslice = { version = "1.0.0", optional = true }
 itertools = { version = "0.10.5", optional = true }
+json-patch = "1.0.0"
+hex = { version = "0.4.3", features = ["serde"] }
 
 [target.'cfg(unix)'.dependencies]
 libc = "0.2"
@@ -59,4 +61,4 @@ rstest = "0.17.0"
 default = ['native-tls']
 native-tls = ['reqwest/native-tls']
 rustls-tls = ['reqwest/rustls-tls']
-sparse = ["rattler_conda_types", "memmap2", "ouroboros", "serde_with", "superslice", "itertools", "serde_json/raw_value"]
+sparse = ["rattler_conda_types", "memmap2", "ouroboros",  "superslice", "itertools", "serde_json/raw_value"]
diff --git a/crates/rattler_repodata_gateway/src/fetch/cache/mod.rs b/crates/rattler_repodata_gateway/src/fetch/cache/mod.rs
@@ -1,15 +1,12 @@
 mod cache_headers;
 
-use blake2::digest::consts::U32;
-use blake2::Blake2b;
 pub use cache_headers::CacheHeaders;
+use rattler_digest::{serde::SerializableHash, Blake2b256};
 use serde::{Deserialize, Deserializer, Serialize, Serializer};
+use serde_with::serde_as;
 use std::{fs::File, io::Read, path::Path, str::FromStr, time::SystemTime};
 use url::Url;
 
-/// Custom blake2b type
-pub type Blake2b256 = Blake2b<U32>;
-
 /// Representation of the `.state.json` file alongside a `repodata.json` file.
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct RepoDataState {
@@ -51,6 +48,9 @@ pub struct RepoDataState {
 
     /// Whether or not JLAP is available for the subdirectory
     pub has_jlap: Option<Expiring<bool>>,
+
+    /// State information related to JLAP
+    pub jlap: Option<JLAPState>,
 }
 
 impl RepoDataState {
@@ -80,6 +80,35 @@ impl FromStr for RepoDataState {
     }
 }
 
+/// Used inside of the `RepoDataState` to store information related to our JLAP state
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct JLAPState {
+    /// Initialization Vector (IV) for of the JLAP file; this is found on the first line of the
+    /// JLAP file.
+    #[serde(rename = "iv", with = "hex")]
+    pub initialization_vector: Vec<u8>,
+
+    /// Current position to use for the bytes offset in the range request for JLAP
+    #[serde(rename = "pos")]
+    pub position: u64,
+
+    /// Footer contains metadata about the JLAP file such as which url it is for
+    pub footer: JLAPFooter,
+}
+
+/// Represents the metadata for a JLAP file, which is typically found at the very end
+#[serde_as]
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct JLAPFooter {
+    /// This is not actually a full URL, just the last part of it (i.e. the filename
+    /// `repodata.json`). That's why we store it as a [`String`]
+    pub url: String,
+
+    /// blake2b hash of the latest `repodata.json` file
+    #[serde_as(as = "SerializableHash::<rattler_digest::Blake2b256>")]
+    pub latest: blake2::digest::Output<Blake2b256>,
+}
+
 /// Represents a value and when the value was last checked.
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct Expiring<T> {

diff --git a/...he/snapshots/rattler_repodata_gateway__fetch__cache__test__parse_repo_data_state_one.snap b/...he/snapshots/rattler_repodata_gateway__fetch__cache__test__parse_repo_data_state_one.snap
@@ -13,4 +13,5 @@ has_zst:
   last_checked: "2023-02-13T14:08:50Z"
 has_bz2: ~
 has_jlap: ~
+jlap: ~
 
diff --git a/...he/snapshots/rattler_repodata_gateway__fetch__cache__test__parse_repo_data_state_two.snap b/...he/snapshots/rattler_repodata_gateway__fetch__cache__test__parse_repo_data_state_two.snap
@@ -16,4 +16,5 @@ has_bz2:
   value: true
   last_checked: "2023-05-18T13:59:07.112638Z"
 has_jlap: ~
+jlap: ~