Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delegate_hashed_bins must hash a target path as it appears in metadata #995

Closed
lukpueh opened this issue Mar 9, 2020 · 8 comments
Closed

Comments

@lukpueh
Copy link
Member

lukpueh commented Mar 9, 2020

Description of issue or feature request:
Fix in accordance with #957 (and #963)!!

The repository tool's delegate_hashed_bins creates hash bins (delegated targets roles) and assigns passed target file paths based on their hash. Later it calls delegate for each bin passing the relevant paths. delegate then might left-strip the targets directory path from each targets path, changing the targets path in a way that it no longer is in the right bin.

Current behavior:
delegate_hashed_bins might hash target path based on a different path than the one that will appear in the (delegated) targets metadata.

Expected behavior:
delegate_hashed_bins must hash a target path as it appears in metadata.

@lukpueh
Copy link
Member Author

lukpueh commented Mar 13, 2020

On a related side-note:

delegated_hashed_bins might generally benefit from some refactoring/clean up. E.g. it does not feel quite necessary to create a dictionary that has regularly incrementing indices as keys:
https://github.com/theupdateframework/tuf/blob/4fe29d138df02035eed9716657cfe0a76e17e178/tuf/repository_tool.py#L2513-L2515

The refactor could be done while fixing above or separately. Below is some semi-pseudo-code, that shows how I would distribute target paths in bins. Feel free to use as inspiration, enhancements/optimizations are welcome... :)

def distribute_in_bins(target_paths, bin_count):
  prefix_length = len(f"{bin_count - 1:x}")
  prefix_count = 16 ** prefix_length
  bin_size = prefix_count // bin_count

  # Create targets object for each bin and assign it to a list, where the index
  # equals the lower bound of the hash prefix range served by the bin divided
  # by the bin size
  ordered_roles = []
  for idx in range(0, prefix_count, bin_size):
    name = f"{idx:0{prefix_length}x}-{idx+bin_size-1:0{prefix_length}x}"
    role = {"name": name, "target_paths": []}
    ordered_roles.append(role)

  # Use floor division to assign targets to their serving bin
  for target_path in target_paths:
    # hash_prefix = _get_hash(target_path)[:prefix_length]
    hash_prefix = target_path # <-- only for testing purpose (FIXME)
    ordered_roles[int(hash_prefix, 16) // bin_size]["target_paths"].append(target_path)

  return ordered_roles

# Pass hex strings instead of paths to better visualize functionality (see FIXME above)
distribute_in_bins([f"{x:02x}" for x in range(0,256)], 32)

@trishankatdatadog
Copy link
Member

Huh, I am curious how that floor division works, it looks like base-10 divided by base-16...

@lukpueh
Copy link
Member Author

lukpueh commented Mar 13, 2020

Well spotted, @trishankatdatadog. It misses an int(hash_prefix, 16) line to convert the hex string to a number.

@lukpueh
Copy link
Member Author

lukpueh commented Mar 13, 2020

Also, as hinted above, this snippet should not be copy-pasted as is. It will need some fleshing out to replace the existing functionality in repository_tool. I just wanted to outline the algorithm I had in mind, for distributing targets in an ordered list of bins, where the index equals the bin size divided by the lower bound of the hash prefix range served by the bin.

@trishankatdatadog
Copy link
Member

Understood, @lukpueh. That bin indexing logic still seems wrong, but I'll double-check later. Understood that it's a sketch!

@lukpueh
Copy link
Member Author

lukpueh commented Mar 13, 2020

@trishankatdatadog, you were right. The variables in the division were switched plus two other minor issues (see edit history of comment). For my defense I only quickly wrote this up today from some pen and paper notes I took last week. Apologies. :) You should be able to copy-paste the updated snippet into your interpreter now and it should show that it works.

@trishankatdatadog
Copy link
Member

Thanks @lukpueh !

joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 20, 2020
Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 24, 2020
Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 24, 2020
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 26, 2020
Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 26, 2020
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 26, 2020
Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 26, 2020
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 30, 2020
Add a helper function to determine the name of a bin that a hashed
targetfile will be delegated to.

Based sketches by Lukas Puehringer in issues theupdateframework#994 & theupdateframework#995

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Mar 30, 2020
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <jlock@vmware.com>
joshuagl added a commit to joshuagl/tuf that referenced this issue Apr 1, 2020
Simplify the delegate_hashed_bins function based on the semi-pseudo-code
by Lukas Puehringer in theupdateframework#995.

Signed-off-by: Joshua Lock <jlock@vmware.com>
@lukpueh
Copy link
Member Author

lukpueh commented Apr 1, 2020

Fixed in #1007

@lukpueh lukpueh closed this as completed Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants