Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly describe hashed bin delegation in the specification #210

Open
joshuagl opened this issue Feb 17, 2022 · 6 comments
Open

Properly describe hashed bin delegation in the specification #210

joshuagl opened this issue Feb 17, 2022 · 6 comments

Comments

@joshuagl
Copy link
Member

Hashed bin delegation is not well documented in the specification. One of the better/more frequently referenced descriptions is in PEP 458.

We might add this to the ~new repository operations section of the specification.

Originally posted by @joshuagl in theupdateframework/taps#148 (review)

@lukpueh
Copy link
Member

lukpueh commented Feb 18, 2022

Maybe an inspirational resource for the repository side:
I recently wrote an excessively commented hashed bin delegation example implementation using the python-tuf Metadata API: theupdateframewok/python-tuf...hashed_bin_delegation.py

@lukpueh
Copy link
Member

lukpueh commented Feb 18, 2022

Feature request:
When describing this in the spec, It would be great to also describe how to choose an appropriate number of bins, given an expected number of target files.

@lukpueh
Copy link
Member

lukpueh commented Apr 11, 2023

Feature request: When describing this in the spec, It would be great to also describe how to choose an appropriate number of bins, given an expected number of target files.

I think the optimal "load factor" is what we are looking for here.

@lukpueh
Copy link
Member

lukpueh commented Oct 18, 2023

I think the optimal "load factor" is what we are looking for here.

This is nonsense, the load factor is related to the cost of retrieving a value from a hash table, which is a completely different optimisation problem than what we are interested in.

@lukpueh
Copy link
Member

lukpueh commented Oct 18, 2023

I spent some more thought on how to find the optimal number of bins for a given number of target files, and came up with the following simple problem definition:

for a chosen 'number of targets'
we want the optimal 'number of bins' (a power of 2)
where 'size of snapshot metadata' + 'size of a single bin metadata' is minimal

For snapshot and bin metadata sizes we only need to consider those parts that change for different numbers of targets and bins:

'size of snapshot metadata' =  'size of a single bin meta entry ' * 'number of bins'
'size of a single bin metadata' = 'size of a single target file meta entry' * ceil('number of targets' / 'number of bins')

Someone with mad math skills can maybe solve this using a fancy (discrete?) optimisation technique.

Until then we can use this script or this Google sheet.

@trishankatdatadog
Copy link
Member

Good Q, Lukas. I think a simple objective to optimise is simply sizeof(snapshot) == sizeof(targets+delegations). Hopefully a Python numerical optimiser can help to find the values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants