Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting rid of internally soft-linked hard-links #855

Closed
enkore opened this issue Apr 7, 2016 · 4 comments
Closed

Getting rid of internally soft-linked hard-links #855

enkore opened this issue Apr 7, 2016 · 4 comments

Comments

@enkore
Copy link
Contributor

enkore commented Apr 7, 2016

Today in the Borg "Getting rid of…" show: soft-linked hard-links.

This distinction between "regular files" and "regular files with nlink>1" has been a bit of a troublemaker in various places, because it makes it hard to work on subsets of all items. The original solution with the 'source' attribute is nice, because it avoids storing the chunk id list twice, and because it makes it straightforward to link all links together when extracting (the full archive, not a subset).

When working with subsets this solution fails and we kludged stuff together to make it work, but it ain't nice.


Ideas

  • Let it be
  • Just put the "chunks" in every file, ignore 'source' except when extracting (to link 'em together)
    • the chunk id list will probably need more space, but for really large files the deduplication of the item metadata should kick in nicely.
    • 1.0 still does the right thing, but if we drop the compat code we have now the troubles for old archives are still there. We could shift blame to "recreate" and reduce the compat code to that single occurrence.
  • Could drop 'source' entirely (=> 1.0 would extract each link independently), index of hard links outside of the 'items' stream
  • ?
@ThomasWaldmann
Copy link
Member

Guess we need to keep the compat code until we do another major release that requires running an upgrade procedure anyway. So, tag this "2.0"?

@enkore
Copy link
Contributor Author

enkore commented Apr 17, 2016

Ape had the good idea of doing this via recreate. I.e. removing all the hardlink_master cruft, implementing a clean solution, and only keeping it in recreate (where it's one of the simpler variants, especially compared with diff!). I'd say this would become a feasible option for 1.2 or 1.3 if recreate has proven reliable in 1.1.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Aug 20, 2016

See also #1473 - one reason why the problem there occurs relatively early is because the chunk list of a file is contained in the ITEM, making the item big (the other reason is having extremely many items).

If we would move the chunklist into an INODE (and reference the inode objects from the item), #1473 would be very much relaxed as the item metadata stream would shrink a lot.

Also, for this ticket here, we could reference same INODE objects from multiple ITEMs to model hardlinks in a natural way.

Note that INODE can not just be 1 storage object (MAX_OBJECT_SIZE = 20MiB) as that only stores ~500.000 object references, with ~2MiB per file content chunk this would mean a file size limit of ~1TB, which is too low.

So, we could have a primary (small) list of objects IDs in the ITEM and each of these object contains a secondary list of references to content objects, so we get n * 1TB.

An optimization could be done to avoid the indirection for small files: just have the primary list directly point to content objects (as it is now) - this could also be the "compatibility mode".

Note: I talked about INODE above. In UNIX filesystems usually also the metadata of the file (except the name) is stored in the INODE. We could discuss doing that or we could just implement the block list part of an INODE.

@ThomasWaldmann
Copy link
Member

Closing in favour of #2325.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants