-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file metadata and permissions to UnixFSv1 #220
Conversation
This is based on the upcoming spec spec change ipfs/specs#220
Question: what happens when a message containing these new properties is parsed by someone using the old protobuf schema? Are the properties ignored or does it fail? |
unixfs/README.md
Outdated
/* file metadata */ | ||
optional uint32 mtime = 7; | ||
optional uint32 atime = 8; | ||
optional uint32 ctime = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djdv pointed out that windows supports creation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not supporting fractional time is an oversight. Especially provided we might be stuck with this stop-gap for quite a while. I urge you to consider doing the following:
message Timespec {
// Use of int64 is deliberate - negative epoch is super unlikely
// Using sint64 would introduce zig-zag encoding ( harder to eyeball )
// The varint representing the time of writing this is 5 bytes long
// It will remain so until October 26, 3058 ( 34,359,738,367 )
required int64 UnixEpoch = 1;
// fixed32 ( always 4 bytes, no varint ), as nanosecs are often > 2^28
// https://developers.google.com/protocol-buffers/docs/proto#scalar
// NOTE: on the wire this value will be *little* endian
optional fixed32 UnixEpochNsec = 2;
}
And using Timespec
for mtime
/ctime
/etc ( atime
is a really odd choice to carry but since it is optional it's not a big deal )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to offer some perspective from the field in terms of what I have and have not, ever, needed when working with package management as well as containerland:
-
mtime comes up sometimes.
-
atime is completely useless. It will often get set again in the course of... anything else you're doing. The
atime
vsrelatime
vsnoatime
flags on common filesystems have also muddied the field beyond all recognition. There is approximately one program in the universe known to look at atime, and it's something you've never heard of. atime should be disregarded. atime provides purely negative value. -
ctime is completely useless. It can't even be expressed unless you're writing filesystem drivers. While it so happens that we are, and we can, that doesn't mean it's a good idea, that it's useful, or that we should. For example, if we were to support commands that just unpack/manifest files on an existing (e.g. ext4) filesystem, we would not be able to preserve ctime anyway. Most other tooling exists within this limitation as well, and as a result, the world has evolved to effectively disregard ctime for any significant purposes; e.g., if there was a piece of software that would not work correctly if ctimes changed, it would fail to work in any container system, anything that uses tarballs, etc; effectively, it would be nonviable, and resultingly, such software tends not to exist.
I'd think long and hard before perpetuating these pieces of metadata just because we 'can'. They're remarkably useless, unused, and in many contexts unusable.
- fractional timestamps I also have very mixed feelings on. We can observe that most tar systems don't do them, in practice. Yes, I know there's an extension for that. Go check if your tar command actually does that by default. Go stdlib tar doesn't. Docker and such systems have elided fractional components for ages as well, since they've been using the go stdlib tar libraries. By and large, nobody seems to have noticed or cared. We can look at the rate of serious problems being reported about timestamp precision in e.g. docker and use it as a barometer for how much fractional timestamps matter. I'd say it's "not much".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only want to store what is needed for the top level use cases (package managers) given this is not a permanent or even our ideal solution to these problems. My assumption is that they need something close to what stat
gives them, which includes atime
and ctime
, but if nobody needs this we shouldn’t preserve this data.
My experience with fractional timestamps is that most FS API’s don’t give them to you by default and require an additional flag or option in order to see them, which didn’t meet the bar I had set, but again, if package managers need this then we can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, note that high precision 64bit timestamps are in the current schema for unixfsv2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what’s the consensus here? keep all 3 or cut down to ctime and mtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO a prerequisite for adding these info like 'ctime' to unixfs should be to have documented answers to:
- If a command like
ipfs get
is going to unpack content from unixfs into a tree of dirs and files on a regular (say, ext4, just to pick something) filesystem, and this were run by a non root user, what should the behavior be? - Is there any use to ctime other than being effectively a random number generator in the input data, which users cannot for any practical purposes control? Should commands to add files to unixfs have a default behavior to set it to a "zero" value? If so, does that change the expected behavior of either 'unpack' or mount commands?
... because whoever implements code around this will discover these questions about five minutes after we merge this.
(Or maybe not, and we quietly introduce a defacto random number generator to our nice content addressable system without noticing, which would be... worse.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note I have fairly strong and (I'd like to think) fairly informed opinions about this, because I've previously wrestled with it in the Timeless project... and while I'm not here to tell you I had The Solution over there, I will tell you that underestimating the complexity of the normalizing filters system that's almost certain to end up growing around this area would be a serious mistake. It's a touchy thing and has been a source of many headaches and bugs for me. Treat it cautiously and especially be wary of adding data if there's no plan for what its normalizers would look and act like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could split this PR into two parts - merge the mode
property (which seems relatively uncontentious and solves some real-world problems our users have today) and bottom out the *time
related stuff separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s two questions here that seem to be getting merged together:
- Should anyone ever be allowed to add these time properties to the File.
- How should IPFS use these properties, and should that behavior be optional or by default.
I’m inclined to follow @achingbrain’s advise and split it at this point. However, I’m going to keep mtime
in the schema since it seems to be the least controversial and it gives us some room to experiment with *time
issues in IPFS. Without it being in the schema we have no room to even experiment sufficiently.
We can work through the *time
issues in the initial implementations and decide before we ship whether or not we actually want to use it in IPFS.
Ignored. |
I'd like to just mention a couple links to prior art that's not merely prior art, but also particularly easy to read and review for inspirations:
Both of these (as well as the specs of tar, which I'm assuming everyone's at least given a cursory glance at already) are highly worth a quick skim just to see what other people have covered when trying to map this terrain. |
I think all file or dir metadata must be in separate metadata block. This will be more flexible than put it in block with links to sub blocks. Don't make mistakes as in bit torrent. Separate metadata block can be included in dirrectory block for each file or directory that it contains. It will protect links between dir block and data that it contains. Metadata block can wrap first file or dir block if it need (wery small files or dirs). It can contain links to blocks in different format (raw). Put it here: message Metadata {
optional string MimeType = 1;
/* file metadata */
optional uint32 mtime = 7;
optional uint32 atime = 8;
optional uint32 ctime = 9;
/* file permissions */
optional uint32 permissions = 10;
optional uint32 uid = 11;
optional uint32 gid = 12;
} |
@ivan386 could you elaborate what do you mean by this? @achingbrain @warpfork note that my suggestions are just that: suggestions, based on the obscenely long time I've spent thinking about all this. The proposed PR is not usable by me in its entirety ( due to 1:1 mapping of entries/metadata-objects ), and that is fine: a properly of a robust framework is the ability to have multiple experiments run simultaneously. Please do not take my suggestions seriously enough to potentially derail this PR ;) |
Every change of metadata change block hash that contain it. If include file/dir metadata to it root block it will be many root hashes for same file/dir with diffrent metadata. Leaf nodes will be same but root block different. This root block can be lost becouse it wery rare. It break chain and file will be inacceseble from parent block. All file data will be awaliable in IPFS but we can't get it without root block. Only way to return that root block in IPFS is to generate it from file data again. But if file root block will contain ctime or other metadata we can't gues what it was. And we will generate diffrent root hash without ability recover old root hash for that file. Separate Metadata block wery small for each file/dir it can be included in parent dir block by using of identity hash. This dir block will contain all files metadata and links to file/dir root block without metadata that always same for same data and not so rare. In this case we can lost links to root blocks to files/dirs only with parent dir block. And can recover root file block only by using file data. |
The file node is already, effectively, a root node of metadata pointing to linked file data. It’s not clear to me what another layer of indirection (metadata message) would preserve if it’s still what everyone needs to link to. The current use case for the metadata message is to define a We discussed this in the issue and the consensus seemed to be that this was the best path forward. |
@mikeal content-type is optional field. mtime can be used in HTTP to. This example link to dir block that contain:
All three links point to same file. |
Wanted to quickly check whether "mode" will fit the use case described for guix such that executable bit can be contained within file metadata? According to @Ambrevar at IPFS camp, that's one of the few remaining blocker to get the guix-ipfs-cache over the finish line - https://github.com/ipfs/camp/blob/master/DEEP_DIVES/package-managers/README.md#guix-patch-notes-that-may-be-helpful-in-other-package-manager-efforts |
@momack2 yes. The mode allows encoding the executable bit along with all the other bits (sticky, read, write, etc.). Although, we should probably define how we're using the mode as we don't want bits like "directory". |
@momack2: Just to confirm what @Stebalien said, yes that would do it! Note that the link to https://github.com/fps/guix-ipfs-cache is a different (more hacky) proof of concept. The proper patch waiting to be merged can be reviewed at https://issues.guix.gnu.org/issue/33899. |
Want to raise one last question from my corner of the ring, I am afraid this is being lost... In #220 (comment) @Stebalien said:
Given that this detour is taken for the sake of the package management / nix-like-things groups: does it actually make sense to do this all as a new generic wrapper node that we will have to support forever? To reiterate on (again @Stebalien)'s point in #217 (comment):
As far as I can tell, for the original set of users waiting on this, there is no usecase where a piece of data needs POSIX attributes without a filename. The filename is already only available as a directory wrapper node, there is no other way to attach name to an "anonymous stream of data". What makes other attributes different enough to warrant a non-directory anonymous wrapper-node? |
The current document represents the closest thing we have to a consensus but it doesn’t feel like it’s solid enough to merge yet given the current spec in master is quite stable. At this point, I think that we should move forward with an implementation in at least one language (probably JS) and come back to the spec once we have some feedback from the implementation. I’ll update the PR to |
Update:
|
@@ -37,6 +37,12 @@ message Data { | |||
|
|||
optional uint64 hashType = 5; | |||
optional uint64 fanout = 6; | |||
|
|||
/* file permissions */ | |||
optional uint32 mode = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrew which bits of the mode do we need for package managers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achingbrain may know this too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executable is the main one, I think everything else is "nice to have".
@mikeal was this documented anywhere? |
Prior discussion in #217