-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update go-ipld-git to a go-ipld-prime codec #46
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, though take it with a grain of salt as this is the first time I look into git object formats. I also assume that the existing test suite is extensive enough to give us confidence of no regressions.
_, ok := t.entries[p] | ||
if !ok { | ||
return nil | ||
t := Type.Tree__Repr.NewBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow why we have a Tree__Repr builder here, and its result is built and assigned into a Tree builder, which is then again built. It seems like the other decode funcs follow the same pattern. Wouldn't a single Build per Decode be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this came out of tests:
the default is to use a basic.any as the assembler, and if that was directly used to make a list / map you end up with a top level basic node rather than the schema'd type that we have in this package.
When that's true, the decode path can't fast-path, and has to re assign that basic node into the schema type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with mvdan here -- it's a little odd to have a concrete reference like this in a method that is already accepting a NodeAssembler as a parameter. When there is a NodeAssembler as a parameter, I would usually expect it to just be used directly. (And maybe documentation can state something about "it is recommended to use a Tree__ReprAssembler here for best results".)
This doesn't seem dangerously incorrect as written, but does seem a little baroque. And it is probably doing a little more memory dancing than is necessary or optimal, too.
tree.go
Outdated
return err | ||
} | ||
for { | ||
node, err := DecodeTreeEntry(rd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before; why not pass the la.AssembleValue()
assembler to the decoder and let it assign directly into it?
|
||
func cidToSha(c cid.Cid) []byte { | ||
h := c.Hash() | ||
return h[len(h)-20:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, isn't this a bit wasteful if we only end up using part of the result? Is this to strip the multihash prefix? Might be good to document it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. this just uses the sha1 part of the multihash without the prefix to provide the git-compatible hash.
@willscott @mvdan this codec changes all of the struct fields to camel-case, |
Some additional thoughts here where the differences in forms can be seen from a pathing perspective: ipfs/kubo#8287 If you're OK with it, I'd like to try and get the forms here back to the original. |
That sounds fine to me - I'm not sure if ipld-prime codegen expects capitalized names, for the sake of exported names in the generated Go code. |
I also have no problem with changing name case back for consistency |
0436bc7 now has the field names matching the original (lower-case, and some other minor adjustments). I think we need to do something about this strict need to match Go conventions on these names since it's not always going to match what you want your data model form to look like. |
Out of curiosity, would it be enough to add "repr" mapping of field names so that the repr layer uses lowercase names? Aside from that, I wonder if it's a bit weird for the schema to have the field names be lowercase, but the type names be uppercase. |
Sadly, no, because it's pathing at the data model layer that we want to care about, not encoding. Having |
This seems fine to me, it actually looks pretty reasonable to my eyes (eyes that have been trained on C-family-style and aren't as accustomed to the upper-case first character rule, granted). Have a look at the schemas that Hannah wrote up in https://github.com/ipfs/go-unixfsnode/blob/main/data/gen/main.go (although the schema field names don't match what's being generated, but you can see the same thing going on there). |
SGTM, thanks for confirming on both accounts. I've replied to ipld/go-ipld-prime#206 :) |
Found in ipfs/go-ipld-git#46 integrating into go-ipfs which wants to load typed nodes when it encounters typed links. It errors because it's trying to load the typed node into the type of the typed link rather than the referenced type.
Found in ipfs/go-ipld-git#46 integrating into go-ipfs which wants to load typed nodes when it encounters typed links. It errors because it's trying to load the typed node into the type of the typed link rather than the referenced type.
waiting for working link system (ipld/go-ipld-prime#143) to test sha1 (currently link system doesn't do the hashing)
@mvdan @willscott I've brought this up to what I think is a mergable state:
and a few other minor things. The main difference that I'm in two minds about is date handling - we're pulling out For now I think it'd be fine to just leave this as is and leave it to the user to reform the stringified version if they want it, we could add back in a utility function to do that if needed, or evolve this code further in the future. IMO this can be merged now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these changes, mostly the lower-casing of fields, looks good to me.
We should get some sort of nod from stewards before merging, i think?
gen/gen.go
Outdated
}, schema.SpawnStructRepresentationStringjoin(" "))) | ||
schema.SpawnStructField("date", "String", false, false), | ||
schema.SpawnStructField("timezone", "String", false, false), | ||
schema.SpawnStructField("email", "String", false, false), | ||
schema.SpawnStructField("name", "String", false, false), | ||
}, schema.SpawnStructRepresentationMap(map[string]string{}))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this to a map rather than a stringjoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's not being used as a stringjoin in any way? We have a custom PersonInfo#GitString()
that's used to encode it in the git format and parsePersonInfo()
to parse the input string for decoding. Stringjoin here is only going to mean that when it's encoded with dag-cbor or dag-json that it comes out as <date> <timezone> <email> <name>
, which IMO is lossy and probably not what we're wanting out of this. Unless I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<date> <timezone> <email> <name>
is the git format though. I would hope that wouldn't be lossy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's $name <$email> $maybedate $maybetimezone
- it's lossy because the email has a <>
around it and we can use the <
to delineate the name which itself can contain any number of spaces, so simple stringjoin isn't enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Rod's take here: this seems better to do at the codec layer and yield up to data model as a map.
If I was going to transliterate some git data to dag-json, I'd probably want this to become a map in dag-json. So that means the codec should do this parse/transform immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
type GpgSig string | ||
|
||
type PersonInfo struct { | ||
date String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you're not keeping dates in a human-friendly single string format, I wonder if you could make this unix timestamp an Int too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, left a few comments/questions. I'm sure you folks have much greater context on the go-ipld-prime usage than I do.
It looks like there are some breaking changes to the data output by this codec. We should be dropping the list of breaking changes into the release tag for when we release this so I think it'd be great if we could writeup the list of breaking changes to the codec (at the spec level) for anyone who happens to be affected. For the purposes of review having an explanation of why a breaking change was made would also be appreciated.
I don't know if we have guidelines on how codec names should match to the internal names of the format we're representing with IPLD, but I'd suspect in general we'd try to match the native names.
"date": "1503667703", | ||
"timezone": "+0200", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that this is easier to work with an a single object? It doesn't seem to match how the Git spec handles things internally. Although perhaps it's fair for us to try and make things easier for our users here.
I was talking with @Stebalien about this and as far as we can tell there hasn't been much tooling developed around this codec so making a breaking change that helps people out and seems sane is probably fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(delayed reply to this) the main idea here is to keep the data lossless and not do too many tricks to make it palatable to the user.
Encoded form looks something like:
author A U Thor <author@example.com> 1465981137 +0000
But there's even some flexibility in what those final two fields can contain and apparently whether we get two, one or zero of them (I'm not sure about that detail though). So we treat them as strings and present them to the user as they are.
The current version of this codec keeps them internally but only presents a nice human-readable form as a DAG node, which isn't going to work when we're treating DAG traversal as traversal over the data model like we're doing with ipld-prime.
We did discuss (in this thread and other places) some options for making the human-readable form emerge out of the data to be available, but @willscott made the case that since this codec is a bit of a reference codec that we'd want to point others to for how to implement one that we'd better keep it simple for now.
So, lossless data, pure data model, no trickery, user gets to interpret the nodes as they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big picture, LGTM.
+1 big happy on the fix to capitizalization (field names being downcase). That makes more sense to me.
Making sure we've got all the loose ends covered here:
|
Yes, and a review for other breaking changes. Posting what the release notes for this release would look like would also be great since as soon as this is merged we're going to cut a release anyway and that release is going to need release notes 😄. |
Have fixed up Original code has a Will put together some basic release notes next week. |
* mergeTag -> mergetag * tagType -> type
Release notes: BREAKING CHANGES
|
Might I suggest we drop that change info into a (I'm pretty happy with the loose format in go-ipld-prime/CHANGELOG.md#released-changes, if having an outline to riff from greases the wheels. :)) |
(Informational note, no action required: I found myself nerdsniped by |
Have added a changelog and fixed the readme bork. Normally I don't like changelogs because it's too easy for them to get out of sync, but recently I've adopted the release-on-every-merge practice in JS and it includes both tag + github release + changelog updates and all I need to do is make sure my commit messages are informative and standardised, e.g. https://github.com/rvagg/cborg/blob/master/CHANGELOG.md & https://github.com/rvagg/cborg/releases & https://www.npmjs.com/package/cborg all from pressing the merge button. Can recommend, although Go folks have other things to worry about in the release process I suppose. |
Sweet. I think this is officially ready for the mergeparty (later this week? early next week?) where we land a bunch of these similar PRs in other repos at once. |
func init() { | ||
mc.RegisterEncoder(cid.GitRaw, Encode) | ||
mc.RegisterDecoder(cid.GitRaw, Decode) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for functioning, or just a convenience function for updating the global registry via module imports?
@warpfork I recall your aversion to global registries. Does encouraging codec creators to do this, but codec users to use custom LinkSystems when needed seem like a reasonable compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this was the compromise we ended up at, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's exactly it.
Global registries have downsides, but using package-init-time's natural synchronization is just too valuable. And having ways to opt-out of using the global registry, if some really needs to, seems to mitigate the biggest downsides handily. So: 👍
I have created a draft release & tag for this, I'm working on some assumptions here about Go (and PL) release practices that may not be quite correct. So someone correct me in the next day or two before I press merge and create the release & tag. I would prefer to use semver properly and do a v1.0.0 but I don't think we're in the habit of doing that with Go so I'm assuming v0.1.0 is appropriate, a tag is sufficient and a GitHub release is a bonus. |
The existing go-ipld-git codec allows for translation between git file data, and go-ipld-format node representations of the IPLD dag. This work attempts to instead provide a go-ipld-prime representation of that dag.
fix #45