Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update go-ipld-git to a go-ipld-prime codec #46

Merged
merged 22 commits into from
Aug 12, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 69 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,8 @@ Git ipld format

[![](https://img.shields.io/badge/made%20by-Protocol%20Labs-blue.svg?style=flat-square)](http://ipn.io)
[![](https://img.shields.io/badge/project-IPFS-blue.svg?style=flat-square)](http://ipfs.io/)
[![](https://img.shields.io/badge/freenode-%23ipfs-blue.svg?style=flat-square)](http://webchat.freenode.net/?channels=%23ipfs)
[![Coverage Status](https://codecov.io/gh/ipfs/go-ipld-git/branch/master/graph/badge.svg)](https://codecov.io/gh/ipfs/go-ipld-git/branch/master)
[![Travis CI](https://travis-ci.org/ipfs/go-ipld-git.svg?branch=master)](https://travis-ci.org/ipfs/go-ipld-git)

> An ipld codec for git objects allowing path traversals across the git graph!

Note: This is WIP and may not be an entirely correct parser.

## Lead Maintainer

[Łukasz Magiera](https://github.com/magik6k)
rvagg marked this conversation as resolved.
Show resolved Hide resolved
> An IPLD codec for git objects allowing path traversals across the git graph.

## Table of Contents

Expand All @@ -29,19 +20,49 @@ go get github.com/ipfs/go-ipld-git
```

## About

This is an IPLD codec which handles git objects. Objects are transformed
into IPLD graph in the following way:
into IPLD graph as detailed below. Objects are demonstrated here using both
[IPLD Schemas](https://ipld.io/docs/schemas/) and example JSON forms.

### Commit

```ipldsch
type GpgSig string

type PersonInfo struct {
date String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're not keeping dates in a human-friendly single string format, I wonder if you could make this unix timestamp an Int too.

timezone String
email String
name String
}

type Commit struct {
author optional PersonInfo
committer optional PersonInfo
message String
parents [&Commit]
tree &Tree # see "Tree" section below
encoding optional String
signature optional GpgSig
mergeTag [Tag]
other [String]
}
```

As JSON, real data would look something like:

* Commit:
```json
{
"author": {
"date": "1503667703 +0200",
"date": "1503667703",
"timezone": "+0200",
"email": "author@mail",
"name": "Author Name"
},
"committer": {
"date": "1503667703 +0200",
"date": "1503667703",
"timezone": "+0200",
Comment on lines +64 to +65

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that this is easier to work with an a single object? It doesn't seem to match how the Git spec handles things internally. Although perhaps it's fair for us to try and make things easier for our users here.

I was talking with @Stebalien about this and as far as we can tell there hasn't been much tooling developed around this codec so making a breaking change that helps people out and seems sane is probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(delayed reply to this) the main idea here is to keep the data lossless and not do too many tricks to make it palatable to the user.

Encoded form looks something like:

author A U Thor <author@example.com> 1465981137 +0000

But there's even some flexibility in what those final two fields can contain and apparently whether we get two, one or zero of them (I'm not sure about that detail though). So we treat them as strings and present them to the user as they are.

The current version of this codec keeps them internally but only presents a nice human-readable form as a DAG node, which isn't going to work when we're treating DAG traversal as traversal over the data model like we're doing with ipld-prime.

We did discuss (in this thread and other places) some options for making the human-readable form emerge out of the data to be available, but @willscott made the case that since this codec is a bit of a reference codec that we'd want to point others to for how to implement one that we'd better keep it simple for now.

So, lossless data, pure data model, no trickery, user gets to interpret the nodes as they want.

"email": "author@mail",
"name": "Author Name"
},
Expand All @@ -51,10 +72,22 @@ into IPLD graph in the following way:
],
"tree": <LINK>
}
```

### Tag

```ipldsch
type Tag struct {
message String
object &Any
tag String
tagger PersonInfo
tagType String
}
```

* Tag:
As JSON, real data would look something like:

```json
{
"message": "message\n",
Expand All @@ -67,12 +100,23 @@ into IPLD graph in the following way:
"email": "author@mail",
"name": "Author Name"
},
"type": "commit"
"tagType": "commit"
rvagg marked this conversation as resolved.
Show resolved Hide resolved
}
```

### Tree

```ipldsch
type Tree {String:TreeEntry}

type TreeEntry struct {
mode String
hash &Any
}
```

* Tree:
As JSON, real data would look something like:

```json
{
"file.name": {
Expand All @@ -87,11 +131,18 @@ into IPLD graph in the following way:
}
```

### Blob

```ipldsch
type Blob bytes
```

As JSON, real data would look something like:

* Blob:
```json
"<base64 of 'blob <size>\0<data>'>"
```

## Contribute

PRs are welcome!
Expand Down
72 changes: 36 additions & 36 deletions commit.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ func DecodeCommit(na ipld.NodeAssembler, rd *bufio.Reader) error {
}

c := _Commit{
Parents: _ListParents{[]_Link{}},
MergeTag: _ListTag{[]_Tag{}},
Other: _ListString{[]_String{}},
parents: _ListParents{[]_LinkCommit{}},
mergeTag: _ListTag{[]_Tag{}},
other: _ListString{[]_String{}},
}
for {
line, _, err := rd.ReadLine()
Expand Down Expand Up @@ -53,30 +53,30 @@ func decodeCommitLine(c Commit, line []byte, rd *bufio.Reader) error {
return err
}

c.GitTree = _LinkTree{cidlink.Link{Cid: shaToCid(sha)}}
c.tree = _LinkTree{cidlink.Link{Cid: shaToCid(sha)}}
case bytes.HasPrefix(line, []byte("parent ")):
psha, err := hex.DecodeString(string(line[7:]))
if err != nil {
return err
}

c.Parents.x = append(c.Parents.x, _Link{cidlink.Link{Cid: shaToCid(psha)}})
c.parents.x = append(c.parents.x, _LinkCommit{cidlink.Link{Cid: shaToCid(psha)}})
case bytes.HasPrefix(line, []byte("author ")):
a, err := parsePersonInfo(line)
if err != nil {
return err
}

c.Author = _PersonInfo__Maybe{m: schema.Maybe_Value, v: a}
c.author = _PersonInfo__Maybe{m: schema.Maybe_Value, v: a}
case bytes.HasPrefix(line, []byte("committer ")):
com, err := parsePersonInfo(line)
if err != nil {
return err
}

c.Committer = _PersonInfo__Maybe{m: schema.Maybe_Value, v: com}
c.committer = _PersonInfo__Maybe{m: schema.Maybe_Value, v: com}
case bytes.HasPrefix(line, []byte("encoding ")):
c.Encoding = _String__Maybe{m: schema.Maybe_Value, v: &_String{string(line[9:])}}
c.encoding = _String__Maybe{m: schema.Maybe_Value, v: _String{string(line[9:])}}
case bytes.HasPrefix(line, []byte("mergetag object ")):
sha, err := hex.DecodeString(string(line)[prefixMergetag:])
if err != nil {
Expand All @@ -88,7 +88,7 @@ func decodeCommitLine(c Commit, line []byte, rd *bufio.Reader) error {
return err
}

c.MergeTag.x = append(c.MergeTag.x, *mt)
c.mergeTag.x = append(c.mergeTag.x, *mt)

if rest != nil {
err = decodeCommitLine(c, rest, rd)
Expand All @@ -101,33 +101,33 @@ func decodeCommitLine(c Commit, line []byte, rd *bufio.Reader) error {
if err != nil {
return err
}
c.Sig = _GpgSig__Maybe{m: schema.Maybe_Value, v: sig}
c.signature = _GpgSig__Maybe{m: schema.Maybe_Value, v: sig}
case len(line) == 0:
rest, err := ioutil.ReadAll(rd)
if err != nil {
return err
}

c.Message = _String{string(rest)}
c.message = _String{string(rest)}
default:
c.Other.x = append(c.Other.x, _String{string(line)})
c.other.x = append(c.other.x, _String{string(line)})
}
return nil
}

func decodeGpgSig(rd *bufio.Reader) (GpgSig, error) {
func decodeGpgSig(rd *bufio.Reader) (_GpgSig, error) {
out := _GpgSig{}

line, _, err := rd.ReadLine()
if err != nil {
return nil, err
return out, err
}

out := _GpgSig{}

if string(line) != " " {
if strings.HasPrefix(string(line), " Version: ") || strings.HasPrefix(string(line), " Comment: ") {
out.x += string(line) + "\n"
} else {
return nil, fmt.Errorf("expected first line of sig to be a single space or version")
return out, fmt.Errorf("expected first line of sig to be a single space or version")
}
} else {
out.x += " \n"
Expand All @@ -136,7 +136,7 @@ func decodeGpgSig(rd *bufio.Reader) (GpgSig, error) {
for {
line, _, err := rd.ReadLine()
if err != nil {
return nil, err
return out, err
}

if bytes.Equal(line, []byte(" -----END PGP SIGNATURE-----")) {
Expand All @@ -146,7 +146,7 @@ func decodeGpgSig(rd *bufio.Reader) (GpgSig, error) {
out.x += string(line) + "\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly advise you to use something like strings.Builder instead; each += on a string is an allocation and copy of the entire thing. Just twenty of these operations could already add up to showing up on CPU profiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copied verbatim from https://github.com/ipfs/go-ipld-git/blob/master/git.go#L310

I'll file as an issue to optimize if we end up caring

}

return &out, nil
return out, nil
}

func encodeCommit(n ipld.Node, w io.Writer) error {
Expand All @@ -158,33 +158,33 @@ func encodeCommit(n ipld.Node, w io.Writer) error {

buf := new(bytes.Buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think this buffer could get large, you could also use a bufio.Writer, which will write to the final writer in chunks. But if we think it will never be more than a few kilobytes, it's likely irrelevant.


fmt.Fprintf(buf, "tree %s\n", hex.EncodeToString(c.GitTree.sha()))
for _, p := range c.Parents.x {
fmt.Fprintf(buf, "tree %s\n", hex.EncodeToString(c.tree.sha()))
for _, p := range c.parents.x {
fmt.Fprintf(buf, "parent %s\n", hex.EncodeToString(p.sha()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify these by using %x instead :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll leave as is since copied verbatim

}
fmt.Fprintf(buf, "author %s\n", c.Author.v.GitString())
fmt.Fprintf(buf, "committer %s\n", c.Committer.v.GitString())
if c.Encoding.m == schema.Maybe_Value {
fmt.Fprintf(buf, "encoding %s\n", c.Encoding.v.x)
fmt.Fprintf(buf, "author %s\n", c.author.v.GitString())
fmt.Fprintf(buf, "committer %s\n", c.committer.v.GitString())
if c.encoding.m == schema.Maybe_Value {
fmt.Fprintf(buf, "encoding %s\n", c.encoding.v.x)
}
for _, mtag := range c.MergeTag.x {
fmt.Fprintf(buf, "mergetag object %s\n", hex.EncodeToString(mtag.Object.sha()))
fmt.Fprintf(buf, " type %s\n", mtag.TagType.x)
fmt.Fprintf(buf, " tag %s\n", mtag.Tag.x)
fmt.Fprintf(buf, " tagger %s\n \n", mtag.Tagger.GitString())
fmt.Fprintf(buf, "%s", mtag.Text.x)
for _, mtag := range c.mergeTag.x {
fmt.Fprintf(buf, "mergetag object %s\n", hex.EncodeToString(mtag.object.sha()))
fmt.Fprintf(buf, " type %s\n", mtag.tagType.x)
fmt.Fprintf(buf, " tag %s\n", mtag.tag.x)
fmt.Fprintf(buf, " tagger %s\n \n", mtag.tagger.GitString())
fmt.Fprintf(buf, "%s", mtag.message.x)
}
if c.Sig.m == schema.Maybe_Value {
if c.signature.m == schema.Maybe_Value {
fmt.Fprintln(buf, "gpgsig -----BEGIN PGP SIGNATURE-----")
fmt.Fprint(buf, c.Sig.v.x)
fmt.Fprint(buf, c.signature.v.x)
fmt.Fprintln(buf, " -----END PGP SIGNATURE-----")
}
for _, line := range c.Other.x {
for _, line := range c.other.x {
fmt.Fprintln(buf, line.x)
}
fmt.Fprintf(buf, "\n%s", c.Message.x)
fmt.Fprintf(buf, "\n%s", c.message.x)

fmt.Printf("encode commit len: %d \n", buf.Len())
// fmt.Printf("encode commit len: %d \n", buf.Len())
// fmt.Printf("out: %s\n", string(buf.Bytes()))
_, err := fmt.Fprintf(w, "commit %d\x00", buf.Len())
if err != nil {
Expand Down
53 changes: 26 additions & 27 deletions gen/gen.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,41 +18,40 @@ func main() {
ts.Accumulate(schema.SpawnList("ListString", "String", false))
ts.Accumulate(schema.SpawnLink("Link"))
ts.Accumulate(schema.SpawnStruct("PersonInfo", []schema.StructField{
schema.SpawnStructField("Name", "String", false, false),
schema.SpawnStructField("Email", "String", false, false),
schema.SpawnStructField("Date", "String", false, false),
schema.SpawnStructField("Timezone", "String", false, false),
}, schema.SpawnStructRepresentationStringjoin(" ")))
schema.SpawnStructField("date", "String", false, false),
schema.SpawnStructField("timezone", "String", false, false),
schema.SpawnStructField("email", "String", false, false),
schema.SpawnStructField("name", "String", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{})))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this to a map rather than a stringjoin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not being used as a stringjoin in any way? We have a custom PersonInfo#GitString() that's used to encode it in the git format and parsePersonInfo() to parse the input string for decoding. Stringjoin here is only going to mean that when it's encoded with dag-cbor or dag-json that it comes out as <date> <timezone> <email> <name>, which IMO is lossy and probably not what we're wanting out of this. Unless I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<date> <timezone> <email> <name> is the git format though. I would hope that wouldn't be lossy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's $name <$email> $maybedate $maybetimezone - it's lossy because the email has a <> around it and we can use the < to delineate the name which itself can contain any number of spaces, so simple stringjoin isn't enough

Copy link
Member

@warpfork warpfork Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Rod's take here: this seems better to do at the codec layer and yield up to data model as a map.

If I was going to transliterate some git data to dag-json, I'd probably want this to become a map in dag-json. So that means the codec should do this parse/transform immediately.

ts.Accumulate(schema.SpawnString("GpgSig"))
ts.Accumulate(schema.SpawnStruct("Tag", []schema.StructField{
schema.SpawnStructField("Object", "Link", false, false),
schema.SpawnStructField("TagType", "String", false, false),
schema.SpawnStructField("Tag", "String", false, false),
schema.SpawnStructField("Tagger", "PersonInfo", false, false),
schema.SpawnStructField("Text", "String", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{"Type": "TagType"})))
schema.SpawnStructField("message", "String", false, false),
schema.SpawnStructField("object", "Link", false, false),
schema.SpawnStructField("tag", "String", false, false),
schema.SpawnStructField("tagger", "PersonInfo", false, false),
schema.SpawnStructField("tagType", "String", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{})))
ts.Accumulate(schema.SpawnList("ListTag", "Tag", false))
ts.Accumulate(schema.SpawnList("ListParents", "Link", false)) //Todo: type 'Parents' links
ts.Accumulate(schema.SpawnLinkReference("LinkCommit", "Commit"))
ts.Accumulate(schema.SpawnList("ListParents", "LinkCommit", false))
ts.Accumulate(schema.SpawnStruct("Commit", []schema.StructField{
schema.SpawnStructField("GitTree", "LinkTree", false, false),
schema.SpawnStructField("Parents", "ListParents", false, false),
schema.SpawnStructField("Message", "String", false, false),
schema.SpawnStructField("Author", "PersonInfo", true, false),
schema.SpawnStructField("Committer", "PersonInfo", true, false),
schema.SpawnStructField("Encoding", "String", true, false),
schema.SpawnStructField("Sig", "GpgSig", true, false),
schema.SpawnStructField("MergeTag", "ListTag", false, false),
schema.SpawnStructField("Other", "ListString", false, false),
schema.SpawnStructField("author", "PersonInfo", true, false),
schema.SpawnStructField("committer", "PersonInfo", true, false),
schema.SpawnStructField("message", "String", false, false),
schema.SpawnStructField("parents", "ListParents", false, false),
schema.SpawnStructField("tree", "LinkTree", false, false),
schema.SpawnStructField("encoding", "String", true, false),
schema.SpawnStructField("signature", "GpgSig", true, false),
schema.SpawnStructField("mergeTag", "ListTag", false, false),
schema.SpawnStructField("other", "ListString", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{})))
ts.Accumulate(schema.SpawnBytes("Blob"))

ts.Accumulate(schema.SpawnList("Tree", "TreeEntry", false))
ts.Accumulate(schema.SpawnMap("Tree", "String", "TreeEntry", false))
ts.Accumulate(schema.SpawnLinkReference("LinkTree", "Tree"))
ts.Accumulate(schema.SpawnStruct("TreeEntry", []schema.StructField{
schema.SpawnStructField("Mode", "String", false, false),
schema.SpawnStructField("Name", "String", false, false),
schema.SpawnStructField("Hash", "Link", false, false),
}, schema.SpawnStructRepresentationStringjoin(" ")))
schema.SpawnStructField("mode", "String", false, false),
schema.SpawnStructField("hash", "Link", false, false),
}, schema.SpawnStructRepresentationMap(map[string]string{})))

if errs := ts.ValidateGraph(); errs != nil {
for _, err := range errs {
Expand Down
Loading