Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New version of ocaml-git #395

Merged
merged 173 commits into from
Sep 2, 2020
Merged

New version of ocaml-git #395

merged 173 commits into from
Sep 2, 2020

Conversation

dinosaure
Copy link
Member

The new version of ocaml-git

The initial goal of this PR is to MirageOS-ize ocaml-git. Indeed, if you
look into details, the current implementation of ocaml-git used by MirageOS is
the Mem implementation which is a simple Hashtbl.t (and only needs an hash
algorithm and the caml-runtime).

This PR wants to provide a possible other way to use ocaml-git with MirageOS.
So, the main problem is the needed implementation to Make a Git store which is
currently too POSIX-compliant.

Side-effect

However, the PR takes the opportunity to update and fix bugs which are
intrinsic:

  • the PACK encoder/decoder
  • the Smart protocol

Underlying needed layout

May be 2 years ago, I started to think the Git store as 2 spaces where:

  • the first one should contains recent objects (possibly volatile)

    As know as loose objects - these objects take the opportunity of the
    underlying file-system to store/search Git objects. The layout is close to a
    simple radix tree over the hex-representation of the used hash algorithm
    where:

    • a file is a leaf
    • a directory is a node of the radix tree
  • the second one should contains lifelong objects

    As know as pack files, which contains several objects.

Let's talk about minor (loose) and major (pack) heaps. From that, what these
spaces needs for Unix world and/or MirageOS world?

For the minor heap, it should simple as it needs:

module type MINOR = sig
  type uid
  
  val exists : t -> uid -> bool
  val length : t -> uid -> int64
  val map : t -> uid -> pos:int64 -> int -> bigstring
  val append : t -> uid -> bigstring -> unit
  val appendv : t -> uid -> bigstring list -> unit
  val list : t -> uid list
end

Where append/appendv (atomic/non-atomic) create and fill the object
uid into t (which is a representation of the minor heap given by the user).
For the UNIX world, a machinery of several syscalls is needed (stat,
create, write and close) and for MirageOS, we still able to use a simple
Hashtbl.t or something better (about memory consumption/performance). But the
real constraint to fit into both worlds is:

module Make (Hash : HASH) (Minor : MINOR with type uid = Hash.t)

As we said, for the Unix world, Git considers the file-system as a radix tree
where paths (keys) are the hash of the Git object.

For the major heap, it is a bit more complex where we can have several PACK
files to store several objects. Then, the indexation of these objects is done by
an *.idx file.

So we can represent this space with:

module type MAJOR = sig
  type t
  type fd
  type uid
  
  val create : t -> uid -> fd
  val append : t -> fd -> string -> unit
  val close : t -> fd -> unit
  val length : t -> fd -> int64
  val map : t -> fd -> pos:int64 -> int -> bigstring
  
  val list : t -> uid list
  
  val move : t -> uid -> uid -> unit
end

By this interface, we assume that the creation of a PACK file (which contains
several Git objects) and the way to fill it should not be atomic (despite the
minor heap).

This interface is close to POSIX (but less close than what we currently have).
However, we can assume this interface as an Append-only interface. Again, this
interface can easily be replaced by a simple Hashtbl.t or something better.
For the Unix world, we can take the opportunity to use Unix.O_APPEND.

By this new design, the Store implementation of Git can easily fit into a
MirageOS without a huge requirement (as before when a real file-system was
needed).

However, an other space with some specific requirements exists. It's about the
way to store references in Git. Into details, this area is mutable (instead of
Minor and Major and should ensure the {i atomicity} when we want to test and
set a reference - similar to the [CAS][cas] atomic operation).

module type REF = sig
  type t
  
  val atomic_write : t -> Reference.t -> string -> unit
  val atomic_read : t -> Reference.t -> string
  val atomic_rm : t -> Reference.t -> unit
  val list : t -> Reference.t list
end

From all of these spaces, I think it's better to localise an error and to trace
what a simple Git.read/Git.write really does over these spaces. Git_unix
provides these spaces according the layout of a Git repository. And, even if in
reality these spaces work on a large common space (the file-system), we can
containerise them each others if we want.

New comers

Carton

May be 2 or 3 years ago, the idea to extract the design of the PACK file to be
usable by something else than Git came over the carton project. With
different iterations, the API was fixed one year ago and the plan to integrate
it into ocaml-git was planified.

The main goal of this sub-project is:

  1. be able to use it in another context (a separation from the Git's logic was needed)
  2. limit it on few syscalls to be able to use it in MirageOS A side-effect of
    it is the possibility to load a PACK file into an unikernel with
    caravan and have an other implementation of a read-only KV-store
    for MirageOS.
  3. test it outside the context of Git

By design, carton needs only the map syscall to read an object and the
append syscall to generate a PACK file. It takes the opportunity to test the
type ('a, 's) io more deeply (see limitation of such design, etc.) and it
seems clear that the result is good enough to:

  1. fit into ocaml-git
  2. be usable by another layout than Git
(not so) Minor updates

carton leads the update of decompress.1.0.0 and duff.0.3 where:

  • the new version of decompress fixed many bugs about the inflation/deflation
    and the process is faster than before. See these articles about decompress
  • the new version of duff fix the support of 32-bits to be able to use
    this library (and by transitivity ocaml-git) into some exotics architectures
Tests over the PACK file

Of course, due to the separation between the Git's logic and the PACK file, we
are able to focus our tests over the format of the PACK file independently Git
assumptions (format of Git objects, hash algorithm, layout of Git repository).

Some fuzzers found into the official Git project was added to keep same
assumptions and the update take the opportunity to fix some bugs about the /PACK
engine/. All tests are available into test/carton/ directory.

The intrinsic possibility about ocaml-git

Due to the requirement of carton to be able to decode/encode a PACK file, the
new design on top of carton unlock the ability to reduce the definition of the
Major heap to the signature given above.

Loose object

Because the question about the PACK file is, now, resolved by carton, we
easily can /formalise/ the way to extract a /loose/ object. Internally,
ocaml-git comes with a new sub-library git.loose which has 3 derivations:

  • git.loose-lwt
  • git.loose-git
  • git-unix.loose-unix

This sub-library (as carton) unlocks the ability to shape this layout into the
Minor interface given above. Of course, it adds the ability, again, to test this
part of ocaml-git without Git assumptions - where the layout is only a
radix-tree of deflated objects.

Encore

A new release of [encore][encore] is available where the API of this library
is better than before. The question of encore is: how to produce an encoder
and a decoder from a common description.

The new API take the opportunity of GADTs to propose a DSL to describe a format.
From it, we are able to derive an angstrom parser or a lavoisier
encoder.

From this update, I did not get any regressions from tests and the encoder was
simplified to focus on the initial goal of encore: ensure the /isomorphism/
between the encoder and the decoder.

This update takes the opportunity to fix a bug about ocaml-git
when we needs to extract a large object. A test was added to ensure
that we properly fix the problem.

Finally, the update of encore unlocks the ability to compile ocaml-git with
js_of_ocaml and fix the issue about that.

Conduit

See conduit about that.

Not So Smart (nss)

Since the version 2 of ocaml-git, I discovered several bugs about the way to
push or pull a Git repository. Even if in most of the case, ocaml-git
works, it appears that the negotiation engine does something wrong.

I decided to rewrite it and fix problems about the negotiation engine.

Then, according the work from @hannesm, I decided to properly integrate a way to
use SSH (with aws-ssh). Of course, on this way, the new version of conduit
helps me to do what I want.

But the biggest change is to delete the duplicate between the TCP, the HTTP and
the SSH implementation of the Smart protocol. Indeed, even if Git does the
same
when it wants to push/fetch, some details exist and the current
version of ocaml-git already integrate some (not right) divergences between
the TCP and the HTTP implementation.

Restart from zero and focus on what the negotiation engine really does to be
able to use into any layered protocols was the goal. Thanks to
colombe to give me the key about the right abstraction.

Transparent integration with ocaml-git

The Smart protocol wants to do only 2 things:

  1. do a negotiation/synchronisation with a peer
  2. receive or send a PACK file

From these 2 tasks, the idea of the Git format, the layout of the store or more
generally the idea of a Git repository is outside the scope of the protocol.
nss wants to provides only a way to get or send a PACK file from a context -
by this way, requirements to do the negotiation are limited into few operations:

type ('uid, 'ref, 'v, 'g, 's) access = {
  get : 'uid -> ('uid, 'v, 'g) store -> ('v option, 's) io;
  parents : 'uid -> ('uid, 'v, 'g) store -> ('v list, 's) io;
  deref : ('uid, 'v, 'g) store -> 'ref -> ('uid option, 's) io;
  locals : ('uid, 'v, 'g) store -> ('ref list, 's) io;
}

Again, the notion of a Git object is outside the implementation of the PACK file
(carton), so nss does not need to know the format of a Git commit but only
the way to get parents of a commit.

Then, from a set of commit, we should be able to create a PACK file (push).
About the fetch operation, it is a bit more complex when we must analyse the
PACK file to produce an index of it. But, again, all of these operations are
available outside Git's notions - and, of course, outside the Git scope.

Regression

Of course, the first goal of nss was to fix negotiation bugs and delete the
duplicate between TCP, SSH and HTTP protocols. All previous regression tests was
added and works and all buggy situations such as this trouble was
added over all protocols (mostly to ensure a good behaviour of our negotiation
engine).

However, the negotiation engine of Git and ocaml-git is not well
defined/formalised. We can imagine an other perspective such a version 3 of the
Smart protocol to be able to fetch/push - but this is not the goal of this
PR, it's definitely a cool and close goal for ocaml-git however.

Performances

I did not do some benchmarks but the only update to decompress.1.0.0 helps us
about performance of course. Then, the scheduling between the protocol process,
the reception of the PACK and the analyse of it (this what you do when you git clone) seems better. A macro benchmark tells to us that this new implementation
is faster than before.

However, I did not have the time to benchmark all of that and mostly trust on
the work done on decompress to say that it's faster than before.

Functor or not functor?

carton, nss or loose were made into the same design, without the logic of
the I/O scheduling. With this new view, functors are used in parsimony and
globally at the end of the development process to provide an easy-to-use API
over LWT or ASYNC.

More concretely, any types defined in these sub-libraries are outside the scope
of functors and their definitions don't depends on the I/O scheduling.

type 'a t

module Make (S : S) = struct
  type nonrec t = S.t t
end

The Git core library follows this new design where the existence of the commit,
for example, does not depends from the application of a functor. The functor
only specialise the definition with the given new type.

At another layer, such as the Value module, we have less constraints
and it more easy for the compiler to infer a type equality even if we forget to
add a constraint.

In that case, every types provided by git and functions to manipulates them
(without a knowledge of the hash algorithm used) are defined outside the scope
of functors.

Conclusion

I think this PR adds a lot of possibilities for MirageOS and it is a really
step-forward about performances and compatibilities with Git and its behaviour.

It paves the way for a better integration with MirageOS of course and open some
possibilities such as:

  • a real GC
  • how to shallow commits
  • a MirageOS server

The split clears the way to add some others logic which are more close to Git
than the format of the PACK file, the loose file or the way to synchronise a Git
repository with a peer.

Finally, the Git core library is only about Git:

  1. format of Git objects
  2. glue between PACK and loose layouts
  3. glue to a protocol

These pieces co-exist together but can be use separately.

@dinosaure
Copy link
Member Author

dinosaure commented Aug 13, 2020

It seems that the current status of this PR can perfectly works with irmin with minor changes. However, we should add more tests. There is an other TODO list about this PR:

  • tests HTTP layer correctly
    • launch an HTTP git server (due to bubblewrap)
  • delete ocurl and use cohttp instead
  • add some tests about references
    • test about cycles
    • properly integrate packed-refs into git-unix (and remove it from store)
    • atomic_wr is atomic
  • properly re-implement git-unix
    • handles EINTR
    • a good impl. of a recursive mkdir
    • a good impl. of a recursive rmdir
    • use rename on the minor and major heap
    • let the user to choose the tmp directory
  • accept only PACKv2
  • split commits

And I think it's done but this PR seems to be a good bedrock for the release of git.2.2.0.

- ENCODER / DECODER
  These interfaces are used to describe a non-blocking interface. From a new
  perspective, these interfaces are useless where `angstrom` and `encore` are
  sufficient to obtain a non-blocking encoder/decoder (see [Angstrom.state] and
  [Encore.Lavoisier.state])
- META
  A special interface (with an argument) to expose the format of a Git
  object from a /meta/ syntax (the module argument). However, since
  `encore.0.6`, this way is not used anymore (a GADT is used instead)
- DESC
  A description of the produced format from a /meta/ decoder language such as
  `angstrom` or a /meta/ encoder language such as `encore.lavoisier`. Due to
  `encore.0.6`, such interface is not used anymore.

  Instead, each Git object should provide a [val format : t Encore.t] which
  can derive to an `angstrom`'s parser or a `lavoisier`'s encoder.
- INFLATE / DEFLATE
  Due to the fact that Zlib is used by a middle layer of Git (eg. PACK file),
  it's unnecessary to /functorize/ `ocaml-git` over these interfaces.
- FILE / DIR / MAPPER / FS
  The first entry-point of this PR, deletion a needed implementation of a
  file-system. By this way, such POSIX-close interfaces must be removed!
- null: [= digest_string ""], this value is used to initialise some values
  such as an [Hash.t array] with an impossible value (Git should __never__
  create an Git object with the hash [digest_string ""]
- length: a better name for [digest_size]
- feed: be able to feed a [Bigstringaf.t] value
@dinosaure
Copy link
Member Author

204 commits after, I think it's done.

@dinosaure dinosaure force-pushed the ng branch 2 times, most recently from bb93091 to 2e518b0 Compare August 27, 2020 10:14
The interface will include only:
- S.DIGEST
- S.BASE

The interface defines only one new type [hash] - and [Make] constraints it.
A type [t] exists outside the /functor/ and it is reused by the /functor/.
We define a type [t] which represents the Git Blob object. Independently of the
hash implementation used, the object exists. Some functions are exposed to
manipulate it (with a documentation).
The interface will include only:
- S.DIGEST
- S.BASE
- Some function to manipulate a tree
- a [format] value which describes the format a Git Tree object

The interface defines only one new type [hash] - and [Make] constraints it.
A type [t] exists outside the /functor/ and it is reused by the /functor/.
We define a type [t] which represents the Git Tree object. Independently of the
hash implementation used, the object exists (it is parameterized by ['hash]).
Some functions are exposed to manipulate it.
The description of the Tree format is represented by an [encore]'s value
[format] (and exposed by the interface). This patch is a translation from
the /meta-syntax/ to [encore]'s combinators.
This module is a replacement of old [Helper] module. It provides a way to
calculate the hash of a Git object given by its OCaml representation.

The way to calculate the hash is:
1) serialise the Git object
2) start with a /header/ ([kind length\000])
3) feed the context with the serialised Git object

A Git object can be big. Instead to entirely serialise it, we /stream/ it to
limit the memory footprint. NOTE: the memory footprint is __not__ really
limited (see [test/tree/test] to understand why) but, at least, the serialised
value is cut to many /small/ [string].
NOTE we don't need temporary buffers anymore to calculate the hash of a Git
object.
@dinosaure
Copy link
Member Author

REVDEP is the only point where ocaml-git fails and it's normal - the work is already done on this side (to port irmin with this new version). As I said at the beginning, all tests are re-integrated and it's well tested with irmin. To help with others projects, I will merge this PR and make an issue about ocurl.

@dinosaure dinosaure merged commit 436e84d into mirage:master Sep 2, 2020
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Jan 9, 2021
… git-unix (3.0.0)

CHANGES:

- Rewrite of `ocaml-git` (@dinosaure, mirage/ocaml-git#395)
- Delete useless constraints on digestif's signature (@dinosaure, mirage/ocaml-git#399)
- Add support of CoHTTP with UNIX and MirageOS (@ulugbekna, mirage/ocaml-git#400)
- Add progress reporting on fetch command (@ulugbekna, mirage/ocaml-git#405)
- Lint dependencies on packages (`git-cohttp-unix` and `git-cohttp-mirage`)
  and update to the last version of CoHTTP (@hannesm, mirage/ocaml-git#407)
- Fix internal `Cstruct_append` implementation (@dinosaure, mirage/ocaml-git#401)
- Implement shallow commit (@dinosaure, mirage/ocaml-git#402)
- Update to `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#408) (deleted by the integration of `mimic`)
- Delete use of `ocurl` (@dinosaure, mirage/ocaml-git#410)
- Delete the useless **old** `git-mirage` package (@hannesm, mirage/ocaml-git#411)
- Fix about unresolved endpoint with `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#412)
- Refactors fetch command (@ulugbekna, mirage/ocaml-git#404)
- Fix ephemerons about temporary devices (@dinosaure, mirage/ocaml-git#413)
- Implementation of `ogit-fetch` as an example (@ulugbekna, mirage/ocaml-git#406)
- Rename `nss` to `git-nss` (@dinosaure, mirage/ocaml-git#415)
- Refactors `git-nss` (@ulugbekna, mirage/ocaml-git#416)
- Update README.md (@ulugbekna, mirage/ocaml-git#417)
- Replace deprecated `Fmt` functions (@ulugbekna, mirage/ocaml-git#421)
- Delete physical equality (@ulugbekna, mirage/ocaml-git#422)
- Rename `prelude` argument by `uses_git_transport` (@ulugbekna, mirage/ocaml-git#423)
- Refactors Smart decoder (@ulugbekna, mirage/ocaml-git#424)
- Constraint to use `fmt.0.8.7` (@dinosaure, mirage/ocaml-git#425)
- Small refactors in `git-nss` (@dinosaure, mirage/ocaml-git#427)
- Delete `conduit.3.0.0` and replace it by `mimic` (@dinosaure, mirage/ocaml-git#428)
- Delete the useless `verify` function on `fetch` and `push` (@dinosaure, mirage/ocaml-git#429)
- Delete `pin-depends` on `awa` (@dinosaure, mirage/ocaml-git#431)
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Jan 9, 2021
…t-unix and git-mirage (3.0.0)

CHANGES:

- Rewrite of `ocaml-git` (@dinosaure, mirage/ocaml-git#395)
- Delete useless constraints on digestif's signature (@dinosaure, mirage/ocaml-git#399)
- Add support of CoHTTP with UNIX and MirageOS (@ulugbekna, mirage/ocaml-git#400)
- Add progress reporting on fetch command (@ulugbekna, mirage/ocaml-git#405)
- Lint dependencies on packages (`git-cohttp-unix` and `git-cohttp-mirage`)
  and update to the last version of CoHTTP (@hannesm, mirage/ocaml-git#407)
- Fix internal `Cstruct_append` implementation (@dinosaure, mirage/ocaml-git#401)
- Implement shallow commit (@dinosaure, mirage/ocaml-git#402)
- Update to `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#408) (deleted by the integration of `mimic`)
- Delete use of `ocurl` (@dinosaure, mirage/ocaml-git#410)
- Delete the useless **old** `git-mirage` package (@hannesm, mirage/ocaml-git#411)
- Fix about unresolved endpoint with `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#412)
- Refactors fetch command (@ulugbekna, mirage/ocaml-git#404)
- Fix ephemerons about temporary devices (@dinosaure, mirage/ocaml-git#413)
- Implementation of `ogit-fetch` as an example (@ulugbekna, mirage/ocaml-git#406)
- Rename `nss` to `git-nss` (@dinosaure, mirage/ocaml-git#415)
- Refactors `git-nss` (@ulugbekna, mirage/ocaml-git#416)
- Update README.md (@ulugbekna, mirage/ocaml-git#417)
- Replace deprecated `Fmt` functions (@ulugbekna, mirage/ocaml-git#421)
- Delete physical equality (@ulugbekna, mirage/ocaml-git#422)
- Rename `prelude` argument by `uses_git_transport` (@ulugbekna, mirage/ocaml-git#423)
- Refactors Smart decoder (@ulugbekna, mirage/ocaml-git#424)
- Constraint to use `fmt.0.8.7` (@dinosaure, mirage/ocaml-git#425)
- Small refactors in `git-nss` (@dinosaure, mirage/ocaml-git#427)
- Delete `conduit.3.0.0` and replace it by `mimic` (@dinosaure, mirage/ocaml-git#428)
- Delete the useless `verify` function on `fetch` and `push` (@dinosaure, mirage/ocaml-git#429)
- Delete `pin-depends` on `awa` (@dinosaure, mirage/ocaml-git#431)
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Jan 9, 2021
…t-unix and git-mirage (3.0.0)

CHANGES:

- Rewrite of `ocaml-git` (@dinosaure, mirage/ocaml-git#395)
- Delete useless constraints on digestif's signature (@dinosaure, mirage/ocaml-git#399)
- Add support of CoHTTP with UNIX and MirageOS (@ulugbekna, mirage/ocaml-git#400)
- Add progress reporting on fetch command (@ulugbekna, mirage/ocaml-git#405)
- Lint dependencies on packages (`git-cohttp-unix` and `git-cohttp-mirage`)
  and update to the last version of CoHTTP (@hannesm, mirage/ocaml-git#407)
- Fix internal `Cstruct_append` implementation (@dinosaure, mirage/ocaml-git#401)
- Implement shallow commit (@dinosaure, mirage/ocaml-git#402)
- Update to `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#408) (deleted by the integration of `mimic`)
- Delete use of `ocurl` (@dinosaure, mirage/ocaml-git#410)
- Delete the useless **old** `git-mirage` package (@hannesm, mirage/ocaml-git#411)
- Fix about unresolved endpoint with `conduit.3.0.0` (@dinosaure, mirage/ocaml-git#412)
- Refactors fetch command (@ulugbekna, mirage/ocaml-git#404)
- Fix ephemerons about temporary devices (@dinosaure, mirage/ocaml-git#413)
- Implementation of `ogit-fetch` as an example (@ulugbekna, mirage/ocaml-git#406)
- Rename `nss` to `git-nss` (@dinosaure, mirage/ocaml-git#415)
- Refactors `git-nss` (@ulugbekna, mirage/ocaml-git#416)
- Update README.md (@ulugbekna, mirage/ocaml-git#417)
- Replace deprecated `Fmt` functions (@ulugbekna, mirage/ocaml-git#421)
- Delete physical equality (@ulugbekna, mirage/ocaml-git#422)
- Rename `prelude` argument by `uses_git_transport` (@ulugbekna, mirage/ocaml-git#423)
- Refactors Smart decoder (@ulugbekna, mirage/ocaml-git#424)
- Constraint to use `fmt.0.8.7` (@dinosaure, mirage/ocaml-git#425)
- Small refactors in `git-nss` (@dinosaure, mirage/ocaml-git#427)
- Delete `conduit.3.0.0` and replace it by `mimic` (@dinosaure, mirage/ocaml-git#428)
- Delete the useless `verify` function on `fetch` and `push` (@dinosaure, mirage/ocaml-git#429)
- Delete `pin-depends` on `awa` (@dinosaure, mirage/ocaml-git#431)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

js_of_ocaml compatibility
1 participant