Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and Refactor of git/utils.ts for Serving Vault Git Repositories #298

Closed
joshuakarp opened this issue Nov 29, 2021 · 28 comments · Fixed by #709
Closed

Review and Refactor of git/utils.ts for Serving Vault Git Repositories #298

joshuakarp opened this issue Nov 29, 2021 · 28 comments · Fixed by #709
Assignees
Labels
development Standard development discussion Requires discussion enhancement New feature or request r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management research Requires research

Comments

@joshuakarp
Copy link
Contributor

joshuakarp commented Nov 29, 2021

Specification

The underlying workings of git/utils.ts is largely a mystery to our team. We want to perform a deep dive into the workings of this code, and remove/refactor any of our own implementation, potentially also replacing their usage in favour of the isomorphic-git/internal-apis. There seems to be some overlap with what our git utils perform, and what iso-git provides.

To clarify... we use git in 2 ways: as a client and as a server.

When using isogit, it only provides the primitives to use git as a client. It does not provide the primitives to use git as a server.

Why do we use git as a client and server? Well our entire secrets management system is built as a stack, from bottom to top:

  1. Encrypted Key Value Database (soon to have transparent block based encryption) - js-db
  2. Virtual Filesystem - js-encryptedfs
  3. Virtual Git plus Filesystem - isogit (this is what constitutes a "vault")
  4. Virtual Unix Operations - Incorporate Unix Commands to Secrets Subcommand - pk secrets * Polykey-CLI#32 (this is what creates "vault operations")
  5. Frontends - Polykey-CLI, Polykey-Desktop, Polykey-Mobile (and then presented over an RPC interface to different front-ends)

As you can see, the "Virtual Git plus Filesystem" layer makes use of isogit. A virtual git repository is created on the virtual filesystem. This is how we are able to then stack unix operations with roughly similar semantics later, in order to maintain API consistency with what developers are used to.

Now the issue is that isogit by itself can act like a git client. That means it can clone/pull repositories, it can create repositories, it can create commits, it can manage the git objects... etc.

However isogit cannot "serve" the repositories. And this is essential to be able to share our vaults.

To serve our repositories we need to know how to act like a "git server" when we use isogit or something were to use git to try and clone/pull our vaults.

So a long time ago, we developed the git server primitives and put them all into git/utils.ts. However this code has not been heavily tested, nor has it gone through any refactoring since then. And since then we have forgotten how it all works, and it's not in the programming style that we have developed over the last few years working on Polykey.

So the git/utils.ts needs to go under review, and most likely some heavy refactoring with the addition of tests, to ensure that acting like a git server makes sense here.

Now there is definitely cross over between isogit and the git/utils.ts. However we aren't aware of what these are. And there probably should be a bunch of "shared types" and similar constructs. We just need to get into it, and get it done.

How this was addressed

After reviewing the code as it was, we ended up just cleaning it up by refactoring the http response to be streamed using an async generator. Most of the utility functions were removed in favour of using the isomorphic-git plumbing functions for interacting with the git data structure.

Ultimately this didn't speed things up very much, however, it did significantly reduce the complexity of the git domain. So it is much easier to follow along with what is happening. There is also a slight improvement with pulling a repo. We use the haves provided by the client to only grab the minimum objects required for the pull.

Additional context

Tasks

  1. Review the whole git domain
  2. Refactor the git domain with the intention of simplifying the code and improving performance when cloning/pulling a vault.
@joshuakarp joshuakarp added enhancement New feature or request development Standard development research Requires research discussion Requires discussion labels Nov 29, 2021
@joshuakarp joshuakarp added the epic Big issue with multiple subissues label Nov 29, 2021
@scottmmorris
Copy link
Contributor

scottmmorris commented Nov 29, 2021

Something else I forgot to mention was this: https://github.com/MatrixAI/js-polykey/pull/266/files#r743402477

As a general comment from what I have seen, when cloning and pulling isomorphic-git searched for the packed-refs file inside the .git directory. It should be auto generated when you initialise a git repository. I haven't been able to find any issues related to it on isomorphic git so it could be something to do with our implementation. I haven't looked too much into it

In another PR I might have a look at the need to write this file here. I remember that cloning and pulling with iso-git fails without (and for some reason it isnt automatically written). This may be because we are looking for it in the git utility functions rather than it being a requirement in the iso-git library. See the packedrefs function in git/utils.ts

@CMCDragonkai
Copy link
Member

Comment about how isomorphic-git library is only intended for client-side usage. We need to address this comment #266 (comment) in order to refactor and good documentation and understanding of how our git server actually works.

@CMCDragonkai
Copy link
Member

@tegefaulkes based on our discussions and review of the git utilities in #266, can you link all those discussions and reviews in this issue?

@tegefaulkes
Copy link
Contributor

I'm going to copy relevant information here rather than link to every comment. But the original thread is here. #266 (comment)


Summary

There are 3 main functions here.

  • request
  • handleInfoRequest
  • handlePackRequest

request returns a function that is used as the custom HTTP handler that we give to the git clone and pull functions. It's job is to convert the HTTP requests the git functions make into GRPC calls and then return them in the proper HTTP response format. It calls the GRPC functions agentService.ts:vaultsGitInfoGet and agentService.ts:vaultsGitPackGet.
agentService.ts:vaultsGitInfoGet calls vaultManager.handleInfoRequest and agentService.ts:vaultsGitPackGet callsvaultManager.handlePackRequest. Overall this is just streaming the data and the request function is returning that as the required HTTP response.

vaultManager.handleInfoRequest job is pretty simple. It's getting a list of all refs. resolving them to their sha hash and then formatting it into a stream. I think the idea here is just to get a full list of refs+hashes.

vaultManager.handlePackRequest Is a lot more complex. There is a lot going on here but it boils down to..

  1. Getting a list of refs that need to be retrieved.
  2. getting a list of commits related to the refs.
  3. getting all of the git objects related to the commits.
  4. packing and formatting all of the git objects.
  5. sending all of the packed objects back through a stream.

This is all handled by the top level functions log, listObjects and pack. Everything under that is for working with the git data structure.



    ┌──────────────────┐      ┌────────────────┐
    │                  │      │                │
    │  Vault Manager   ├──────► IsoGit         │
    │ ---------------- │      │                │  HTTP ┌──────────┐
    │                  │      │  Clone()───────┼───────►  GRPC    │
    │                  │      │  Pull()        │  GET &│  Client  │
    │   Vault map ◄────┼──────┼──┐   ┌───────┐ │  POST │          │
    │   Add vault      │      └──┼───┼───────┼─┘       │          │
    │                  │         │   │       │         └┬─┬───────┘
    └──────────────────┘         │   │       │          │ │ GET
                                 │   │       ▼          │ │Server stream
                               ┌─┴───▼────┐  ┌────────┐ │ └──────►
                               │ vault    │  │        │ │
                               │ internal─┼──┤► EFS   │ └────────►
                               │          │  │        │  Duplex stream
                               └──────────┘  └────────┘   POST



Useful technical documentation

https://www.git-scm.com/docs/http-protocol
https://github.com/git/git/tree/master/Documentation/technical
Documentation of note:

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jan 21, 2022

@tegefaulkes It's important also to write up how we realised that isogit is a client-only/focused library, which means it does not provide functionality useful for the server-side of git logic.

The git/utils.ts is mostly (if not all) used for the server-side of git. One problem is that the git/utils.ts seems quite messy with a lot of magic constants that isn't explained, and it wasn't clear where or what standards the code in git/utils.ts was derived from. It's also written in a very C-like manner with a web of procedures calling each other.

Therefore what we need to do is clean up git/utils.ts by refactoring them into TypeScript constructs, with proper types, unions, records... etc, and with all constants abstracted out and documented, and instead of a web of procedural functions, we can replace it with classes and functional data-flow.

Try to provide a summary that explains the WHY as well (not just WHAT or HOW).

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jan 21, 2022

From: #266 (comment)

Once all the residual todos excluding the git utilities refactoring is done as we have discussed in our meeting today @tegefaulkes then this PR should be ready to merge.

The git utilities refactoring can be done later, but we have a better understanding of what the git utilities is for, and how it was written. The subsequent refactoring of the git utilities has to be based on reference documentation:

It should make use of the diagram:

Any usage of magic strings, magic numbers should be removed/abstracted from the git utilities and if they need to exit, be made into documented constants. All functions that are part of some overall protocol of either a data format or communication format should also be bundled together. This means the utils.ts can be turned into a utils directory and all the functions that relate a single format can be bundled together. For example https://user-images.githubusercontent.com/22425593/147191011-faf83025-2865-4118-a91d-7a44d40e556d.png this shows that the git utilities is very imperative API, just random functions that are called together to construct some sort of data structure. The functions could be "object-orientised" to fit our overall Polykey style. It looks like the existing git utilities is based on a C-like programming style, very imperative, very procedural. We should have objects/classes instead that construct the domains we need for handling the git request.

IMG_20211224_124659.jpg

@CMCDragonkai
Copy link
Member

Details of this issue should be collated up top, and the issue description should have a fleshed out spec so we know what to do.

@CMCDragonkai
Copy link
Member

Relevant issue here: isomorphic-git/isomorphic-git#1071 regarding some of the work on isomorphic git server side and initial discussion about his here: #43

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management label Jul 24, 2022
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 8, 2022

A blast from the past (shows the previous dependency DAG, that we now have codesee generating automatically):

image

See the GitBackend in particular.

@CMCDragonkai CMCDragonkai changed the title Conduct review of git/utils.ts Review and Refactor of git/utils.ts for the Git Server Jul 13, 2023
@CMCDragonkai CMCDragonkai changed the title Review and Refactor of git/utils.ts for the Git Server Review and Refactor of git/utils.ts for Serving Vault Git Repositories Jul 13, 2023
@CMCDragonkai
Copy link
Member

@tegefaulkes with the new RPC system. What changed with respect to this? #298 (comment)

@tegefaulkes
Copy link
Contributor

Mostly just the handlers for the RPC were converted. I think this is still WIP as it's not fully tested yet.

@CMCDragonkai
Copy link
Member

Yea the server side and client side need a deep review.

    ┌──────────────────┐      ┌────────────────┐
    │                  │      │                │
    │  Vault Manager   ├──────► IsoGit         │
    │ ---------------- │      │                │  HTTP ┌──────────┐
    │                  │      │  Clone()───────┼───────►  GRPC    │
    │                  │      │  Pull()        │  GET &│  Client  │
    │   Vault map ◄────┼──────┼──┐   ┌───────┐ │  POST │          │
    │   Add vault      │      └──┼───┼───────┼─┘       │          │
    │                  │         │   │       │         └┬─┬───────┘
    └──────────────────┘         │   │       │          │ │ GET
                                 │   │       ▼          │ │Server stream
                               ┌─┴───▼────┐  ┌────────┐ │ └──────►
                               │ vault    │  │        │ │
                               │ internal─┼──┤► EFS   │ └────────►
                               │          │  │        │  Duplex stream
                               └──────────┘  └────────┘   POST

@tegefaulkes
Copy link
Contributor

So right now the git domain is a bit of a mess. It's under-documented, Structured poorly and generally is just the oldest code in the repo. So we need to go through it and refactor it. The refactor has two main goals.

  1. Update the structure and inline documentation to be more in line with the rest of the code base. This means structuring it better and fully explaining what is going on in the code. Right now it's all very magical and how-you-doing so it's very hard to follow along.
  2. We need to optimise the code so it functions faster. As of now I''m unsure how much the git domain factors into how slow vaults are cloned from a remote node. But we need to look into optimising this section as part of the general optimisation work. There are two potential avenues for optimising the git domain.
    a. We're not really streaming here. There are a few parts where it just accumulates a whole bunch of data before moving to the next step. So we batch -> stream -> batch. We should look into doing a pure streaming solution here. Even the git.clone method takes the response body as an async iterator. So it's fully possible to full end to end streaming here.
    b. The HTTP protocol for git is meant to be very efficient. It will only request blobs that it's missing. So we can skip a bunch of work when cloning if we fully support this aspect of the protocol.

@tegefaulkes
Copy link
Contributor

I'm not entirely sure how to restructure this. It already mostly conforms to how we'd structure things. Except maybe renaming utils.ts to git.ts.

@CMCDragonkai
Copy link
Member

You should setup:

  1. Any classes representing high level cohesive functionality with internal state
  2. A types.ts for all data structure types
  3. A utils.ts that should contain all the relevant atomics and primitives that don't have internal state
  4. A index.ts to export everything

I believe within the git system there's stuff that corresponds to internal state those should become classes to create.

@CMCDragonkai
Copy link
Member

All those magic strings and magic data should all become constants put into utils.ts, with a Typescript type for each of these.

Look into functional decomposition of streaming, batching, and turn into using WebStreams in the same way I've done that for QUIC. Like the creation of CloneStream or whatever.

@CMCDragonkai
Copy link
Member

Consider this a discovery/prototyping phase, then diagram out a few different alternatives, but when reading the code, it should be obvious what the structure of a clone operation is, what a pull operation is.

Also what are the main data structures - whether external or internal that should exist? Make them explicit. Are there maps we need to consult, are there arrays to keep indexed, are there just stream constructs?

@CMCDragonkai
Copy link
Member

There's alot we can do.

@CMCDragonkai
Copy link
Member

The whole clone and pull which uses HTTP streams - because we are now using JS-rpc, and we have our own muxing and demuxing protocol, I'm thinking that you would want to preserve the HTTP protocol and properly write a wrapping and unwrapping functions. That way in the future we could maintain compatibility with real git protocols, while wrapping them into our js-rpc system. It should be made as simple and as fast as possible - benchmarks should be created for these in our benches.

From a haskell perspective, always try to think in terms of symmetry and reflexivity.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 1, 2024

Almost every single object - whether packs, refs or all other primitive concepts that we're using should all be given their own type in types.ts and a corresponding constructor and deconstructor.

An OOP style would create things like class GitTag, class GitRef, class GitObject... etc.

@CMCDragonkai
Copy link
Member

Any usage of magic strings, magic numbers should be removed/abstracted from the git utilities and if they need to exit, be made into documented constants. All functions that are part of some overall protocol of either a data format or communication format should also be bundled together. This means the utils.ts can be turned into a utils directory and all the functions that relate a single format can be bundled together. For example https://user-images.githubusercontent.com/22425593/147191011-faf83025-2865-4118-a91d-7a44d40e556d.png this shows that the git utilities is very imperative API, just random functions that are called together to construct some sort of data structure. The functions could be "object-orientised" to fit our overall Polykey style. It looks like the existing git utilities is based on a C-like programming style, very imperative, very procedural. We should have objects/classes instead that construct the domains we need for handling the git request.

@tegefaulkes
Copy link
Contributor

Moving forward I'm going to do the following.

  1. Create a http.ts file that has factories for creating webstreams that provide the HTTP protocol responses. The high level aspect of streaming these responses will be handled here.
  2. The utils.ts file will contain smaller utilities for generating the smaller aspects of the HTTP response. This includes any magic values as constants and utilities such as generating the line length header and multiplexing.
  3. All aspects of the domain will be reviewed for proper typing. Typing will be updated as required.
  4. Optimisations will be applied as needed. Already I can see that every other line of the pack section is just 0005� repeated. It's the equivalent of sending and empty string between each actual packet of data.

Copy link
Contributor

This is a very old spec. I won't modify it too much besides changing the task list and adding a footnote.

@CMCDragonkai
Copy link
Member

Why reopen this? IS there another PRs targeting this?

@tegefaulkes
Copy link
Contributor

I didn't? Might be some linear weirdness but I haven't done anything besides merge this and then rebase some other PRs.

@tegefaulkes
Copy link
Contributor

Linear had the PR as non-closing since I only referenced it with the Ref keyword. It seems then it merged, this was closed but the github automation moved the linear issue back to inProgress? Likely reopening this.

I'm closing it again.

@tegefaulkes
Copy link
Contributor

I'm not sure the linear sync is working properly. The linear issue didn't auto close after I closed this. This is a little frustrating.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 15, 2024

I didn't reference any PRs in my issues. Not sure if that is thing that is causing problems.

@CMCDragonkai CMCDragonkai removed the epic Big issue with multiple subissues label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development discussion Requires discussion enhancement New feature or request r&d:polykey:core activity 1 Secret Vault Sharing and Secret History Management research Requires research
4 participants