Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate huge repo for testing purposes #156

Open
aguschin opened this issue May 11, 2022 · 8 comments
Open

Generate huge repo for testing purposes #156

aguschin opened this issue May 11, 2022 · 8 comments
Assignees
Labels
housekeeping CI, tests, maintenance and dev productivity

Comments

@aguschin
Copy link
Contributor

We need to generate a big repo with 10k+ git tags and 10k+ commits:

  • to check how GTO will handle those
  • to use in Studio BE for benchmarking and driving the development

If GTO / Studio will process those git tags slow enough, we need to see if we can optimize something:

  • maybe we can limit the tags check-ref analyzes to only those that reference a single commit
  • maybe we can optimize constructing internal registry state in GTO / make it lazier than now (e.g. read stuff by parts/chunks?)

cc @amritghimire

@aguschin aguschin added the housekeeping CI, tests, maintenance and dev productivity label May 11, 2022
@aguschin aguschin self-assigned this May 11, 2022
@aguschin aguschin moved this to Todo in MLEM + GTO May 11, 2022
@shcheklein
Copy link
Member

Can we use https://github.com/iterative/dvcgen ?

@aguschin
Copy link
Contributor Author

Thanks, I'll take a look. @Suor I see you've created the repo, do you think extending it to generate git tags for GTO makes sense?

@aguschin
Copy link
Contributor Author

Ok, I've experimented a bit with dvcgen + custom script that generates git tags that register/promote GTO artifacts. No annotations were created, registrations/promotions only.
What I've learnt so far:

  1. Rn, GTO is pretty slow.
    • If you have ~100 git tags in your repo, most of the commands will take 0.5-1.5 seconds 🐢
    • If you have ~1500 git tags in your repo, most of the commands will take 6-12 seconds 🐌
  2. GTO slows down both from gto-created tags and other git tags.
  3. Pretty much all commands are slow since they're constructing internal "state of the registry" to validate the action user want to execute (check-ref also). E.g. if you want to register new version + bump patch part, GTO needs to read all git tags in repo beforehand 🤔

Thoughts:

  1. This will very soon become tedious for Studio users.
  2. Studio BE was correct to use internal representation and storing it in DB.
  3. I'll look into this and try to make it faster until the end of the week.

@amritghimire
Copy link

The GTO functionality that we would like to be optimized at minimum are:

  • RepoIndexManager.from_repo(path).get_commit_index(commit.sha) We do cache RepoIndexManager.from_repo(path). So, it would be good to confirm that get_commit_index doesn't reconstruct the internal state of registry as you mentioned. We use the index.state.items from the response
  • check_ref(path, tag) from API. We call it for each tagnames during the update. Possible enhancement would be some caching mechanism somewhat similar for the get_commit_index at minimum to above for multiple iteration. Similarly, if it is possible to get result for check_ref without actually going through all tags, that would be great.
  • API for get_stages . If we could use same git registry for both this and previous stage for caching, that could also be a great enhancement.

@aguschin
Copy link
Contributor Author

Thanks, @amritghimire! I'll see what I can do. Re your items:

  • get_commit_index itself neither change anything, nor reconstruct the internal registry state
  • got it
  • got it

@aguschin aguschin moved this from Todo to In Progress in MLEM + GTO Jun 1, 2022
@aguschin
Copy link
Contributor Author

aguschin commented Jun 3, 2022

@aguschin
Copy link
Contributor Author

aguschin commented Jul 12, 2022

I've experimented a bit more, this is an updated observations list:

  1. The biggest problem is reading git tags details. git tag -l | xargs git show takes around 1.5s for 1k tags (I assume, Python wrappers just call Git cli undernearth). At the same time, gto show for this repo takes about 4s.
  2. irrelevant git tags are currently filtered out and don't increase execution time (so my earlier observation I posted above is no longer valid).
  3. pygit2 is as slow as gitpython in the most time-consuming task: reading all git tag references and getting information out of them. I assume, Python wrapper doesn't add overhead here.
  4. git tag -l | xargs git show works 10x faster than git tag -l | xargs -n 1 git show. A hypothesis: if we could read all git tags info simultaneously, that would be faster. But, going through git tags in Python takes ~1.5s, about the same time git tag -l | xargs git show is executed, so not sure it could be faster (Don't know how it's implemented in pygit2/gitpython). Maybe we could call https://git-scm.com/docs/git-show-ref or https://git-scm.com/docs/git-for-each-ref directly instead and parse the results.
  5. Maybe things would speed up if I implement "lazy reading" for annotated tags. For some operations (like gto show) annotation information (author, date, message) could be irrelevant. It can remove this 1.4s delay. Other relevant option is to optimize Registry State calculations and remove them where they're not needed. E.g. if you ask about model-0, we read model-0 git tags only.
  6. One more option is to create some cache file and store git tags related information there. Then GTO will read information about git tag from cache file once (could be a problem if someone deletes a git tag and recreates it changing the time/author - this caching mechanism won't spot it I think).
  7. Suprisingly, git pack-ref that packs all refs in a single file (supposedly to speed up the work with repo) increases gto show time from 4s to 14-21s. This is unfortunate, since for huge repos some users will surely pack their refs. At the same time, both git show are executed approximately the same time as for the non-packed repo.

could use some opinios here @Suor @shcheklein

@aguschin aguschin mentioned this issue Jul 18, 2022
@aguschin aguschin removed this from MLEM + GTO Aug 29, 2022
@aguschin
Copy link
Contributor Author

related to #88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
housekeeping CI, tests, maintenance and dev productivity
Projects
None yet
Development

No branches or pull requests

3 participants