Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crazy Idea: git backend registry #97

Closed
josh opened this issue Oct 13, 2014 · 21 comments
Closed

Crazy Idea: git backend registry #97

josh opened this issue Oct 13, 2014 · 21 comments

Comments

@josh
Copy link

josh commented Oct 13, 2014

@maccman and I had a pretty crazy, but I think genius, idea to just have the registry in a git repo.

https://github.com/josh/birdsnest

It addresses many of the concerns in #73 and requires very little maintenance and upkeep from the bower team. Making the infrastructure more complex instead of less seems like a setup for failure given our current maintainership and budget issues. I think we can turn our limited resources into an advantage here.

/cc @paulirish @sheerun @rayshan @benschwarz

@benschwarz
Copy link
Member

Looks fun, @josh

The impractical side birdsnest not running any code is that it won't record usage statistics.

Otherwise, manually registering packages via pull request would prove to be more work than what we already have (and cause delays in package registration).

These seem like blockers to me right now, sorry to rain on your parade… I think the idea is pretty cool, but I'd be interested to know your response to these concerns. Any plans?

@josh
Copy link
Author

josh commented Oct 13, 2014

The impractical side birdsnest not running any code is that it won't record usage statistics.

I'd argue the registry is a poor place to track downloads since it isn't even offering them. Git clone and release downloads are recorded in the GitHub traffic graphs and even includes downloads that don't go through name resolution.

In terms of stats, do you have numbers on how much people care about this hit data? I imagine most people don't know it exists nor care. I think we should value decentralization and mirroring over any centralization analytics system for an open registry.

(Footnote: No one looks at their DNS queries to tell how much traffic their sites get)

Otherwise, manually registering packages via pull request would prove to be more work than what we already have (and cause delays in package registration).

Automatic merge of new packages is trivial to setup with a bot. The situation today for disputing package names and removal is a complete mess and pretty much nothing gets resolved. 1) because the process is manual for us maintains to ssh into the db and remove packages and 2) theres no way structured process to have a public discussion about removals. Things like delisted your own packages can be automated by a merge bot.

A full author history would give us an audit log to track down the original person that submitted a package or proposed a rename. Today the database is completely anonymous and completely fails us here.

@benschwarz
Copy link
Member

I'd argue the registry is a poor place to track downloads since it isn't even offering them. Git clone and release downloads are recorded in the GitHub traffic graphs and even includes downloads that don't go through name resolution.

In terms of stats, do you have numbers on how much people care about this hit data? I imagine most people don't know it exists nor care. I think we should value decentralization and mirroring over any centralization analytics system for an open registry.

I don't have usage analytics for http://bower.io/stats/, @rayshan does though.

Automatic merge of new packages is trivial to setup with a bot. The situation today for disputing package names and removal is a complete mess and pretty much nothing gets resolved. 1) because the process is manual for us maintains to ssh into the db and remove packages and 2) theres no way structured process to have a public discussion about removals. Things like delisted your own packages can be automated by a merge bot.

👍 SGTM.

A full author history would give us an audit log to track down the original person that submitted a package or proposed a rename. Today the database is completely anonymous and completely fails us here.

YES!

@josh
Copy link
Author

josh commented Oct 13, 2014

So I still think there would be a place for separate "bower package index services". http://customelements.io would be a good example here. Community sites can handle niche curation and nicely displayed metadata and more extension full text search over package readmes.

Potentially "install" hits could still be managed by a standalone service that would be optional to participant in. Keeping it separate from the core registry solves many of the security and service availability issues.

Definitely a good question. I do feel a little biased because I'd personally like to be opt-ed from download tracking and choose privacy instead.

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

Pros:

  • Companies could more easily create their own repositories.
  • We have clear history of edits to registry.
  • One could "pin" registry to given version and don't worry about continuous changes.

Cons:

  • Need to change old registry so it syncs with github repo.
  • Need to write a bot for accepting new components (also causes a lot of e-mail noise)
  • What to do with bower register command for old bower? Release a patch that disables it?
  • Problems with file limit on some systems? (about 20 000 files in one directory)
  • Long clone delay as a number of edits and new components grow

I propose middle-ground solution, and use github repository as storage for registry, instead redis.

Advantages:

  1. It wouldn't break old bower register and new bower unregister command.
  2. No need for bot accepting new components (users can send PR only for edits).
  3. No e-mail spam of new component merges (only edit requests).
  4. No long clones, no need to worry about files limit (registry is still used through http).

@patrickkettner
Copy link
Member

Couldn't bower register be a command that opens a PR with the proper
information?

Also, what OS's filesystem has a limit anywhere near that low?

On Mon, Oct 13, 2014 at 3:10 PM, Adam Stankiewicz notifications@github.com
wrote:

Pros:

  • Companies could more easily create their own repositories.
  • We have clear history of edits to registry.
  • One could "pin" registry to given version and don't worry about
    continuous changes.

Cons:

  • Need to change old registry so it syncs with github repo.
  • Need to write a bot for accepting new components (also causes a lot
    of e-mail noise)
  • What to do with bower register command for old bower? Release a
    patch that disables it?
  • Problems with file limit on some systems? (about 20 000 files in one
    directory)
  • Long clone delay as a number of edits and new components grow

I propose middle-ground solution, and use github repository as storage for
registry, instead redis.

Advantages:

  1. It wouldn't break old bower register and new bower unregister
    command.
  2. No need for bot accepting new components (users can send PR only
    for edits).
  3. No e-mail spam of new component merges (only edit requests).
  4. No long clones, no need to worry about files limit (registry is
    still used through http).


Reply to this email directly or view it on GitHub
#97 (comment).

patrick

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

I meant already implemented bower register command and all of bower users that didn't upgrade.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

@patrickkettner
Copy link
Member

I meant already implemented bower register command and all of bower users
that didn't upgrade.

Isn't that the case for anyone that doesn't update it now? I would think
that the best thing to do would create a new err code from the server that
results in a 'update your client' message on the client (to help with any
similar issue in the future), and then keep a thin server up that just
returns that err until the hits get below an acceptable threshold.

Also, I'm pretty sure it's not possible to automatically open pre-filled
PR on GitHub.

'course it is.
https://developer.github.com/v3/pulls/#create-a-pull-request How else
would github clients work?

On Mon, Oct 13, 2014 at 3:24 PM, Adam Stankiewicz notifications@github.com
wrote:

I meant already implemented bower register command and all of bower users
that didn't upgrade.

Also, I'm pretty sure it's not possible to automatically open pre-filled
PR on GitHub.


Reply to this email directly or view it on GitHub
#97 (comment).

patrick

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

You need to fork registry, clone it, commit change, push it, authorize with GitHub API, and send PR via API. That's pretty hard to implement. Also I still don't see how to instead sending PR, show pre-filled PR.

Other issues are with too many e-mail notifications and writing auto-accepting bot.

I think it's easier and cleaner to write script that commits new entry instead accepting PR. Or course one could send new entry via PR, but it wouldn't be necessary given bower register command still works.

@josh
Copy link
Author

josh commented Oct 13, 2014

Need to change old registry so it syncs with github repo.

Existing registry can be proxy to the new API to preserve compatibility with old clients.

Need to write a bot for accepting new components (also causes a lot of e-mail noise)

I'd probably suggest no one directly watch the "repo" but setup a team for people to mention when it requires direct review. Its a pretty typical GitHub workflow for repos with high volume issue trackers.

Also, I'm pretty sure it's not possible to automatically open pre-filled PR on GitHub.

Well, luckily I'm just the person to make that happen.

Problems with file limit on some systems? (about 20 000 files in one directory)

If we cared about FAT32 its 65,534. I'd say we wouldn't and every other modern FS supports trillions. Most length limits are on the file name itself.

This is more of a format issue, but you can also shard off the first letter. packages/j/jquery

Long clone delay as a number of edits and new components grow

Pretty unlikely. The current registry is about 4MB. Even with git history, gzip is amazing at compressing this stuff.

To put performance in perspective, a cold bower install jquery would have to fetch a little 4MB repo with names (just once) then clone down an entire 22MB jquery repo. Bower's perf issues are on package repo fetchs and updates, not this registry.

@patrickkettner
Copy link
Member

On Mon, Oct 13, 2014 at 3:39 PM, Adam Stankiewicz notifications@github.com
wrote:

You need to fork registry, clone it, commit change, authorize with GitHub
API and send PR via API. That's pretty hard to implement. Also I still
don't see how to instead sending PR, show pre-filled PR.

The proposed search was implemented by cloning the same repo, so if that
plan was followed, the only hurdle would be a one time authorization.

As far as emails go - why would anyone subscribe to it? Even if someone did
for some reason, they could easily opt out.

@josh
Copy link
Author

josh commented Oct 13, 2014

You need to fork registry, clone it, commit change, authorize with GitHub API and send PR via API. That's pretty hard to implement. Also I still don't see how to instead sending PR, show pre-filled PR.

Web flow baby.

Something like https://github.com/josh/birdsnest/new/gh-pages/packages/foo?body=https://github.com/josh/foo.git could prefill the filename and body and you just have to it submit.

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

Need to change old registry so it syncs with github repo.

Existing registry can be proxy to the new API to preserve compatibility with old clients.

Here are the routes:

app.get('/packages', routes.packages.list);
app.get('/packages/:name', routes.packages.fetch);
app.get('/packages/search/:name', routes.packages.search);
app.post('/packages', routes.packages.create);
app.del('/packages/:name', routes.packages.remove);

Only packages/:name could be easily proxied.

It's hard to proxy GET /packages. In bower it's used on bower search without argument. And who knows who uses it (e.g. your update.sh script). Maybe some clone & build every 1 minute trick would work.

/packages/search/:name is used by bower search :name. It would be nice if it used http://bower.io/search/ instead, but for now you need to use same endpoint for both search and viewing package endpoint (see docs on registry.search config option).

POST /packages and DEL /packages/:name could automatically commit to repository?

Or should we drop support for some endpoints and make someone angry? :)

Long clone delay as a number of edits and new components grow

Pretty unlikely. The current registry is about 4MB. Even with git history, gzip is amazing at compressing this stuff.

Still, currently if you want to install only jquery, there's one ~1KB request to /packages/jquery.

To put performance in perspective, a cold bower install jquery would have to fetch a little 4MB repo with names (just once) then clone down an entire 22MB jquery repo. Bower's perf issues are on repo fetchs and updates, not this registry.

Bower usually downloads packaged .zip tags. For jquery it's 750KB, just one file.

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

Just FYI I'm not against this. I just want to list possible issues.

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

Bower's search-server uses GET /packages as well.

@josh
Copy link
Author

josh commented Oct 13, 2014

Just FYI I'm not against this. I just want to list possible issues.

Haha for sure.

Theres room for some debate just around the storage format alone. Originally I had just a single .txt file. But I think separate files avoids merge conflicts, takes file sorting out of the question and leads slightly better git object compression across changes.

File values could potentially be a full json object with other metadata, but I really don't know what else would need to be a core concern. Keywords and dependencies are described in the package's metadata which works better since the author can just change it.

Theres also an interesting issue about package name validation in regards to the FS. I mostly blame @maccman for the original poor validation. For an example, he removed a package from the registry that managed to register its name as an empty string "". It also seems like stuff like component/foo is valid today but that seems very problematic to the bower install phase. Theres no way to install both component and component/foo.

@josh josh changed the title git backend registry Crazy Idea: git backend registry Oct 13, 2014
@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

The file format could be JSON:

{
  "url": "git://github.com/jquery/jquery"
}

It allows for tricks like following, even with current bower version:

bower install jquery --config.registry=https://sheerun.github.com/birdsnest

Theoretically one could host their own bower repo even now, just by forking birdsnest.

@sheerun
Copy link
Contributor

sheerun commented Oct 13, 2014

What needs to be done:

  • Set-up staging registry for testing
  • Implement proxying /packages/:name to REPO_URL/packages/:name
  • Implement periodic fetching and generating /packages from REPO_URL
  • Proxy /packages/search/:name to http://bower.io (or deploy http://bower.io under http://bower.herokuapp.com)
  • Let POST /packages/ create new commit (or error for pre-filled PR URL)
  • Let DELETE /packages/:name create new commit (or error with pre-filled PR URL)
  • Check if all works properly and mirror bower registry in github repository
  • Immediately deploy staging registry under production endpoint
  • Change documentation about editing registry.

Did I miss something? Is it really worth it?

@josh
Copy link
Author

josh commented Oct 13, 2014

Did I miss something? Is it really worth it?

I think it needs more buy in from other involved peeps.

@josh josh closed this as completed Nov 17, 2014
@patrickkettner
Copy link
Member

why the closure?

@paulirish
Copy link
Member

I believe because there's not enough engineering interest currently. So
better to keep it closed as its unlikely to happen.​

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants