Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choco Daily All-Package Extract #2144

Open
roysubs opened this issue Oct 21, 2020 · 9 comments
Open

Choco Daily All-Package Extract #2144

roysubs opened this issue Oct 21, 2020 · 9 comments

Comments

@roysubs
Copy link

roysubs commented Oct 21, 2020

I appreciate that Chocolatey is a package manager in the same way as apt, yum, pacman etc, and that the FOSS version, like any open source project, requires us to be a bit more hands on and roll our own DIY solutions for some things, but it occurs to me that package discovery is a needlessly frustrating, slow, and awkward user-experience (and as DevOps / SysAdmin types are the main users of this great package manager, it seems odd to cripple the main users). e.g. There is no way to find newly released packages (as in brand new, newly created, as opposed to updates), and tying up a package name to its package descriptions is really awkward since the output from choco list says nothing about the package (great if you know what a packagename is for, terrible if you have to repeatedly go through choco list --description to resolve. This firstly made me write a simple PowerShell script to try and interrogate the server, but it is very slow and wasteful to repeatedly query the server over and over, so after discussion with Paul Broadwith, he suggested that I propose a feature request for discussion.

My proposal is very simple: Once per day, a query is run on the DB to extract the following information:
Name,Version,PackageSize,DateOfPackageCreation,DateOfLastPublish,DateOfLastApproval,StatusOfLastApproval,Description(multiple fields?),LinkToGithubProject,LinkToWebPage,Maintainer,OtherUsefulMetaData1,OtherUsefulMetaData2,...etc

This information can then be available to download as a CSV or XML or JSON (in addition, publish the output as a simple web page sortable by column, with Name field being a hyperlink to the Chocolatey page for that package, and the Status or Maintainer field being a hyperlink to the Github project - I think such a page would become a super-efficient resource to find packages).

After that, I do not need to interrogate the server at all while I am hunting for packages, as I can just use PowerShell tools like Select-String to interrogate the .CSV - only when I have found the exact packages that I want to download, I can immediately go to a choco install to get what I need.

Pros:
• Normally, if I am hunting for packages, I might make 20 queries against the server (firstly, choco list <searchterm>, then multiple choco list <searchterm> --description to get some idea of what the package is - and each step of this is really quite slow to come back with information...). This is tedious and frustrating. With the above, on any day that I want to query the DB, I would just use curl (or maybe a switch in choco list also) to grab the daily-dump from any site hosting it and query that.
• The result is a massively faster retrieval of information to allow users (both FOSS and paid-subscribers) to very quickly discover new and interesting packages that are available. i.e. instead of interrogating the server (a waste of server resources!) I would just repeatedly interrogate the CSV/XML/JSON and this would take milliseconds instead of quite slow process that I have to do.
• Overall queries on the server should be reduced. This might not be a huge factor, but anything that can reduce stress on the DB is probably desirable in general. Additionally, the daily extract could be held in any other location, thus spreading query-load away from the DB server.

Cons:
• None. Setting up the daily dump would be trivially simple; query server-side once per day to create the CSV / XML / JSON etc, then all users can query that dump lightning fast instead of the frustrating and slow repeated choco list requests over and over. Having that one query available as a CSV daily would make usability / package discovery a dream compared to the current slog-fest.

Why this would benefit all Chocolatey users:
• I have been told that there is a daily email with new/updated package information but that it doesn't contain a good overview of the information.
• The website is also not the best way to get overview information as we cannot easily use tools to interrogate for packages that we want, and DevOps / CLI users are the core user-base of Chocolatey, helping those users, can only help the spread of the tool.
• It would allow people to have a much faster holistic overview of the totality of the available packages, which allows the community to more easily test and find issues with broken or redundant packages, to help keep the environment clean, and the more easily that FOSS users can manipulate that totality view of the package list, the more easy it would be for them to compile useful summaries to management to show them how the paid subscription could piggy-back off the huge maintained package list (giving techies the ability to quickly compile such justification is good).

My own workaround to search for a term is super-clunky-nasty-cludge and horribly slow, but I can finally extract something useful. https://gist.github.com/roysubs/09454e20d54adeb805f14a4ec89deaa9
e.g. choco-index azure aws git will compile all packages that match these search terms and present them in an easy to read format and save the output. It is very slow, and is a real shame to have to d0o this when having a single daily query from the server as a CSV would make this all doable in a fraction of a second instead of the 15-30 minutes that it takes to collect the above, and so much simpler for Chocolatey users to discover more of the great packages that are in the repository.

I would very much appreciate consideration of this as I think it would be trivially simple to implement (just a simple query run once per day) with only upsides for all users and making package discovery massively easier than very slow CLI choco list --description runs of having to go to a website and doing search after search (not exactly how DevOps / SysAdmins that like to work fast and automate works!) and just makes for a much faster and more efficient user experience (i.e. my command just to find packages with azure/aws/git takes 7 minutes to run ... surely a daily extract is a better way to return such basic information and allow chocolatey users to do many searches instantly instead of equivalent 7 minutes per run like the above?)

@gep13
Copy link
Member

gep13 commented Oct 26, 2020

@roysubs I believe I am right in saying that what you want is what we refer to as package indexing. There is an open issue around this here:

#820

This is something that has been on our roadmap for a while, and something that will eventually get implemented.

@roysubs
Copy link
Author

roysubs commented Oct 26, 2020

Thanks! I think that you are right that ultimately, a local cache could possibly achieve a similar goal, but would it include the important Description fields for every package so that I could do free-form searches to look for packages that match certain criteria in the package name or Description fields? And would it include the FirstPackageReleaseDate, so that I can quickly do a search to say "what completely brand new packages appeared in the last week" (this is quite distinct from packages that have been updated, which is a very different thing)? If it could achieve these things, that would be great, but I think that what I see in #820 is that it will do none of those things... :-( (please correct me if I am wrong!?).

Also, since this has been on the roadmap for over 4 years, what do you think of just running that query once per day at your side that collects this information into a simple CSV published somewhere on the chocolatey site (that I can grab with Invoke-WebRequest)? This would turn CLI package discovery from a death-rattle crawl nightmare of running scripts that take 10-15 minutes just to pull the most basic package discovery information (as per my script above), into a joy and a pleasure to use the command line for all queries to find useful packages from the repo! (and I would appreciate this so much <all-of-my-fingers-and-toes-crossed!>).

@mkevenaar
Copy link
Contributor

@roysubs why not simply use the current available RSS feed? http://feeds.feedburner.com/Chocolatey

@roysubs
Copy link
Author

roysubs commented Oct 26, 2020

Thanks for that, I've not been aware of that. Had a look now, but it seems that this doesn't help with package discovery?

e.g. Would the RSS feed allow me to take a query (say "aws") and from that list every package that matches that in it's package name or Description fields, and see information about the size of the package, FirstReleaseDate, LastPublishedDate, AppVersion etc, and so making it super-easy to pin down packages, find new packages etc?

It is useful to see newly published packages, and I might be wrong, but this does not seem to help (much) with package discovery (i.e. A "good user experience" would be: being able to very quickly crawl the available packages via an offline CSV in seconds instead of the 10-15 minute torture that I have to endure on the CLI just to return basic package information as described above).

@roysubs
Copy link
Author

roysubs commented Oct 31, 2020

A bit more on my situation and why so many people are in the same situation as me, and why this is useful: Like most people, my company has not (yet) bought Chocolatey, and so, like the vast majority of Chocolatey users, I am exclusively using the Community driven CCR ( chocolatey.org/packages ). This is primarily what I am referring to, since 99+% of what I will access and install will come from there (until my company hopefully purchases a license).

So, all new users to Chocolatey (that have not started working in an organisation that has a paid license) will be focused on the CCR as the central location from which they will a) search for packages, b) install packages. It is this package discovery process that I am most interested in, and the "user experience" here is frankly, awful. The website is great, the tools are great, but if I want to do a quick search on some search term, like say choco list browser, the query return is slow and because zero package information is attached to that output, I then have to do choco list <package> --description.

I feel that this massively hurts the experience for new users to Chocolatey and turns them away. I've always found this package discovery aspect incredibly frustrating, and everyone that I've shown it too finds this very slow and awkward. I'm just trying to ask for some way that the CCR can be quickly queried - I feel that this will greatly increase the user experience for users.

Maybe a better way overall to achieve this would be as follows:

choco list --updateofflinecache # Get a dump of all of the above fields (without downloading any packages!).
choco list --downloadlibrarycsv # Get a CSV snapshot of the environment with the fields mentioned above.
choco list <searchterm> --offlinecache # Do an instant search just on the local cache.
choco list <packagename> --description --offlinecache # Do an instant package query only the localcache.

Sorry to describe the user experience of this aspect of using choco to be bad. I love chocolatey, but as you can see from my script above, ~10 minutes just to do basic discovery on packages that I'd like to test out is horribly slow, and it just seems that it could be so easily fixed (either as above or via a daily CSV that has all of these fields exported from the DB etc)?

@TheCakeIsNaOH
Copy link
Member

Like most people, my company has not (yet) bought Chocolatey, and so, like the vast majority of Chocolatey users, I am exclusively using the Community driven CCR ( chocolatey.org/packages ).

You can use a custom repository without purchasing Chocolatey, and you can purchase Chocolatey and still use the chocolatey.org repository. Purchasing Chocolatey is not linked to using a custom repository.

Most Nuget repositories work fine with Chocolatey. Nexus, Artifactory, ProGet, MyGet all work.
https://chocolatey.org/docs/how-to-host-feed

I've always found this package discovery aspect incredibly frustrating, and everyone that I've shown it too finds this very slow and awkward.

If you already know the name of the software you are looking for, I've found it to be fine.

If you are looking to discover new pieces of software via finding the Chocolatey packages for that software, then yes, I would agree that it is not ideal. But in this case, I'm not sure that is the intent of the search/list command. I think this would need to be part of a larger discussion for if Chocolatey.org should also be a discovery platform for software, not just a repository.

@roysubs
Copy link
Author

roysubs commented Oct 31, 2020

Exactly, I'm not talking about when I know the software already. Don't you think that "discovery" of available packages is not only a nice-to-have, but actually a core function that people can find more packages that might be of interest to them?

For example, when I do choco list console, I might discover 15 packages that are of interest to me, and that in turn might launch me into finding a tool that was unrelated to my original search but which turns out to be a real benefit to me that I was not aware of before I did that search. For this, the output of choco list console is horribly lacking from a user experience perspective because it just tells you the package name and that information is next to useless. e.g. To someone that does not know what "putty" is, that name could mean anything (ok, we all know what "putty" is, but you see what I mean about package names).

In turn, the ability to discover packages in really quick succession with 20 searches for various different things also massively increases the overall reach of lesser known Chocolatey packages, making it a) a ton more appealing and enjoyable to use, and b) making some of those less-downloaded packages come within reach of people doing random searches for things and tinkering with "oh, ok, that minor package that is not downloaded a lot looks really interesting for something that I'm working on..." and they install it and find out that it really helps them with some task. It seems such an important oversight, to just neuter peoples ability to quickly do random searches for things. Are these not worthwhile user experience benefits? And I mean, at such a low cost, because it's just allowing people to cache the metadata, or to have a CSV with the fields that I mentioned above that I would happily open in Excel or use PowerShell to do regex searches through. Searching on the metadata instantaneously would be a real pleasure where you can whip through tons of random searches in seconds instead of 30+ minutes (quite literally!) to do 5 or 10 searches but then have to tie up what each packagename is to a meaningful description via choco list <packagename> --description?

I know what you mean, but repositories are big places, discovery is the means to get a handle on what is in that repository and that just seems to be a natural match to me; it feels currently like having a database but no easy way to search that information (well, of course, that's exactly what it is! lol). i.e. Shouldn't "easily searchable" be a core component of a repository?

@pauby
Copy link
Member

pauby commented Jul 15, 2021

Don't you think that "discovery" of available packages is not only a nice-to-have, but actually a core function that people can find more packages that might be of interest to them?

This is what community.chocolatey.org/packages can be used for.

Until there is a local index to be able to pull that information from I don't see how it would be feasible for every choco list to pull back the information you are looking for.

So I think you're really look at this coming into package indexing.

@roysubs
Copy link
Author

roysubs commented Oct 16, 2021

When you say "This is what community.chocolatey.org/packages can be used for", I completely agree, but consider this: if I am on Linux (say Ubuntu), I can do apt search azure, then look at that list, then find a package that I'm interested in and do apt show <packagename> and up pops the full description (extremely quickly). Doing the same in Chocolatey is frustratingly slow - Chocolatey is primarily about the cli; in fact, it's "the command-line PowerShell package manager that people who have used Linux dreamed about having on Windows for many years", but that cli user experience, to get very simple descriptive information, is frustratingly slow as above, and hence why I built my clunky little function in the above Gist to discover / index information from the database.

I completely agree that this is about package indexing, but, just the meta-data, as a clone of the community repository on my hard disk is somewhat overkill.

"Until there is a local index to be able to pull that information from I don't see how it would be feasible for every choco list to pull back the information you are looking for.". My understanding is limited and I'm really keen for your insight: if there is a way to do create a local index that I can query, but without the overkill of having to download the entire repository then I'd love to know how? i.e. just the meta-data / descriptions, etc (I don't mind if it takes me 5 minutes to download that, as I'd probably only refresh that once a month, so that's fine), and, once downloaded, I can direct chocolatey to scrutinise that local index so that I can do discovery on the console, as fast or faster than in Linux, where I could say "show me every package that contains the word 'ssh' somewhere in Description field"? This would be lightning fast, while my script to do the same on the console (my simple function linked to above) takes multiple minutes to do the same from the console currently - maybe there is a way to achieve this without cloning the entire repository locally, I'm keen to try anything that you suggest (and sorry if I don't fully understand some of the ways that chocolatey does things). 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants