Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SourceRank 2.0 #1916

Open
andrew opened this issue Jan 9, 2018 · 43 comments
Open

SourceRank 2.0 #1916

andrew opened this issue Jan 9, 2018 · 43 comments
Milestone

Comments

@andrew
Copy link
Contributor

andrew commented Jan 9, 2018

SourceRank 2.0

Below are my thoughts on the next big set of changes to "SourceRank", which is the metric that Libraries.io calculates for each project to produce a number that can be used for sorting in lists and weighting in search results as well as encouraging good practises in open source projects to improve quality and discoverability.

Goals:

  • Comparable number for packages within a given ecosystem
  • Easily understandable for a human to scan without explanation
  • Clear breakdown available of factors that went into the final number
  • Any negative factors that contribute to the score should be fixable by maintainer
  • Changes should be trackable over time
  • Less popular projects that are good quality and well maintained should still score well
  • Doing work to improve sourcerank for a project should encourage following best practises

History:

SourceRank was inspired by google pagerank as a better alternative score to GitHub Stars.

The main element of the score is the number of open source software projects that depend upon a package.

If a lot of projects depend upon a package that implies a some other things about that package:

  • it's working software, people tend to remove dependencies that are broken
  • It's documented enough to be used by the people that have committed it

Problems with 1.0:

Sourcerank 1.0 doesn't have a ceiling to the score, the project with the highest score is mocha with a sourcerank of 32, when a user is shown an arbitrary number it's very difficult to know if that number is good or bad. Ideally the number should either be out of a total, i.e. 7/10, a percentage 82% or a score of some kinda like B+

Some of the elements of sourcerank cannot be fixed as they judge actions from years ago with the same level as recent actions, for example "Follows SemVer?" will punish a project for having an invalid semver number from many years ago, even if the project has followed semver perfectly for the past couple years. Recent behaviour should have more impact that past behaviour.

If a project solves a very specific niche problem or if a project tends to be used within closed source applications a lot more than open source projects (LDAP connectors, payment gateway clients etc) then usage data within open source will be small and sourcerank 1.0 would rank it lower.

Related to small usage factors, a project within a smaller ecosystem will currently get a low score even if it is the most used project within that whole ecosystem when compared to a much larger ecosystem, elm vs javascript for example.

Popularity should be based on the ecosystem which the package exists within rather than within the whole Libraries.io universe.

Quality or issues with dependencies of a project are not taken into account, if adding a package brings with it some bad quality dependencies then that should affect the score, in a similar way, the total number of direct and transitive dependencies should be taken into account.

Projects that host their development repository on GitHub currently get a much better score than ones hosted on GitLab, Bitbucket or elsewhere. Whilst being able to see and contribute to the projects development is important, where that happens should not influence the score based on the ease of access to that data.

Sourcerank is currently only calculated at the project level, but in practise the sourcerank varies by version of a project as well for a number of quality factors.

Things we want to avoid:

Past performance is not indicative of future results - when dealing with voluteeners and open source projects where the license says "THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND" we shouldn't reward extra unpaid work

Metrics that can easily be gamed by doing things bad for the community, i.e. hammering the download endpoint of a package manager to boost the numbers

Relying to heavily on any metrics that can't be collected for some package managers, download counts and dependent repository counts are

Proposed changes:

  • Group SourceRank breakdown into 3 or 4 sections (usage, quality, community, distribution)
  • Usage/popularity metrics should be based on projects ecosystem scale
  • Overall score should be between 0 and 100
  • Time based score elements should focus on more recent behaviours
  • SourceRank of Dependencies (both direct and transitive) should be taken into account

Potential 2.0 Factors/Groupings:

Usage/Popularity

  • Dependent projects and repositories
  • Stars

Quality

  • Basic details info present
  • Following semver recently
  • Source repository present
  • Documentation (readme present)
  • SourceRank of runtime dependencies on latest version
  • Outdated status of runtime dependencies on latest version
  • Status (prerelease, new, deprecated, unmaintained, yanked etc)
  • License present?
  • License compatibility of dependencies
  • Prerelease?
  • Major version prerelease

Community/maintenance

  • Bus factor (Number of committers or percentage share of commits per commiter)
  • How responsive the issue tracker is
  • Overall repository activity levels
  • Frequency of releases

Reference Links:

SourceRank 1.0 docs: https://docs.libraries.io/overview#sourcerank

SourceRank 1.0 implementation:

Metrics repo: https://github.com/librariesio/metrics

Npms score breakdown: https://api.npms.io/v2/package/redis

@andrew
Copy link
Contributor Author

andrew commented Jan 9, 2018

The initial pass at implementing SourceRank 2.0 will be focused on realigning how scores are calculated and shown using existing metrics and data that we have, rather than collecting/adding new metrics.

The big pieces are:

  • scale the sourcerank between 0 and 100
  • usage and popularity metrics should only be considered within the packages ecosystem
  • the scores of top-level dependencies of the latest version of a project should be taken into account
  • Breakdown should be separated into three sections, with their own scores (Usage, Quality, Community)
  • Only consider recent releases when looking at historical data (like semver validity)
  • A good project with source hosted on GitLab or Bitbucket should have a similar score to one hosted on GitHub

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

There are a few places where the sourcerank 1.0 details/score are exposed in the API that need to be kept around for backwards compatibility:

  • ProjectSerializer - the rank field is present wherever project records are serialized for the API
  • sourcerank api endpoint which gives a breakdown of the values that go into sourcerank 1.0 score
  • project search sort by rank option
  • rank columns in the open data release files: projects and projects_with_repository_fields

For the first pass, sourcerank 2.0 is going to focus on just packages published to package managers, we'll save the repository sourcerank update for later as it doesn't quite match so well with the focus on projects.

So the following repo sourcerank details in the api will remain the same:

  • RepositorySerializer - the rank field is present wherever repository records are serialized for the API
  • repo search sort by rank option

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

Starting to implement bits over here: #2056

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

One other thing that springs to mind, the first pass of the implementation will focus at a project level and only really considers the latest release.

Then we'll move on to tackle #475 which will store more details at a per-version level, which will then allow us to calculate the SourceRank for each version.

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

Making good progress on filling in the details of the calculator, current approach for scoring is to take the average of the different categories scores and for each category take the average of the different scores that go into it too, with the max being 100 for score.

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

Example of the current breakdown of implemented scores:

  • overall score: 88.7
    • popularity_score: 80
      • dependent_projects_score: 70
      • dependent_repositories_score: 90
    • community_score: 99
      • contribution_docs_score: 99
    • quality_score: 87
      • basic_info_score: 87

@andrew
Copy link
Contributor Author

andrew commented Mar 27, 2018

Things to think about soon:

  • How to calculate, store and show the change in score for a given category over time
  • How to visualize what goes into a score and how to improve it

Also because scores now take into account other projects within an ecosystem, we'll likely want to recalculate lots of scores at the same time in an efficient way, for example:

  • Recalculate every elm module when the most popular elm module jumps in popularity
  • Recalculate the scores for every gem that depends on rails when the score of rails changes significantly

@andrew
Copy link
Contributor Author

andrew commented Mar 28, 2018

Added an actual method to output the breakdown of the score:

{
  "popularity": {
    "dependent_projects": 0,
    "dependent_repositories": 0
  },
  "community": {
    "contribution_docs": {
      "code_of_conduct": false,
      "contributing": false,
      "changelog": false
    }
  },
  "quality": {
    "basic_info": {
      "description": true,
      "homepage": false,
      "repository_url": false,
      "keywords": true,
      "readme": false,
      "license": true
    },
    "status": 100
  }
}

@andrew
Copy link
Contributor Author

andrew commented Mar 29, 2018

Thinking about dependency related scores, here's my current thinking:

  • one score for the average source rank score for every top level runtime dependency
  • if a package has zero dependencies then 100/100
  • packages with excessive numbers of top level runtime dependencies should be marked down independently of the score of each dependency

Eventually we should also look at the size and complexity of the package, as those rules could encourage vendoring of dependencies to avoid them reducing the score.

@andrew
Copy link
Contributor Author

andrew commented Mar 29, 2018

First pass at an implementation for the calculator is complete in #2056, going to kick the tires with some data next

@andrew
Copy link
Contributor Author

andrew commented Mar 29, 2018

Current output for a locally synced copy of Split:

{
  "popularity": {
    "dependent_projects": 0,
    "dependent_repositories": 0,
    "stars": 0,
    "forks": 0,
    "watchers": 0
  },
  "community": {
    "contribution_docs": {
      "code_of_conduct": true,
      "contributing": true,
      "changelog": true
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": 100,
    "maintainers": 50
  },
  "quality": {
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": true,
      "keywords": true,
      "readme": true,
      "license": true
    },
    "status": 100,
    "multiple_versions": 100,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "outdated_dependencies": 100,
    "dependencies_count": 0,
    "direct_dependencies": {
      "sinatra": 42.99146412037037,
      "simple-random": 45.08333333333333,
      "redis": 42.58333333333333
    }
  }
}

@jab
Copy link

jab commented Mar 31, 2018

I just found an issue with SourceRank's "Follows SemVer" scoring that is not mentioned here. I'm hoping it can be resolved in the next version of SourceRank, and that this is the right place to bring this up:

The relevant standard that Python packages must adhere to for versioning is defined in PEP 440:
https://packaging.python.org/tutorials/distributing-packages/#standards-compliance-for-interoperability

As that page says, “the recommended versioning scheme is based on Semantic Versioning, but adopts a different approach to handling pre-releases and build metadata”.

Here is an example Python package where all the release versions are both PEP 440- and SemVer-compliant.

But the pre-release versions are PEP 440-compliant only. Otherwise interoperability with Python package managers and other tooling would break.

The current SourceRank calculation gives this package (and many other Python packages like it) 0 points for "Follows SemVer" due to having published pre-releases. (And only counting what was published recently wouldn't fix this.)

Can you please update the SourceRank calculation to take this into account? Perhaps just ignoring the SemVer requirement for pre-release versions of Python packages, which it’s impossible for them to satisfy without breaking interop within the Python ecosystem, would be a relatively simple fix?

Thanks for your consideration and for the great work on libraries.io!

@andrew
Copy link
Contributor Author

andrew commented Apr 3, 2018

@jab ah I didn't know that, yeah I think it that makes sense, rubygems has a similar invalid semver prerelease format.

I was thinking about adding in some ecosystem specific rules/changes, this is a good one to experiment with.

@andrew
Copy link
Contributor Author

andrew commented Apr 3, 2018

Sourcerank 2.0 things I've been thinking about over the long weekend:

  • We'll need to cache the overall score onto a separate field to rank on the projects table, most likely called sourcerank_2 with a default of 0
  • Would be good to know when the sourcerank was last calculated to easily find the projects with the most outdated score to be updated next, most likely a datetime field called sourcerank_2_last_calculated that defaults to null for projects that have never had a score calculated
  • Now that dependency scores can be recursive, it would be good to be able to see how many top-level runtime dependencies a project has, which ends up being "how many top-level runtime dependencies does a version have", which we can then cache onto projects for the latest version. This then allows us to easily find zero dependency projects, perfect for calculating the sourcerank for first. DB field for runtime dependencies count on Versions #2079
  • To get the timeseries element for sourcerank 2.0 to show improvement over time, we should store the whole breakdown for each project on a regular basis, as we won't easily be able to recreate some older breakdowns as the values of dependent repos and dependent projects change without keeping a history. Need to ponder on this some more but I'm currently thinking we store it in a separate table and keep a score per project each week.
  • We don't have 100% support across all package managers, so the score needs to be able to skip certain checks rather than giving 0, on a similar note, if the project's source is on bitbucket, there's no concept of stars, in that case it shouldn't be given zero stars as it will have a big impact on the overall popularity score.

@chris48s
Copy link

chris48s commented Apr 3, 2018

Packagist/composer also defines similar: https://getcomposer.org/doc/04-schema.md#version

It seems like "follows [ecosystem specific version schema]" would be a better criteria than "follows SemVer", but I can see that this could add a lot of additional complexity when you deal with a lot of languages.

@andrew
Copy link
Contributor Author

andrew commented Apr 3, 2018

I've added runtime_dependencies_count fields to both projects and versions, running some background to backfill those counts on all package managers where we have both the concept of Versions and Dependencies.

Also highlights the need for #543 sooner rather than later for all the package managers that use Tags rather than Versions.

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

Mos of the runtime_dependencies_count tasks finished over night, just python and node.js still running.

Versions updated: 11,658,167
Projects updated: 716,828

And just for fun, the average number of runtime dependencies across:

all versions: 3.03
all projects: 1.69

Project data broken down by ecosystem:

"NuGet"=>1.71,
"Haxelib"=>0.65,
"Packagist"=>1.98,
"Homebrew"=>1.16,
"CPAN"=>4.23,
"Atom"=>1.14,
"Dub"=>0.72,
"Elm"=>2.44,
"Puppet"=>1.37,
"Pub"=>2.08,
"Rubygems"=>1.45,
"Cargo"=>0.51,
"Maven"=>0.95,
"Hex"=>1.31,
"NPM"=>2.52,
"CRAN=>0.0

*CRAN doesn't use runtime as a kind, so might need to tweak that slightly, imports seems like the most appropriate

python are still running:

"Pypi"=>0.08

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

Some other thinks to note down:

  • we should store the raw numbers used not just the individual scores in the breakdown so we can update the scoring mechanism and be able to recompute old scores easily.
  • we should use log rather than just linear for popularity metrics
  • might want to store the sourcerank version with the breakdown to easily find old scores that need updating after an algo change
  • should be able to pass in values for max_dependent_projects, max_dependent_repositories etc when initializing the SourceRankCalculator object to enable faster score generation for multiple projects from the same ecosystem

Next up I'm going to add the sourcerank_2 and sourcerank_2_last_calculated to projects then experiment with generating some actual rank figures for one ecosystem.

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

Here's what the breakdown looks like now for rails (on my laptop, missing some data)

{
  "overall_score": 81,
  "popularity": {
    "score": 60.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0.0,
    "stars": 100.0,
    "forks": 100.0,
    "watchers": 100.0
  },
  "community": {
    "score": 93.33333333333333,
    "contribution_docs": {
      "code_of_conduct": true,
      "contributing": true,
      "changelog": false
    },
    "recent_releases": 100,
    "brand_new": 100,
    "contributors": 100,
    "maintainers": 100
  },
  "quality": {
    "score": 80.0,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": true,
      "keywords": true,
      "readme": true,
      "license": true
    },
    "status": 100,
    "multiple_versions": 100,
    "semver": 0,
    "stable_release": 100
  },
  "dependencies": {
    "score": 89.66666666666667,
    "outdated_dependencies": 100,
    "dependencies_count": 89,
    "direct_dependencies": {
      "sprockets-rails": 72,
      "railties": 83,
      "bundler": 77,
      "activesupport": 84,
      "activerecord": 75,
      "activemodel": 85,
      "activejob": 83,
      "actionview": 82,
      "actionpack": 82,
      "actionmailer": 81,
      "actioncable": 81
    }
  }
}

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

More fiddling with local data, tables to compare sourcerank 1 and 2 scores this time.

Top 25 local ruby projects ordered by sourcerank 1:

Name SourceRank 1 SourceRank 2
activesupport 18 86/100
rspec 18 77/100
sinatra 17 80/100
bundler 17 76/100
rubocop 17 81/100
actionpack 17 82/100
rdoc 17 81/100
activemodel 17 85/100
kramdown 16 78/100
activerecord 16 75/100
split 16 74/100
railties 15 83/100
rails 15 81/100
actioncable 15 81/100
test-unit 15 79/100
activejob 15 83/100
actionmailer 15 81/100
concurrent-ruby 15 75/100
pry 15 70/100
simplecov 15 78/100
cucumber 15 76/100
actionview 15 82/100
guard-rspec 14 72/100
json_pure 14 70/100
ffi 14 75/100

Top 25 local ruby projects by both sourcerank 1 and 2

SourceRank 1 SourceRank 2
activesupport (18) activesupport (86)
rspec (18) activemodel (85)
sinatra (17) railties (83)
bundler (17) activejob (83)
rubocop (17) actionpack (82)
actionpack (17) actionview (82)
rdoc (17) rubocop (81)
activemodel (17) rdoc (81)
kramdown (16) actionmailer (81)
activerecord (16) rails (81)
split (16) actioncable (81)
railties (15) sinatra (80)
rails (15) test-unit (79)
actioncable (15) kramdown (78)
test-unit (15) simplecov (78)
activejob (15) rspec (77)
actionmailer (15) redcarpet (77)
concurrent-ruby (15) sprockets (77)
pry (15) arel (77)
simplecov (15) bundler (76)
cucumber (15) cucumber (76)
actionview (15) webmock (76)
guard-rspec (14) thor (76)
json_pure (14) i18n (76)
ffi (14) mini_mime (76)

Same as the first table but with top 50 and github star column included:

Name SourceRank 1 SourceRank 2 Stars
activesupport 18 86/100 39180
rspec 18 77/100 2237
activemodel 17 85/100 39180
actionpack 17 82/100 39180
rdoc 17 81/100 459
rubocop 17 81/100 8875
sinatra 17 80/100 9907
bundler 17 76/100 4152
kramdown 16 78/100 1199
activerecord 16 75/100 39180
split 16 74/100 2105
railties 15 83/100 39180
activejob 15 83/100 39180
actionview 15 82/100 39180
actioncable 15 81/100 39180
rails 15 81/100 39180
actionmailer 15 81/100 39180
test-unit 15 79/100 170
simplecov 15 78/100 3325
cucumber 15 76/100 4974
concurrent-ruby 15 75/100 4020
pry 15 70/100 5100
redcarpet 14 77/100 4180
thor 14 76/100 4012
webmock 14 76/100 2795
ffi 14 75/100 1462
guard-rspec 14 72/100 1122
racc 14 71/100 347
mocha 14 71/100 921
json_pure 14 70/100 495
i18n 13 76/100 696
capybara 13 75/100 8293
rake-compiler 13 73/100 444
rdiscount 13 70/100 764
aruba 13 69/100 803
hoe-bundler 13 66/100 6
RedCloth 13 66/100 430
coderay 13 65/100 704
sprockets 12 77/100 555
liquid 12 75/100 6251
gherkin 12 75/100 249
uglifier 12 74/100 512
slop 12 74/100 886
globalid 12 74/100 591
erubi 12 74/100 198
backports 12 73/100 283
rails-html-sanitizer 12 73/100 143
mime-types 12 72/100 248
mustermann 12 69/100 564
hoe-git 12 68/100 24

All of these are missing a lot of the "popularity" indicators as I just synced a few hundred rubygems locally without all the correct dependent counts.

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

And at the bottom end of the chart:

Name SourceRank 1 SourceRank 2
gem_plugin 2 43/100
shellany 3 45/100
text-hyphen 3 48/100
method_source 3 48/100
fastthread 3 51/100
text-format 4 36/100
coveralls 4 42/100
shotgun 4 43/100
cucumber-wire 4 43/100
spoon 4 43/100
rack-mount 4 44/100
therubyracer 4 44/100
actionwebservice 4 44/100
rbench 4 44/100
codeclimate-test-reporter 4 44/100
mini_portile 4 46/100
abstract 4 47/100
mongrel 4 48/100
win32console 4 50/100
markaby 4 51/100
cgi_multipart_eof_fix 4 52/100
activestorage 5 47/100
colorize 5 48/100
pry-doc 5 48/100
rest-client 5 48/100
http-cookie 5 48/100
tool 5 48/100
polyglot 5 49/100
jsminc 5 49/100
multi_test 5 50/100
activeresource 5 50/100
guard-compat 5 52/100
mini_portile2 5 52/100
redis 5 52/100
temple 5 52/100
less 5 53/100
eventmachine 5 54/100
coffee-script-source 5 54/100
tilt 5 54/100
erubis 5 54/100
diff-lcs 5 54/100
rubyforge 5 57/100
yard 5 60/100
fakeredis 6 49/100
launchy 6 50/100
slim 6 52/100
simple-random 6 53/100
term-ansicolor 6 53/100
oedipus_lex 6 53/100
haml 6 53/100

@andrew
Copy link
Contributor Author

andrew commented Apr 4, 2018

Taking text-format as a low scoring project as an example of things in possibly change:

{
  "overall_score": 36,
  "popularity": {
    "score": 0.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0,
    "stars": 0,
    "forks": 0,
    "watchers": 0
  },
  "community": {
    "score": 30.0,
    "contribution_docs": {
      "code_of_conduct": false,
      "contributing": false,
      "changelog": false
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": 0,
    "maintainers": 50
  },
  "quality": {
    "score": 66.66666666666666,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": false,
      "keywords": false,
      "readme": false,
      "license": false
    },
    "status": 100,
    "multiple_versions": 0,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "score": 49.0,
    "outdated_dependencies": 0,
    "dependencies_count": 99,
    "direct_dependencies": {
      "text-hyphen": 48
    }
  }
}
  • no accessible source repo (homepage is pointed at rubyforge which is dead)
  • readme and contribution_docs probably shouldn't be false if no repo is present, should just be nil and skipped
  • contributors should be nil and skipped if repo isn't present
  • stars, forks and watchers should be nil and skipped if repo isn't present

Looking inside the source of the gem, there is:

  • changelog
  • readme
  • Mentions that license as Ruby or Artistic in readme

We can definitely add support for detecting changelog and readme to the version-level metadata and feed that back in here once complete.

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

Updated the calculator to not punish projects that aren't on GitHub, here's the new breakdown for text-format, with an increased score from 36 to 42:

{
  "overall_score": 42,
  "popularity": {
    "score": 0.0,
    "dependent_projects": 0.0,
    "dependent_repositories": 0,
    "stars": null,
    "forks": null,
    "watchers": null
  },
  "community": {
    "score": 50.0,
    "contribution_docs": {
      "code_of_conduct": null,
      "contributing": null,
      "changelog": null
    },
    "recent_releases": 0,
    "brand_new": 100,
    "contributors": null,
    "maintainers": 50
  },
  "quality": {
    "score": 68.0,
    "basic_info": {
      "description": true,
      "homepage": true,
      "repository_url": false,
      "keywords": false,
      "readme": null,
      "license": false
    },
    "status": 100,
    "multiple_versions": 0,
    "semver": 100,
    "stable_release": 100
  },
  "dependencies": {
    "score": 49.0,
    "outdated_dependencies": 0,
    "dependencies_count": 99,
    "direct_dependencies": {
      "text-hyphen": 48
    }
  }
}

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

Other related areas to think about when it comes to different levels of support for package manager features we have:

  • if we don't support parsing dependencies for that package manager, the dependency score should be skipped
  • if we don't support parsing maintainers for that package manager, the maintainers score should be skipped

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

  • I wonder if the outdated_dependencies_score should be a percentage based on the number of dependencies it has that are outdated

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

Probably also want to skip the dependent_* popularity scores if we don't have the support for it in that ecosystem, just wondering how many ecosystems that will affect:

Name Dependent Projects Dependent Repos
Alcatraz false false
Atom true true
Bower false true
CPAN true true
CRAN true true
Cargo true true
Carthage false true
Clojars false true
CocoaPods false true
Dub true true
Elm true true
Emacs false false
Go false true
Hackage false true
Haxelib true true
Hex true true
Homebrew true false
Inqlude false false
Julia false true
Maven true true
Meteor false true
npm true true
Nimble false false
NuGet true true
Packagist true true
PlatformIO false false
Pub true true
Puppet true false
PureScript false false
PyPI true true
Racket false false
Rubygems true true
Shards false true
Sublime false false
SwiftPM false true
WordPress false false

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

Three of the double false package managers in the table are editor plugins and don't really do dependencies: Alcatraz, Emacs, Sublime

The others are either smaller, we don't have any support for versions or don't have a concept of dependencies: Inqlude, Nimble, PlatformIO, PureScript, Racket, WordPress

@andrew
Copy link
Contributor Author

andrew commented Apr 5, 2018

Atom is also a little weird here because it depends on npm modules and uses package.json, so doesn't really have either, but is flagged as having both, basically all the editor plugins don't really work for dependent_* scores

@andrew andrew mentioned this issue Apr 5, 2018
9 tasks
@amyeastment
Copy link

Three (very early) concepts for the popover that explains what the SourceRank 2.0 rating is
screen shot 2018-04-06 at 11 16 23 am

A few known issues we need to tackle:

  • How to present this such that someone understands higher score = better
  • What we want to relay - issues only, or the full status?
  • The branding of SourceRank 2.0 - what it's officially called, the color scheme/branding (not to mention making sure we choose something accessible!)
  • How we elegantly handle having some bits of data for some ecosystems, but not having bits of data for other ecosystems - do we just not show things, or call out we don't have that data in the UI?
  • The "Meta-sourcerank 2.0" when looking at the dependencies for a particular project and how we present that
  • @andrew raised the point that comparing the data across ecosystems is kinda like apples and oranges since different communities have different levels of expected activity/quality/tolerance for things like total dependencies...and I think it highlighted for me that we may want to try as much as possible to get a user to select their ecosystem first before showing them lists of packages where they might be comparing their SourceRank (like in search results or Explore, for example)...just food for thought.

@andrew
Copy link
Contributor Author

andrew commented Apr 9, 2018

A couple other screenshots of bits I was experimenting with on Friday in this branch: https://github.com/librariesio/libraries.io/tree/sourcerank-view

screen shot 2018-04-06 at 14 50 16

screen shot 2018-04-06 at 15 23 13

@andrew
Copy link
Contributor Author

andrew commented Apr 17, 2018

Making progress on Sourcerank 2.0 (now known as Project Score because trademarks 😅) again, I've merged and deployed #2056 and have calculated the scores for all the rust packages on Cargo, will report back one the score breakdowns shortly.

@andrew
Copy link
Contributor Author

andrew commented Apr 17, 2018

Graphing the distribution of sourcerank 1.0 scores of all cargo modules (raw data):

screen shot 2018-04-17 at 12 36 50

vs the distribution of sourcerank 2.0 scores of all cargo modules (raw data):

screen shot 2018-04-17 at 12 38 17

@andrew
Copy link
Contributor Author

andrew commented Apr 18, 2018

Similar graphs for Hex, the elixir package manager:

Sourcerank 1.0 distribution:

screen shot 2018-04-18 at 14 27 11

Sourcerank 2.0 distribution:

screen shot 2018-04-18 at 14 27 16

@andrew
Copy link
Contributor Author

andrew commented Apr 18, 2018

Projects with low scores are receiving quite a large boost from having zero or very few high scoring dependencies, which made me think that maybe we should skip dependency scores for projects with no dependencies.

But @kszu made a good point in slack, rake is very highly used and has no dependencies which is seen as a plus, skipping the dependencies score for that would lower its score.

We do skip the whole dependency block on a per-ecosystem basis if there's no support for measuring dependencies, but if we do support it, it feels like we should keep the same set of rules for each package within a given ecosystem.

@andrew
Copy link
Contributor Author

andrew commented Apr 19, 2018

Successfully generated the new scores for all rubygems, distribution curve is looking pretty good:

rubygems sourcerank 2 distribuion

@andrew
Copy link
Contributor Author

andrew commented Apr 23, 2018

I've now implemented a ProjectScoreCalculationBatch class which calculates scores for a number of projects in a single ecosystem and returns the dependent projects on the ones where the scores changed.

This then has a set of queues stored in Redis (one for each platform), which contain the ids of projects that need a recalculation, which slowly empties after each run and avoids recalculating things over and over.

Overall calculating the score for a number of projects in an ecosystem is much faster than sourcerank 1.0, mostly because the ProjectScoreCalculationBatch preloads much of the information required from the database.

Projects from Rubygems, Cargo and Hex are automatically being queued for recalculation after being saved, will be enabling more platforms once the initial scores have been calculated

@andrew
Copy link
Contributor Author

andrew commented Apr 23, 2018

It's now enabled on: alcatraz, atom, cargo, carthage, dub, elm, emacs, haxelib, hex, homebrew, inqlude, julia, nimble, pub, purescript, racket, rubygems, sublime, swiftpm

ProjectScoreCalculationBatch.run_all is being ran on a cron job every 10 mins, if all goes well over night will do the initial score calculations for some more of the larger platforms.

Next steps:

  • start storing the score breakdowns in the database
  • add a page that shows the most recent breakdown
  • index score into elasticsearch

@andrew
Copy link
Contributor Author

andrew commented Apr 24, 2018

Project scores are now being calculated for all platforms, the backlog is pretty long, will likely take 24 hours to work through all the calculations.

@andrew
Copy link
Contributor Author

andrew commented Apr 24, 2018

Adding a basic score overview page:

screen shot 2018-04-24 at 14 46 38

@andrew
Copy link
Contributor Author

andrew commented Apr 30, 2018

Going to work on improving the explanations around each element of the score breakdown page as well as including the raw data that goes into the breakdown object (number of stars, contributor count etc) so that it can be stored in the database in a way that doesn't require the calculator to load data on demand as well as being able to show historic changes for each element.

@gpotter2
Copy link

gpotter2 commented Jun 8, 2018

Hello,

I have a question about sourcerank: what is the point of using GitHub stars?

I don't really understand what sourcerank is aiming to do: half of its elements are about the project's health, and how close it is to standards (Basic info, Readme, license...), whereas the other half is about the project's popularity (contributors, stars and dependants).

If sourcerank was aiming at showing how wealthy is a package, the "popularity" information would be useless, and its whealthy requirements should be even stricter. On the other hand, if the goal is to compare it with it forks, clones or similar projects, then the impact of the "popularity" information should be more important.

For instance, imagine two similar projects are doing the same thing and are perfectly well configured. They will mostly have the very same sourcerank.
Even if one has about 3000 stars, the other only has 400, they will have the very same "github stars" impact (log(400) = 2.6 rounded to 3, log(3000) = 3.4 rounded to 3). Maybe the use of logarithms is too strong.

Some projects are one-man projects, updated once in a while. It does not reflects the project's activity.
Github's pulse system is super efficient: maybe sourcerank could also be based on the activity of the project (number of issues / PRs / releases per week/months/year/whatever), what GitHub calls "code frequency". Maybe it could also register the number of forks, or if the project has a readthedocs page...

Is it planned in the 2.0 version to fix those issues? I've seen great improvements above, and was also wondering if the way those data was calculated has drastically changed or not.

I really enjoy libraries.io, and thank you all for your amazing work !

Regards,

@filips123
Copy link

filips123 commented Nov 21, 2018

One more problem (I think I has already been posted here) is that many projects have 0 SemVer score because they didn't follow SemVer in the early releases. How do you plan to fix that?

Also, there could be a problem with outdated dependencies. Maybe you could only count this if more than one dependency is outdated or depending on release type (major, minor, patch).

Some problems are also with not brand new. Sometimes the score for this is 0 because project has changed name or repository.

When do you plan to release SourceRank 2.0 to main website?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants