Systematic considerations for generating initial ratings #30

cytusine0 · 2024-02-29T20:11:16Z

cytusine0
Feb 29, 2024
Collaborator

Starting a discussion for this to keep some better records as we build this in over the next few weeks. Thoughts very welcome!

Overall, the point of starting people out at different ratings is that a tournament player's starting rank is significant prior knowledge that should be taken into account when we have a Bayesian rating algorithm. While everyone begins with some rating volatility, it is best to assume that a generic rank 1000 player beats a generic rank 100000 player a significant fraction of the time to make the "mixing of ratings" faster.

At the current moment, we only have a formula coded in to fit osu!std. Ignoring the multiplier scaling, players start off with a standard deviation (volatility) of 5 and a rating of 45 - 3.2 ln(rank), clipped from below and above at 5 and 30. The reasons for this choice are the following:

At one point in data collection, we found that the distribution of players in our data had a median around 12000, and in fact ln(rank) was essentially normally distributed. Thus a formula like this would give a starting distribution of ratings which is close to bell-shaped. Furthermore, the Plackett-Luce model documentation recommends an initialization of sigma = mu/3, and indeed a rank 12000 player starts at a rating around 15.
The clipping of ratings does two things. It ensures that players with very low ranks (currently worse than ~250k) are all treated as essentially the same "beginners" in rating, which makes sense especially because showing up in rating already requires qualification in some accepted tournament and thus skill is not unbounded from below. And it also ensures that very top players (currently better than rank ~100) are not given an additional boost, so that climbing in rating leaderboards requires performing well against other players and not just having a high starting rank.
Beyond this, the only free parameter to vary is the "spread" of ratings across rank ranges. From what I remember, this was fairly arbitrary -- we chose some numbers that seemed okay, and then we saw whether the subsequent spread of ratings for players with some starting rating (say between 20 and 22) looked good after processing data. (If our spread had been too wide, we would have seen high initial ratings drop quickly and low initial ratings climb significantly because we were too overconfident in high-ranked players' performance.)

However, there are a few problems we'll want to address as we continue thinking about this:

The current formula is reliant on some self-reference to the data (choosing the median rank in step 1 and shifting tournament performances perhaps due to changing "deranking meta" in step 3). On its own, this is definitely not a bad thing -- we need to tune our model to the situation we want to use it for -- but those numbers can change. If we suddenly have hundreds of new 100k-range tournaments pop up, that would mean many players entering at a low default rating. Would we want to code in a change to account for this, perhaps adjusting the initial ratings everyone is assigned every 6 months? Or would it be more sensible to stick to a fixed formula so that the actual rating numbers consistently mean something?
Initial ratings are based on osu!track's historical rank data (and not taking BWS into account, since a player who just starts playing tournaments has no badges). However, being rank 100 in 2019 is very different from being rank 100 today, so we are starting old players with higher ratings than new players of the same skill. Rating decay means the effects of this aren't so noticeable on leaderboards now, but are there any systematic ways we should account for this comparison? One idea suggested was to use historical pp numbers, but this may be a little iffy due to reworks.
Does this general "3-step process" make sense for other gamemodes as well? That is, is it sensible to look at the median rank of all catch players in the database and use that as a centering point for initial ratings, and then (assuming we have enough data to do so) run rating calculations to see what a good spread by rank would be?

Any thoughts on any of these points would be appreciated -- I'm making this a single discussion point because all of these questions are sort of connected and answers to any of them would be useful for others.

cytusine0 · 2024-02-29T21:07:37Z

cytusine0
Feb 29, 2024
Collaborator Author

I should also clarify: if you want a sense of what these numbers actually "mean" (in reference to something like chess), the current multiplier we're using for all ratings is 45. That is, a high 5 digit would start off with a rating around 675, and all players start off with a rating between 225 and 1350. With our current data, that puts the very top players at a rating around 3000 after all matches are processed.

0 replies

m1ntleaf · 2024-02-29T21:21:26Z

m1ntleaf
Feb 29, 2024

If you could clarify, how would you incorporate osutrack data, and what about players who do not have any data on the website?

Also, I think that adjusting ratings can be done, but at a less frequent rate, such as yearly. Admittedly, I'm not too experienced in rating systems such as this one, but I don't think ratings will deviate far enough from the median. Maybe use a metric to automatically adjust the rating according to percentile? Although, that could be flawed if, again, the playerbase changes too rapidly.

4 replies

hburn7 Feb 29, 2024
Maintainer

how would you incorporate osutrack data

Initial ratings are based on osu!track's historical rank data

what about players who do not have any data on the website

We currently use the current rank from the osu! API for the ruleset of interest.

cytusine0 Feb 29, 2024
Collaborator Author

Good questions! Currently, we look at the earliest match in our database for each player and take the closest-known osutrack rank point to that, which is our best guess of "when this person started playing tournaments, they were rank X". If there's no data on osutrack at all, we're just defaulting to their current global rank. It's possible we need to think about things like "using the average of the opponents in the first match" or "just defauling to the standard value of 15" instead, but it's not so clear yet.

On the second point, yes it could be a good idea to do things automatically by percentile -- the more iffy thought to keep in mind is if o!TR gets used for screening, and then someone who was above 1500 rating suddenly might drop below it because we have to do a recalculation of the median or something. Suddenly dropping that on people might be a bit weird for the screening process, even if we have a predefined time for that to occur.

m1ntleaf Feb 29, 2024

Thank you for the response! My thought is to implement a separate Elitebot-like system that estimates star rating comfort, so that tournament organizers can set that as a benchmark. Perhaps this could be similar or correlated with the rating value, or not.

cytusine0 Mar 15, 2024
Collaborator Author

Elitebotix-like star rating measurements do exist in ETX and SIP already and I think they do have a reasonable meaning, but I think the philosophy we're taking here is specifically to have a system which doesn't need to take difficulty into account (because it cares about performance, not skill). I do think we want to encourage tournament organizers to compare results across systems, and I do know some of the o!TR team members are excited about adding ratings for specific mods and otherwise increasing the amount of raw statistics that we have. But personally I'm not really sure how much we want to build that into the fundamental concept for now.

SourMongoose · 2024-02-29T23:21:39Z

SourMongoose
Feb 29, 2024
Collaborator

I think the range of 225-1350 makes sense for starting ratings.

Ideally a change in initial rating should not affect someone's rating after multiple tournaments - it might be worth ignoring osutrack data completely just to stay consistent across all players.

5 replies

cytusine0 Mar 14, 2024
Collaborator Author

Can you clarify what you mean by "change in initial rating should not affect someone's rating after multiple tournaments?" The osutrack data from the past is already established and won't change, and the main cases where we'd want to use the historical data are when e.g. top 1000 players started out in their career playing 5 digit tournaments years earlier. (We wouldn't want a situation where it is very easy to farm rating off of them if they were in fact at a 5 digit level at the time.) It's true for consistency that there might be some players who don't have this historical data, but if I remember correctly that's actually relatively rare. @hburn7 do we have any numbers on that / do you have a rough sense of how much osutrack data tends to be available?

SourMongoose Mar 15, 2024
Collaborator

Basically just mean that someone starting at, say 600 vs 800 shouldn't affect their final rating too much after inputting a bunch of tournament results.

cytusine0 Mar 15, 2024
Collaborator Author

Ah, yes, I think this is true. But that does make me feel more okay with using osutrack data even if it's missing for some people -- it means that we're making the best attempt that we can at reflecting everyone's starting ability and it won't seriously affect any global leaderboards / eventual ratings in the long run, but it cuts down weird historical rating changes to the extent that we can.

hburn7 Mar 15, 2024
Maintainer

@cytusine0 I don't have any exact numbers, but it seems highly reliable. I have nothing but glowing remarks for the osu!track dev, he will maintain it for as long as he can. As long as osutrack is available, it seems like a no-brainer to keep using it.

hburn7 Mar 15, 2024
Maintainer

This query returns 0 results, so yea - seems to be highly available: select * from players where earliest_osu_global_rank is null and rank_standard is not null

cytusine0 · 2024-03-15T01:27:59Z

cytusine0
Mar 15, 2024
Collaborator Author

Okay, here's one piece of systematic data so we can point to it later. Here are the rough star ratings for the Quarterfinals pools in each of the rank ranges' world cups (which I think is an okay benchmark for the typical difficulty of the rank range at the time -- choosing QF here because that's near the "average" pool that people would play):

Year	Open	3 Digit	4 Digit	5 Digit	6 Digit
2018	6.5
2019	6.8	6.6
2020	6.8	6.7	6.6	5.9
2021	6.8	6.8		6.0	4.9
2022	7.0	7.0	6.6	6.15	5.0
2023	7.2	6.8	6.9	6.3	5.1

These numbers are stratified enough that I think we don't actually need any further correction to "old player initial ratings" other than the decay that we're already doing (though I think how exactly decay is calculated could still use some discussion later).

If anyone thinks there's a problem with how I'm thinking about this / wants to gather their own data, feel free to add it in a reply here!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic considerations for generating initial ratings #30

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Systematic considerations for generating initial ratings #30

cytusine0 Feb 29, 2024 Collaborator

Replies: 4 comments · 9 replies

cytusine0 Feb 29, 2024 Collaborator Author

m1ntleaf Feb 29, 2024

hburn7 Feb 29, 2024 Maintainer

cytusine0 Feb 29, 2024 Collaborator Author

m1ntleaf Feb 29, 2024

cytusine0 Mar 15, 2024 Collaborator Author

SourMongoose Feb 29, 2024 Collaborator

cytusine0 Mar 14, 2024 Collaborator Author

SourMongoose Mar 15, 2024 Collaborator

cytusine0 Mar 15, 2024 Collaborator Author

hburn7 Mar 15, 2024 Maintainer

hburn7 Mar 15, 2024 Maintainer

cytusine0 Mar 15, 2024 Collaborator Author

cytusine0
Feb 29, 2024
Collaborator

Replies: 4 comments 9 replies

cytusine0
Feb 29, 2024
Collaborator Author

m1ntleaf
Feb 29, 2024

hburn7 Feb 29, 2024
Maintainer

cytusine0 Feb 29, 2024
Collaborator Author

cytusine0 Mar 15, 2024
Collaborator Author

SourMongoose
Feb 29, 2024
Collaborator

cytusine0 Mar 14, 2024
Collaborator Author

SourMongoose Mar 15, 2024
Collaborator

cytusine0 Mar 15, 2024
Collaborator Author

hburn7 Mar 15, 2024
Maintainer

hburn7 Mar 15, 2024
Maintainer

cytusine0
Mar 15, 2024
Collaborator Author