Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-13801: Remove limits on tag cloud subqueries #3400

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

reosarevok
Copy link
Member

@reosarevok reosarevok commented Oct 31, 2024

Fix MBS-13801

Problem

The counts on the tag list / cloud do not adequately represent reality, except for the top genres. That's because each subquery is limited, so any genres not on the top 200 and tags not on the top 50 for any specific entity are just ignored entirely for that entity.

These limits were introduced in ace41c6 under the assumption exact numbers weren't too important because we didn't show them, and the results were reasonable enough.

The first was already a wrong assumption at the time, since we did show the counts on hover in the cloud page. It's even worse now, since we have a list version with counts front and center.

The second assumption seems also a bit off on testing without the limits; it brings a few entirely new tags to the list, since the current query undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

Solution

This just removes the per-subquery limits, leaving only a limit on the full query to return the top 200 genres and 50 tags. The original worry why these limits were added (a slow query that got a second or so faster with them) still applies, of course. That said, this query runs once, and then is cached for 24h. It seems acceptable to have one person a day wait a bit longer for their results in exchange for actually correct results when the count differences are in the thousands on the lower ranks of the lists (the top is of course used everywhere and already has the right counts mostly).

Testing

Manually, checking that the lower ranks of the genres list and especially a big chunk of the tags list change significantly for the better without the limits (the "music video" tag, which was missing and led to the report, was on position 51 before - it jumps around 2k uses and 12 positions with the unabridged data).

@reosarevok reosarevok added the Bug Bugs that should be checked/fixed soonish label Oct 31, 2024
@mwiencek
Copy link
Member

Makes sense to me. I tested the query on hendrix before and after the patch, and the difference was more like 200ms than 1s, so I'm guessing the PG v12 -> v16 upgrade improved things quite a bit.

The commit message seems slightly exaggerated to me though. 😄

The second assumption seems also dodgy on testing without the limits; it brings a bunch of entirely new tags to the list, since the current query severely undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

When ace41c6 was committed, we only had the tag cloud (no list, and no genres either). Any tags that would've been added/removed from the cloud due to the 200-limit would've been in the smallest font size possible: if you compared them side-by-side it'd be difficult to even figure out all the ones that changed. So "the results were reasonable enough" doesn't seem that dodgy to me at the time the commit was made (ignoring the inaccurate counts on hover, which does seem like an oversight).

(BTW, prior to that commit, the tag cloud didn't update at all, and was displaying data that was many years old.)

And I was curious what the "bunch of entirely new tags" are, and it's five genres and two tags that would be added/removed. It is a bunch, but since the current query "severely undercounts," it sounded like things would be much worse. 😄

Genre differences:

old - new:

     name      
---------------
 horror punk
 harsh noise
 future funk
 acid jazz
 dungeon synth
(5 rows)

new - old:

      name       
-----------------
 alternative pop
 grime
 neo soul
 hard trance
 spoken word
(5 rows)

Tag differences:

old - new:

         name         
----------------------
 tribal
 alternative and punk
(2 rows)

new - old:

  name  
--------
 80s
 garage
(2 rows)

@reosarevok
Copy link
Member Author

Huh. It certainly brought more tags in earlier, such as "music video" :) But I see that one grew enough now that it shows up in the current list.

Any tags that would've been added/removed from the cloud due to the 200-limit would've been in the smallest font size possible: if you compared them side-by-side it'd be difficult to even figure out all the ones that changed

Sure, my point was that even then we showed the exact counts on hover, so it wasn't just size differences :) That's what I meant with the assumption that we don't show counts being wrong.

I'm not trying to say that commit wasn't an improvement over the broken tag cloud, please don't take it as such! :) If the commit message seems over the top to you I can always reword it.

@mwiencek
Copy link
Member

Huh. It certainly brought more tags in earlier, such as "music video" :) But I see that one grew enough now that it shows up in the current list.

Yeah, I noticed that one shifted, and things may have been worse before given the right conditions.

Sure, my point was that even then we showed the exact counts on hover, so it wasn't just size differences :) That's what I meant with the assumption that we don't show counts being wrong.

I agree, but that was the first wrong assumption ("exact numbers weren't too important because we didn't show them"). I was just talking about the second one ("results were reasonable enough"):

The second assumption seems also dodgy on testing without the limits; it brings a bunch of entirely new tags to the list, since the current query severely undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

The old commit was basically saying it gave reasonably accurate results for the purposes of displaying a tag cloud, which I still agree with. :) My only complaint with the wording above is that the sequence of "dodgy, bunch of entirely new tags, severely undercounts" make it sound like the cloud will be vastly improved, but in reality it's hard to notice unless you're looking at the list UI and the counts, which I think is what you meant.

These were introduced in ace41c6
under the assumption exact numbers weren't too important
because we didn't show them, and the results were reasonable enough.

The first was already a wrong assumption at the time, since we
did show the counts on hover in the cloud page. It's even worse now,
since we have a list version with counts front and center.

The second assumption seems also a bit off on testing without the limits;
it brings a few entirely new tags to the list, since the current query
undercounts tags that are used a bit on many entity types
in comparison with tags used a lot on one entity type only.

The original worry why these limits were added (a slow query that got
a second or so faster with them) still applies, of course. That said, this
query runs once, and then is cached for 24h. It seems acceptable to have
one person a day wait a bit longer for their results in exchange for
actually correct results when the count differences are in the
thousands on the lower ranks of the lists (the top is of course used
everywhere and already has the right counts mostly).
@reosarevok
Copy link
Member Author

I reworded the commit message to be more chill about it, since it was a bit alarmist in hindsight 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs that should be checked/fixed soonish
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants