MBS-13801: Remove limits on tag cloud subqueries #3400

reosarevok · 2024-10-31T08:33:06Z

Fix MBS-13801

Problem

The counts on the tag list / cloud do not adequately represent reality, except for the top genres. That's because each subquery is limited, so any genres not on the top 200 and tags not on the top 50 for any specific entity are just ignored entirely for that entity.

These limits were introduced in ace41c6 under the assumption exact numbers weren't too important because we didn't show them, and the results were reasonable enough.

The first was already a wrong assumption at the time, since we did show the counts on hover in the cloud page. It's even worse now, since we have a list version with counts front and center.

The second assumption seems also a bit off on testing without the limits; it brings a few entirely new tags to the list, since the current query undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

Solution

This just removes the per-subquery limits, leaving only a limit on the full query to return the top 200 genres and 50 tags. The original worry why these limits were added (a slow query that got a second or so faster with them) still applies, of course. That said, this query runs once, and then is cached for 24h. It seems acceptable to have one person a day wait a bit longer for their results in exchange for actually correct results when the count differences are in the thousands on the lower ranks of the lists (the top is of course used everywhere and already has the right counts mostly).

Testing

Manually, checking that the lower ranks of the genres list and especially a big chunk of the tags list change significantly for the better without the limits (the "music video" tag, which was missing and led to the report, was on position 51 before - it jumps around 2k uses and 12 positions with the unabridged data).

mwiencek · 2024-11-14T05:11:10Z

Makes sense to me. I tested the query on hendrix before and after the patch, and the difference was more like 200ms than 1s, so I'm guessing the PG v12 -> v16 upgrade improved things quite a bit.

The commit message seems slightly exaggerated to me though. 😄

The second assumption seems also dodgy on testing without the limits; it brings a bunch of entirely new tags to the list, since the current query severely undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

When ace41c6 was committed, we only had the tag cloud (no list, and no genres either). Any tags that would've been added/removed from the cloud due to the 200-limit would've been in the smallest font size possible: if you compared them side-by-side it'd be difficult to even figure out all the ones that changed. So "the results were reasonable enough" doesn't seem that dodgy to me at the time the commit was made (ignoring the inaccurate counts on hover, which does seem like an oversight).

(BTW, prior to that commit, the tag cloud didn't update at all, and was displaying data that was many years old.)

And I was curious what the "bunch of entirely new tags" are, and it's five genres and two tags that would be added/removed. It is a bunch, but since the current query "severely undercounts," it sounded like things would be much worse. 😄

Genre differences:

old - new:

     name      
---------------
 horror punk
 harsh noise
 future funk
 acid jazz
 dungeon synth
(5 rows)

new - old:

      name       
-----------------
 alternative pop
 grime
 neo soul
 hard trance
 spoken word
(5 rows)

Tag differences:

old - new:

         name         
----------------------
 tribal
 alternative and punk
(2 rows)

new - old:

  name  
--------
 80s
 garage
(2 rows)

reosarevok · 2024-11-14T08:12:31Z

Huh. It certainly brought more tags in earlier, such as "music video" :) But I see that one grew enough now that it shows up in the current list.

Any tags that would've been added/removed from the cloud due to the 200-limit would've been in the smallest font size possible: if you compared them side-by-side it'd be difficult to even figure out all the ones that changed

Sure, my point was that even then we showed the exact counts on hover, so it wasn't just size differences :) That's what I meant with the assumption that we don't show counts being wrong.

I'm not trying to say that commit wasn't an improvement over the broken tag cloud, please don't take it as such! :) If the commit message seems over the top to you I can always reword it.

mwiencek · 2024-11-14T16:54:02Z

Huh. It certainly brought more tags in earlier, such as "music video" :) But I see that one grew enough now that it shows up in the current list.

Yeah, I noticed that one shifted, and things may have been worse before given the right conditions.

Sure, my point was that even then we showed the exact counts on hover, so it wasn't just size differences :) That's what I meant with the assumption that we don't show counts being wrong.

I agree, but that was the first wrong assumption ("exact numbers weren't too important because we didn't show them"). I was just talking about the second one ("results were reasonable enough"):

The second assumption seems also dodgy on testing without the limits; it brings a bunch of entirely new tags to the list, since the current query severely undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only.

The old commit was basically saying it gave reasonably accurate results for the purposes of displaying a tag cloud, which I still agree with. :) My only complaint with the wording above is that the sequence of "dodgy, bunch of entirely new tags, severely undercounts" make it sound like the cloud will be vastly improved, but in reality it's hard to notice unless you're looking at the list UI and the counts, which I think is what you meant.

These were introduced in ace41c6 under the assumption exact numbers weren't too important because we didn't show them, and the results were reasonable enough. The first was already a wrong assumption at the time, since we did show the counts on hover in the cloud page. It's even worse now, since we have a list version with counts front and center. The second assumption seems also a bit off on testing without the limits; it brings a few entirely new tags to the list, since the current query undercounts tags that are used a bit on many entity types in comparison with tags used a lot on one entity type only. The original worry why these limits were added (a slow query that got a second or so faster with them) still applies, of course. That said, this query runs once, and then is cached for 24h. It seems acceptable to have one person a day wait a bit longer for their results in exchange for actually correct results when the count differences are in the thousands on the lower ranks of the lists (the top is of course used everywhere and already has the right counts mostly).

reosarevok · 2024-11-15T09:59:36Z

I reworded the commit message to be more chill about it, since it was a bit alarmist in hindsight 😅

reosarevok added the Bug Bugs that should be checked/fixed soonish label Oct 31, 2024

reosarevok force-pushed the MBS-13801 branch from 7ab39b3 to 81b2a59 Compare November 14, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBS-13801: Remove limits on tag cloud subqueries #3400

MBS-13801: Remove limits on tag cloud subqueries #3400

reosarevok commented Oct 31, 2024 •

edited

Loading

mwiencek commented Nov 14, 2024

reosarevok commented Nov 14, 2024

mwiencek commented Nov 14, 2024

reosarevok commented Nov 15, 2024

MBS-13801: Remove limits on tag cloud subqueries #3400

Are you sure you want to change the base?

MBS-13801: Remove limits on tag cloud subqueries #3400

Conversation

reosarevok commented Oct 31, 2024 • edited Loading

Fix MBS-13801

Problem

Solution

Testing

mwiencek commented Nov 14, 2024

reosarevok commented Nov 14, 2024

mwiencek commented Nov 14, 2024

reosarevok commented Nov 15, 2024

reosarevok commented Oct 31, 2024 •

edited

Loading