Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explainable discrepancy in tag stats CSV download counts and tab labels on /tag/air-quality #8642

Open
jywarren opened this issue Oct 20, 2020 · 2 comments
Labels
documentation feature lacks proper documentation or needs more documents Epic
Milestone

Comments

@jywarren
Copy link
Member

jywarren commented Oct 20, 2020

Jeanette from the PL staff noted a discrepancy - when downloading a CSV and summing notes, questions, and wikis, the totals Jeanette got are:

From /stats: notes = 206; questions = 97; wikis = 42

However this was for a range of: https://publiclab.org/tag/air-quality/stats?utf8=%E2%9C%93&start=01-01-2010&end=14-10-2020&commit=Go

These don't match the tab totals shown at https://publiclab.org/tag/air-quality, of:

247 notes, 140 questions, 53 wikis (note one more question was shown since Jeanette's screenshot)

image

Exact discrepancy

A full date range CSV i got showed:

303 notes | 97 questions | 42 wikis

that means we are showing discrepancies of

-56 notes | 42 questions | 11 wikis (where the /tag page has this # MORE than the stats CSV)

Known sources of discrepancy

First, noting that some of the questions are for notes tagged with question:air-quality but which lack air-quality - this accounts for some or all of the 139-97 = 42 questions discrepancy.

Second, the stats pages do not count notes, questions, or wikis which bear tags which have a parent tag (a system we are trying to phase out) of air-quality. The last line of this section of code shows those extra nodes getting included for the /tag/air-quality page.

I was able to find 61 notes and 11 wikis that bear a child tag of air-quality, which has affected this count. That seems to account for the wikis discrepancy.

irb(main):034:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ?', 'air-quality').size
=> 304
irb(main):035:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ? OR term_data.parent = ?', 'air-quality', 'air-quality').size
=> 365

After accounting for 61 extra notes, we actually have 61 + 56 = 117 notes shown on the CSV which were not shown on the /tag page.

But, according to these lines, we exclude all questions of any kind from this note count. Let's see how that affects the count:

irb(main):035:0> Node.where(status: 1, type: 'note').includes(:revision, :tag).references(:term_data, :node_revisions).where('term_data.name = ? OR term_data.parent = ?', 'air-quality', 'air-quality').where('node.nid NOT IN (?)', @qids).size
=> 247

So, that took us from 365 to 247, if we are including parent tags. That's the number shown on /tags/air-quality.

Without counting parent tags OR questions, we get 206 notes - that's vs. 303 in the CSV.

Let's look at where the CSV is being compiled:

plots2/app/models/tag.rb

Lines 216 to 239 in 27a3839

def contribution_graph_making(type = 'note', start = Time.now - 1.year, fin = Time.now)
weeks = {}
week = span(start, fin)
while week >= 1
# initialising month variable with the month of the starting day
# # of the week
month = (fin - (week * 7 - 1).days)
# Now fetching the weekly data of notes or wikis
current_week =
Tag.nodes_for_period(
type,
nids,
(fin.to_i - week.weeks.to_i).to_s,
(fin.to_i - (week - 1).weeks.to_i).to_s
).size
weeks[(month.to_f * 1000)] = current_week
week -= 1
end
weeks
end

This is a little convoluted, but i traced through it and it seems OK.

Running Tag.nodes_for_period() on the whole 10 year span returned 248, which is only 1 off:

irb(main):051:0> Tag.nodes_for_period('note',nids,(Time.now - 10.year).to_i, Time.now.to_i).size
=> 248

That's for the same nids collection as we got for the tags page - with parent tags, and excluding questions. Let's try running it without the parent tags, but leaving the questions in...

irb(main):060:0> Tag.nodes_for_period('note',nids,(Time.now - 10.year).to_i, Time.now.to_i).size
=> 305

OK, so the discrepancy seems to be (within an error of 2 notes) that the stats are excluding parent tags and including questions.


Takeaway

I believe this means that we don't need to change any queries, but we should add some of these caveats to the stats pages for those wondering. I can make an FTO once we settle on explanatory text!

Linking this thread to this explanation of questions counts on tag pages: #8246

@jywarren jywarren added this to the Metrics milestone Oct 20, 2020
@jywarren jywarren changed the title Discrepancy in tag stats CSV download counts and tab labels on /tag/air-quality Explainable discrepancy in tag stats CSV download counts and tab labels on /tag/air-quality Oct 20, 2020
@jywarren jywarren added the documentation feature lacks proper documentation or needs more documents label Oct 20, 2020
@jywarren
Copy link
Member Author

jywarren commented Oct 20, 2020

The explanatory text currently says:

The graphs above are stacked, and questions are counted both on their own as well as part of the tally for notes (because they are a form of note).

So the text could be expanded to:

The graphs above are stacked, and questions are counted both on their own as well as part of the tally for notes (because they are a form of note). Additional discrepancies may come from the tag page also listing questions tagged with "question:_____" but lacking the base tag, and also listing notes with only "child tags" of the base tag, in a system we are planning to slowly deprecate.

@jywarren
Copy link
Member Author

Link to "deprecating tag aliasing" - #6367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation feature lacks proper documentation or needs more documents Epic
Projects
None yet
Development

No branches or pull requests

2 participants