Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts #75

Closed
pdurbin opened this issue Feb 5, 2020 · 56 comments
Assignees
Labels
NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... pm.GREI-d-1.5.1 NIH, yr1, aim5, task1: Standardize download metrics pm.GREI-d-1.5.2 NIH, yr1, aim5, task2: WG with other repositories to follow Make Data Count recommendations Size: 10 A percentage of a sprint.

Comments

@pdurbin
Copy link
Member

pdurbin commented Feb 5, 2020

Established Dataverse installations that have been operating for years might be reluctant to turn on Make Data Count (MDC) because the download counts will be reset to zero unless something is done to somehow copy the "classic" download counts into the new "datasetmetrics" database table that powers MDC download metrics. For example, Harvard Dataverse has over 10 million "classic" downloads:

Screen Shot 2020-02-06 at 11 47 41 AM

Many Dataverse installations probably don't have all the Apache (or Glassfish or whatever) access logs from years ago lying around but the database table filedownload could be used as a source for timestamps of downloads from the "classic" system. After standup on 2020-02-05 @djbrooke @kcondon talked about this and I made the following diagram (best to open it in a new window since the text is so small):

make-data-count

source for the image above: make-data-count.uml.txt

This is what I added to the diagram, which is based on http://guides.dataverse.org/en/4.19/admin/make-data-count.html#architecture

== Historical Logging ==
sysadmin --> exportLogsApi : GET /api/admin/mdc/exportLogs
exportLogsApi --> log : all history from database
main.py --> log : read historical logs
main.py --> datasetMetrics : write metrics to datasetmetrics table (using SUSHI, as below)
main.py --> reports : send metrics to DataCite

This is a bit hand wavy because we'd still use SUSHI as indicated by the Log Processing part of the diagram.

Roughly, the idea is this:

  • Create a new Dataverse API for sysadmins to use to export from Dataverse a series logs that are compatible with Counter Processor (one per month for 10 years, for example)
  • Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.
  • Use Counter Processor to send the historical data to DataCite.

See also pull request IQSS/dataverse#6543

@djbrooke
Copy link
Contributor

djbrooke commented Feb 5, 2020

  • Does COUNTER care if there are missing fields from these counts we move over?
  • What we currently missing? What would we miss out on if we don't comply with COUNTER?
  • We don't expect to do any log scraping as part of this

@qqmyers
Copy link
Member

qqmyers commented Feb 5, 2020

FWIW: I think the primary thing Counter is doing that affects the aggregate counts being shown today is to remove robot hits. If robots have been being excluded by a robots.txt file, then this may not be too important. However, the way it works is to review ip addresses versus a blacklist, so not having ip addresses would stop that. I haven't found any ip address info in the database - not sure if there's some other field that would at least indicate a human/non-robot access. I'm not sure if having a null ip will cause it to break or not - I suspect breaking as one of the errors I previously found (logging un-published downloads) was something I caught because counter threw an error with a null field.

In terms of finer breakdowns, counter also uses ip address to find countries, and I think it looks for sessioncookie info to assess unique counts. The sessioninfo in the filedownload table might be of use there. The human/machine separation is from the user-agent - not sure there's anything in the tables to guess at that (or at least break browser from curl api calls).

@pdurbin
Copy link
Member Author

pdurbin commented Feb 6, 2020

@qqmyers thanks for your comment. Yes, there are a lot of unknowns here, which is why this issue got estimated as a large yesterday in sprint planning.

Heads up that I just updated the description to have more than just a diagram and a link to your pull request. I hope it helps clarify what we talked about.

Further questions and comments are welcome from all, of course!

@jggautier
Copy link
Collaborator

jggautier commented Mar 23, 2020

1. What are we currently missing [regarding downloads]?

It's not clear to me, and perhaps this investigation can include figuring out if and how the following information about pre-MDC counts can be determined:

  • Legitimate machine access (as opposed to internet robots and crawlers)
  • Regular versus machine downloads/explores. Regular means "typical user behavior" and machine means "typical [non-robot and crawler] machine behavior"
  • Downloads/explores that exclude double clicks. If a file is downloaded/explored in the same session within at most 30 seconds, that must count as only one download/explore.
  • Unique downloads/explores. How many times did a download/explore happen in a session?
  • Countrycodes. In what country was the user/machine when she/it downloaded the file?
  • Dataset versions. What dataset version was the user/machine on when she/it downloaded the file?
  • Dates of downloads. For many file downloads in the Harvard Dataverse database, there is no timestamp, so we don't know when (what year, month, day, time of day) the file was downloaded. Is this the case for any other Dataverse installation?

2. Does COUNTER care if there are missing fields from these counts we move over?

"Section 3.3.9: Missing and Unknown Field Values" of the COUNTER guide says that COUNTER will accept missing/unknown values for fields, and indicates how to express missing values for optional and required fields.

Which fields are optional and required? The COUNTER guide says to refer to "Section 8" of the "Research Data SUSHI API Specification". This article cites this API documentation as the Research Data SUSHI API Specification. I can't tell what section 8 is, but there's a page called "dataset usage" model that has red asterisks next to a few fields, like dataset title, type, and platform. Are these the required fields, and is everything else optional?

3. What would we miss out on if we don't comply with COUNTER?

Are the auditors checking on a pass/fail basis? Or are they grading implementations on a scale? (The bottom of the COUNTER guide lists companies doing audits.)


From the previous comments, it sounds like we're limiting ourselves to info in the database because it's the source that Dataverse-based repositories are most likely to have access to. Not every Dataverse-based repository will have access to access logs, which is the only other option.

@djbrooke, for questions 2 and 3 should we schedule a meeting with the COUNTER team? If we can confirm that many of the types of metrics in the first question can be missing from monthly reports, e.g. machine counts, unique counts, dataset versions, it might make the conversion easier (or possible).

@qqmyers has already done some digging in the database for parts of the first question. It seems like the big question is if the database can help us determine human (regular) downloads/explores versus machine downloads/explores that exclude bots and crawlers. I also don't remember a way to do this using info in the database, but I can look. I also plan to explore the database to see how we might determine when a download/explore happened within a session and on what dataset version a file download happened.

If anyone else with time or familiar with the database can help answer these questions, please feel free :)

@qqmyers
Copy link
Member

qqmyers commented Mar 23, 2020

FWIW: If the access logs are available for some instances and they do a better job than the db, it might be worth considering a split - use access logs or do the dual counts as I proposed earlier. They should co-exist well - the cut-over date for dual counts could be the start of time (1970!) by default so it wouldn't trigger if the access logs had been mined.

@jggautier
Copy link
Collaborator

jggautier commented Mar 23, 2020

I agree @qqmyers.

Just ran into a Google doc that lists methods for getting info for MDC counts. It's targeted to Harvard Dataverse, but I think it supports the idea that the access logs would be the only way for Dataverse-based repositories to distinguish between "regular" and "machine" downloads (and exclude bots by filtering by IP address), which is required for converting "classic" downloads. We can't send request counts that combine "regular" and "machine" downloads.

We also can't send counts unless we can exclude double clicks, which Harvard Dataverse can't do because ~630k older download counts don't have timestamps.

So even if we can send reports that are missing most info about requests (e.g. unique counts, dataset versions, dates of requests), it seems that repositories that want to provide MDC counts but (1) cannot use access logs or (2) have download counts without timestamps cannot convert their "classic" download counts. For those repositories, they can either:

  • not provide pre-MDC counts or
  • offer those counts somehow in addition to MDC counts (e.g. the dual count proposal)

That Google doc also brings up another question about sessions: When determining unique and total counts, can we use the session cookie ID (instead of the user cookie ID) to define a session?

That Google doc defines session cookie IDs and user cookie IDs. The article (page 27) says that session cookie IDs (which Dataverse's database does store) can be used to filter double clicks, although it's not the most reliable way. This filtering also must be done for MDC compliant metrics.

But when determining unique versus total counts, can Dataverse also define a session by the session cookie ID instead of a user cookie ID? That Google doc says we don't have user cookie IDs.

More questions to ask the COUNTER team.

@jggautier
Copy link
Collaborator

jggautier commented Mar 26, 2020

Best effort (instead of all-or-nothing)

@djbrooke shared a confirmation from Daniella Lowenberg that no fields are required and that a best effort should be made to convert "classic", non-standard counts to standard counts. The guides and specs say counts need to exclude bot/crawler activity and double clicks.

If Dataverse repositories believe or are not sure if their download/explore counts include bot/crawler activity, they should make their best effort to exclude as much as possible by:

  • matching the ip addresses in their access logs to lists of known bot/crawler ip addresses
  • ?

Dataverse repositories should also make their best effort to exclude double clicks from their "classic" counts, which can be done by using the timestamp and session cookie ID of each download/explore recorded. Timestamps and session cookie IDs may be found in the Dataverse repository's database and/or its access logs.

After this best effort for filtering counts, counts can be converted to MDC counts, even if the repository believes some counts may represent bot/crawler activity or double clicks.

This line of reasoning seems different than the thinking behind the proposal that @qqmyers supports for providing "classic counts" in addition to MDC counts, but...

Counts with unknown month/year

The standardized reports are made monthly, so I'm assuming that Dataverse needs to report the month and year when each download/explore occurred. For example, in this example JSON report, the "Period" object on line 70 must be included and must contain a "begin-date" key with a date value that is the first day of the report's month.

This requirement seems justified since, just like excluding bots and double clicks, knowing when some activity happened improves the quality of the metric.

Are there repositories that have downloads/explores whose months and years can't be determined? If yes, what should be done about those counts?

Harvard Dataverse's database doesn't have timestamps for around 600k download/explore counts. I think the GitHub issue about this (IQSS/dataverse#3324) doesn't mention trying to find this information from access logs. But @landreev or @scolapasta, would you know if this was already considered?

I queried the database to see which datasets in which dataverses would lose download/explore counts if Harvard Dataverse removed the counts with no timestamps, and can share the results if needed.

@scolapasta
Copy link

@jggautier we did consider it, but unfortunately, these downloads are from before we tracked downloads individually, i.e. we just had a count. When we went to the current model, we created a row for each, but didn't know if the specific date. Access logs from that time are no longer available. (discussed with @landreev to confirm, as well)

@jggautier
Copy link
Collaborator

Great, thanks @scolapasta and @landreev!

I think we could contact installations to find out:

  • Which installations have counts that can't be converted because the month/year is unknown.
  • What those installations want to do about the counts where the month/year is unknown. (Discarding counts that can't be converted would mean that many of the older datasets in Harvard Dataverse would lose the majority of their counts.)

Would the "best effort" instructions we provide to installations for converting the timestamped counts include what @pdurbin wrote earlier?:

  • Create a new Dataverse API for sysadmins to use to export from Dataverse a series logs that are compatible with Counter Processor (one per month for 10 years, for example)
  • Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.
  • Use Counter Processor to send the historical data to DataCite.

@djbrooke
Copy link
Contributor

djbrooke commented May 1, 2020

@jggautier thanks! All sounds good.

For converting counts, instead of contacting other installations we could make a decision on our own for Harvard Dataverse and then provide the guidance/infrastructure to installations so they can do what we did (or use some parts of the guidance/infrastructure to make their own decisions about old counts).

Generally, what @pdurbin suggested makes sense as well.

@jggautier
Copy link
Collaborator

jggautier commented May 1, 2020

Some questions and observations that might help with the decision:

  • Already noted that if we don't make available (in any way) the download counts that can't be converted, that would mean many of the older datasets in Harvard Dataverse would lose the majority of their existing download counts. Could the decision to not make the "classic" counts available in any way be based on how these dataset owners are using these counts (such as reporting them to funders) and what would happen if the classic counts were unavailable to depositors/people who manage those datasets?
  • Assuming that when MDC is turned on, we'll continue to store and collect "classic" download counts in the Dataverse database and we could somehow make those "classic" counts available (not in the UI), if there are people or systems who will get dataset metrics only from DataCite in the MDC standard, those people/systems wouldn't know about the download counts that couldn't be converted to the standard. For example, someone using DataCite's APIs to programmatically get activity counts of datasets or another platform using DataCite's APIs to display activity counts wouldn't know about a particular dataset's 12k "classic" downloads that couldn't be converted. For a better indication of that dataset's popularity/quality, those people/systems would need to get the "classic" download counts from Harvard Dataverse.
  • If the "classic" download counts are somehow provided to dataset depositors/curators and others, will making those download counts harder to find, because they won't be displayed in the UI, make them less effective at indicating popularity/quality? For example, will the counts be less effective if they're only in a tool tip or available only using the API and not as easy to find as the MDC counts?
  • When MDC is turned on, what if Harvard Dataverse never reports/displays MDC download counts for the 1446 datasets whose "classic" download counts can't be converted? For those datasets, only the non-MDC download counts would be available. Harvard Dataverse would still provide pageview counts for those datasets, and the MDC reports sent to DataCite for those datasets would always include only the pageviews. (That is, in the MDC reports, "requests" would never be included for those datasets, and the "investigations" would include only pageviews.) People/systems couldn't get download counts from DataCite for these datasets.
  • Or what if Harvard Dataverse does report displays only MDC download counts, including for those 1446 datasets, but also indicates that the dataset has download counts that couldn't be converted to the MDC standard and are available in some other way (like using the Dataverse API)?
  • @djbrooke asked what if we add dates to downloads that don't have them? The date might be some date before Dataverse started tracking when files were downloaded. @scolapasta said that was before guestbooks were introduced. Or maybe the date could be 2008-06-01, the first day of the month before 2008-07-31, which is the earliest download date (guestbookresponse.responsetime) recorded in Harvard Dataverse.

@jggautier
Copy link
Collaborator

We decided first to work on converting counts that have timestamps and exclude counts that don't. While that work continues, we'll keep looking into what to do with the counts that don't have timestamps (including asking other repositories if they ran into this issue and how they handled it, and if the MDC leaders have suggestions).

Continuing from what @pdurbin wrote about converting timestamped counts:

  1. Create a new Dataverse API for sysadmins to use to export from Dataverse a series of logs that are compatible with Counter Processor (one per month for 10 years, for example)
  • Does that mean that this sysadmin API will take the timestamped counts and use information in the database and access logs to try to:
    • remove download counts caused by bot/crawler activity?
    • distinguish between duplicate and unique counts?
  • The API would export a series of logs. Are the logs the JSON reports that are sent to DataCite (like this example), mentioned in the third step below?
  1. Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.
  1. Use Counter Processor to send the historical data to DataCite.

This will include all counts up to and including the previous month, right? For example, if on June 15 the Counter Processor is used to populate the new "datasetmetrics" table and send the historical data to DataCite, the historical data would be activity (downloads/explores) that happened up to and including May.

Then when the repository starts sending regular monthly reports to DataCite, it'll continue with the following complete month? In the example, the report would be sent sometime in July and include activity in June.

@qqmyers
Copy link
Member

qqmyers commented May 14, 2020

counter_processor should be able to remove bot and duplicate counts if the info is available, so if the sysadmin has a way of determining these things, it might be easiest if they just include that info in the made-up log rather than trying to process themselves. (I'm not sure I understand what info is available but suppose, for example that only one person has requested access to restricted files in a dataset and there are three quick downloads of them. If there isn't any info about the session or IP address to give counter-processor, one could still infer that they are from the same person. If so, rather than trying to calculate whether the downloads should all be counted, one could just add made-up info about the session to the log and let counter-processor do the math. Similar for other things - rather than making the decision in new code, one could just make sure the choice being made about whether its the same person, or whether it was a robot is conveyed in the made-up log file and counter-processor can handle it from there.)

counter-processor nominally runs daily and reports in the current month, up to yesterday. Once it's processed, it reports to Dataverse, which clears the table for the current month and takes the new aggregate info, and it updates with DataCite - again reporting new aggregate numbers that are used in place of the old ones.

To process old months, you can tell counter-processor the year-month combo to process and will process all the logs for that month and send one report to Dataverse and DataCite, i.e. no need to run it for each day in an old month. Similarly, if the cron breaks and you skip some days, as long as you process again in that month, counter-processor will catch-up and report all the days it missed.

@landreev
Copy link
Collaborator

Aside from the question of how much, and what kind of extra info can be extracted from the logs, I just want to make sure we don't count on it too much. Specifically when the downloads that happened in the past are concerned. In our own production environment we only keep the access logs around for some number of months. We have never assumed that it was something we want to archive and store permanently. We should also assume that the way other installations handle this may vary, spanning the full range from permanently archiving all their access logs, to not having the access logs enabled at all.

For the purposes of accounting for ongoing downloads, it is safe to assume that the logs for, say, the past week are always going to be present. (And we can instruct other installations that this is required). But it would probably be worth to ensure that we store all the information needed for the MDC metrics right away. So that going back to reading access logs isn't going to be necessary (this may already be the case, going forward, I'm just not 100% sure from reading the above).

@landreev
Copy link
Collaborator

landreev commented May 14, 2020

@qqmyers (et al) On the issue of recognizing "bot/crawler activity":
In the context of our own prod. environment here at Harvard, when we talk about "bot activity" we often mean a very specific type of activity: A very large portion of our downloads here are "partial downloads" from some automated archival network - where a user publishes a (large, multi-GB) file and then armies of these automated clients keep making calls for very small portions of these files, often just a few bytes at a time, with the byte range argument added to the download urls. These clients make thousands of these calls; that, in the Dataverse accounting system result in full GuestBookResponse entries; so for all practical purposes there's no way to tell that those were not real, full downloads of that multi-GB file! I opened IQSS/dataverse#5957 for this last year; would be super easy, to add some simple way for marking these downloads (or to not create "guest book responses" for these at all!) but for whatever reason it ended up on the back burner.
(for the record, the only reason this even works is that we redirect to S3 - the extra offset/length parameter is passed to AWS which recognizes it. Dataverse, on the other hand, does not. So for a locally stored file it would actually make an honest attempt to serve the entire file! - another potential easy fix).
In our production here I can go back and recognize these partial downloads in the access logs - but only because I added "%header.range%" to the Glassfish access log configuration. But, again, this can only be done for as long as the logs are kept around.

I guess the bottom line is, this would be something very specific to individual installations, and their use patterns. And subject to their admins looking at the logs and recognizing some use patterns as bot activity. (the word "bot" in the logged %header.user-agent% field can be a giveaway too). But I don't think we'll be able to provide scripts that would reliably do this for other installations. (And I understand that this is somewhat along the lines of what you were saying above too).

@qqmyers
Copy link
Member

qqmyers commented May 14, 2020

@landreev
Once mdc logging is enabled, all the info needed is in the new logs - no need for access logs anymore
FYI - I'm @qqmyers (one -e) so I'm not getting your @qqmeyers
For bots, counter-processor uses dynamic lists for bots and machine counts (see https://raw.githubusercontent.com/CDLUC3/Make-Data-Count/master/user-agents/lists/robot.txt for the list of words/regexs it looks for) . As long as the user agent info is available, the range info may not be needed.
(FWIW - with range queries - that's what's needed to be efficient with mime-type recognition (#6762 ) - you usually only need the first few to ~1K bytes, so making range queries possible in the StorageIO classes (file included) will be needed there. That should help the non-redirect download case above if range handling is allowed in the api.)

@landreev
Copy link
Collaborator

I'm @qqmyers (one -e) so I'm not getting your @qqmeyers

lol, yes - but do note that I eventually noticed, and corrected it!

@landreev
Copy link
Collaborator

As long as the user agent info is available, the range info may not be needed.

True. Most of our byte range access requests are from this user agent "reqwest": (but do note almost just as many with no user agent supplied)

# grep -v 'NULL-HEADER-RANGE' access/server_access_log.2020-0[45]* |
> grep 'GET /api/access/datafile' |
> awk '{print $9}' |
> sort | uniq -c | sort -nr +0 -1
 266606 "NULL-USER-AGENT"
 203893 "reqwest/0.9.24"
  94405 "reqwest/0.9.22"
   3744 "reqwest/0.9.19"
   1729 "Mozilla/5.0
    489 "Mozilla/4.0
     17 "Core/1.63.5221.400
     14 "okhttp/3.9.0"
     14 "facebookexternalhit/1.1
     12 "reqwest/0.9.20"
      5 "FDM/5.1.38.7312"
      2 "Python-urllib/3.7"
      1 "Wget/1.19.4
      1 "HUAWEI_SCL-TL00_TD/5.0

@jggautier
Copy link
Collaborator

jggautier commented May 15, 2020

Thanks. This is great! I'd like to make sure we're using the same terms for things, preferably the terms used in the COUNTER spec, to reduce confusion (maybe just my own confusion as the least technical person here :) ). For example, @qqmyers you wrote that "counter-processor uses dynamic lists for bots and machine counts." These dynamic lists are used for recognizing activity that should be excluded, right? But the COUNTER spec allows for counting "legitimate" machine activity, like scripts run on a user's computer. I wouldn't want anyone else reading this thread to think that machine counts and bot counts are the same thing. It's important that any method for excluding activity doesn't exclude legitimate machine counts.

I've been keeping a table (just moved it to a Google Sheet) so that I understand how the decisions we make, and the decisions made by CDL's counter-processor, affect counts in any situation I'm aware of, including the instance of partial download counts reported in IQSS/dataverse#5957. If this might be helpful for anyone else, please feel free to contribute or question an interpretation. (I know that for this issue's scope, only the situations involving downloads/explores/requests matter, since Dataverse installations haven't been counting other types of activity, like pageviews (outside of third-party analytics tools)).

@landreev. You wrote in IQSS/dataverse#5957 that the group that downloaded only a part of many files did that "to keep their distributed copies in sync". I just want to make sure I understand why we should label that activity as bot/crawler activity. The COUNTER spec defines "Internet Robots and Crawlers" as agents whose actions are "intended for search indexing and related applications".

  • So do we agree that machine activity related to keeping distributed copies in sync would fall under "related applications", or that it's a process of that group's search indexing, and should be considered bot/crawler activity?
  • Are we suggesting that any machine activity where only part of the file is downloaded should be considered bot/crawler activity? When someone tells a machine to download only part of a file, is it only for search indexing/related purposes?

Since there's no "one-size-fits-all" method for recognizing "bot/crawler activity", are there a number of methods we can recommend to installations to ensure that they exclude bots/crawlers? It sounds like installations can't always rely on user agents or the dynamic lists that counter-processor uses (we couldn't have anticipated that "reqwest" would start download parts of thousands of files). But if I understand what @qqmyers wrote, if an installation discovers activity that was reported as legitimate but should have been excluded as bot/crawler activity, that installation can re-send corrected reports for the months of that activity.

@landreev
Copy link
Collaborator

@jggautier Re: all those automated partial downloads: I can look up the actual conversation with the user who published the original file that started the whole thing. I don't remember all the details tbh; that big 2.5GB file is an archive of some TeX macros. Why the individual clients need to make thousands of partial downloads at regular intervals exactly, I can't necessarily explain. But I do think it's not a stretch to say that it probably falls under "search and indexing and related applications".

My main arguments in favor of NOT counting these as regular downloads: It just feels wrong. I'm not talking about the MDC metrics - just the fact that these thousands of downloads of just a few bytes each are shown as if somebody actually downloaded a multi-GB file thousands of times. It does likely distort our download statistics at this point. Also, it's kind of a waste of resources on the Dataverse side - each download count is actually a database table entry (in the GuestBookResponse table).

Another way of looking at this, maybe we should not focus on the "partial", but instead on the fact that they are repeated downloads from the same IP address. So, perhaps we should count the first one as a regular download; but ignore the consecutive hundreds of them if they are happening within some number of hours; or within the same day etc. This way we don't need to pay attention to whether they are requesting just a few select bytes, or the whole file.

@landreev
Copy link
Collaborator

(Just want to emphasize that I don't have a strong opinion on how we should be treating these "partial downloads". My only strong opinion is that what we are doing now feels wrong. Not counting any partial downloads would be a bit over-the-top. Asking for 1GB from a 2GB file, for example, does feel like a "real" download... So maybe my other idea, not counting repeated downloads on the same file over a short period of time is a better bet?)

@jggautier
Copy link
Collaborator

jggautier commented May 18, 2020

Ah, thanks @landreev. I'm leaning toward thinking that the partial downloads reported in IQSS/dataverse#5957 should be excluded from the MDC downloads, too (should not be populated in the datasetmetrics table so shouldn't be reported to DataCite).

I've always assumed that IQSS/dataverse#5957 wasn't prioritized because this MDC work would take care of problems like these, by excluding this type of activity or differentiating unique from total counts, and "regular" counts from machine counts.

Does that make sense? If it's not clear from the conversation with the user who published the original file that started the whole thing in IQSS/dataverse#5957, if you like I could reach out to them to ask.

There might be Dataverse installations that continue to provide non-MDC counts, which also contain repeated downloads of the same file over a short period of time. But I think it's better to encourage and help Dataverse repositories to follow a standard other platforms are following, too.

Also, it's kind of a waste of resources on the Dataverse side - each download count is actually a database table entry (in the GuestBookResponse table).

Maybe this could be resolved as the GuestBookResponse table is further separated from the download stats? (Of course I'm thinking this is out of scope for this issue/spike.)

I think the next steps are:

  1. Determine if the partial downloads described in Download counts for partial downloads dataverse#5957 should be excluded as bot/crawler activity or counted as legitimate machine activity (does that involve adding a user agent to a blacklist?)
  2. Test populating the datasetmetrics table (also excluding downloads/explores in the guestbookresponse and/or filedownload tables that don't have timestamps)
  3. Compare the MDC counts in the datasetmetrics table to the "classic" counts in the guestbookresponse and/or filedownload tables to make sure everything's working as we expect (Seeing the differences will also help us determine how to communicate the changes in counts once they're later displayed in the Harvard Dataverse UI)
  4. Once we're satisfied, populate the datasetmetrics table in Harvard Dataverse's production database and send the reports to DataCite

@landreev
Copy link
Collaborator

landreev commented Jun 4, 2020

@jggautier
@qqmyers
@pdurbin
@scolapasta
@djbrooke
Back to this issue.
A piece of good news: we have significantly more saved access logs than I realized/reported yesterday. Turns out I squirreled away and saved the logs from the LTS-maintained servers back in 2018, as we were moving to AWS. Continuing the squirrel analogy, somewhat gratuitously, it appeared that I had forgotten where exactly those nuts were buried. But I found them after all, so we now have the apache logs for the entire, or most of the, lifespan of Dataverse 4 - beginning April 2015. Or the time period during which 90% of all the "legacy" downloads were served.
This does NOT necessarily mean that we'll be able to reliably track every single GuestBookResponse record from the database to its access log entry (there may be some gaps, or logging failures). And some log entries are still missing at least some of the info that MDC likes to have. But we should have enough of that information for this whole idea of cooking up MDC logs retroactively to be looking like a viable option again; as we were having some doubts about it yesterday.

I still don't fully know how to get it done. But I now understand what we are trying to achieve, enough to know what questions to ask. So I'm going to be doing that, below.

@landreev
Copy link
Collaborator

landreev commented Jun 4, 2020

@jggautier
@qqmyers
@pdurbin
@scolapasta
@djbrooke
OK, for most (?) of you, the summaries below is something you already know. This is for my benefit - let me know if I'm still not getting it right.

  • The reason we want to convert these "legacy" downloads into MDC is that otherwise the current counts will reset to 0 when we turn MDC on.
  • QDR/@qqmyers's solution (#6543, show legacy and MDC metrics, both) was considered, but rejected. (IMHO - I feel like it's worth reconsidering).
  • The things we are hoping to extract from the logs, that our GuestBookResponse records do not have:
    USER AGENT: (browser, software client, etc.) important because Counter Processor relies on it to identify "machine access" requests. (Although, in an apparently rather crude/primitive way!) None of our GuestBookResponse entries have that; some of our logs will have that.
    IP ADDRESS:
    SESSION ID: Same as above; CP uses this info to filter out repeated ("double click") requests. The access logs entries will have this info for some downloads, but not for all.
    Questions:
  1. Is it ok to cook up MDC logs for the downloads for which we don't have all the info above? Rather than dropping them? This may have been actually answered, in that "best effort" discussion way above; but I'm just making sure. Sounds like yes, I am cooking up MDC entries, for them with "-" in place of the missing values above for these, correct?
  2. What exactly am I doing for the downloads for which we don't have ANY information at all? These are the prehistoric, DVN-period downloads for which we only know the counts per file, not even dates. There's no way to retrieve that information. These constitute only ~5% of all the counts (~700K out of 15M). But there are probably specific files for which these legacy^2 downloads constitute most of the counts. So I don't think we want to drop them outright. But then the only way to get them counted w/ MDC would be to make up some fake dates for them. Up in the issue there's a quote from the MDC spec that says that everything in that CP log file is "optional". But then CP logs themselves are dated (there's one log file per day); so basically I would need to pick some fake date in that "pre-historic" range, and place that download into the CP log for that day. This feels pretty... sketchy - but I can't think of a better way. And I am open to suggestions.
  3. Again, this may have been clear to everybody else but me, but to confirm: Even for the downloads for which all the extra information can be obtained from the logs, the MDC counts may end up lower than the previously maintained legacy counts. Especially so, actually - since MDC uses that extra information to identify some downloads as "double clicks" and such, and drop them from the counts. Provided the above is correct, is everybody ok with this? It kinda sounds like at least some people were actually viewing these numbers as meaningful and important... are they going to be ok seeing them go down?
  4. Please note that the issue of the "Tectonic" downloads, that was discussed at great lengths above is NOT an open question anymore; that's been resolved and I know what to do with them exactly.

Sorry that was so long. I'm not personally convinced all that MDC business is worth this amount of attention we've already put into it. But it sounds like it's important to at least some people - so here we are.

@landreev
Copy link
Collaborator

landreev commented Jun 4, 2020

@jggautier
@qqmyers
@pdurbin
@scolapasta
@djbrooke
What exactly is the time frame for this anyway? Is there any need to start showing MDC metrics in our production starting any specific date?
Per @qqmyers' advice, I have enabled MDC logging on our prod servers. Without switching to displaying MDC metrics just yet.

@landreev
Copy link
Collaborator

landreev commented Jun 4, 2020

@jggautier
@qqmyers
@pdurbin
@scolapasta
@djbrooke
Unrelated to the legacy counts, but a couple of questions about the MDC logs:

  1. Is that by design that we are saving the storage identifier (for ex., s3://dvn-cloud:1523be...) in the filename column? Shouldn't it be the real filename, from the FileMetadata, under which the user is actually downloading it? (is it ever used for any meaningful purpose?)
  2. Not a question, but - yes, it does look like even going forward there will still be some required pre-processing that we'll need to be doing on these MDC log files before feeding them to Count Processor; consulting the access log file in the process. Specifically, the ip address in the counter*.log file on our prod. servers is NOT the real ip, but the ip of the AWS ELB proxy. I think in practical terms, that would result in Counter Processor potentially mis-typing many guest downloads from different users as repeated downloads coming from the same address. So we'll need to either get the real address from the access log file; or add code to the application to (optionally) get it from the x-forwarded-for, or another, configurable header or cookie.
    Modifying the "user agent" field for what we know are "machine downloads" is another case.

@qqmyers
Copy link
Member

qqmyers commented Jun 4, 2020

  1. Not sure why this choice was made - seems like the real filename makes more sense unless there's something in the spec that specifies you want something more unique (e.g. than readme.txt)
  2. BUMMER! Looks like QDR has this issue as well and the apache logs we have also capture the LB IP addresses as far as I can tell (still checking). Aside from potentially causing collisions, where I think the session Ids will still help distinguish, this will completely break geo-spatial assignments which is another important add of MDC. Regardless of how fast back logs are processed, getting good info for new counts, whether through local config of AWS LBs or code change should probably be prioritized (and shared with the community assuming others use load balancing.)

@mreekie mreekie added the Size: 10 A percentage of a sprint. label Jan 17, 2023
@landreev
Copy link
Collaborator

landreev commented Jan 17, 2023

Linking the issue IQSS/dataverse#9025 in the main project - a lot of the more recent discussion concerning this issue was happening there. As we revisit it during this spike, let's make sure to take any potentially useful information there into consideration.

@landreev landreev changed the title Spike: populate Make Data Count downloads from "classic" downloads in database Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts Jan 17, 2023
@mreekie mreekie moved this from 1️⃣ ▶ORDERED BACKLOG (Stefano) to 3️⃣▶ 💨👟SPRINT READY BACKLOG in IQSS Dataverse Project Jan 18, 2023
@mreekie mreekie moved this from 3️⃣▶ 💨👟SPRINT READY BACKLOG to 4️⃣▶⏱In This Sprint in IQSS Dataverse Project Jan 26, 2023
@mreekie mreekie moved this from 4️⃣▶⏱In This Sprint to 1️⃣ ▶ORDERED BACKLOG (Stefano) in IQSS Dataverse Project Jan 27, 2023
@mreekie mreekie moved this from 1️⃣ ▶ORDERED BACKLOG (Stefano) to 3️⃣▶ 💨👟SPRINT READY BACKLOG in IQSS Dataverse Project Jan 27, 2023
@jggautier
Copy link
Collaborator

jggautier commented Mar 2, 2023

I'm listing some of the collections in the Harvard Dataverse whose admins either rely on download counts now or have told us that they are very interested in being able to rely on them, such as for measuring the impact of their data and data sharing efforts. We/I can talk to the admins of these collections so that how we implement Make Data Count in Harvard Dataverse is informed by a better understanding of needs of users in the Harvard Dataverse:

@mreekie mreekie added pm.GREI-d-1.5.1 NIH, yr1, aim5, task1: Standardize download metrics pm.GREI-d-1.5.2 NIH, yr1, aim5, task2: WG with other repositories to follow Make Data Count recommendations labels Mar 20, 2023
@mreekie mreekie moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Mar 27, 2023
@mreekie mreekie moved this from Clear of the Backlog to SPRINT READY in IQSS Dataverse Project Mar 27, 2023
@landreev
Copy link
Collaborator

It does sound to me like it has been established, that there are local users/collections who value their existing download counts. While it may have some value to further investigate their needs, I'm not sure if it's really necessary for the purposes of deciding how to proceed, with the dev. plan. (We already know we can't afford to drop the existing counts).
It does sound like we have a degree of consensus that we want to implement what we've been referring to as the "QDR solution" - an option to display both the old-style counts, collected prior to the start of the MDC records, and the MDC metrics. This of course will still be optional; an installation will still have the options to stick with the "classic", non-MDC counts, or the MDC counts exclusively.
This is also explained in some detail in the linked issue https://github.com/IQSS/dataverse/issues/9025.
So let's "finalize" the plan by prioritizing either merging the existing IQSS/dataverse#6543, or if that one is too old, by pulling in the QDR changes via a new pr.

@pdurbin
Copy link
Member Author

pdurbin commented Apr 11, 2023

Discussed at standup. No objection to showing both counts.

The next step is probably to see if we can update and merge Jim's pull request:

@pdurbin pdurbin assigned mreekie and unassigned landreev and jggautier Apr 11, 2023
@pdurbin
Copy link
Member Author

pdurbin commented Apr 12, 2023

We gave the PR a 10

This one: IQSS/dataverse#6543

@pdurbin pdurbin closed this as completed Apr 12, 2023
@github-project-automation github-project-automation bot moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Apr 12, 2023
@jggautier
Copy link
Collaborator

jggautier commented Nov 27, 2023

For the sake of posterity, I should say that although I wrote that I'd like to learn from admins of collections in Harvard Dataverse who generally rely on metrics, I wasn't able to talk with them about this. One of the things I wanted to learn was if it was necessary to show both counts.

Unfortunately @landreev let me know that my comment was taken to mean that these users would like both counts to show, which may or may not be true, and supported the idea of a solution where both counts are shown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... pm.GREI-d-1.5.1 NIH, yr1, aim5, task1: Standardize download metrics pm.GREI-d-1.5.2 NIH, yr1, aim5, task2: WG with other repositories to follow Make Data Count recommendations Size: 10 A percentage of a sprint.
Projects
Status: No status
Development

No branches or pull requests

7 participants