-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts #75
Comments
|
FWIW: I think the primary thing Counter is doing that affects the aggregate counts being shown today is to remove robot hits. If robots have been being excluded by a robots.txt file, then this may not be too important. However, the way it works is to review ip addresses versus a blacklist, so not having ip addresses would stop that. I haven't found any ip address info in the database - not sure if there's some other field that would at least indicate a human/non-robot access. I'm not sure if having a null ip will cause it to break or not - I suspect breaking as one of the errors I previously found (logging un-published downloads) was something I caught because counter threw an error with a null field. In terms of finer breakdowns, counter also uses ip address to find countries, and I think it looks for sessioncookie info to assess unique counts. The sessioninfo in the filedownload table might be of use there. The human/machine separation is from the user-agent - not sure there's anything in the tables to guess at that (or at least break browser from curl api calls). |
@qqmyers thanks for your comment. Yes, there are a lot of unknowns here, which is why this issue got estimated as a large yesterday in sprint planning. Heads up that I just updated the description to have more than just a diagram and a link to your pull request. I hope it helps clarify what we talked about. Further questions and comments are welcome from all, of course! |
It's not clear to me, and perhaps this investigation can include figuring out if and how the following information about pre-MDC counts can be determined:
"Section 3.3.9: Missing and Unknown Field Values" of the COUNTER guide says that COUNTER will accept missing/unknown values for fields, and indicates how to express missing values for optional and required fields. Which fields are optional and required? The COUNTER guide says to refer to "Section 8" of the "Research Data SUSHI API Specification". This article cites this API documentation as the Research Data SUSHI API Specification. I can't tell what section 8 is, but there's a page called "dataset usage" model that has red asterisks next to a few fields, like dataset title, type, and platform. Are these the required fields, and is everything else optional?
Are the auditors checking on a pass/fail basis? Or are they grading implementations on a scale? (The bottom of the COUNTER guide lists companies doing audits.) From the previous comments, it sounds like we're limiting ourselves to info in the database because it's the source that Dataverse-based repositories are most likely to have access to. Not every Dataverse-based repository will have access to access logs, which is the only other option. @djbrooke, for questions 2 and 3 should we schedule a meeting with the COUNTER team? If we can confirm that many of the types of metrics in the first question can be missing from monthly reports, e.g. machine counts, unique counts, dataset versions, it might make the conversion easier (or possible). @qqmyers has already done some digging in the database for parts of the first question. It seems like the big question is if the database can help us determine human (regular) downloads/explores versus machine downloads/explores that exclude bots and crawlers. I also don't remember a way to do this using info in the database, but I can look. I also plan to explore the database to see how we might determine when a download/explore happened within a session and on what dataset version a file download happened. If anyone else with time or familiar with the database can help answer these questions, please feel free :) |
FWIW: If the access logs are available for some instances and they do a better job than the db, it might be worth considering a split - use access logs or do the dual counts as I proposed earlier. They should co-exist well - the cut-over date for dual counts could be the start of time (1970!) by default so it wouldn't trigger if the access logs had been mined. |
I agree @qqmyers. Just ran into a Google doc that lists methods for getting info for MDC counts. It's targeted to Harvard Dataverse, but I think it supports the idea that the access logs would be the only way for Dataverse-based repositories to distinguish between "regular" and "machine" downloads (and exclude bots by filtering by IP address), which is required for converting "classic" downloads. We can't send request counts that combine "regular" and "machine" downloads. We also can't send counts unless we can exclude double clicks, which Harvard Dataverse can't do because ~630k older download counts don't have timestamps. So even if we can send reports that are missing most info about requests (e.g. unique counts, dataset versions, dates of requests), it seems that repositories that want to provide MDC counts but (1) cannot use access logs or (2) have download counts without timestamps cannot convert their "classic" download counts. For those repositories, they can either:
That Google doc also brings up another question about sessions: When determining unique and total counts, can we use the session cookie ID (instead of the user cookie ID) to define a session? That Google doc defines session cookie IDs and user cookie IDs. The article (page 27) says that session cookie IDs (which Dataverse's database does store) can be used to filter double clicks, although it's not the most reliable way. This filtering also must be done for MDC compliant metrics. But when determining unique versus total counts, can Dataverse also define a session by the session cookie ID instead of a user cookie ID? That Google doc says we don't have user cookie IDs. More questions to ask the COUNTER team. |
Best effort (instead of all-or-nothing) @djbrooke shared a confirmation from Daniella Lowenberg that no fields are required and that a best effort should be made to convert "classic", non-standard counts to standard counts. The guides and specs say counts need to exclude bot/crawler activity and double clicks. If Dataverse repositories believe or are not sure if their download/explore counts include bot/crawler activity, they should make their best effort to exclude as much as possible by:
Dataverse repositories should also make their best effort to exclude double clicks from their "classic" counts, which can be done by using the timestamp and session cookie ID of each download/explore recorded. Timestamps and session cookie IDs may be found in the Dataverse repository's database and/or its access logs. After this best effort for filtering counts, counts can be converted to MDC counts, even if the repository believes some counts may represent bot/crawler activity or double clicks. This line of reasoning seems different than the thinking behind the proposal that @qqmyers supports for providing "classic counts" in addition to MDC counts, but... Counts with unknown month/year The standardized reports are made monthly, so I'm assuming that Dataverse needs to report the month and year when each download/explore occurred. For example, in this example JSON report, the "Period" object on line 70 must be included and must contain a "begin-date" key with a date value that is the first day of the report's month. This requirement seems justified since, just like excluding bots and double clicks, knowing when some activity happened improves the quality of the metric. Are there repositories that have downloads/explores whose months and years can't be determined? If yes, what should be done about those counts? Harvard Dataverse's database doesn't have timestamps for around 600k download/explore counts. I think the GitHub issue about this (IQSS/dataverse#3324) doesn't mention trying to find this information from access logs. But @landreev or @scolapasta, would you know if this was already considered? I queried the database to see which datasets in which dataverses would lose download/explore counts if Harvard Dataverse removed the counts with no timestamps, and can share the results if needed. |
@jggautier we did consider it, but unfortunately, these downloads are from before we tracked downloads individually, i.e. we just had a count. When we went to the current model, we created a row for each, but didn't know if the specific date. Access logs from that time are no longer available. (discussed with @landreev to confirm, as well) |
Great, thanks @scolapasta and @landreev! I think we could contact installations to find out:
Would the "best effort" instructions we provide to installations for converting the timestamped counts include what @pdurbin wrote earlier?:
|
@jggautier thanks! All sounds good. For converting counts, instead of contacting other installations we could make a decision on our own for Harvard Dataverse and then provide the guidance/infrastructure to installations so they can do what we did (or use some parts of the guidance/infrastructure to make their own decisions about old counts). Generally, what @pdurbin suggested makes sense as well. |
Some questions and observations that might help with the decision:
|
We decided first to work on converting counts that have timestamps and exclude counts that don't. While that work continues, we'll keep looking into what to do with the counts that don't have timestamps (including asking other repositories if they ran into this issue and how they handled it, and if the MDC leaders have suggestions). Continuing from what @pdurbin wrote about converting timestamped counts:
This will include all counts up to and including the previous month, right? For example, if on June 15 the Counter Processor is used to populate the new "datasetmetrics" table and send the historical data to DataCite, the historical data would be activity (downloads/explores) that happened up to and including May. Then when the repository starts sending regular monthly reports to DataCite, it'll continue with the following complete month? In the example, the report would be sent sometime in July and include activity in June. |
counter_processor should be able to remove bot and duplicate counts if the info is available, so if the sysadmin has a way of determining these things, it might be easiest if they just include that info in the made-up log rather than trying to process themselves. (I'm not sure I understand what info is available but suppose, for example that only one person has requested access to restricted files in a dataset and there are three quick downloads of them. If there isn't any info about the session or IP address to give counter-processor, one could still infer that they are from the same person. If so, rather than trying to calculate whether the downloads should all be counted, one could just add made-up info about the session to the log and let counter-processor do the math. Similar for other things - rather than making the decision in new code, one could just make sure the choice being made about whether its the same person, or whether it was a robot is conveyed in the made-up log file and counter-processor can handle it from there.) counter-processor nominally runs daily and reports in the current month, up to yesterday. Once it's processed, it reports to Dataverse, which clears the table for the current month and takes the new aggregate info, and it updates with DataCite - again reporting new aggregate numbers that are used in place of the old ones. To process old months, you can tell counter-processor the year-month combo to process and will process all the logs for that month and send one report to Dataverse and DataCite, i.e. no need to run it for each day in an old month. Similarly, if the cron breaks and you skip some days, as long as you process again in that month, counter-processor will catch-up and report all the days it missed. |
Aside from the question of how much, and what kind of extra info can be extracted from the logs, I just want to make sure we don't count on it too much. Specifically when the downloads that happened in the past are concerned. In our own production environment we only keep the access logs around for some number of months. We have never assumed that it was something we want to archive and store permanently. We should also assume that the way other installations handle this may vary, spanning the full range from permanently archiving all their access logs, to not having the access logs enabled at all. For the purposes of accounting for ongoing downloads, it is safe to assume that the logs for, say, the past week are always going to be present. (And we can instruct other installations that this is required). But it would probably be worth to ensure that we store all the information needed for the MDC metrics right away. So that going back to reading access logs isn't going to be necessary (this may already be the case, going forward, I'm just not 100% sure from reading the above). |
@qqmyers (et al) On the issue of recognizing "bot/crawler activity": I guess the bottom line is, this would be something very specific to individual installations, and their use patterns. And subject to their admins looking at the logs and recognizing some use patterns as bot activity. (the word "bot" in the logged |
@landreev |
lol, yes - but do note that I eventually noticed, and corrected it! |
True. Most of our byte range access requests are from this user agent "reqwest": (but do note almost just as many with no user agent supplied)
|
Thanks. This is great! I'd like to make sure we're using the same terms for things, preferably the terms used in the COUNTER spec, to reduce confusion (maybe just my own confusion as the least technical person here :) ). For example, @qqmyers you wrote that "counter-processor uses dynamic lists for bots and machine counts." These dynamic lists are used for recognizing activity that should be excluded, right? But the COUNTER spec allows for counting "legitimate" machine activity, like scripts run on a user's computer. I wouldn't want anyone else reading this thread to think that machine counts and bot counts are the same thing. It's important that any method for excluding activity doesn't exclude legitimate machine counts. I've been keeping a table (just moved it to a Google Sheet) so that I understand how the decisions we make, and the decisions made by CDL's counter-processor, affect counts in any situation I'm aware of, including the instance of partial download counts reported in IQSS/dataverse#5957. If this might be helpful for anyone else, please feel free to contribute or question an interpretation. (I know that for this issue's scope, only the situations involving downloads/explores/requests matter, since Dataverse installations haven't been counting other types of activity, like pageviews (outside of third-party analytics tools)). @landreev. You wrote in IQSS/dataverse#5957 that the group that downloaded only a part of many files did that "to keep their distributed copies in sync". I just want to make sure I understand why we should label that activity as bot/crawler activity. The COUNTER spec defines "Internet Robots and Crawlers" as agents whose actions are "intended for search indexing and related applications".
Since there's no "one-size-fits-all" method for recognizing "bot/crawler activity", are there a number of methods we can recommend to installations to ensure that they exclude bots/crawlers? It sounds like installations can't always rely on user agents or the dynamic lists that counter-processor uses (we couldn't have anticipated that "reqwest" would start download parts of thousands of files). But if I understand what @qqmyers wrote, if an installation discovers activity that was reported as legitimate but should have been excluded as bot/crawler activity, that installation can re-send corrected reports for the months of that activity. |
@jggautier Re: all those automated partial downloads: I can look up the actual conversation with the user who published the original file that started the whole thing. I don't remember all the details tbh; that big 2.5GB file is an archive of some TeX macros. Why the individual clients need to make thousands of partial downloads at regular intervals exactly, I can't necessarily explain. But I do think it's not a stretch to say that it probably falls under "search and indexing and related applications". My main arguments in favor of NOT counting these as regular downloads: It just feels wrong. I'm not talking about the MDC metrics - just the fact that these thousands of downloads of just a few bytes each are shown as if somebody actually downloaded a multi-GB file thousands of times. It does likely distort our download statistics at this point. Also, it's kind of a waste of resources on the Dataverse side - each download count is actually a database table entry (in the GuestBookResponse table). Another way of looking at this, maybe we should not focus on the "partial", but instead on the fact that they are repeated downloads from the same IP address. So, perhaps we should count the first one as a regular download; but ignore the consecutive hundreds of them if they are happening within some number of hours; or within the same day etc. This way we don't need to pay attention to whether they are requesting just a few select bytes, or the whole file. |
(Just want to emphasize that I don't have a strong opinion on how we should be treating these "partial downloads". My only strong opinion is that what we are doing now feels wrong. Not counting any partial downloads would be a bit over-the-top. Asking for 1GB from a 2GB file, for example, does feel like a "real" download... So maybe my other idea, not counting repeated downloads on the same file over a short period of time is a better bet?) |
Ah, thanks @landreev. I'm leaning toward thinking that the partial downloads reported in IQSS/dataverse#5957 should be excluded from the MDC downloads, too (should not be populated in the datasetmetrics table so shouldn't be reported to DataCite). I've always assumed that IQSS/dataverse#5957 wasn't prioritized because this MDC work would take care of problems like these, by excluding this type of activity or differentiating unique from total counts, and "regular" counts from machine counts. Does that make sense? If it's not clear from the conversation with the user who published the original file that started the whole thing in IQSS/dataverse#5957, if you like I could reach out to them to ask. There might be Dataverse installations that continue to provide non-MDC counts, which also contain repeated downloads of the same file over a short period of time. But I think it's better to encourage and help Dataverse repositories to follow a standard other platforms are following, too.
Maybe this could be resolved as the GuestBookResponse table is further separated from the download stats? (Of course I'm thinking this is out of scope for this issue/spike.) I think the next steps are:
|
@jggautier I still don't fully know how to get it done. But I now understand what we are trying to achieve, enough to know what questions to ask. So I'm going to be doing that, below. |
@jggautier
Sorry that was so long. I'm not personally convinced all that MDC business is worth this amount of attention we've already put into it. But it sounds like it's important to at least some people - so here we are. |
@jggautier |
@jggautier
|
|
Linking the issue IQSS/dataverse#9025 in the main project - a lot of the more recent discussion concerning this issue was happening there. As we revisit it during this spike, let's make sure to take any potentially useful information there into consideration. |
I'm listing some of the collections in the Harvard Dataverse whose admins either rely on download counts now or have told us that they are very interested in being able to rely on them, such as for measuring the impact of their data and data sharing efforts. We/I can talk to the admins of these collections so that how we implement Make Data Count in Harvard Dataverse is informed by a better understanding of needs of users in the Harvard Dataverse:
|
It does sound to me like it has been established, that there are local users/collections who value their existing download counts. While it may have some value to further investigate their needs, I'm not sure if it's really necessary for the purposes of deciding how to proceed, with the dev. plan. (We already know we can't afford to drop the existing counts). |
Discussed at standup. No objection to showing both counts. The next step is probably to see if we can update and merge Jim's pull request: |
We gave the PR a 10 This one: IQSS/dataverse#6543 |
For the sake of posterity, I should say that although I wrote that I'd like to learn from admins of collections in Harvard Dataverse who generally rely on metrics, I wasn't able to talk with them about this. One of the things I wanted to learn was if it was necessary to show both counts. Unfortunately @landreev let me know that my comment was taken to mean that these users would like both counts to show, which may or may not be true, and supported the idea of a solution where both counts are shown. |
Established Dataverse installations that have been operating for years might be reluctant to turn on Make Data Count (MDC) because the download counts will be reset to zero unless something is done to somehow copy the "classic" download counts into the new "datasetmetrics" database table that powers MDC download metrics. For example, Harvard Dataverse has over 10 million "classic" downloads:
Many Dataverse installations probably don't have all the Apache (or Glassfish or whatever) access logs from years ago lying around but the database table
filedownload
could be used as a source for timestamps of downloads from the "classic" system. After standup on 2020-02-05 @djbrooke @kcondon talked about this and I made the following diagram (best to open it in a new window since the text is so small):source for the image above: make-data-count.uml.txt
This is what I added to the diagram, which is based on http://guides.dataverse.org/en/4.19/admin/make-data-count.html#architecture
This is a bit hand wavy because we'd still use SUSHI as indicated by the
Log Processing
part of the diagram.Roughly, the idea is this:
See also pull request IQSS/dataverse#6543
The text was updated successfully, but these errors were encountered: