Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Test Counter Processor for use by Dataverse in implementation of Make Data Count #5385

Closed
matthew-a-dunlap opened this issue Dec 7, 2018 · 5 comments

Comments

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Dec 7, 2018

For Dataverse to generate aggregated usage data to comply with the Make Data Count specifications, raw data access logs will need to be processed for a variety of reasons. Furthermore, this aggregated data needs to be sent to the Make Data Count servers for centralized access and further use.

An implementation exists that will fill both of these needs for us, namely Counter Processor. Before we decide to go forward with this tool we need to ensure that it provides enough value to require an additional dependency. Even if we decide not to use this tool inside Dataverse, exploring this tool allows us to better understand practical considerations in generating and processing or logs. Furthermore, this tool could be used as a "mock" implementation for us to test with as we develop our own implementation in-house.

@matthew-a-dunlap
Copy link
Contributor Author

Reasons for processing:

  • Anonymizing data
  • Removing duplicate clicks within 30 seconds
  • Aggregating multiple interactions within a session as unique

@pdurbin
Copy link
Member

pdurbin commented Dec 10, 2018

It's pretty hacky but I started trying to stand up counter-processor in Vagrant in 02c5538

@djbrooke
Copy link
Contributor

Outcome of this spike will be a well informed decision to use the processor, fork it, or write our own. Also, we will need a plan to handle how multiple-server situations are handled.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Dec 19, 2018

Counter-processor is creating sushi more or less as expected. We haven't been able to test the outputted sushi against DataCite at this point but its a step in the right direction.

There are some questions about counter-processor and the outputted sushi:

  • Does counter processor actually process file names and their sizes into sushi?
  • Does IPv6 actually work as a raw input?
  • Is publisher_id actually required by Make Data Count? This is the one field we don't have in Dataverse (unless I'm mistaken).

Sample raw logs piped into counter processor. Two lines are from Dataverse and one is from the counter-processor sample logs:

#Fields: event_time	client_ip	session_cookie_id	user_cookie_id	user_id	request_url	identifier	filename	size	user-agent	title	publisher	publisher_id	authors	publication_date	version	other_id	target_url	publication_year
2018-05-31T14:52:18-0500	0.0.0.1	-	-	3939	http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/QFL54A	hdl:10.5072/FK2/QFL54A	testfile.txt	33	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36	fake	Root	grid.266093.8	Admin, Dataverse	2018-05-13T20:39:31Z	1	-	http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/QFL54A	4242	2018
2018-05-31T16:55:18-0500	128.195.188.234	testsession	testcookie	:guest	http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/QFL54A	doi:10.5072/FK2/QFL54B	-	-	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36	Another Title	Root	1	Admin, Dataverse | Otherman	2018-05-15T20:39:31Z	1	testother	-	42	2018
2018-05-31T00:04:18-07:00	0.0.0.0	-	-	-	http://dash.ucmerced.edu/stash/downloads/file_download/1904	doi:10.7272/Q6KW5CXVz	All_PD/PD-RAW16_P046X2.hdr	348	-	Predicting Cognitive Decline	UC San Francisco	grid.266102.1	Michael W. Weiner	2013-08-22T17:26:04Z	1	-	https://datashare.ucsf.edu/stash/dataset/doi:10.7272/Q6KW5CXVz	2012

Outputted sushi from counter processor:

{
  "report-header": {
    "report-name": "dataset report",
    "report-id": "DSR",
    "release": "rd1",
    "created": "2018-05-31",
    "created-by": "Dash",
    "report-attributes": [],
    "reporting-period": {
      "begin-date": "2018-05-01",
      "end-date": "2018-05-31"
    },
    "report-filters": [],
    "exceptions": [
      {}
    ]
  },
  "report_datasets": [
    {
      "dataset-title": "fake",
      "dataset-id": [
        {
          "type": "hdl",
          "value": "10.5072/FK2/QFL54A"
        }
      ],
      "dataset-contributors": [
        {
          "type": "name",
          "value": "Admin, Dataverse"
        }
      ],
      "dataset-dates": [
        {
          "type": "pub-date",
          "value": "2018-05-13"
        }
      ],
      "platform": "Dash",
      "publisher": "Root",
      "publisher-id": [
        {
          "type": "grid",
          "value": "266093.8"
        }
      ],
      "data-type": "dataset",
      "yop": "4242",
      "uri": "http://localhost:8080/dataset.xhtml?persistentId=doi:10.5072/FK2/QFL54A",
      "performance": [
        {
          "period": {
            "begin-date": "2018-05-01",
            "end-date": "2018-05-31"
          },
          "instance": []
        }
      ]
    },
    {
      "dataset-title": "Another Title",
      "dataset-id": [
        {
          "type": "doi",
          "value": "10.5072/FK2/QFL54B"
        }
      ],
      "dataset-contributors": [
        {
          "type": "name",
          "value": "Admin, Dataverse "
        },
        {
          "type": "name",
          "value": " Otherman"
        }
      ],
      "dataset-dates": [
        {
          "type": "pub-date",
          "value": "2018-05-15"
        }
      ],
      "platform": "Dash",
      "publisher": "Root",
      "publisher-id": [
        {
          "type": "",
          "value": ""
        }
      ],
      "data-type": "dataset",
      "yop": "42",
      "uri": null,
      "performance": [
        {
          "period": {
            "begin-date": "2018-05-01",
            "end-date": "2018-05-31"
          },
          "instance": []
        }
      ]
    },
    {
      "dataset-title": "Predicting Cognitive Decline",
      "dataset-id": [
        {
          "type": "doi",
          "value": "10.7272/Q6KW5CXVz"
        }
      ],
      "dataset-contributors": [
        {
          "type": "name",
          "value": "Michael W. Weiner"
        }
      ],
      "dataset-dates": [
        {
          "type": "pub-date",
          "value": "2013-08-22"
        }
      ],
      "platform": "Dash",
      "publisher": "UC San Francisco",
      "publisher-id": [
        {
          "type": "grid",
          "value": "266102.1"
        }
      ],
      "data-type": "dataset",
      "yop": "2012",
      "uri": "https://datashare.ucsf.edu/stash/dataset/doi:10.7272/Q6KW5CXVz",
      "performance": [
        {
          "period": {
            "begin-date": "2018-05-01",
            "end-date": "2018-05-31"
          },
          "instance": [
            {
              "access-method": "machine",
              "metric-type": "total-dataset-requests",
              "count": 1,
              "country-counts": {}
            },
            {
              "access-method": "machine",
              "metric-type": "total-dataset-investigations",
              "count": 1,
              "country-counts": {}
            }
          ]
        }
      ]
    }
  ]
}

@matthew-a-dunlap
Copy link
Contributor Author

After discussion with @djbrooke , we decided this spike investigating Counter Processor is done and can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants