Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate verbose data usage logs for processing and communication with Make Data Count #5384

Closed
matthew-a-dunlap opened this issue Dec 7, 2018 · 5 comments
Assignees

Comments

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Dec 7, 2018

To generate information on views/downloads/citations for Make Data Count, the first step is to log the raw usage data for later processing.

Our current goal in this logging is to have the syntax match what is used by Counter Processor. This is so we can potentially use Counter Processor to process our raw logs. Even if we do not use Counter Processor, this syntax is a good starting point for our development efforts.

See this doc for some thoughts about our design path.

@matthew-a-dunlap
Copy link
Contributor Author

This example from counter-processor is a good reference for how we want to format our logs.

matthew-a-dunlap pushed a commit that referenced this issue Dec 13, 2018
Actual logging is only taking place on dataset page and that is incomplete
matthew-a-dunlap pushed a commit that referenced this issue Dec 18, 2018
@matthew-a-dunlap
Copy link
Contributor Author

To work on this I've piped some log output from my logging code into counter-processor (using @pdurbin 's vagrant setup in #5385). It parses my log file into the database ok but then chokes when trying to generate the sushi. Not done but definitely progress.

(note: inbetween runs if you delete both files in the state folder it will start over from scratch)

[root@standalone counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487]# CONFIG_FILE=/dataverse/scripts/vagrant/counter-processor-config.yaml python36 main.py
Running report for 2018-05-01T00:00:00 to 2018-06-01T00:00:00
31 daily log file(s) will be added to the database
Last processed date: not processed yet for 2018-05
processing sample_logs/counter_2018-05-01.log
processing sample_logs/counter_2018-05-02.log
processing sample_logs/counter_2018-05-03.log
processing sample_logs/counter_2018-05-04.log
processing sample_logs/counter_2018-05-05.log
processing sample_logs/counter_2018-05-06.log
processing sample_logs/counter_2018-05-07.log
processing sample_logs/counter_2018-05-08.log
processing sample_logs/counter_2018-05-09.log
processing sample_logs/counter_2018-05-10.log
processing sample_logs/counter_2018-05-11.log
processing sample_logs/counter_2018-05-12.log
processing sample_logs/counter_2018-05-13.log
processing sample_logs/counter_2018-05-14.log
processing sample_logs/counter_2018-05-15.log
processing sample_logs/counter_2018-05-16.log
processing sample_logs/counter_2018-05-17.log
processing sample_logs/counter_2018-05-18.log
processing sample_logs/counter_2018-05-19.log
processing sample_logs/counter_2018-05-20.log
processing sample_logs/counter_2018-05-21.log
processing sample_logs/counter_2018-05-22.log
processing sample_logs/counter_2018-05-23.log
processing sample_logs/counter_2018-05-24.log
processing sample_logs/counter_2018-05-25.log
processing sample_logs/counter_2018-05-26.log
processing sample_logs/counter_2018-05-27.log
processing sample_logs/counter_2018-05-28.log
processing sample_logs/counter_2018-05-29.log
processing sample_logs/counter_2018-05-30.log
processing sample_logs/counter_2018-05-31.log

Calculating stats for doi:10.5072/FK2/QFL54A
Traceback (most recent call last):
  File "main.py", line 42, in <module>
    my_report.output()
  File "/scripts/vagrant/counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487/output_processor/json_report.py", line 19, in output
    body = {'report_datasets': [self.dict_for_id(my_id) for my_id in self.ids_to_process ] }
  File "/scripts/vagrant/counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487/output_processor/json_report.py", line 19, in <listcomp>
    body = {'report_datasets': [self.dict_for_id(my_id) for my_id in self.ids_to_process ] }
  File "/scripts/vagrant/counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487/output_processor/json_report.py", line 67, in dict_for_id
    return js_meta.descriptive_dict()
  File "/scripts/vagrant/counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487/output_processor/json_metadata.py", line 32, in descriptive_dict
    'publisher-id': [ { 'type': self.meta.publisher_id_type(), 'value': self.meta.publisher_id_bare() } ],
  File "/scripts/vagrant/counter-processor-a73dbced06f0ac2f0d85231e4d9dd4f21bee8487/models/metadata_item.py", line 40, in publisher_id_type
    m = re.search('^([a-zA-Z]{2,10})[\:\.\=].+', self.publisher_id)
  File "/usr/lib64/python3.6/re.py", line 182, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

@pdurbin
Copy link
Member

pdurbin commented Dec 20, 2018

self.publisher_id

I was just passing along to @sekmiller that @matthew-a-dunlap has already identified the issue with "publisher_id" above but from running the code as of 2e2965c we are not yet populating publisher-id (I just noticed //entry.setPublisherId(); is commented out).

As I mentioned at #4821 (comment) the sample log uses "publisher" and "publisher id" like this:

publisher: UC Irvine
publisher_id: grid.266093.8

GRID seems to be https://en.wikipedia.org/wiki/Global_Research_Identifier_Database (perhaps a little like ISNI?)

A workaround for now is to hack on the logs and replace "-" with "1" in the appropriate "publisher id" field so that counter-processor doesn't throw the exception above.

I guess I'm a little confused about who the publisher is supposed to be and if that publisher will always have an id.

matthew-a-dunlap pushed a commit that referenced this issue Jan 4, 2019
Lots of comments in this commit, will clean up soon
matthew-a-dunlap pushed a commit that referenced this issue Jan 4, 2019
matthew-a-dunlap pushed a commit that referenced this issue Jan 4, 2019
Next step is to pipe these into counter processor and see what happens on the MDC server
matthew-a-dunlap pushed a commit that referenced this issue Jan 7, 2019
matthew-a-dunlap pushed a commit that referenced this issue Jan 7, 2019
This is incomplete as most downloads don't log to guestbook until the end via the  DownloadInstanceWriter and it is unclear how to get this info that late in the pipe.
matthew-a-dunlap pushed a commit that referenced this issue Jan 8, 2019
Access api & DownloadInstanceWriter now have the info needed to create an MDC entry. This included adding a new constructor to for MDC logging that takes uriInfo and headers.
matthew-a-dunlap pushed a commit that referenced this issue Jan 8, 2019
At this point counterprocessor with our custom regex is readable by Counter Processor and generates sushi. Right now it is not reporting ANY unique investigations and almost certainly should. More investigation is needed.
@matthew-a-dunlap
Copy link
Contributor Author

It seems that our lack of a publisher for our Datasets/Files is a problem submitting to Make Data Count. Those fields are required.

We can bypass this check if we submit it blank with a type chosen

"publisher": "", "publisher-id": [{"type": "client-id", "value": ""}

But I suspect this is not good practice and may bite us down the road.

This is the error if its Omitted:

- "#/report-datasets/0": "The property '#/report-datasets/0' did not contain a required property of 'publisher' in schema 7757177d-ae02-5888-8cdf-d748b3fb8616#"
- "#/report-datasets/0": "The property '#/report-datasets/0' did not contain a required property of 'publisher-id' in schema 7757177d-ae02-5888-8cdf-d748b3fb8616#"

This is the error if its blank:

- "#/report-datasets/0/publisher-id/0/type": "The property '#/report-datasets/0/publisher-id/0/type' value \"\" did not match one of the following values: isni, orcid, grid, urn, client-id in schema 7757177d-ae02-5888-8cdf-d748b3fb8616#"

@pdurbin
Copy link
Member

pdurbin commented Jan 11, 2019

I processed the logs into a sushi report which I tried to import into the datasetsmetrics table but it didn't work. The commands I used:

[root@uiswhlpt3621005 counter-processor-0.0.1]# cd /tmp/counter-processor-0.0.1
[root@uiswhlpt3621005 counter-processor-0.0.1]# rm -f state/counter_db_2019-01.sqlite3 state/statefile.json 
[root@uiswhlpt3621005 counter-processor-0.0.1]# CONFIG_FILE=/root/make-data-count/counter-processor-config.yaml python36 main.py
Running report for 2019-01-01T00:00:00 to 2019-02-01T00:00:00
11 daily log file(s) will be added to the database
Last processed date: not processed yet for 2019-01
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-01.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-02.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-03.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-04.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-05.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-06.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-07.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-08.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-09.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-10.log
processing /usr/local/glassfish4/glassfish/domains/domain1/logs/counter_2019-01-11.log

Calculating stats for doi:10.5072/FK2/BL2IBM

Writing JSON report to /dataverse/sushi_sample_logs.json
[root@uiswhlpt3621005 counter-processor-0.0.1]# curl -s -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/dataverse/sushi_sample_logs.json&persistentId=doi:10.5072/FK2/BL2IBM' | jq .
{
  "status": "OK",
  "data": {
    "message": "Dummy Data has been added to dataset :persistentId"
  }
}
[root@uiswhlpt3621005 counter-processor-0.0.1]# psql -U dvnuser dvndb -c 'select * from datasetmetrics;'
 id | countrycode | downloadstotal | downloadsunique | monthyear | viewstotal | viewsunique | dataset_id 
----+-------------+----------------+-----------------+-----------+------------+-------------+------------
(0 rows)

[root@uiswhlpt3621005 counter-processor-0.0.1]# 

I had to tweak the config.yaml file: counter-processor-config.yaml.txt

I was running 69d48c7

Here's the sushi json file: sushi69d48c7.json.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants