Put all system messages in collection note table #242

jpmckinney · 2019-12-16T05:33:34Z

In summary:

~~Add columns to the collection_note table for:~~
- severity (warning, error)
- the source (e.g. the store, upgrade, or compile step)
- the type of subject (e.g. file item, file, collection)
- the primary key of the subject (e.g. 1, 3, 10)
- Update: In v2, the collection_note table has a data JSON field, but there are so few occasions for errors or warnings at present, that it doesn't really matter to have the additional detail.
~~Have the check step store errors in this table, instead of release_check_error and record_check_error~~
- v2 doesn't catch these errors, which shouldn't occur anyway.
Have the load step store warnings and errors in this table, instead of the warnings and errors columns of the collection_file and collection_file_item tables
- v2 stores FileError items from Kingfisher Collect as CollectionNote rows.
- Per Correctly handle NUL characters (control code) #263, v1 removed a lot of control characters for no reason. It then appended a warning about each control code it replaced. In v2, we only replace the NUL escape sequence, and this is not notable for logging.
- v1 stored errors about invalid JSON. v2 (and Kingfisher Collect) doesn't allow invalid JSON to reach its queues.
- v1 had many warnings in ocdskingfisherprocess/transform/compile_releases.py (including a warning for two records with the same OCID, which is impossible per the database index). v2's record_compiler creates notes for these (with more precise messages).
As upgrade-1-0-to-1-1: Store warnings (in collection_note table) #67, compile-releases: Store warnings (in collection_note table) #222, Compile Releases Transform - include records table #63 (and any others) are closed, their warnings should be logged in the collection_note table

There are likely existing Python packages that implement warning and/or logging handlers that store messages in the DB.

The warnings and errors produced by the upgrade #67 and compile #222 transforms can be logged by a new logger, whose handler stores the messages in the database, associated to the relevant collection.

That way, analysts can query the table for warnings/errors, and the web UI can also report the number and list of warnings/errors.

This table can also be used for the warnings/errors that are presently stored separately on each file and file item as JSON (while preserving the ID of the file or file item to which the message relates). I assume that if I'm checking a collection before analyzing it, that it'd be useful to see all warnings/errors in one place – rather than having to separately query the file and file item tables for any potential warnings/errors – and I assume that most of the time, it's more useful to just know that a particular error occurred within a collection, than to know the specific file/item on which it occurred (though the new table would still make it easy to know that). Do others agree with moving all such messages to one table?

The new table would thus store the severity (warning, error), the message, the source (e.g. the store, upgrade, or compile step), and the subject (e.g. file item 1, file 3, collection 10).

The text was updated successfully, but these errors were encountered:

romifz · 2019-12-16T18:22:00Z

To be honest I never had the need to check these messages but the proposal sounds good.

yolile · 2019-12-17T14:03:27Z

I agree with this and also suggest if we can store the logs about the insertions in the database as well (when scrapy send a POST to Kingfisher process). We had cases when we received a 500 or similar errors from kingfisher process not knowing why.

jpmckinney · 2019-12-17T15:14:27Z

@yolile Do you want to store a log of successful insertions (e.g. "[filename] stored in collection_file")? Or is it sufficient to store the request that caused the 500 error?

Also, in future, please open issues for any unexpected errors, so that they can be investigated. I have access to Sentry, which might provide more detail on some 500 errors.

yolile · 2019-12-17T16:44:44Z

I think that it would be ok to store only the non 200 ones

Also, in future, please open issues for any unexpected errors, so that they can be investigated. I have access to Sentry, which might provide more detail on some 500 errors.

Ok, I will

jpmckinney · 2019-12-17T18:32:30Z

I just noticed that the release_check_error and record_check_error tables are to store exceptions raised while trying to check a release/record (rather than to store errors in the data). Both tables are presently empty. Since these tables are just for logging exceptions that might occur unexpectedly, I figure they can also be moved to the table proposed in this issue.

Since the tables are empty, I figure there is no disruption to existing practices if they are dropped.

duncandewhurst · 2019-12-18T22:06:47Z

I'm happy with the proposal.

Like @romifz, I also haven't actually checked any of these errors before.

If it's important to check these, even when it looks like a load has run successfully, then perhaps we should document them in a loading data checklist.

jpmckinney · 2019-12-19T22:04:21Z

Noting that I'll handle this in the new system, as writing a PR for the present system will touch too many parts of the code (since it's not just a simple database change).

pindec · 2019-12-20T16:30:11Z

A separate table would be helpful, and we can/should add checks and explanatory notes to the data feedback colab notebooks.

The #222 issue meant the data that data journalists were looking at did not accurately reflect the source because the way it was structured meant the compile script removed data it considered repeat data. As we increase OCDS data use/re-use e.g. via views, we need to verify that compilation and upgrade scripts to generate those views haven't removed data due to data quality issues.

Because the existing system didn't throw/capture errors, we only realised the issue belatedly.

odscjames · 2020-04-22T11:31:59Z

Personal

I wanted to comment and say I'm taking a lesson from this. The warnings and errors columns on several tables were put in to answer the use case that analysts need to know if there were problems converting/handling the data. But I don't think I put as much effort into communicating that to analysts and discussing this with them. I'm taking a lesson from this to make sure I communicate more on things like this.

Where to store

I don't think it makes much difference whether warnings / errors are stored in columns on data tables or in a separate table. In either case we should be able to write something so it's easy for analysts to see the info.

Changes to suggested scheme

However, if we are going to move everything to a separate table (currently collection_note) I'd like to suggest some changes to the suggestions at the top:

Links

the type of subject (e.g. file item, file, collection)
the primary key of the subject (e.g. 1, 3, 10)

Firstly, notes already have a collection_id foreign key, NOT NULL. All notes have to be in one collection. I think this will and should still hold; I have not seen any use cases where a note should span multiple collections.

Can I suggest instead directly adding:

collection_file_id Foreign Key, Nullable
collection_file_item_id Foreign Key, Nullable

This allows

easier queries to be written when trying to see notes for a file
proper foreign keys and maybe indexes to be used thus probably making any such queries faster
allowing linking multiple things

OCID

Add an OCID string column, NULLable. This allows storing messages about an OCID and helps solve the problem I just realised and described in #283

Source

the source (e.g. the store, upgrade, or compile step)

It would be good to explore a bit more what things are expected to be in this value, as I'm not clear at the moment. From what I think this means, I think my next suggestion will cover that:

Human Readable

I would like to add a machine readable version of all automatic messages put in here. Currently we just log human readable messages (this is historical; the collection notes was not originally intended to be used for this) but that's very difficult to search through. For example we currently log things like:

  OCID ocds-213czf-000-00001 could not be compiled because merge library threw an error: MissingDateKeyError The `date` field of at least one release is missing.

If I as an analyst want to list all the OCID's that have bad date data, it's difficult. I have to write something to take apart a human string, and it's very possible that string might change in the future. Also, I18N??

Instead, I'd rather log a JSON blob with some values that are designed for searching:

 { 'type': 'merge-error', 'error-type': 'MissingDateKeyError', 'ocid':'ocds-213czf-000-00001'}

Now writing a query to find all the OCID's is easier, much more robust and the UI that reports on this can be I18N'd later if we need that.

(ps. This is similar to the approach we are trying to take to solve some I18N issues in CoVE)

So actually I suggest not using collection_note

The collection note table was designed soley for human written notes.

It's use case was we wanted to know which analyst had started a collection, so if we had question about it we know who to go to.

That's why all the examples in doc's like https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html have a standard note template included, to try and encourage it's use.

EG

curl http://scrape:PASSWORD@collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."

I would suggest the data that a human written "note" generates is very different from the data a machine written log entry generates.

And thus I would suggest seperating them:

Keep the collection_note for human written notes only.
Create a new table collection_log for machine generated warnings and errors.

Summary

Keep collection_note for human written notes only.
start a new table collection_log with (so far):
- collection_id NOT NULL
- collection_file_id Foreign Key, Nullable
- collection_file_item_id Foreign Key, Nullable
- data JSON NOT NULL
- level - Warning or Error or other
All code that currently puts existing notes automatically generated in the collection_note table, start putting in the collection_log table instead.
Have the check step store errors in collection_log table, instead of release_check_error and record_check_error
Have the load step store warnings and errors in collection_log table, instead of the warnings and errors columns of the collection_file and collection_file_item tables

jpmckinney · 2020-04-22T14:41:46Z

Thanks! The issue description was relatively high-level, so I'll flesh it out here. A few clarifications:

The issue description describes what kinds of columns to add. So, yes, id, collection_id, note, stored_at would remain.

With respect to:

"the type of subject (e.g. file item, file, collection)" and "the primary key of the subject (e.g. 1, 3, 10)"
- The idea here is to have a polymorphic association, e.g. a subject_type column with values like collection_file, collection_file_item, release, etc. and a subject_id column. Together, they identify a row in a table. There'd be a composite index on both for fast queries. However, there wouldn't be a foreign key for data integrity. Since this is a log message, I don't think data integrity is a concern; a common use case for logs is for them to be able to survive what they are about.
- We could instead have a foreign key for every possible subject, but (1) we'll have to migrate to add a column whenever we add a new subject, (2) we'll have to update code that reads the table to consider the new column, and (3) in terms of data integrity, this model would allow a log message that relates to multiple (or no) subjects. (3) can be fixed with a more complex table definition, but I'd rather keep definitions simple.
"the source (e.g. the store, upgrade, or compile step)"
- It's important to know not only what a message is about, but also what operation triggered the message. We can of course get very specific (e.g. identify specific Python methods), but these messages are intended to be useful to analysts, in which case identifying higher-level operations like "store" (from local load or web), "upgrade", "compile" are more useful. We can explore what these should be when work is done on this issue.

In terms of changes to the proposal:

I don't like JSON blobs as they are much harder to migrate and have much poorer constraints. Instead of blobs we can just have columns for whatever keys we think the blobs should have. That way, all messages are guaranteed to have the same schema without extra effort, which makes it much easier to read the messages in a predictable way.
There is only one human-authored collection note per collection (with one exception), and it typically just identifies who started the collection. If analysts launched crawls via a web interface (related Have an overview of system, across Collect, Process #295) with user accounts (see Send email notifications when all steps are done #273), then this could also just be another "system" message. As such, I don't see a significant line between human-authored and machine-authored. I also think it makes more sense to have a single stream of log messages rather than two. Users can easily filter between the two (e.g. the source of a human-authored message can be user).
We could have an ocid column, though I'm not sure how frequently an analyst will want to search messages by OCID. We can confirm when work starts on this issue. I'm imagining the web interface (related Have an overview of system, across Collect, Process #295) would have a box for full-text search (among other filter options), in which case I don't think it matters much to the analyst whether the search matches an OCID in a dedicated ocid column or if it matches a note containing that OCID.

jpmckinney · 2023-07-04T02:28:17Z

This was done in v2.

Note that v2's release_compiler doesn't warn about these from v1's _process_releases method:

multiple releases with a 'compiled' tag (of which the first is taken)
- v2 would merge these with any others
one release with a 'compiled' tag (which is taken)
- v2 would merge this with any others
undated releases (in which case the dated releases are used)
- v2 would error on the undated releases

I probably discussed this with Datlab at the time and decided the logic was unnecessary, or undesirable (i.e. we don't trust that publishers merged releases correctly, unless we have no dated releases).

If we want to add the logic back, we can refer to v2's record_compiler and the old _process_releases (above link).

Note that the behavior for the first two is what both v1 and v2 do for records, in the case where the record has at least 1 dated release and no linked releases (i.e. it ignores whether any of those were tagged as compiled).

Since release packages are never expected to contain linked releases (this has never been witnessed), then the current v2 behavior is fine. We can re-add logic if we encounter undated releases again (not for any existing publications yet).

jpmckinney added the framework label Dec 16, 2019

jpmckinney modified the milestones: V2 preparation, V2 consultation Dec 16, 2019

jpmckinney mentioned this issue Dec 18, 2019

HTTP API Endpoints - check required fields are there and sane, error with HTTP code if not #47

Closed

jpmckinney added a commit that referenced this issue Jan 7, 2020

Remove ReleaseCheckError and RecordCheckError, #242

92b5055

jpmckinney modified the milestones: V2 consultation, Django Mar 24, 2020

This was referenced Mar 25, 2020

Deploy Django branch #272

Closed

Preserve existing behaviors in Django branch #274

Closed

jpmckinney changed the title ~~Add table for logger messages~~ Put all system messages in collection note table Mar 25, 2020

jpmckinney mentioned this issue Apr 15, 2020

Compile Release Transform: raise MissingDateKeyError('The date field of at least one release is missing.') #279

Closed

jpmckinney mentioned this issue Jul 7, 2020

CoVE sentry errors in kingfisher process #301

Closed

jpmckinney added error handling Relating to how errors are handled and removed framework labels Aug 13, 2020

jpmckinney added this to the V2 (Django) milestone Nov 20, 2020

jpmckinney mentioned this issue Nov 11, 2021

refactor: Use template for settings.py #350

Merged

jpmckinney modified the milestone: Priority Jan 25, 2023

jpmckinney mentioned this issue Jan 27, 2023

Collect warnings (collection_note) from Kingfisher Process before wiping collections open-contracting/data-registry#276

Open

jpmckinney closed this as completed Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put all system messages in collection note table #242

Put all system messages in collection note table #242

jpmckinney commented Dec 16, 2019 •

edited

Loading

romifz commented Dec 16, 2019

yolile commented Dec 17, 2019

jpmckinney commented Dec 17, 2019 •

edited

Loading

yolile commented Dec 17, 2019

jpmckinney commented Dec 17, 2019

duncandewhurst commented Dec 18, 2019

jpmckinney commented Dec 19, 2019

pindec commented Dec 20, 2019

odscjames commented Apr 22, 2020 •

edited by jpmckinney

Loading

jpmckinney commented Apr 22, 2020 •

edited

Loading

jpmckinney commented Jul 4, 2023 •

edited

Loading

Put all system messages in collection note table #242

Put all system messages in collection note table #242

Comments

jpmckinney commented Dec 16, 2019 • edited Loading

romifz commented Dec 16, 2019

yolile commented Dec 17, 2019

jpmckinney commented Dec 17, 2019 • edited Loading

yolile commented Dec 17, 2019

jpmckinney commented Dec 17, 2019

duncandewhurst commented Dec 18, 2019

jpmckinney commented Dec 19, 2019

pindec commented Dec 20, 2019

odscjames commented Apr 22, 2020 • edited by jpmckinney Loading

Personal

Where to store

Changes to suggested scheme

Links

OCID

Source

Human Readable

So actually I suggest not using collection_note

Summary

jpmckinney commented Apr 22, 2020 • edited Loading

jpmckinney commented Jul 4, 2023 • edited Loading

jpmckinney commented Dec 16, 2019 •

edited

Loading

jpmckinney commented Dec 17, 2019 •

edited

Loading

odscjames commented Apr 22, 2020 •

edited by jpmckinney

Loading

jpmckinney commented Apr 22, 2020 •

edited

Loading

jpmckinney commented Jul 4, 2023 •

edited

Loading