Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put all system messages in collection note table #242

Closed
1 task done
jpmckinney opened this issue Dec 16, 2019 · 11 comments
Closed
1 task done

Put all system messages in collection note table #242

jpmckinney opened this issue Dec 16, 2019 · 11 comments
Labels
error handling Relating to how errors are handled
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Dec 16, 2019

In summary:

  • Add columns to the collection_note table for:
    • severity (warning, error)
    • the source (e.g. the store, upgrade, or compile step)
    • the type of subject (e.g. file item, file, collection)
    • the primary key of the subject (e.g. 1, 3, 10)
    • Update: In v2, the collection_note table has a data JSON field, but there are so few occasions for errors or warnings at present, that it doesn't really matter to have the additional detail.
  • Have the check step store errors in this table, instead of release_check_error and record_check_error
    • v2 doesn't catch these errors, which shouldn't occur anyway.
  • Have the load step store warnings and errors in this table, instead of the warnings and errors columns of the collection_file and collection_file_item tables
    • v2 stores FileError items from Kingfisher Collect as CollectionNote rows.
    • Per Correctly handle NUL characters (control code) #263, v1 removed a lot of control characters for no reason. It then appended a warning about each control code it replaced. In v2, we only replace the NUL escape sequence, and this is not notable for logging.
    • v1 stored errors about invalid JSON. v2 (and Kingfisher Collect) doesn't allow invalid JSON to reach its queues.
    • v1 had many warnings in ocdskingfisherprocess/transform/compile_releases.py (including a warning for two records with the same OCID, which is impossible per the database index). v2's record_compiler creates notes for these (with more precise messages).
  • As upgrade-1-0-to-1-1: Store warnings (in collection_note table) #67, compile-releases: Store warnings (in collection_note table) #222, Compile Releases Transform - include records table #63 (and any others) are closed, their warnings should be logged in the collection_note table

There are likely existing Python packages that implement warning and/or logging handlers that store messages in the DB.


The warnings and errors produced by the upgrade #67 and compile #222 transforms can be logged by a new logger, whose handler stores the messages in the database, associated to the relevant collection.

That way, analysts can query the table for warnings/errors, and the web UI can also report the number and list of warnings/errors.

This table can also be used for the warnings/errors that are presently stored separately on each file and file item as JSON (while preserving the ID of the file or file item to which the message relates). I assume that if I'm checking a collection before analyzing it, that it'd be useful to see all warnings/errors in one place – rather than having to separately query the file and file item tables for any potential warnings/errors – and I assume that most of the time, it's more useful to just know that a particular error occurred within a collection, than to know the specific file/item on which it occurred (though the new table would still make it easy to know that). Do others agree with moving all such messages to one table?

The new table would thus store the severity (warning, error), the message, the source (e.g. the store, upgrade, or compile step), and the subject (e.g. file item 1, file 3, collection 10).

@jpmckinney jpmckinney modified the milestones: V2 preparation, V2 consultation Dec 16, 2019
@romifz
Copy link

romifz commented Dec 16, 2019

To be honest I never had the need to check these messages but the proposal sounds good.

@yolile
Copy link
Member

yolile commented Dec 17, 2019

I agree with this and also suggest if we can store the logs about the insertions in the database as well (when scrapy send a POST to Kingfisher process). We had cases when we received a 500 or similar errors from kingfisher process not knowing why.

@jpmckinney
Copy link
Member Author

jpmckinney commented Dec 17, 2019

@yolile Do you want to store a log of successful insertions (e.g. "[filename] stored in collection_file")? Or is it sufficient to store the request that caused the 500 error?

Also, in future, please open issues for any unexpected errors, so that they can be investigated. I have access to Sentry, which might provide more detail on some 500 errors.

@yolile
Copy link
Member

yolile commented Dec 17, 2019

I think that it would be ok to store only the non 200 ones

Also, in future, please open issues for any unexpected errors, so that they can be investigated. I have access to Sentry, which might provide more detail on some 500 errors.

Ok, I will

@jpmckinney
Copy link
Member Author

I just noticed that the release_check_error and record_check_error tables are to store exceptions raised while trying to check a release/record (rather than to store errors in the data). Both tables are presently empty. Since these tables are just for logging exceptions that might occur unexpectedly, I figure they can also be moved to the table proposed in this issue.

Since the tables are empty, I figure there is no disruption to existing practices if they are dropped.

@duncandewhurst
Copy link

I'm happy with the proposal.

Like @romifz, I also haven't actually checked any of these errors before.

If it's important to check these, even when it looks like a load has run successfully, then perhaps we should document them in a loading data checklist.

@jpmckinney
Copy link
Member Author

Noting that I'll handle this in the new system, as writing a PR for the present system will touch too many parts of the code (since it's not just a simple database change).

@pindec
Copy link

pindec commented Dec 20, 2019

A separate table would be helpful, and we can/should add checks and explanatory notes to the data feedback colab notebooks.

The #222 issue meant the data that data journalists were looking at did not accurately reflect the source because the way it was structured meant the compile script removed data it considered repeat data. As we increase OCDS data use/re-use e.g. via views, we need to verify that compilation and upgrade scripts to generate those views haven't removed data due to data quality issues.

Because the existing system didn't throw/capture errors, we only realised the issue belatedly.

@odscjames
Copy link

odscjames commented Apr 22, 2020

Personal

I wanted to comment and say I'm taking a lesson from this. The warnings and errors columns on several tables were put in to answer the use case that analysts need to know if there were problems converting/handling the data. But I don't think I put as much effort into communicating that to analysts and discussing this with them. I'm taking a lesson from this to make sure I communicate more on things like this.

Where to store

I don't think it makes much difference whether warnings / errors are stored in columns on data tables or in a separate table. In either case we should be able to write something so it's easy for analysts to see the info.

Changes to suggested scheme

However, if we are going to move everything to a separate table (currently collection_note) I'd like to suggest some changes to the suggestions at the top:

Links

the type of subject (e.g. file item, file, collection)
the primary key of the subject (e.g. 1, 3, 10)

Firstly, notes already have a collection_id foreign key, NOT NULL. All notes have to be in one collection. I think this will and should still hold; I have not seen any use cases where a note should span multiple collections.

Can I suggest instead directly adding:

  • collection_file_id Foreign Key, Nullable
  • collection_file_item_id Foreign Key, Nullable

This allows

  • easier queries to be written when trying to see notes for a file
  • proper foreign keys and maybe indexes to be used thus probably making any such queries faster
  • allowing linking multiple things

OCID

Add an OCID string column, NULLable. This allows storing messages about an OCID and helps solve the problem I just realised and described in #283

Source

the source (e.g. the store, upgrade, or compile step)

It would be good to explore a bit more what things are expected to be in this value, as I'm not clear at the moment. From what I think this means, I think my next suggestion will cover that:

Human Readable

I would like to add a machine readable version of all automatic messages put in here. Currently we just log human readable messages (this is historical; the collection notes was not originally intended to be used for this) but that's very difficult to search through. For example we currently log things like:

  OCID ocds-213czf-000-00001 could not be compiled because merge library threw an error: MissingDateKeyError The `date` field of at least one release is missing.

If I as an analyst want to list all the OCID's that have bad date data, it's difficult. I have to write something to take apart a human string, and it's very possible that string might change in the future. Also, I18N??

Instead, I'd rather log a JSON blob with some values that are designed for searching:

 { 'type': 'merge-error', 'error-type': 'MissingDateKeyError', 'ocid':'ocds-213czf-000-00001'}

Now writing a query to find all the OCID's is easier, much more robust and the UI that reports on this can be I18N'd later if we need that.

(ps. This is similar to the approach we are trying to take to solve some I18N issues in CoVE)

So actually I suggest not using collection_note

The collection note table was designed soley for human written notes.

It's use case was we wanted to know which analyst had started a collection, so if we had question about it we know who to go to.

That's why all the examples in doc's like https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html have a standard note template included, to try and encourage it's use.

EG

curl http://scrape:PASSWORD@collect.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=spider_name -d note="Started by NAME."

I would suggest the data that a human written "note" generates is very different from the data a machine written log entry generates.

And thus I would suggest seperating them:

  • Keep the collection_note for human written notes only.
  • Create a new table collection_log for machine generated warnings and errors.

Summary

  • Keep collection_note for human written notes only.
  • start a new table collection_log with (so far):
    • collection_id NOT NULL
    • collection_file_id Foreign Key, Nullable
    • collection_file_item_id Foreign Key, Nullable
    • data JSON NOT NULL
    • level - Warning or Error or other
  • All code that currently puts existing notes automatically generated in the collection_note table, start putting in the collection_log table instead.
  • Have the check step store errors in collection_log table, instead of release_check_error and record_check_error
  • Have the load step store warnings and errors in collection_log table, instead of the warnings and errors columns of the collection_file and collection_file_item tables

@jpmckinney
Copy link
Member Author

jpmckinney commented Apr 22, 2020

Thanks! The issue description was relatively high-level, so I'll flesh it out here. A few clarifications:

  • The issue description describes what kinds of columns to add. So, yes, id, collection_id, note, stored_at would remain.

With respect to:

  • "the type of subject (e.g. file item, file, collection)" and "the primary key of the subject (e.g. 1, 3, 10)"
    • The idea here is to have a polymorphic association, e.g. a subject_type column with values like collection_file, collection_file_item, release, etc. and a subject_id column. Together, they identify a row in a table. There'd be a composite index on both for fast queries. However, there wouldn't be a foreign key for data integrity. Since this is a log message, I don't think data integrity is a concern; a common use case for logs is for them to be able to survive what they are about.
    • We could instead have a foreign key for every possible subject, but (1) we'll have to migrate to add a column whenever we add a new subject, (2) we'll have to update code that reads the table to consider the new column, and (3) in terms of data integrity, this model would allow a log message that relates to multiple (or no) subjects. (3) can be fixed with a more complex table definition, but I'd rather keep definitions simple.
  • "the source (e.g. the store, upgrade, or compile step)"
    • It's important to know not only what a message is about, but also what operation triggered the message. We can of course get very specific (e.g. identify specific Python methods), but these messages are intended to be useful to analysts, in which case identifying higher-level operations like "store" (from local load or web), "upgrade", "compile" are more useful. We can explore what these should be when work is done on this issue.

In terms of changes to the proposal:

  • I don't like JSON blobs as they are much harder to migrate and have much poorer constraints. Instead of blobs we can just have columns for whatever keys we think the blobs should have. That way, all messages are guaranteed to have the same schema without extra effort, which makes it much easier to read the messages in a predictable way.
  • There is only one human-authored collection note per collection (with one exception), and it typically just identifies who started the collection. If analysts launched crawls via a web interface (related Have an overview of system, across Collect, Process #295) with user accounts (see Send email notifications when all steps are done #273), then this could also just be another "system" message. As such, I don't see a significant line between human-authored and machine-authored. I also think it makes more sense to have a single stream of log messages rather than two. Users can easily filter between the two (e.g. the source of a human-authored message can be user).
  • We could have an ocid column, though I'm not sure how frequently an analyst will want to search messages by OCID. We can confirm when work starts on this issue. I'm imagining the web interface (related Have an overview of system, across Collect, Process #295) would have a box for full-text search (among other filter options), in which case I don't think it matters much to the analyst whether the search matches an OCID in a dedicated ocid column or if it matches a note containing that OCID.

@jpmckinney jpmckinney added error handling Relating to how errors are handled and removed framework labels Aug 13, 2020
@jpmckinney
Copy link
Member Author

jpmckinney commented Jul 4, 2023

This was done in v2.

Note that v2's release_compiler doesn't warn about these from v1's _process_releases method:

  • multiple releases with a 'compiled' tag (of which the first is taken)
    • v2 would merge these with any others
  • one release with a 'compiled' tag (which is taken)
    • v2 would merge this with any others
  • undated releases (in which case the dated releases are used)
    • v2 would error on the undated releases

I probably discussed this with Datlab at the time and decided the logic was unnecessary, or undesirable (i.e. we don't trust that publishers merged releases correctly, unless we have no dated releases).

If we want to add the logic back, we can refer to v2's record_compiler and the old _process_releases (above link).

Note that the behavior for the first two is what both v1 and v2 do for records, in the case where the record has at least 1 dated release and no linked releases (i.e. it ignores whether any of those were tagged as compiled).

Since release packages are never expected to contain linked releases (this has never been witnessed), then the current v2 behavior is fine. We can re-add logic if we encounter undated releases again (not for any existing publications yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Relating to how errors are handled
Projects
None yet
Development

No branches or pull requests

6 participants