-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put all system messages in collection note table #242
Comments
To be honest I never had the need to check these messages but the proposal sounds good. |
I agree with this and also suggest if we can store the logs about the insertions in the database as well (when scrapy send a POST to Kingfisher process). We had cases when we received a 500 or similar errors from kingfisher process not knowing why. |
@yolile Do you want to store a log of successful insertions (e.g. "[filename] stored in collection_file")? Or is it sufficient to store the request that caused the 500 error? Also, in future, please open issues for any unexpected errors, so that they can be investigated. I have access to Sentry, which might provide more detail on some 500 errors. |
I think that it would be ok to store only the non 200 ones
Ok, I will |
I just noticed that the Since the tables are empty, I figure there is no disruption to existing practices if they are dropped. |
I'm happy with the proposal. Like @romifz, I also haven't actually checked any of these errors before. If it's important to check these, even when it looks like a load has run successfully, then perhaps we should document them in a loading data checklist. |
Noting that I'll handle this in the new system, as writing a PR for the present system will touch too many parts of the code (since it's not just a simple database change). |
A separate table would be helpful, and we can/should add checks and explanatory notes to the data feedback colab notebooks. The #222 issue meant the data that data journalists were looking at did not accurately reflect the source because the way it was structured meant the compile script removed data it considered repeat data. As we increase OCDS data use/re-use e.g. via views, we need to verify that compilation and upgrade scripts to generate those views haven't removed data due to data quality issues. Because the existing system didn't throw/capture errors, we only realised the issue belatedly. |
PersonalI wanted to comment and say I'm taking a lesson from this. The warnings and errors columns on several tables were put in to answer the use case that analysts need to know if there were problems converting/handling the data. But I don't think I put as much effort into communicating that to analysts and discussing this with them. I'm taking a lesson from this to make sure I communicate more on things like this. Where to storeI don't think it makes much difference whether warnings / errors are stored in columns on data tables or in a separate table. In either case we should be able to write something so it's easy for analysts to see the info. Changes to suggested schemeHowever, if we are going to move everything to a separate table (currently Links
Firstly, notes already have a Can I suggest instead directly adding:
This allows
OCIDAdd an Source
It would be good to explore a bit more what things are expected to be in this value, as I'm not clear at the moment. From what I think this means, I think my next suggestion will cover that: Human ReadableI would like to add a machine readable version of all automatic messages put in here. Currently we just log human readable messages (this is historical; the collection notes was not originally intended to be used for this) but that's very difficult to search through. For example we currently log things like:
If I as an analyst want to list all the OCID's that have bad date data, it's difficult. I have to write something to take apart a human string, and it's very possible that string might change in the future. Also, I18N?? Instead, I'd rather log a JSON blob with some values that are designed for searching:
Now writing a query to find all the OCID's is easier, much more robust and the UI that reports on this can be I18N'd later if we need that. (ps. This is similar to the approach we are trying to take to solve some I18N issues in CoVE) So actually I suggest not using collection_noteThe collection note table was designed soley for human written notes. It's use case was we wanted to know which analyst had started a collection, so if we had question about it we know who to go to. That's why all the examples in doc's like https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html have a standard note template included, to try and encourage it's use. EG
I would suggest the data that a human written "note" generates is very different from the data a machine written log entry generates. And thus I would suggest seperating them:
Summary
|
Thanks! The issue description was relatively high-level, so I'll flesh it out here. A few clarifications:
With respect to:
In terms of changes to the proposal:
|
This was done in v2. Note that v2's
I probably discussed this with Datlab at the time and decided the logic was unnecessary, or undesirable (i.e. we don't trust that publishers merged releases correctly, unless we have no dated releases). If we want to add the logic back, we can refer to v2's Note that the behavior for the first two is what both v1 and v2 do for records, in the case where the record has at least 1 dated release and no linked releases (i.e. it ignores whether any of those were tagged as compiled). Since release packages are never expected to contain linked releases (this has never been witnessed), then the current v2 behavior is fine. We can re-add logic if we encounter undated releases again (not for any existing publications yet). |
In summary:
Add columns to thecollection_note
table for:collection_note
table has adata
JSON field, but there are so few occasions for errors or warnings at present, that it doesn't really matter to have the additional detail.Have thecheck
step store errors in this table, instead ofrelease_check_error
andrecord_check_error
load
step store warnings and errors in this table, instead of thewarnings
anderrors
columns of thecollection_file
andcollection_file_item
tablesocdskingfisherprocess/transform/compile_releases.py
(including a warning for two records with the same OCID, which is impossible per the database index). v2's record_compiler creates notes for these (with more precise messages).collection_note
tableThere are likely existing Python packages that implement warning and/or logging handlers that store messages in the DB.
The warnings and errors produced by the upgrade #67 and compile #222 transforms can be logged by a new logger, whose handler stores the messages in the database, associated to the relevant collection.
That way, analysts can query the table for warnings/errors, and the web UI can also report the number and list of warnings/errors.
This table can also be used for the warnings/errors that are presently stored separately on each file and file item as JSON (while preserving the ID of the file or file item to which the message relates). I assume that if I'm checking a collection before analyzing it, that it'd be useful to see all warnings/errors in one place – rather than having to separately query the file and file item tables for any potential warnings/errors – and I assume that most of the time, it's more useful to just know that a particular error occurred within a collection, than to know the specific file/item on which it occurred (though the new table would still make it easy to know that). Do others agree with moving all such messages to one table?
The new table would thus store the severity (warning, error), the message, the source (e.g. the store, upgrade, or compile step), and the subject (e.g. file item 1, file 3, collection 10).
The text was updated successfully, but these errors were encountered: