-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DF/091: improve error handling #320
Conversation
Same operations were performed in `process_input_ds()` and `get_output_ds_info()`; now they are collected under the name `get_ds_info()`, while all the input/output DS type related operations are left to `process_*_ds()`.
If single dataset record does not contain enough information to construct service fields for ES indexing, it does not mean that all the datasets should be skipped.
Changes in error handling. 1. If Rucio returns `DatasetIdentifierNotFound`, set `deleted` to `True` (previously was treated as any other `RucioException`). 2. In case of `RucioException` mark message as "incomplete" -- this record will be written to ES in "update" mode, not "insert" (previously `deleted` and `bytes` would be set to `True` and `-1`). 3. If some of required fields can not be extracted from Rucio (or have value `null`), they are left unset (previously would be set to `None`. 4. If in the result message some fields of those that were supposed to be added at this stage are missed, the message is marked as `_incomplete`. What is not good in this logic: if dataset was removed from Rucio, it will always be marked as "requiring update". But for now we have only this option: to use "update" instead of "insert", we need to know that the message is incomplete; and when the message coming to Stage 019 is incomplete -- it should be marked as "requiring update" for further investigation. Maybe the logic could be extended and "_incomplete" should be turned into two different markers: "update since the original source removed useful data" and "update since we failed to connect to the original source", but... not now.
…-merge DF/091: improve error handling #320
On first glance the changes look fine, will verify the logic and the checklist later. |
``` 2020-03-05 19:28:04 (DEBUG) (ProcessorStage) Traceback (most recent call last): (==) File "<...>/ProcessorStage.py", line 231, in run (==) if msg and process(self, msg): (==) File "<...>/091_datasetsRucio/datasets_processing.py", line 233, in process_input_ds (==) data.update(ds) (==) UnboundLocalError: local variable 'ds' referenced before assignment ```
`/Utils/Dataflow/test/utils/compare_ndjson_files.py` shows changes like these: ``` Record seem to differ for uid=mc15_13TeV.387007.MGPy8EG_A14N_BB_direct_300_295_MET100.merge.AOD.e3994_a766_a821_r7676_tid08124406_00 key = deleted: Items missed in (2): (1) False Items missed in (1): (2) True Item missed in (2): 'bytes' Item missed in (2): 'events' Key missed in (1): '_incomplete' ``` ...which means: * DS with given uid was removed ('deleted: True'); * fields 'bytes' and 'events' are not presented in the new record; * key '_incomplete' is added. So when this record gets to the 019 (esFormat) and 069 (upload2es), it will be written in the "update" mode: 'bytes' and 'events' won't be changed, 'removed' will be set to True, and '_update_required' will be set to True. The latter may be not the best option, but it is the only flag we have right now to say that there's something wrong about given record... and here "wrong" is that Rucio did not provide us with the information we needed.
…-merge DF/091: improve error handling #320
Proper error message is missing for a non-existing dataset, it seems that
By the way, not sure if it is in the scope of this PR, but missing
Input datasets are similarly tied to |
Thank you for checking things out and providing examples [1].
Right, and it is an expected behaviour, for the "non-existing" dataset cannot be distinguished from the "deleted" one. The By "expected behaviour" I understand the following:
I don't see any problem in this scenario, but it doesn't mean there are none -- please let me know if I missed something. But if we just log this information for every dataset removed, it will be as useful as in Stage 095 -- far too many records about "no data found", that I usually have to filter out from the log before reading. Of course, at some level of details we would like to have this information, but as for now we don't have level-based control on the logs, so what goes there is very custom. And for now I just didn't want to know about every dataset removed from Rucio, so I decided not to log it. But any other Update: I see, I added this as a "thing to check" to the checklist in the PR's description. Crossed it out, for it was a bit too paranoic item.
Not every task must have the output; e.g. it may be aborted task or I don't know what. So it is not a error, it is just a task without output datasets -- nothing to worry about.
Right. In case of input datasets it is supposed to modify the input message; so if no modification needed -- it just outputs the original version. But in case of output datasets, it is supposed to generate new messages from scratch; so if no messages are generated -- nothing is passed further. And the original message is dropped from the message flow anyway, since the downstream stages expect mesages with dataset, not task metadata. [1] It is more convenient to provide command you're running in every example, even if it's the same (maybe skipping the long log at the stage's beginning), and with input data passed to the stage via |
Aha, I see. Ok then.
True.
You are right, my bad. |
While (re-)checking things myself stumbled over this kind of error:
Before this PR:
The latter looks noisy but accurate (FATAL (why we stopped) + ERROR (what brought us here) + DEBUG (what and where happened)); but the new version with So I think I will cancel the approval for now (to not merge the PR eventually) and fix this behaviour. Update: fixed in 72f374d. |
When the stage fails to load `rucio.client` module, it does not stop the execution -- for in case of `--skip` opotion specified we do not actually need the module to go on. The script exits only when tries to initialize the client -- but since the initialization occures within `try/except` clause, it first stumbles over the expected exception name (`DataIdentifierNotFound`), which it failed to load from the `rucio.client`. We had similar issue with `RucioException` once, so now the new exception is handled similarly.
@Evildoor,
How did you test it? All I get is the "proper error message", but the stage exits instead of going on and producing the 'incomplete' messages:
Or, if valid certificate/key files passed with incorrect user name:
|
I am not sure what exactly I was thinking for more than a month ago, but now the following statement sounds fair for me:
In other words, when the configuration is changed, the one making changes must make sure that after them the process operates normally. Which sounds fair for me, for if the new configuration is incorrect -- it's better to know about it immediately, not after few weeks when someone from the outsides starts wondering: "why doesn't the DKB provide information about dataset sizes for the latest tasks any more?" From this point of view, the behaviour stated above (when stage 091 stops after noticing invalid credentials) is the correct one, while the list of things-to-be-checked should be updated accordingly. |
I have a feeling that we had already discussed this issue, and even went back-and-forth a bit, changing our decision at least once, but could not find anything to prove my suspicion. It seems that we did discuss another similar issue, and that's what's triggering my memory. It's also a question about "to stop or not to stop stage execution when credentials are wrong", but for AMI instead of Rucio. The verdict was "yes, stop" - you are suggesting the same thing here, and I have no objections to your logic, so I'll agree here as well. |
Hm. I did several tests with various problems, and was satisfied with them, but can't remember what exactly I did. Nevermind, I'll run new tests since we had a discussion about whether to interrupt stage or not, AND are having issues here. Results (bold highlights cases which look wrong to me):
That's all I could think of, for now. |
@Evildoor, During my tests (see
In my case, it claimed
Here -- claimed that file does not exist.
In both cases: I can guess that "_incomplete" marker you got might be a correct result -- datasets mentioned in the sample with data from 2018 may already be removed. I make tests with dataset name However I don't know about "error messages" you mention: I saw none. As for this:
-- I also got same or similar puzzling logs in a couple of other cases. I'll see what I can do about it and get back soon. About "should or should not the stage fail":
Thank you for finding it: I also felt that there were multiple approaches to this questions, but thought that all of them took place at weekly meetings and none was fixed in text form. Good to know some of the conclusions were in fact written down, and even better -- that back then the decision was pretty much the same. I think it is worth being documented somewhere in the installation/development guide as a "best practice" for stages development, so that we (and someone else) wouldn't stumble over it over and over again for each stage. If, of course, one will remember to check the guide instead of reinventing the wheel :) |
Rucio client module may raise not only `RucioException`, but also some other exceptions: e.g. `IOError` (when files provided in the configuration as user certificate/key can not be read). What we need to do in case of a error in `init_rucio_client()` is to make sure that this error can be distinguished from any other error occured during the rucio client usage. In other words -- when we fail to initialise the client, it is not a "Rucio error" (indicating that we have some problems with Rucio), but a "dataflow error" (indicating that we have some problems with the stage which is unable to do its job). Calling `sys.exit()` is a bit severe: yes, we need to stop the stage execution if we can't initialise the client; but it would also be correct to say that we need to stop the dataflow. And, if `DataflowException` is raised during the `process()` execution, it is up to the common library to take care about the problem (output necessary details and interrupt the stage -- or do whatever is the default action when a stage says "I have a problem operating as a part of the dataflow process").
Yes, I've noticed a similar thing, and did my tests by performing a successful attempt last in a series of tests - but it seems that it wasn't enough, despite my observations. Thanks for pointing where the token is. Now, with its removal before each attempt, the results are the same as yours.
If my memory serves, I tested on batches that included non-deleted datasets, but...
... the messages are obsolete, since removal of the token gets rid of them.
Good.
Processing interrupts, a message is a bit puzzling:
Not sure if we can catch a difference between this and actual case of a directory being specified as a key/certificate. Even if the answer is "no", this, probably, can be left as-is, because it points towards the credentials, after all. |
Might be useful if we raise it instead of any other exception (and want to specify our own exception message, but still would like to keep the original exception information as well).
...to make the traceback shorter, and the topmost error message -- more informative.
Conflicts: Utils/Dataflow/pyDKB/VERSION
Yes, these are cases I meantioned as producing "similar puzzling logs". |
Well, it feels like all the issues discovered in this PR (and around it) are fully addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, either I got no messages about updates here, or I did and missed them.
Changes in error handling.
DatasetIdentifierNotFound
, setdeleted
toTrue
(previously was treated as any otherRucioException
).RucioException
mark message as "incomplete" -- thisrecord will be written to ES in "update" mode, not "insert"
(previously
deleted
andbytes
would be set toTrue
and-1
).value
null
), they are left unset (previously would be set toNone
.be added at this stage are missed, the message is marked as
_incomplete
.Additional changes:
update
action for incomplete messages [1];[1] It was not done initially since we wanted to first check how many "incomplete" records we're gonna have before make a decision: should records be "updated" on the fly, or maybe run some separate process that would update records marked with
_update_required
once a week or so.But now the goal is to make data integration more reliable: as for statistics, we may teach 069 to count "update" records... anyway it should start counting records sooner or later, so why not.
What to check:
_incomplete
marker;rucio.client
failed to load: (new item)rucio.client
import;output message is marked as(replaced with the one below);_incomplete
(To Be Discussed)proper error message(not found == removed, nothing to worry about);_incomplete
;_update_required
service field (in this case we will have record with some data missed for sure);_update_required
field (in this case we want to believe that we already have some of the missed data in the ES).--update
option specified:--update
;"doc_as_upsert": true
(and without"upsert"
documet);_update_required
marker.[2] Variants of "incorrect credentials":
[3]