Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DF/091: improve error handling #320

Merged
merged 13 commits into from
Apr 13, 2020
Merged

DF/091: improve error handling #320

merged 13 commits into from
Apr 13, 2020

Conversation

mgolosova
Copy link
Collaborator

@mgolosova mgolosova commented Feb 17, 2020

Changes in error handling.

  1. If Rucio returns DatasetIdentifierNotFound, set deleted to
    True (previously was treated as any other RucioException).
  2. In case of RucioException mark message as "incomplete" -- this
    record will be written to ES in "update" mode, not "insert"
    (previously deleted and bytes would be set to True and -1).
  3. If some of required fields can not be extracted from Rucio (or have
    value null), they are left unset (previously would be set to
    None.
  4. If in the result message some fields of those that were supposed to
    be added at this stage are missed, the message is marked as
    _incomplete.

Additional changes:

  • Stage 019 is set to always use update action for incomplete messages [1];
  • fixed minor bug in logic: when output datasets being processed, all datasets were dropped as soon as single one had problems with ES indexing service fields definition;
  • Stage 091: refactoring;
  • Stage 091: samples update.

[1] It was not done initially since we wanted to first check how many "incomplete" records we're gonna have before make a decision: should records be "updated" on the fly, or maybe run some separate process that would update records marked with _update_required once a week or so.
But now the goal is to make data integration more reliable: as for statistics, we may teach 069 to count "update" records... anyway it should start counting records sooner or later, so why not.


What to check:

  • general logic: things should really work as it is said below:
  • behaviour when there's no Rucio errors:
    • records are generated without _incomplete marker;
    • -//-
  • behaviour in case of Rucio errors:
    • rucio.client failed to load: (new item)
      • stage stops with error message indicating problems with rucio.client import;
      • -//-
    • invalid credentials [2]:
      • proper error message;
      • -//-
      • output message is marked as _incomplete (To Be Discussed) (replaced with the one below);
      • interrupt stage execution;
      • -//-
    • dataset not found:
      • proper error message (not found == removed, nothing to worry about);
      • output message is marked as _incomplete;
      • -//-
  • interaction with Stage 019 [3]:
    • not "incomplete" messages are treated as usual (decorated with "insert" action JSON);
    • -//-
    • "incomplete" messages are decorated with "update" action JSON;
    • -//-
    • "update" records are to be:
      • inserted with _update_required service field (in this case we will have record with some data missed for sure);
      • -//-
      • used as update for existing records without _update_required field (in this case we want to believe that we already have some of the missed data in the ES).
      • -//-
    • in case of --update option specified:
      • "incomplete" messages are treated just the same as without --update;
      • not "incomplete" messages are passed with "doc_as_upsert": true (and without "upsert" documet);
      • not "incomplete" messages are passed without _update_required marker.

[2] Variants of "incorrect credentials":

  • incorrect account name;
  • incorrect password for the certificate;
  • empty string provided as certificate;
  • empty string provided as key;
  • no certificate provided;
  • no key provided;
  • cert file does not exist;
  • key file does not exist;
  • certificate is a plain text file;
  • key is a plain text file.

[3]

cd Utils/Dataflow/091_datasetsRucio

# Prepare input samples for Stage 19
head input/sample2018.ndjson \
  | VIRTUAL_ENV=~dkb/.rucio ./datasets_processing -m s -E '' -t o \
  |  tr $'\n' $'\x1e' \
  > output_ds_2018
head input/sample2018.ndjson \
  | VIRTUAL_ENV=~dkb/.rucio ./datasets_processing -m s -E '' -t i \
  | ../016_task2es/task2es.py -m s -E '' \
  | tr $'\n' $'\x1e' > input_ds_2018

# Run Stage 019
cat output_ds_2018 | ../019_esFormat/run.sh | less
cat input_ds_2018 | ../019_esFormat/run.sh | less

Same operations were performed in `process_input_ds()` and
`get_output_ds_info()`; now they are collected under the name
`get_ds_info()`, while all the input/output DS type related operations
are left to `process_*_ds()`.
If single dataset record does not contain enough information to
construct service fields for ES indexing, it does not mean that all the
datasets should be skipped.
Changes in error handling.
1. If Rucio returns `DatasetIdentifierNotFound`, set `deleted` to
   `True` (previously was treated as any other `RucioException`).
2. In case of `RucioException` mark message as "incomplete" -- this
   record will be written to ES in "update" mode, not "insert"
   (previously `deleted` and `bytes` would be set to `True` and `-1`).
3. If some of required fields can not be extracted from Rucio (or have
   value `null`), they are left unset (previously would be set to
   `None`.
4. If in the result message some fields of those that were supposed to
   be added at this stage are missed, the message is marked as
   `_incomplete`.

What is not good in this logic: if dataset was removed from Rucio, it
will always be marked as "requiring update". But for now we have only
this option: to use "update" instead of "insert", we need to know that
the message is incomplete; and when the message coming to Stage 019 is
incomplete -- it should be marked as "requiring update" for further
investigation.

Maybe the logic could be extended and "_incomplete" should be turned
into two different markers: "update since the original source removed
useful data" and "update since we failed to connect to the original
source", but... not now.
@mgolosova
Copy link
Collaborator Author

@Evildoor
Copy link
Contributor

Evildoor commented Mar 4, 2020

On first glance the changes look fine, will verify the logic and the checklist later.

```
2020-03-05 19:28:04 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "<...>/ProcessorStage.py", line 231, in run
(==)     if msg and process(self, msg):
(==)   File "<...>/091_datasetsRucio/datasets_processing.py", line 233, in process_input_ds
(==)     data.update(ds)
(==) UnboundLocalError: local variable 'ds' referenced before assignment
```
@mgolosova
Copy link
Collaborator Author

Rucio stage (?)

`/Utils/Dataflow/test/utils/compare_ndjson_files.py` shows changes like these:

```
Record seem to differ for uid=mc15_13TeV.387007.MGPy8EG_A14N_BB_direct_300_295_MET100.merge.AOD.e3994_a766_a821_r7676_tid08124406_00
key = deleted:
Items missed in (2):
(1) False
Items missed in (1):
(2) True
Item missed in (2): 'bytes'
Item missed in (2): 'events'
Key missed in (1): '_incomplete'
```

...which means:
* DS with given uid was removed ('deleted: True');
* fields 'bytes' and 'events' are not presented in the new record;
* key '_incomplete' is added.

So when this record gets to the 019 (esFormat) and 069 (upload2es), it
will be written in the "update" mode: 'bytes' and 'events' won't be
changed, 'removed' will be set to True, and '_update_required' will be
set to True.

The latter may be not the best option, but it is the only flag we have
right now to say that there's something wrong about given record... and
here "wrong" is that Rucio did not provide us with the information we
needed.
witxka added a commit that referenced this pull request Mar 17, 2020
@Evildoor
Copy link
Contributor

Proper error message is missing for a non-existing dataset, it seems that RucioException does not trigger in process_output_ds():

[vaulov@aiatlas171 091_datasetsRucio]$ ./datasets_processing.py -s s -d s
(TRACE) Set VIRTUAL_ENV: /afs/cern.ch/user/v/vaulov/Work/dkb/Utils/Dataflow/091_datasetsRucio/.rucio
2020-03-26 11:05:30 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-26 11:05:30 (INFO) (ProcessorStage) Configuration parameters:
2020-03-26 11:05:30 (INFO) (ProcessorStage)   hdfs         : 'False'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   ds_type      : 'o'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   dest         : 's'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   input_dir    : 'None'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   skip_process : 'False'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   eom          : '\n'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   output_dir   : 'out'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   source       : 's'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   eop          : ''
2020-03-26 11:05:30 (INFO) (ProcessorStage)   mode         : 'f'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   config       : 'None'
2020-03-26 11:05:30 (INFO) (ProcessorStage)   input_files  : '[]'
2020-03-26 11:05:30 (INFO) (ProcessorStage) Starting stage execution.
{"taskname": "mc16_13Te0_yyWW_llvv.recon.e6666_e5984_a875_r9364", "taskid": 102, "primary_input": "mc16_13T75_tid14331895_00", "output":"abc"}
{"datasetname": "abc", "deleted": true, "_type": "output_dataset", "_parent": 102, "_incomplete": true, "_id": "abc"}

By the way, not sure if it is in the scope of this PR, but missing "output" produces no feedback altogether:

2020-03-26 11:08:09 (INFO) (ProcessorStage) Starting stage execution.
{}
{"output":"abc"}
(WARN) Skip message (not enough info for ES indexing).
2020-03-26 11:08:20 (INFO) (ProcessorStage) Stopping stage.

Input datasets are similarly tied to "primary_input", but they pass the message further without changes instead of dropping it.

@mgolosova
Copy link
Collaborator Author

mgolosova commented Mar 27, 2020

@Evildoor,

Thank you for checking things out and providing examples [1].

Proper error message is missing for a non-existing dataset, it seems that RucioException does not trigger in process_output_ds():

091_datasetsRucio$ echo '{"taskname": "mc16_13Te0_yyWW_llvv.recon.e6666_e5984_a875_r9364", "taskid": 102, "primary_input": "mc16_13T75_tid14331895_00", "output":"abc"}' |  ./datasets_processing.py -s s -d s
(TRACE) Set VIRTUAL_ENV: /root/.rucio
2020-03-27 12:10:38 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-27 12:10:38 (INFO) (ProcessorStage) Configuration parameters:
2020-03-27 12:10:38 (INFO) (ProcessorStage)   hdfs         : 'False'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   ds_type      : 'o'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   dest         : 's'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   input_dir    : 'None'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   skip_process : 'False'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   eom          : '\n'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   output_dir   : 'out'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   source       : 's'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   eop          : ''
2020-03-27 12:10:38 (INFO) (ProcessorStage)   mode         : 'f'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   config       : 'None'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   input_files  : '[]'
2020-03-27 12:10:38 (INFO) (ProcessorStage) Starting stage execution.
{"datasetname": "abc", "deleted": true, "_type": "output_dataset", "_parent": 102, "_incomplete": true, "_id": "abc"}
2020-03-27 12:10:38 (INFO) (ProcessorStage) Stopping stage.

Right, and it is an expected behaviour, for the "non-existing" dataset cannot be distinguished from the "deleted" one. The RucioException in this case is very precise: DataIdentifierNotFound; and it is handled in get_ds_info() (check here).

By "expected behaviour" I understand the following:

  • stage does not fail;
  • stage generates record, that will add to ES the following information:
    • Task with ID=102 has output dataset with name="abc";
    • Dataset with name="abc" has property deleted=True;
    • Last time we checked we couldn't get the whole information about dataset with name="abc", so we might want to try it again some day;
  • stage is ready to process the next record.

I don't see any problem in this scenario, but it doesn't mean there are none -- please let me know if I missed something.

But if we just log this information for every dataset removed, it will be as useful as in Stage 095 -- far too many records about "no data found", that I usually have to filter out from the log before reading. Of course, at some level of details we would like to have this information, but as for now we don't have level-based control on the logs, so what goes there is very custom. And for now I just didn't want to know about every dataset removed from Rucio, so I decided not to log it. But any other RucioException -- yes, should be logged; since I'm not sure what it might be and what to do about it.

Update: I see, I added this as a "thing to check" to the checklist in the PR's description. Crossed it out, for it was a bit too paranoic item.


By the way, not sure if it is in the scope of this PR, but missing "output" produces no feedback altogether

Not every task must have the output; e.g. it may be aborted task or I don't know what. So it is not a error, it is just a task without output datasets -- nothing to worry about.

Input datasets are similarly tied to "primary_input", but they pass the message further without changes instead of dropping it.

Right. In case of input datasets it is supposed to modify the input message; so if no modification needed -- it just outputs the original version. But in case of output datasets, it is supposed to generate new messages from scratch; so if no messages are generated -- nothing is passed further. And the original message is dropped from the message flow anyway, since the downstream stages expect mesages with dataset, not task metadata.


[1] It is more convenient to provide command you're running in every example, even if it's the same (maybe skipping the long log at the stage's beginning), and with input data passed to the stage via echo (or cat for multiline ones) -- not from the keyboard. It will allow reproducing of your experiment with single copy-and-paste action and make easier to distinguish the input data from the output.

@Evildoor
Copy link
Contributor

@Evildoor,

Thank you for checking things out and providing examples [1].

Proper error message is missing for a non-existing dataset, it seems that RucioException does not trigger in process_output_ds():

091_datasetsRucio$ echo '{"taskname": "mc16_13Te0_yyWW_llvv.recon.e6666_e5984_a875_r9364", "taskid": 102, "primary_input": "mc16_13T75_tid14331895_00", "output":"abc"}' |  ./datasets_processing.py -s s -d s
(TRACE) Set VIRTUAL_ENV: /root/.rucio
2020-03-27 12:10:38 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-27 12:10:38 (INFO) (ProcessorStage) Configuration parameters:
2020-03-27 12:10:38 (INFO) (ProcessorStage)   hdfs         : 'False'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   ds_type      : 'o'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   dest         : 's'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   input_dir    : 'None'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   skip_process : 'False'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   eom          : '\n'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   output_dir   : 'out'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   source       : 's'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   eop          : ''
2020-03-27 12:10:38 (INFO) (ProcessorStage)   mode         : 'f'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   config       : 'None'
2020-03-27 12:10:38 (INFO) (ProcessorStage)   input_files  : '[]'
2020-03-27 12:10:38 (INFO) (ProcessorStage) Starting stage execution.
{"datasetname": "abc", "deleted": true, "_type": "output_dataset", "_parent": 102, "_incomplete": true, "_id": "abc"}
2020-03-27 12:10:38 (INFO) (ProcessorStage) Stopping stage.

Right, and it is an expected behaviour, for the "non-existing" dataset cannot be distinguished from the "deleted" one. The RucioException in this case is very precise: DataIdentifierNotFound; and it is handled in get_ds_info() (check here).

By "expected behaviour" I understand the following:

  • stage does not fail;

  • stage generates record, that will add to ES the following information:

    • Task with ID=102 has output dataset with name="abc";
    • Dataset with name="abc" has property deleted=True;
    • Last time we checked we couldn't get the whole information about dataset with name="abc", so we might want to try it again some day;
  • stage is ready to process the next record.

I don't see any problem in this scenario, but it doesn't mean there are none -- please let me know if I missed something.

But if we just log this information for every dataset removed, it will be as useful as in Stage 095 -- far too many records about "no data found", that I usually have to filter out from the log before reading. Of course, at some level of details we would like to have this information, but as for now we don't have level-based control on the logs, so what goes there is very custom. And for now I just didn't want to know about every dataset removed from Rucio, so I decided not to log it. But any other RucioException -- yes, should be logged; since I'm not sure what it might be and what to do about it.

Update: I see, I added this as a "thing to check" to the checklist in the PR's description. Crossed it out, for it was a bit too paranoic item.

Aha, I see. Ok then.

By the way, not sure if it is in the scope of this PR, but missing "output" produces no feedback altogether

Not every task must have the output; e.g. it may be aborted task or I don't know what. So it is not a error, it is just a task without output datasets -- nothing to worry about.

Input datasets are similarly tied to "primary_input", but they pass the message further without changes instead of dropping it.

Right. In case of input datasets it is supposed to modify the input message; so if no modification needed -- it just outputs the original version. But in case of output datasets, it is supposed to generate new messages from scratch; so if no messages are generated -- nothing is passed further. And the original message is dropped from the message flow anyway, since the downstream stages expect mesages with dataset, not task metadata.

True.

[1] It is more convenient to provide command you're running in every example, even if it's the same (maybe skipping the long log at the stage's beginning), and with input data passed to the stage via echo (or cat for multiline ones) -- not from the keyboard. It will allow reproducing of your experiment with single copy-and-paste action and make easier to distinguish the input data from the output.

You are right, my bad.

Evildoor
Evildoor previously approved these changes Mar 30, 2020
@mgolosova
Copy link
Collaborator Author

mgolosova commented Mar 31, 2020

While (re-)checking things myself stumbled over this kind of error:
After this PR:

091_datasetsRucio]$ head -n 1 input/sample2018.ndjson | VIRTUAL_ENV=./ ./datasets_processing.py -m s
(ERROR) Could not load rucio configuration file rucio.cfg.Rucio looks in the following directories for a configuration file, in order:
        ${RUCIO_HOME}/etc/rucio.cfg
        /opt/rucio/etc/rucio.cfg
        ${VIRTUAL_ENV}/etc/rucio.cfg.
2020-03-31 12:19:20 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-31 12:19:20 (INFO) (ProcessorStage) Configuration parameters:
<...>
2020-03-31 12:19:20 (INFO) (ProcessorStage) Starting stage execution.
(FATAL) Failed to initialize Rucio client: module not loaded.
2020-03-31 12:19:20 (ERROR) (ProcessorStage) global name 'DataIdentifierNotFound' is not defined
2020-03-31 12:19:20 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "/home/dkb/dkb-dev.git/Utils/Dataflow/091_datasetsRucio/../pyDKB/dataflow/stage/ProcessorStage.py", line 231, in run
(==)     if msg and process(self, msg):
(==)   File "./datasets_processing.py", line 149, in process_output_ds
(==)     ds = get_ds_info(ds_name, mfields)
(==)   File "./datasets_processing.py", line 271, in get_ds_info
(==)     except DataIdentifierNotFound, err:
(==) NameError: global name 'DataIdentifierNotFound' is not defined
2020-03-31 12:19:20 (INFO) (ProcessorStage) Stopping stage.

Before this PR:

(ERROR) Could not load rucio configuration file rucio.cfg.Rucio looks in the following directories for a configuration file, in order:
        ${RUCIO_HOME}/etc/rucio.cfg
        /opt/rucio/etc/rucio.cfg
        ${VIRTUAL_ENV}/etc/rucio.cfg.
2020-03-31 12:24:52 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-31 12:24:52 (INFO) (ProcessorStage) Configuration parameters:
<...>
2020-03-31 12:24:52 (INFO) (ProcessorStage) Starting stage execution.
(FATAL) Failed to initialize Rucio client: module not loaded.
2020-03-31 12:24:52 (ERROR) (ProcessorStage) Module not found or misconfigured: 'rucio'
2020-03-31 12:24:52 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "/home/dkb/dkb-dev.git/Utils/Dataflow/091_datasetsRucio/../pyDKB/dataflow/stage/ProcessorStage.py", line 231, in run
(==)     if msg and process(self, msg):
(==)   File "./datasets_processing.py", line 144, in process_output_ds
(==)     ds = get_output_ds_info(dataset)
(==)   File "./datasets_processing.py", line 231, in get_output_ds_info
(==)     mdata = get_metadata(dataset, mfields.keys())
(==)   File "./datasets_processing.py", line 275, in get_metadata
(==)     rucio_client = get_rucio_client()
(==)   File "./datasets_processing.py", line 116, in get_rucio_client
(==)     init_rucio_client()
(==)   File "./datasets_processing.py", line 105, in init_rucio_client
(==)     raise DataflowException("Module not found or misconfigured: 'rucio'")
(==) DataflowException: Module not found or misconfigured: 'rucio'
2020-03-31 12:24:52 (INFO) (ProcessorStage) Stopping stage.

The latter looks noisy but accurate (FATAL (why we stopped) + ERROR (what brought us here) + DEBUG (what and where happened)); but the new version with global name ... not defined is definitely not what we'd like to see when failed to load Rucio client module...

So I think I will cancel the approval for now (to not merge the PR eventually) and fix this behaviour.


Update: fixed in 72f374d.
Also added an item to the PR's things-to-be-checked list.

@mgolosova mgolosova dismissed Evildoor’s stale review March 31, 2020 10:35

There's an issue to be fixed.

When the stage fails to load `rucio.client` module, it does not stop the
execution -- for in case of `--skip` opotion specified we do not
actually need the module to go on.

The script exits only when tries to initialize the client -- but since
the initialization occures within `try/except` clause, it first stumbles
over the expected exception name (`DataIdentifierNotFound`), which it
failed to load from the `rucio.client`. We had similar issue with
`RucioException` once, so now the new exception is handled similarly.
@mgolosova
Copy link
Collaborator Author

@Evildoor,
I'm having problems with checking this item:

  • behaviour in case of Rucio errors:
    • invalid credentials:
      • proper error message;
      • output message is marked as _incomplete;

How did you test it?

All I get is the "proper error message", but the stage exits instead of going on and producing the 'incomplete' messages:

091_datasetsRucio]$ cat ./etc/rucio.cfg
[common]
loglevel = DEBUG

[client]
rucio_host = https://voatlasrucio-server-prod.cern.ch:443
auth_host = https://voatlasrucio-auth-prod.cern.ch:443
ca_cert = /etc/pki/tls/certs/CERN-bundle.pem
account = dkb
client_cert = /etc/motd
client_key = /etc/motd
auth_type = x509
request_retries = 3
091_datasetsRucio]$
091_datasetsRucio]$ head -n 1 input/sample2018.ndjson | VIRTUAL_ENV=./ ./datasets_processing.py -m s -t i
2020-03-31 14:41:45 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-03-31 14:41:45 (INFO) (ProcessorStage) Configuration parameters:
<...>
2020-03-31 14:41:45 (INFO) (ProcessorStage) Starting stage execution.
Internal error: Request for authentication token returned no result!
(ERROR) Failed to initialize Rucio client.
(ERROR) Cannot authenticate.
(==) Details: x509 authentication failed for account=dkb with identity={'client_cert': '/etc/motd', 'client_key': '/etc/motd'}.
2020-03-31 14:41:45 (ERROR) (ProcessorStage) 1
2020-03-31 14:41:45 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "/home/dkb/dkb-dev.git/Utils/Dataflow/091_datasetsRucio/../pyDKB/dataflow/stage/ProcessorStage.py", line 231, in run
<...traceback...>
(==) SystemExit: 1
2020-03-31 14:41:45 (INFO) (ProcessorStage) Stopping stage.

Or, if valid certificate/key files passed with incorrect user name:

<...>
2020-03-31 14:47:38 (INFO) (ProcessorStage) Starting stage execution.
(ERROR) Failed to initialize Rucio client.
(ERROR) Cannot authenticate.
(==) Details: Cannot authenticate to account dkb with given credentials.
<...>

@mgolosova
Copy link
Collaborator Author

mgolosova commented Mar 31, 2020

I am not sure what exactly I was thinking for more than a month ago, but now the following statement sounds fair for me:

If stage is misconfigured and can not operate properly (e.g. if 091 is provided
with invalid credentials), it means that the whole integration process is
misconfigured (for process' configuration is a combination of its stages
configuration -- plus, of course, stages themselves, their execution
instructions, ordering, etc.)

If integration process is misconfigured, it must not try to skip the
misconfigured stages. It must fail. For if it does not, it will be harder to
spot the misconfiguration.

So if stage is misconfigured -- it must fail.

In other words, when the configuration is changed, the one making changes must make sure that after them the process operates normally. Which sounds fair for me, for if the new configuration is incorrect -- it's better to know about it immediately, not after few weeks when someone from the outsides starts wondering: "why doesn't the DKB provide information about dataset sizes for the latest tasks any more?"


From this point of view, the behaviour stated above (when stage 091 stops after noticing invalid credentials) is the correct one, while the list of things-to-be-checked should be updated accordingly.

@Evildoor
Copy link
Contributor

Evildoor commented Apr 1, 2020

I am not sure what exactly I was thinking for more than a month ago, but now the following statement sounds fair for me:

If stage is misconfigured and can not operate properly (e.g. if 091 is provided
with invalid credentials), it means that the whole integration process is
misconfigured (for process' configuration is a combination of its stages
configuration -- plus, of course, stages themselves, their execution
instructions, ordering, etc.)

If integration process is misconfigured, it must not try to skip the
misconfigured stages. It must fail. For if it does not, it will be harder to
spot the misconfiguration.

So if stage is misconfigured -- it must fail.

In other words, when the configuration is changed, the one making changes must make sure that after them the process operates normally. Which sounds fair for me, for if the new configuration is incorrect -- it's better to know about it immediately, not after few weeks when someone from the outsides starts wondering: "why doesn't the DKB provide information about dataset sizes for the latest tasks any more?"

From this point of view, the behaviour stated above (when stage 091 stops after noticing invalid credentials) is the correct one, while the list of things-to-be-checked should be updated accordingly.

I have a feeling that we had already discussed this issue, and even went back-and-forth a bit, changing our decision at least once, but could not find anything to prove my suspicion. It seems that we did discuss another similar issue, and that's what's triggering my memory. It's also a question about "to stop or not to stop stage execution when credentials are wrong", but for AMI instead of Rucio. The verdict was "yes, stop" - you are suggesting the same thing here, and I have no objections to your logic, so I'll agree here as well.

@Evildoor
Copy link
Contributor

Evildoor commented Apr 1, 2020

@Evildoor,
I'm having problems with checking this item:

  • behaviour in case of Rucio errors:

    • invalid credentials:

      • proper error message;
      • output message is marked as _incomplete;

How did you test it?

Hm. I did several tests with various problems, and was satisfied with them, but can't remember what exactly I did. Nevermind, I'll run new tests since we had a discussion about whether to interrupt stage or not, AND are having issues here. Results (bold highlights cases which look wrong to me):

  • Incorrect account name given in config: interrupt with complaint about credentials.
  • Incorrect password supplied for the certificate: interrupt, but error message is puzzling:
[vaulov@aiatlas171 091_datasetsRucio]$ cat input/sample2018.ndjson | ./datasets_processing.py -m s
(TRACE) Set VIRTUAL_ENV: /afs/cern.ch/user/v/vaulov/Work/dkb/Utils/Dataflow/091_datasetsRucio/.rucio
2020-04-01 21:17:04 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-04-01 21:17:04 (INFO) (ProcessorStage) Configuration parameters:
2020-04-01 21:17:04 (INFO) (ProcessorStage)   hdfs         : 'False'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   ds_type      : 'o'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   dest         : 's'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   input_dir    : 'None'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   skip_process : 'False'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   eom          : '\n'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   output_dir   : 'out'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   source       : 's'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   eop          : ''
2020-04-01 21:17:04 (INFO) (ProcessorStage)   mode         : 's'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   config       : 'None'
2020-04-01 21:17:04 (INFO) (ProcessorStage)   input_files  : '[]'
2020-04-01 21:17:04 (INFO) (ProcessorStage) Starting stage execution.
Enter PEM pass phrase:
2020-04-01 21:17:07 (ERROR) (ProcessorStage) [Errno 22] Invalid argument
2020-04-01 21:17:07 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "/afs/cern.ch/user/v/vaulov/Work/dkb/Utils/Dataflow/091_datasetsRucio/../pyDKB/dataflow/stage/ProcessorStage.py", line 231, in run
(==)     if msg and process(self, msg):
(==)   File "./datasets_processing.py", line 150, in process_output_ds
(==)     ds = get_ds_info(ds_name, mfields)
(==)   File "./datasets_processing.py", line 267, in get_ds_info
(==)     mdata = get_metadata(dataset, mfields.keys())
(==)   File "./datasets_processing.py", line 311, in get_metadata
(==)     metadata = rucio_client.get_metadata(scope=scope, name=dataset)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/didclient.py", line 402, in get_metadata
(==)     r = self._send_request(url, type='GET')
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 319, in _send_request
(==)     self.__get_token()
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 553, in __get_token
(==)     if not self.__get_token_x509():
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 398, in __get_token_x509
(==)     verify=self.ca_cert)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 498, in get
(==)     return self.request('GET', url, **kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
(==)     resp = self.send(prep, **send_kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
(==)     r = adapter.send(request, **kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
(==)     timeout=timeout
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
(==)     chunked=chunked)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
(==)     self._validate_conn(conn)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
(==)     conn.connect()
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 344, in connect
(==)     ssl_context=context)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/util/ssl_.py", line 336, in ssl_wrap_socket
(==)     context.load_cert_chain(certfile, keyfile)
(==) IOError: [Errno 22] Invalid argument
2020-04-01 21:17:07 (INFO) (ProcessorStage) Stopping stage.

  • No certificate given in config: interrupt with message about missing certificate.
  • No key given in config: processing with "incomplete" and error messages.
  • Incorrect certificate (name of a non-existent file is given): processing with "incomplete" and error messages.
  • Incorrect key (name of a non-existent file is given): interrupt with message about missing file.
  • Incorrect certificate (name of a text file is given): processing with "incomplete" and error messages.
  • Incorrect key (name of a text file is given): processing with "incomplete" and error messages.

That's all I could think of, for now.
Note: I wanted to provide error messages for other cases as well, but they contain things, such as paths to correct credentials, that I don't want to put here, on public discussion (yes, they are also protected by other means, but still), and redacting them out defeats the whole purpose, or would require way too much time. Same goes for changes to rucio config that were made to provoke the desired result.

@mgolosova
Copy link
Collaborator Author

@Evildoor,
there are some difficulties with testing due to the internal logic of the Rucio client module. In particular, if user managed to get authenticated once, its "token" is stored in /tmp/$USER/.rucio_<account> and is used at the next authentication attempt.

During my tests (see [2] in the PR description) -- this time -- I removed the locally stored token after every successful authentication attempt to make sure that each test is done in a "fresh" environment. And here're results that differ from yours:

  • No key given in config: processing with "incomplete" and error messages.

In my case, it claimed Internal error: Request for authentication token returned no result! (meaning that no locally stored token was found) and exited normally.

  • Incorrect certificate (name of a non-existent file is given): processing with "incomplete" and error messages.

Here -- claimed that file does not exist.

  • Incorrect certificate (name of a text file is given): processing with "incomplete" and error messages.
  • Incorrect key (name of a text file is given): processing with "incomplete" and error messages.

In both cases: Internal error: Request for authentication token returned no result! and exit.

I can guess that "_incomplete" marker you got might be a correct result -- datasets mentioned in the sample with data from 2018 may already be removed. I make tests with dataset name "mc16_13TeV.395366.MGPy8EG_A14N23LO_CC_directZlHlWv_550.simul.HITS.e6715_e5984_a875_tid14195904_00" specified as a value for both "primary_input" and "output" fields, and it gives me "*deleted": false and other Rucio-related fields.

However I don't know about "error messages" you mention: I saw none.

As for this:

  • Incorrect password supplied for the certificate: interrupt, but error message is puzzling

-- I also got same or similar puzzling logs in a couple of other cases. I'll see what I can do about it and get back soon.


About "should or should not the stage fail":

I have a feeling that we had already discussed this issue <...> but for AMI instead of Rucio (#282 (review)). The verdict was "yes, stop" - you are suggesting the same thing here, and I have no objections to your logic, so I'll agree here as well.

Thank you for finding it: I also felt that there were multiple approaches to this questions, but thought that all of them took place at weekly meetings and none was fixed in text form. Good to know some of the conclusions were in fact written down, and even better -- that back then the decision was pretty much the same. I think it is worth being documented somewhere in the installation/development guide as a "best practice" for stages development, so that we (and someone else) wouldn't stumble over it over and over again for each stage. If, of course, one will remember to check the guide instead of reinventing the wheel :)

Rucio client module may raise not only `RucioException`, but also some
other exceptions: e.g. `IOError` (when files provided in the
configuration as user certificate/key can not be read).

What we need to do in case of a error in `init_rucio_client()` is to
make sure that this error can be distinguished from any other error
occured during the rucio client usage. In other words -- when we fail to
initialise the client, it is not a "Rucio error" (indicating that we
have some problems with Rucio), but a "dataflow error" (indicating that
we have some problems with the stage which is unable to do its job).

Calling `sys.exit()` is a bit severe: yes, we need to stop the stage
execution if we can't initialise the client; but it would also be
correct to say that we need to stop the dataflow. And, if
`DataflowException` is raised during the `process()` execution, it is
up to the common library to take care about the problem (output
necessary details and interrupt the stage -- or do whatever is the
default action when a stage says "I have a problem operating as a part
of the dataflow process").
@Evildoor
Copy link
Contributor

Evildoor commented Apr 3, 2020

@Evildoor,
there are some difficulties with testing due to the internal logic of the Rucio client module. In particular, if user managed to get authenticated once, its "token" is stored in /tmp/$USER/.rucio_<account> and is used at the next authentication attempt.

Yes, I've noticed a similar thing, and did my tests by performing a successful attempt last in a series of tests - but it seems that it wasn't enough, despite my observations. Thanks for pointing where the token is. Now, with its removal before each attempt, the results are the same as yours.

I can guess that "_incomplete" marker you got might be a correct result -- datasets mentioned in the sample with data from 2018 may already be removed. I make tests with dataset name "mc16_13TeV.395366.MGPy8EG_A14N23LO_CC_directZlHlWv_550.simul.HITS.e6715_e5984_a875_tid14195904_00" specified as a value for both "primary_input" and "output" fields, and it gives me "*deleted": false and other Rucio-related fields.

If my memory serves, I tested on batches that included non-deleted datasets, but...

However I don't know about "error messages" you mention: I saw none.

... the messages are obsolete, since removal of the token gets rid of them.

As for this:

  • Incorrect password supplied for the certificate: interrupt, but error message is puzzling

-- I also got same or similar puzzling logs in a couple of other cases. I'll see what I can do about it and get back soon.

Good.

  • empty string provided as certificate;
  • empty string provided as key;

Processing interrupts, a message is a bit puzzling:

[vaulov@aiatlas171 091_datasetsRucio]$ cat input_data | ./datasets_processing.py -m s
(TRACE) Set VIRTUAL_ENV: /afs/cern.ch/user/v/vaulov/Work/dkb/Utils/Dataflow/091_datasetsRucio/.rucio
2020-04-03 15:44:52 (WARN) (pyDKB.dataflow.cds) Submodule failed (No module named invenio_client.contrib)
2020-04-03 15:44:52 (INFO) (ProcessorStage) Configuration parameters:
2020-04-03 15:44:52 (INFO) (ProcessorStage)   hdfs         : 'False'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   ds_type      : 'o'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   dest         : 's'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   input_dir    : 'None'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   skip_process : 'False'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   eom          : '\n'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   output_dir   : 'out'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   source       : 's'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   eop          : ''
2020-04-03 15:44:52 (INFO) (ProcessorStage)   mode         : 's'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   config       : 'None'
2020-04-03 15:44:52 (INFO) (ProcessorStage)   input_files  : '[]'
2020-04-03 15:44:52 (INFO) (ProcessorStage) Starting stage execution.
2020-04-03 15:44:52 (ERROR) (ProcessorStage) [Errno 21] Is a directory
2020-04-03 15:44:52 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(==)   File "/afs/cern.ch/user/v/vaulov/Work/dkb/Utils/Dataflow/091_datasetsRucio/../pyDKB/dataflow/stage/ProcessorStage.py", line 231, in run
(==)     if msg and process(self, msg):
(==)   File "./datasets_processing.py", line 144, in process_output_ds
(==)     ds = get_output_ds_info(dataset)
(==)   File "./datasets_processing.py", line 231, in get_output_ds_info
(==)     mdata = get_metadata(dataset, mfields.keys())
(==)   File "./datasets_processing.py", line 275, in get_metadata
(==)     rucio_client = get_rucio_client()
(==)   File "./datasets_processing.py", line 116, in get_rucio_client
(==)     init_rucio_client()
(==)   File "./datasets_processing.py", line 101, in init_rucio_client
(==)     rucio_client = rucio.client.Client()
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/client.py", line 64, in __init__
(==)     super(Client, self).__init__(rucio_host=rucio_host, auth_host=auth_host, account=account, ca_cert=ca_cert, auth_type=auth_type, creds=creds, timeout=timeout, user_agent=user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/accountclient.py", line 44, in __init__
(==)     super(AccountClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/accountlimitclient.py", line 40, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/metaclient.py", line 45, in __init__
(==)     super(MetaClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/pingclient.py", line 35, in __init__
(==)     super(PingClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/replicaclient.py", line 45, in __init__
(==)     super(ReplicaClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/requestclient.py", line 37, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/rseclient.py", line 47, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/scopeclient.py", line 46, in __init__
(==)     super(ScopeClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/didclient.py", line 56, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/ruleclient.py", line 39, in __init__
(==)     super(RuleClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, dq2_wrapper)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/subscriptionclient.py", line 38, in __init__
(==)     super(SubscriptionClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/lockclient.py", line 44, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/configclient.py", line 44, in __init__
(==)     super(ConfigClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/touchclient.py", line 41, in __init__
(==)     super(TouchClient, self).__init__(rucio_host, auth_host, account, ca_cert, auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/importclient.py", line 35, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/exportclient.py", line 35, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/credentialclient.py", line 35, in __init__
(==)     auth_type, creds, timeout, user_agent)
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 227, in __init__
(==)     self.__authenticate()
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 646, in __authenticate
(==)     self.__get_token()
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 553, in __get_token
(==)     if not self.__get_token_x509():
(==)   File "/usr/lib/python2.7/site-packages/rucio/client/baseclient.py", line 398, in __get_token_x509
(==)     verify=self.ca_cert)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 498, in get
(==)     return self.request('GET', url, **kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
(==)     resp = self.send(prep, **send_kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
(==)     r = adapter.send(request, **kwargs)
(==)   File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
(==)     timeout=timeout
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
(==)     chunked=chunked)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
(==)     self._validate_conn(conn)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
(==)     conn.connect()
(==)   File "/usr/lib/python2.7/site-packages/urllib3/connection.py", line 344, in connect
(==)     ssl_context=context)
(==)   File "/usr/lib/python2.7/site-packages/urllib3/util/ssl_.py", line 336, in ssl_wrap_socket
(==)     context.load_cert_chain(certfile, keyfile)
(==) IOError: [Errno 21] Is a directory

Not sure if we can catch a difference between this and actual case of a directory being specified as a key/certificate. Even if the answer is "no", this, probably, can be left as-is, because it points towards the credentials, after all.

Might be useful if we raise it instead of any other exception (and want
to specify our own exception message, but still would like to keep the
original exception information as well).
...to make the traceback shorter, and the topmost error message -- more
informative.
@mgolosova
Copy link
Collaborator Author

@Evildoor,

  • empty string provided as certificate;
  • empty string provided as key;

Processing interrupts, a message is a bit puzzling.

Yes, these are cases I meantioned as producing "similar puzzling logs".
New version should provide a more readable output for all IOErrors raised during an attempt to load the certificate from files. However, that Is a directory phrase -- as well as Invalid argument in case of a wrong password -- is still there; it's not very friendly, but it is the originally generated by some system library call message, so I don't think it would be a good idea to reword it somehow.

@mgolosova
Copy link
Collaborator Author

Well, it feels like all the issues discovered in this PR (and around it) are fully addressed now.
Did I miss something?

Copy link
Contributor

@Evildoor Evildoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, either I got no messages about updates here, or I did and missed them.

@mgolosova mgolosova merged commit 4e2373b into master Apr 13, 2020
@mgolosova mgolosova deleted the 091-error-handling branch April 13, 2020 10:44
@mgolosova mgolosova mentioned this pull request Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants