Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Handling (Network Error) needs improvement - should reconnect #156

Closed
aio-witc opened this issue Dec 30, 2023 · 1 comment · Fixed by #164
Closed

Error Handling (Network Error) needs improvement - should reconnect #156

aio-witc opened this issue Dec 30, 2023 · 1 comment · Fixed by #164

Comments

@aio-witc
Copy link

Context:

  • (internal: This is the discussed OpenDNSSEC "lost key" issue.)
  • we noticed random errors and unresponsive apps when using the pkcs11 module
  • we were finally able to reproduce it
  • underlying cause is a firewall, that drops idle sessions after a timeout (7200s in the original setting, 300s in the lab setup)
  • just reconnecting would have worked

Suggestions:

  • shield the applications from network issues
  • send keepalives (possibly SO_KEEPALIVE) - this might be a hotfix, but does not solve the basic issue. In case of a network outage, all pkcs11 dependent services would fail or hang for 15mins.
  • implement timeouts (configurable) - the pkcs11 should not "hang" for 15mins before returning an error (see logs!). e.g. if the NetHSM does not reply within 3s, try to reconnect (or return an error, if no reconnects are configured)
  • try to reconnect - when the timeout has expired, the module should transparently try to reconnect and to re-establish the session. This should be configurable as well (e.g. "try 10 times with 3s between the attempts")
  • CKR_DEVICE_REMOVED might be a better error for the case, when the connected timed out (Chapter 5.1.2 in the spec), in my limited understanding, CKR_DEVICE_ERROR is not expected from C_Sign*.

Message from python script:

<PrivateKey label='witcLoadTest' id='776974634c6f616454657374' EC>
waiting for 1s
waiting for 61s
waiting for 121s
waiting for 181s
waiting for 241s
waiting for 301s
Traceback (most recent call last):
  File "/home/awitc/dev/pkcs11/main.py", line 26, in <module>
    signature=key.sign(text)
              ^^^^^^^^^^^^^^
  File "/home/awitc/dev/pkcs11/venv/lib64/python3.12/site-packages/pkcs11/types.py", line 939, in sign
    return self._sign(data, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pkcs11/_pkcs11.pyx", line 1072, in pkcs11._pkcs11.SignMixin._sign
  File "pkcs11/_pkcs11.pyx", line 1083, in pkcs11._pkcs11.SignMixin._sign
  File "pkcs11/_errors.pyx", line 88, in pkcs11._pkcs11.assertRV
pkcs11.exceptions.DeviceError

tcpdump:

16:44:09.355851 IP 10.8.201.83.https > 10.8.97.141.59324: Flags [.], ack 2244710, win 65535, length 0
16:44:09.356861 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2244710:2244989, ack 1999208, win 6159, length 279
16:44:09.356994 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2244989:2245084, ack 1999208, win 6159, length 95
16:44:09.359853 IP 10.8.201.83.https > 10.8.97.141.59324: Flags [.], ack 2244989, win 65535, length 0
16:44:09.359854 IP 10.8.201.83.https > 10.8.97.141.59324: Flags [P.], seq 1999208:1999540, ack 2245084, win 65535, length 332
16:44:09.359854 IP 10.8.201.83.https > 10.8.97.141.59324: Flags [.], ack 2245084, win 65535, length 0
16:44:09.400549 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [.], ack 1999540, win 6159, length 0
16:49:10.361754 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245363, ack 1999540, win 6159, length 279
16:49:10.361916 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245363:2245458, ack 1999540, win 6159, length 95
16:49:10.569948 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:10.985992 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:11.809731 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:13.473990 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:16.801723 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:23.458046 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:49:36.770040 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:50:03.394152 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:50:58.689995 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:52:45.186021 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:54:48.065967 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:56:50.946011 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
16:58:53.826008 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
17:00:56.705984 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374
17:02:59.585967 IP 10.8.97.141.59324 > 10.8.201.83.https: Flags [P.], seq 2245084:2245458, ack 1999540, win 6159, length 374

Log:

[2023-12-30T15:44:09Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
    Host: zg-pxx-hsmlab.zg.ch
    user-agent: pkcs11-rs/0.1.0
    authorization: ***
    accept: application/json
    content-type: application/json
    Content-Length: 73
[2023-12-30T15:44:09Z DEBUG ureq::response] Body entirely buffered (length: 112)
[2023-12-30T15:44:09Z DEBUG ureq::pool] adding stream to pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] response 200 to POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:44:09Z DEBUG ureq::pool] pulling stream from pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] sending request (reused connection) POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:44:09Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
    Host: zg-pxx-hsmlab.zg.ch
    user-agent: pkcs11-rs/0.1.0
    authorization: ***
    accept: application/json
    content-type: application/json
    Content-Length: 73
[2023-12-30T15:44:09Z DEBUG ureq::response] Body entirely buffered (length: 112)
[2023-12-30T15:44:09Z DEBUG ureq::pool] adding stream to pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] response 200 to POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:49:10Z DEBUG ureq::pool] pulling stream from pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:49:10Z DEBUG ureq::unit] sending request (reused connection) POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:49:10Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
    Host: zg-pxx-hsmlab.zg.ch
    user-agent: pkcs11-rs/0.1.0
    authorization: ***
    accept: application/json
    content-type: application/json
    Content-Length: 73
[2023-12-30T16:05:02Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)
[2023-12-30T16:05:02Z ERROR nethsm_pkcs11::backend] Request error : https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign: Network Error: Network Error: Error encountered in the status line: Connection timed out (os error 110)

Python test script:

import pkcs11
import secrets
import time

lib = pkcs11.lib('./nethsm-pkcs11-v1.0.0-x86_64-fedora.38.so')
token = lib.get_token(token_label='LocalHSM')
session=token.open()
key=session.get_key(object_class=pkcs11.ObjectClass.PRIVATE_KEY, label='witcLoadTest')

for i in range(0, 60):
    for j in range(0, 1000):
        text=secrets.token_bytes(16)
        signature=key.sign(text)

    waiting=i*60+1
    print(f"waiting for {waiting}s")
    time.sleep(waiting)
@daringer
Copy link

daringer commented Jan 2, 2024

  • implement timeouts (configurable) - the pkcs11 should not "hang" for 15mins before returning an error (see logs!). e.g. if the NetHSM does not reply within 3s, try to reconnect (or return an error, if no reconnects are configured)
  • try to reconnect - when the timeout has expired, the module should transparently try to reconnect and to re-establish the session. This should be configurable as well (e.g. "try 10 times with 3s between the attempts")

we would first target these two issues with:

  • better default timeout
  • configurable timeout
  • reconnecting after timeout occurs

and then re-evaluate this issue accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants