You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(internal: This is the discussed OpenDNSSEC "lost key" issue.)
we noticed random errors and unresponsive apps when using the pkcs11 module
we were finally able to reproduce it
underlying cause is a firewall, that drops idle sessions after a timeout (7200s in the original setting, 300s in the lab setup)
just reconnecting would have worked
Suggestions:
shield the applications from network issues
send keepalives (possibly SO_KEEPALIVE) - this might be a hotfix, but does not solve the basic issue. In case of a network outage, all pkcs11 dependent services would fail or hang for 15mins.
implement timeouts (configurable) - the pkcs11 should not "hang" for 15mins before returning an error (see logs!). e.g. if the NetHSM does not reply within 3s, try to reconnect (or return an error, if no reconnects are configured)
try to reconnect - when the timeout has expired, the module should transparently try to reconnect and to re-establish the session. This should be configurable as well (e.g. "try 10 times with 3s between the attempts")
CKR_DEVICE_REMOVED might be a better error for the case, when the connected timed out (Chapter 5.1.2 in the spec), in my limited understanding, CKR_DEVICE_ERROR is not expected from C_Sign*.
Message from python script:
<PrivateKey label='witcLoadTest' id='776974634c6f616454657374' EC>
waiting for 1s
waiting for 61s
waiting for 121s
waiting for 181s
waiting for 241s
waiting for 301s
Traceback (most recent call last):
File "/home/awitc/dev/pkcs11/main.py", line 26, in <module>
signature=key.sign(text)
^^^^^^^^^^^^^^
File "/home/awitc/dev/pkcs11/venv/lib64/python3.12/site-packages/pkcs11/types.py", line 939, in sign
return self._sign(data, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pkcs11/_pkcs11.pyx", line 1072, in pkcs11._pkcs11.SignMixin._sign
File "pkcs11/_pkcs11.pyx", line 1083, in pkcs11._pkcs11.SignMixin._sign
File "pkcs11/_errors.pyx", line 88, in pkcs11._pkcs11.assertRV
pkcs11.exceptions.DeviceError
[2023-12-30T15:44:09Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
Host: zg-pxx-hsmlab.zg.ch
user-agent: pkcs11-rs/0.1.0
authorization: ***
accept: application/json
content-type: application/json
Content-Length: 73
[2023-12-30T15:44:09Z DEBUG ureq::response] Body entirely buffered (length: 112)
[2023-12-30T15:44:09Z DEBUG ureq::pool] adding stream to pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] response 200 to POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:44:09Z DEBUG ureq::pool] pulling stream from pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] sending request (reused connection) POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:44:09Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
Host: zg-pxx-hsmlab.zg.ch
user-agent: pkcs11-rs/0.1.0
authorization: ***
accept: application/json
content-type: application/json
Content-Length: 73
[2023-12-30T15:44:09Z DEBUG ureq::response] Body entirely buffered (length: 112)
[2023-12-30T15:44:09Z DEBUG ureq::pool] adding stream to pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:44:09Z DEBUG ureq::unit] response 200 to POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:49:10Z DEBUG ureq::pool] pulling stream from pool: https|zg-pxx-hsmlab.zg.ch|443 -> Stream(RustlsStream)
[2023-12-30T15:49:10Z DEBUG ureq::unit] sending request (reused connection) POST https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign
[2023-12-30T15:49:10Z DEBUG ureq::unit] writing prelude: POST /api/v1/keys/witcLoadTest/sign HTTP/1.1
Host: zg-pxx-hsmlab.zg.ch
user-agent: pkcs11-rs/0.1.0
authorization: ***
accept: application/json
content-type: application/json
Content-Length: 73
[2023-12-30T16:05:02Z DEBUG ureq::stream] dropping stream: Stream(RustlsStream)
[2023-12-30T16:05:02Z ERROR nethsm_pkcs11::backend] Request error : https://zg-pxx-hsmlab.zg.ch/api/v1/keys/witcLoadTest/sign: Network Error: Network Error: Error encountered in the status line: Connection timed out (os error 110)
Python test script:
import pkcs11
import secrets
import time
lib = pkcs11.lib('./nethsm-pkcs11-v1.0.0-x86_64-fedora.38.so')
token = lib.get_token(token_label='LocalHSM')
session=token.open()
key=session.get_key(object_class=pkcs11.ObjectClass.PRIVATE_KEY, label='witcLoadTest')
for i in range(0, 60):
for j in range(0, 1000):
text=secrets.token_bytes(16)
signature=key.sign(text)
waiting=i*60+1
print(f"waiting for {waiting}s")
time.sleep(waiting)
The text was updated successfully, but these errors were encountered:
implement timeouts (configurable) - the pkcs11 should not "hang" for 15mins before returning an error (see logs!). e.g. if the NetHSM does not reply within 3s, try to reconnect (or return an error, if no reconnects are configured)
try to reconnect - when the timeout has expired, the module should transparently try to reconnect and to re-establish the session. This should be configurable as well (e.g. "try 10 times with 3s between the attempts")
Context:
Suggestions:
Message from python script:
tcpdump:
Log:
Python test script:
The text was updated successfully, but these errors were encountered: