-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup of huge disk breaks because of inactivity_timeout #71
Labels
bug
Issue is a bug or fix for a bug
Milestone
Comments
nirs
added a commit
to nirs/ovirt-imageio
that referenced
this issue
May 24, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we changed the default timeout from 60 seconds to 15. The reasoning was that clients have no reason to connect and keep the connection idle for long time. Once a client connects, it is expected to start sending requests. On the first request, the socket timeout is replaced by the ticket inactivity timeout, set by the user creating the transfer. Turns out that there is a valid use case for idle clients, and the shorter timeout breaks downloads of big images (reproduced with 8 TiB image. The failure flow is: 1. Client connects and send an EXTENTS request. 2. While EXTENTS request is collecting data, client connects multiple downloads connections. 3. The connected download threads wait on a queue for work, but since EXTENTS request did not finish, the connections are idle. 4. After 15 seconds the server close the idle connections. 5. When the EXTENTS request finish, the client fails to send request to the server. Here is example failure on the client side from ovirt-stress backup run: ... File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to res = self._get(length) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get self._con.request("GET", self.url.path, headers=headers) File "/usr/lib64/python3.6/http/client.py", line 1273, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 1004, in send self.sock.sendall(data) BrokenPipeError: [Errno 32] Broken pipe Looking in the server logs, we can see that EXTENTS request took about 24 seconds: 2022-05-24 16:22:57,548 INFO (Thread-75) [extents] [local] EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero ... 2022-05-24 16:23:44,907 INFO (Thread-75) [http] CLOSE connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch 2 ops, 47.297816 s] [extents 2 ops, 47.296568 s] Downloading 8 TiB disk is an edge case, but this can happen with smaller images on very fragmented file system, or if there is another reason that cause EXTENTS request to be slow. Revert the timeout back to the previous value used in ovirt 4.4. We may shorten the timeout once we support partial extents: https://bugzilla.redhat.com/1924940 Fixes oVirt#71 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs
added a commit
to nirs/ovirt-imageio
that referenced
this issue
May 25, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we changed the default timeout from 60 seconds to 15. The reasoning was that clients have no reason to connect and keep the connection idle for long time. Once a client connects, it is expected to start sending requests. On the first request, the socket timeout is replaced by the ticket inactivity timeout, set by the user creating the transfer. Turns out that there is a valid use case for idle clients, and the shorter timeout breaks downloads of big images (reproduced with 8 TiB image. The failure flow is: 1. Client connects and send an EXTENTS request. 2. While EXTENTS request is collecting data, client connects multiple downloads connections. 3. The connected download threads wait on a queue for work, but since EXTENTS request did not finish, the connections are idle. 4. After 15 seconds the server close the idle connections. 5. When the EXTENTS request finish, the client fails to send request to the server. Here is example failure on the client side from ovirt-stress backup run: ... File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to res = self._get(length) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get self._con.request("GET", self.url.path, headers=headers) File "/usr/lib64/python3.6/http/client.py", line 1273, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 1004, in send self.sock.sendall(data) BrokenPipeError: [Errno 32] Broken pipe Looking in the server logs, we can see that EXTENTS request took about 24 seconds: 2022-05-24 16:22:57,548 INFO (Thread-75) [extents] [local] EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero ... 2022-05-24 16:23:44,907 INFO (Thread-75) [http] CLOSE connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch 2 ops, 47.297816 s] [extents 2 ops, 47.296568 s] Downloading 8 TiB disk is an edge case, but this can happen with smaller images on very fragmented file system, or if there is another reason that cause EXTENTS request to be slow. Revert the timeout back to the previous value used in ovirt 4.4. We may shorten the timeout once we support partial extents: https://bugzilla.redhat.com/1924940 Fixes: oVirt#71 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs
added a commit
that referenced
this issue
May 25, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we changed the default timeout from 60 seconds to 15. The reasoning was that clients have no reason to connect and keep the connection idle for long time. Once a client connects, it is expected to start sending requests. On the first request, the socket timeout is replaced by the ticket inactivity timeout, set by the user creating the transfer. Turns out that there is a valid use case for idle clients, and the shorter timeout breaks downloads of big images (reproduced with 8 TiB image. The failure flow is: 1. Client connects and send an EXTENTS request. 2. While EXTENTS request is collecting data, client connects multiple downloads connections. 3. The connected download threads wait on a queue for work, but since EXTENTS request did not finish, the connections are idle. 4. After 15 seconds the server close the idle connections. 5. When the EXTENTS request finish, the client fails to send request to the server. Here is example failure on the client side from ovirt-stress backup run: ... File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy self._src.write_to(self._dst, req.length, self._buf) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to res = self._get(length) File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get self._con.request("GET", self.url.path, headers=headers) File "/usr/lib64/python3.6/http/client.py", line 1273, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 1004, in send self.sock.sendall(data) BrokenPipeError: [Errno 32] Broken pipe Looking in the server logs, we can see that EXTENTS request took about 24 seconds: 2022-05-24 16:22:57,548 INFO (Thread-75) [extents] [local] EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero ... 2022-05-24 16:23:44,907 INFO (Thread-75) [http] CLOSE connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch 2 ops, 47.297816 s] [extents 2 ops, 47.296568 s] Downloading 8 TiB disk is an edge case, but this can happen with smaller images on very fragmented file system, or if there is another reason that cause EXTENTS request to be slow. Revert the timeout back to the previous value used in ovirt 4.4. We may shorten the timeout once we support partial extents: https://bugzilla.redhat.com/1924940 Fixes: #71 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
nirs
added
needs testing
Candidate for downstream testing
and removed
needs testing
Candidate for downstream testing
labels
May 25, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Trying to do incremental backup of 8 TiB disk can fail if getting image
extents takes more than 15 seconds.
Flow is :
The issue is caused by #14, shortening socket timeout to 15 seconds.
Before this change, the timeout was 60 seconds.
Slow extent is known issue - we don't support partial extents, so
client must get extent for entire image, and for very large images
this can be slow. This is tracked in https://bugzilla.redhat.com/1924940
Log
Analysis
In the connection logs getting EXTENTS we see:
We have 2 calls - one with context=zero, and one with context=dirty. The call with
context=zero is needed since we don't have a way to get the image size without
calling extents. This will be eliminated when #67 will be fixed.
It is likely that both EXTENTS calls took the same time, since they return similar
data. Image has about 3g, so we have no changes clean area of 8189 GiB in both
dirty and zero extents requests. So EXTENTS took ~24 seconds.
If we revert the short socket timeout (15 seconds) back to 60 seconds this flow
should work in most cases. It can still break if extents is slower, for example
big image on very fragmented file system.
The text was updated successfully, but these errors were encountered: