Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup of huge disk breaks because of inactivity_timeout #71

Closed
nirs opened this issue May 24, 2022 · 0 comments · Fixed by #74
Closed

Backup of huge disk breaks because of inactivity_timeout #71

nirs opened this issue May 24, 2022 · 0 comments · Fixed by #74
Labels
bug Issue is a bug or fix for a bug
Milestone

Comments

@nirs
Copy link
Member

nirs commented May 24, 2022

Trying to do incremental backup of 8 TiB disk can fail if getting image
extents takes more than 15 seconds.

Flow is :

  1. Client connects first connection, getting OPTIONS
  2. Client ask to EXTENTS
  3. Client connect 4 download connections (waiting until extents returns)
  4. Extents is very slow, 15 pass...
  5. Server read timeout expire, idle client disconnect
  6. Extent returns, client start sending requrests
  7. Client fail with BrokenPipe since server closed the other side.

The issue is caused by #14, shortening socket timeout to 15 seconds.
Before this change, the timeout was 60 seconds.

Slow extent is known issue - we don't support partial extents, so
client must get extent for entire image, and for very large images
this can be slow. This is tracked in https://bugzilla.redhat.com/1924940

Log

2022-05-24 16:22:57,509 INFO    (Thread-73) [tickets] [local] ADD transfer={'dirty': True, 'ops': ['read'], 'size': 8796093022208, 'sparse': False, 'inactivity_timeout': 60, 'transfer_id': '39f14719-2533-45d5-8315-9d0b577d5732', 'uuid': 'e751adce-fd08-48d8-89af-b22e6ebff8d9', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/39f14719-2533-45d5-8315-9d0b577d5732.sock'}
2022-05-24 16:22:57,510 INFO    (Thread-73) [http] CLOSE connection=73 client=local [connection 1 ops, 0.001157 s] [dispatch 1 ops, 0.000419 s]
2022-05-24 16:22:57,545 INFO    (Thread-74) [http] OPEN connection=74 client=::ffff:192.168.122.23
2022-05-24 16:22:57,546 INFO    (Thread-74) [images] [::ffff:192.168.122.23] OPTIONS transfer=39f14719-2533-45d5-8315-9d0b577d5732
2022-05-24 16:22:57,547 INFO    (Thread-74) [http] CLOSE connection=74 client=::ffff:192.168.122.23 [connection 1 ops, 0.002147 s] [dispatch 1 ops, 0.000830 s]
2022-05-24 16:22:57,548 INFO    (Thread-75) [http] OPEN connection=75 client=local
2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local] EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
2022-05-24 16:22:59,043 INFO    (Thread-76) [http] OPEN connection=76 client=local
2022-05-24 16:22:59,044 INFO    (Thread-76) [http] CLOSE connection=76 client=local [connection 1 ops, 0.000561 s] [dispatch 1 ops, 0.000156 s]
2022-05-24 16:23:03,064 INFO    (Thread-77) [http] OPEN connection=77 client=local
2022-05-24 16:23:03,064 INFO    (Thread-77) [http] CLOSE connection=77 client=local [connection 1 ops, 0.000575 s] [dispatch 1 ops, 0.000144 s]
2022-05-24 16:23:11,074 INFO    (Thread-78) [http] OPEN connection=78 client=local
2022-05-24 16:23:11,075 INFO    (Thread-78) [http] CLOSE connection=78 client=local [connection 1 ops, 0.000532 s] [dispatch 1 ops, 0.000125 s]
2022-05-24 16:23:21,101 INFO    (Thread-79) [http] OPEN connection=79 client=local
2022-05-24 16:23:21,101 INFO    (Thread-79) [http] CLOSE connection=79 client=local [connection 1 ops, 0.000600 s] [dispatch 1 ops, 0.000192 s]
2022-05-24 16:23:21,429 INFO    (Thread-80) [http] OPEN connection=80 client=local
2022-05-24 16:23:21,430 INFO    (Thread-81) [http] OPEN connection=81 client=local
2022-05-24 16:23:21,432 INFO    (Thread-82) [http] OPEN connection=82 client=local
2022-05-24 16:23:21,433 INFO    (Thread-75) [extents] [local] EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=dirty
2022-05-24 16:23:21,435 INFO    (Thread-83) [http] OPEN connection=83 client=local
2022-05-24 16:23:31,130 INFO    (Thread-84) [http] OPEN connection=84 client=local
2022-05-24 16:23:31,131 INFO    (Thread-84) [http] CLOSE connection=84 client=local [connection 1 ops, 0.000530 s] [dispatch 1 ops, 0.000127 s]
2022-05-24 16:23:36,445 WARNING (Thread-80) [http] Timeout reading or writing to socket: timed out
2022-05-24 16:23:36,445 INFO    (Thread-80) [http] CLOSE connection=80 client=local [connection 1 ops, 15.015437 s]
2022-05-24 16:23:36,445 WARNING (Thread-81) [http] Timeout reading or writing to socket: timed out
2022-05-24 16:23:36,445 INFO    (Thread-81) [http] CLOSE connection=81 client=local [connection 1 ops, 15.014717 s]
2022-05-24 16:23:36,445 WARNING (Thread-82) [http] Timeout reading or writing to socket: timed out
2022-05-24 16:23:36,445 INFO    (Thread-82) [http] CLOSE connection=82 client=local [connection 1 ops, 15.013580 s]
2022-05-24 16:23:36,450 WARNING (Thread-83) [http] Timeout reading or writing to socket: timed out
2022-05-24 16:23:36,450 INFO    (Thread-83) [http] CLOSE connection=83 client=local [connection 1 ops, 15.015075 s]
2022-05-24 16:23:41,214 INFO    (Thread-85) [http] OPEN connection=85 client=local
2022-05-24 16:23:41,215 INFO    (Thread-85) [http] CLOSE connection=85 client=local [connection 1 ops, 0.000699 s] [dispatch 1 ops, 0.000201 s]
2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch 2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]
2022-05-24 16:23:51,272 INFO    (Thread-86) [http] OPEN connection=86 client=local
2022-05-24 16:23:51,273 INFO    (Thread-86) [tickets] [local] REMOVE transfer=39f14719-2533-45d5-8315-9d0b577d5732
2022-05-24 16:23:51,275 INFO    (Thread-86) [http] CLOSE connection=86 client=local [connection 1 ops, 0.002394 s] [dispatch 1 ops, 0.000553 s]

Analysis

In the connection logs getting EXTENTS we see:

2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch 2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

We have 2 calls - one with context=zero, and one with context=dirty. The call with
context=zero is needed since we don't have a way to get the image size without
calling extents. This will be eliminated when #67 will be fixed.

It is likely that both EXTENTS calls took the same time, since they return similar
data. Image has about 3g, so we have no changes clean area of 8189 GiB in both
dirty and zero extents requests. So EXTENTS took ~24 seconds.

If we revert the short socket timeout (15 seconds) back to 60 seconds this flow
should work in most cases. It can still break if extents is slower, for example
big image on very fragmented file system.

@nirs nirs added the bug Issue is a bug or fix for a bug label May 24, 2022
nirs added a commit to nirs/ovirt-imageio that referenced this issue May 24, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we
changed the default timeout from 60 seconds to 15. The reasoning was
that clients have no reason to connect and keep the connection idle for
long time. Once a client connects, it is expected to start sending
requests. On the first request, the socket timeout is replaced by the
ticket inactivity timeout, set by the user creating the transfer.

Turns out that there is a valid use case for idle clients, and the
shorter timeout breaks downloads of big images (reproduced with 8 TiB
image. The failure flow is:

1. Client connects and send an EXTENTS request.
2. While EXTENTS request is collecting data, client connects multiple
   downloads connections.
3. The connected download threads wait on a queue for work, but since
   EXTENTS request did not finish, the connections are idle.
4. After 15 seconds the server close the idle connections.
5. When the EXTENTS request finish, the client fails to send request to
   the server.

Here is example failure on the client side from ovirt-stress backup run:

      ...
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy
        self._src.write_to(self._dst, req.length, self._buf)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to
        res = self._get(length)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get
        self._con.request("GET", self.url.path, headers=headers)
      File "/usr/lib64/python3.6/http/client.py", line 1273, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
        self.send(msg)
      File "/usr/lib64/python3.6/http/client.py", line 1004, in send
        self.sock.sendall(data)
    BrokenPipeError: [Errno 32] Broken pipe

Looking in the server logs, we can see that EXTENTS request took about
24 seconds:

    2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local]
    EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
    ...
    2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE
    connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch
    2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

Downloading 8 TiB disk is an edge case, but this can happen with smaller
images on very fragmented file system, or if there is another reason
that cause EXTENTS request to be slow.

Revert the timeout back to the previous value used in ovirt 4.4.

We may shorten the timeout once we support partial extents:
https://bugzilla.redhat.com/1924940

Fixes oVirt#71

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs nirs added this to the ovirt-4.5.1 milestone May 24, 2022
nirs added a commit to nirs/ovirt-imageio that referenced this issue May 25, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we
changed the default timeout from 60 seconds to 15. The reasoning was
that clients have no reason to connect and keep the connection idle for
long time. Once a client connects, it is expected to start sending
requests. On the first request, the socket timeout is replaced by the
ticket inactivity timeout, set by the user creating the transfer.

Turns out that there is a valid use case for idle clients, and the
shorter timeout breaks downloads of big images (reproduced with 8 TiB
image. The failure flow is:

1. Client connects and send an EXTENTS request.
2. While EXTENTS request is collecting data, client connects multiple
   downloads connections.
3. The connected download threads wait on a queue for work, but since
   EXTENTS request did not finish, the connections are idle.
4. After 15 seconds the server close the idle connections.
5. When the EXTENTS request finish, the client fails to send request to
   the server.

Here is example failure on the client side from ovirt-stress backup run:

      ...
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy
        self._src.write_to(self._dst, req.length, self._buf)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to
        res = self._get(length)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get
        self._con.request("GET", self.url.path, headers=headers)
      File "/usr/lib64/python3.6/http/client.py", line 1273, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
        self.send(msg)
      File "/usr/lib64/python3.6/http/client.py", line 1004, in send
        self.sock.sendall(data)
    BrokenPipeError: [Errno 32] Broken pipe

Looking in the server logs, we can see that EXTENTS request took about
24 seconds:

    2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local]
    EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
    ...
    2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE
    connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch
    2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

Downloading 8 TiB disk is an edge case, but this can happen with smaller
images on very fragmented file system, or if there is another reason
that cause EXTENTS request to be slow.

Revert the timeout back to the previous value used in ovirt 4.4.

We may shorten the timeout once we support partial extents:
https://bugzilla.redhat.com/1924940

Fixes: oVirt#71
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs nirs closed this as completed in #74 May 25, 2022
nirs added a commit that referenced this issue May 25, 2022
In commit d5e9c75 (http: Configurable inactivity timeout) we
changed the default timeout from 60 seconds to 15. The reasoning was
that clients have no reason to connect and keep the connection idle for
long time. Once a client connects, it is expected to start sending
requests. On the first request, the socket timeout is replaced by the
ticket inactivity timeout, set by the user creating the transfer.

Turns out that there is a valid use case for idle clients, and the
shorter timeout breaks downloads of big images (reproduced with 8 TiB
image. The failure flow is:

1. Client connects and send an EXTENTS request.
2. While EXTENTS request is collecting data, client connects multiple
   downloads connections.
3. The connected download threads wait on a queue for work, but since
   EXTENTS request did not finish, the connections are idle.
4. After 15 seconds the server close the idle connections.
5. When the EXTENTS request finish, the client fails to send request to
   the server.

Here is example failure on the client side from ovirt-stress backup run:

      ...
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 288, in copy
        self._src.write_to(self._dst, req.length, self._buf)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 207, in write_to
        res = self._get(length)
      File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 432, in _get
        self._con.request("GET", self.url.path, headers=headers)
      File "/usr/lib64/python3.6/http/client.py", line 1273, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1319, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1268, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/lib64/python3.6/http/client.py", line 1044, in _send_output
        self.send(msg)
      File "/usr/lib64/python3.6/http/client.py", line 1004, in send
        self.sock.sendall(data)
    BrokenPipeError: [Errno 32] Broken pipe

Looking in the server logs, we can see that EXTENTS request took about
24 seconds:

    2022-05-24 16:22:57,548 INFO    (Thread-75) [extents] [local]
    EXTENTS transfer=39f14719-2533-45d5-8315-9d0b577d5732 context=zero
    ...
    2022-05-24 16:23:44,907 INFO    (Thread-75) [http] CLOSE
    connection=75 client=local [connection 1 ops, 47.359750 s] [dispatch
    2 ops, 47.297816 s] [extents 2 ops, 47.296568 s]

Downloading 8 TiB disk is an edge case, but this can happen with smaller
images on very fragmented file system, or if there is another reason
that cause EXTENTS request to be slow.

Revert the timeout back to the previous value used in ovirt 4.4.

We may shorten the timeout once we support partial extents:
https://bugzilla.redhat.com/1924940

Fixes: #71
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
@nirs nirs added needs testing Candidate for downstream testing and removed needs testing Candidate for downstream testing labels May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is a bug or fix for a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant