Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HA autotests #427

Closed
kinvaris opened this issue Feb 6, 2017 · 4 comments
Closed

Add HA autotests #427

kinvaris opened this issue Feb 6, 2017 · 4 comments

Comments

@kinvaris
Copy link
Contributor

kinvaris commented Feb 6, 2017

Test will require the parent hypervisor information for the first tests. These can be modified by killing certain processes later
Flow:

  • Find sync dtl vpool
  • Get two storagerouter
  • Determine which will get shot and which will survive (requires 3node setup)
  • Create a VM on a computenode (a third node which is not involved)
  • Start writing (to filesystem on the VM)
  • Shoot the storagerouter
  • Validate the Edge Connection, validate IOPS downtime, validate io errors
@wimpers wimpers added this to the Fargo RC4 milestone Feb 9, 2017
@wimpers
Copy link

wimpers commented Feb 9, 2017

@JeffreyDevloo can you please update on status?

@pploegaert
Copy link
Contributor

Jeffrey's update:
Results and experiences

  • Virtual environments with bridged network interfaces do not close their connections on force off
  • BS 1mb (while vpool of 4kb, qd=64) gave func=xfer, error=Device or resource busy - for further information please check with cnanakos
  • Creating and deleting a volume with the same name constantly can give MDS errors - the slave was not properly configured at the moment of the shutting down of the volume owner

My virtual environment:

  • All of my test results have been concluded by using the fio instead of a vm due to the VM creation being incredibly slow.
  • HA kicked in for bs=4k, fio settings(r,w) (100,0)
  • Average downtime of the fio: 135 sec for 1 volume

export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_TIME=60; export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_INTVL=20; export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_PROBES=3; /tmp/fio.bin.latest --iodepth=32 --rw=readwrite --bs=4k --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --enable_ha=1 --group_reporting=1 --name=test --volumename=ci_scenario_hypervisor_ha_test_vdisk01 --name=test2 --volumename=cnanakos3; exec bash

@JeffreyDevloo
Copy link
Contributor

JeffreyDevloo commented Feb 22, 2017

Problems along the way

Use of a not so powerful environment

Wait times were a lot longer. Initially tested with Virtual Machines but their setup took way to much time. Instead switched to FIO to get quick results

Not killing the volumedriver process but killing the whole node

Connection will remain when the node is shot. The host cannot respond anymore (if its alive but no longer listening on port, the kernel will kill the internal connection). This gave some issues initially to the point where we had to adjust timeout settings.

Able to manage the parent hypervisor of my environment

The KVM sdk that I wrote for creating and managing VMs cannot take a password without some serious config change in the hypervisor.
Instead I opted for a key approach and this should be used if the test will be added to the other bulk of tests
Example:
generate a new ssh-key, spread it across all vms of the cluster. Add the public to authorized hosts in the parent hypervisor.
Add the following config under .ssh on the VMs:

Host 10.100.69.222  # <- hypervisor IP
    HostName 10.100.69.222 # <- hypervisor IP
    IdentityFile ~/.ssh/id_rsa_hypervisor
    User root

Threading in Python

I am no expert in threading and thread management. Working with threads to perform different tasks such as monitoring was a challenge. Controlling the flow of these threads even more as I needed to control them to read my shared resource.

Test cases

Used settings:

The settings below are now the default:

LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_TIME=60
LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_INTVL=20
LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_PROBES=3

NETWORK_XIO_KEEPALIVE_TIME=60
NETWORK_XIO_KEEPALIVE_INTVL=20
NETWORK_XIO_KEEPALIVE_PROBES=3

Fio cmd:

SCREEN -S fio -dm bash -c while /tmp/fio.bin.latest --name=test --iodepth=32 --rw=readwrite --bs=4k --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --volumename=ci_scenario_hypervisor_ha_test_vdisk12 --enable_ha=1 --verify=crc32c-intel --verifysort=1 --verify_fatal=1 --verify_backlog=1000000; do :; done; exec bash

Fail over 1 volume

  • Volume failed over in 125-135 seconds
  • Moved to the volumedriver where the edge connected from
  • No validation errors

Fail over 25 volumes

  • Volumes failed over in 135-145 seconds
  • Moved to the volumedriver where the edge connected from
  • No validation errors

Fail over 50 volumes

  • Volumes failed over in 145-155 seconds
  • Moved to the volumedriver where the edge connected from
  • No validation errors

Fail over 100 volumes

  • Had to increase CPU power in my compute node to avoid throttling (4 -> 8 cpus)
  • Lowered iodepth to 1
  • Got hit by
[ERROR] - timerfd_create failed. Too many open files[2017/02/21-16:22:09.164037] xio_context.c:160            [ERROR] - context's workqueue create failed. Too many open files

and only 55-56 volumes failed over. Other reported IO error (expected due to all connections being aborted to due FD limit)

  • Split up fio to 2 fios with each 50 volumes as 50 volumes managed to failover
    Same issue: 86 volumes managed to failover
  • Increased FD limit to my sessions with: 'ulimit -n 4096;' before executing the screen
  • All 100 volumes failed over
  • Volumes failed over in 220s
  • No validation errors

Increasing BS to 1mb as it previously showed errors

fio cmd:

SCREEN -S fio -dm bash -c while /tmp/fio.bin.latest --name=test --iodepth=4 --rw=readwrite --bs=1m --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --volumename=ci_scenario_hypervisor_ha_test_vdisk12 --enable_ha=1 --verify=crc32c-intel --verifysort=1 --verify_fatal=1 --verify_backlog=1000000; do :; done; exec bash

Changes in cmd:

  • bs = 1m
  • iodepth = 4

Fail over 1 volume

  • All volumes migrated in 123s
  • No validation errors

Fail over 25 volumes

  • All volumes migrated in 160
  • No validation errors

Fail over 50 volumes

  • All volumes migrated in 180s
  • No validation errors

Fail over 100 volumes

  • Usage of increased FD limit
  • Split up into two fios
  • All 100 volumes failed over
  • Volumes failed over in 224s
  • No validation errors

@JeffreyDevloo
Copy link
Contributor

Fixed by #442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants