Add HA autotests #427

kinvaris · 2017-02-06T10:28:01Z

Test will require the parent hypervisor information for the first tests. These can be modified by killing certain processes later
Flow:

Find sync dtl vpool
Get two storagerouter
Determine which will get shot and which will survive (requires 3node setup)
Create a VM on a computenode (a third node which is not involved)
Start writing (to filesystem on the VM)
Shoot the storagerouter
Validate the Edge Connection, validate IOPS downtime, validate io errors

wimpers · 2017-02-09T13:21:45Z

@JeffreyDevloo can you please update on status?

pploegaert · 2017-02-16T13:25:47Z

Jeffrey's update:
Results and experiences

Virtual environments with bridged network interfaces do not close their connections on force off
BS 1mb (while vpool of 4kb, qd=64) gave func=xfer, error=Device or resource busy - for further information please check with cnanakos
Creating and deleting a volume with the same name constantly can give MDS errors - the slave was not properly configured at the moment of the shutting down of the volume owner

My virtual environment:

All of my test results have been concluded by using the fio instead of a vm due to the VM creation being incredibly slow.

HA kicked in for bs=4k, fio settings(r,w) (100,0)
Average downtime of the fio: 135 sec for 1 volume

export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_TIME=60; export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_INTVL=20; export LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_PROBES=3; /tmp/fio.bin.latest --iodepth=32 --rw=readwrite --bs=4k --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --enable_ha=1 --group_reporting=1 --name=test --volumename=ci_scenario_hypervisor_ha_test_vdisk01 --name=test2 --volumename=cnanakos3; exec bash

JeffreyDevloo · 2017-02-22T09:03:34Z

Problems along the way

Use of a not so powerful environment

Wait times were a lot longer. Initially tested with Virtual Machines but their setup took way to much time. Instead switched to FIO to get quick results

Not killing the volumedriver process but killing the whole node

Connection will remain when the node is shot. The host cannot respond anymore (if its alive but no longer listening on port, the kernel will kill the internal connection). This gave some issues initially to the point where we had to adjust timeout settings.

Able to manage the parent hypervisor of my environment

The KVM sdk that I wrote for creating and managing VMs cannot take a password without some serious config change in the hypervisor.
Instead I opted for a key approach and this should be used if the test will be added to the other bulk of tests
Example:
generate a new ssh-key, spread it across all vms of the cluster. Add the public to authorized hosts in the parent hypervisor.
Add the following config under .ssh on the VMs:

Host 10.100.69.222  # <- hypervisor IP
    HostName 10.100.69.222 # <- hypervisor IP
    IdentityFile ~/.ssh/id_rsa_hypervisor
    User root

Threading in Python

I am no expert in threading and thread management. Working with threads to perform different tasks such as monitoring was a challenge. Controlling the flow of these threads even more as I needed to control them to read my shared resource.

Test cases

Used settings:

The settings below are now the default:

LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_TIME=60
LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_INTVL=20
LIBOVSVOLUMEDRIVER_XIO_KEEPALIVE_PROBES=3

NETWORK_XIO_KEEPALIVE_TIME=60
NETWORK_XIO_KEEPALIVE_INTVL=20
NETWORK_XIO_KEEPALIVE_PROBES=3

Fio cmd:

SCREEN -S fio -dm bash -c while /tmp/fio.bin.latest --name=test --iodepth=32 --rw=readwrite --bs=4k --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --volumename=ci_scenario_hypervisor_ha_test_vdisk12 --enable_ha=1 --verify=crc32c-intel --verifysort=1 --verify_fatal=1 --verify_backlog=1000000; do :; done; exec bash

Fail over 1 volume

Volume failed over in 125-135 seconds
Moved to the volumedriver where the edge connected from
No validation errors

Fail over 25 volumes

Volumes failed over in 135-145 seconds
Moved to the volumedriver where the edge connected from
No validation errors

Fail over 50 volumes

Volumes failed over in 145-155 seconds
Moved to the volumedriver where the edge connected from
No validation errors

Fail over 100 volumes

Had to increase CPU power in my compute node to avoid throttling (4 -> 8 cpus)
Lowered iodepth to 1
Got hit by

[ERROR] - timerfd_create failed. Too many open files[2017/02/21-16:22:09.164037] xio_context.c:160            [ERROR] - context's workqueue create failed. Too many open files

and only 55-56 volumes failed over. Other reported IO error (expected due to all connections being aborted to due FD limit)

Split up fio to 2 fios with each 50 volumes as 50 volumes managed to failover
Same issue: 86 volumes managed to failover
Increased FD limit to my sessions with: 'ulimit -n 4096;' before executing the screen
All 100 volumes failed over
Volumes failed over in 220s
No validation errors

Increasing BS to 1mb as it previously showed errors

fio cmd:

SCREEN -S fio -dm bash -c while /tmp/fio.bin.latest --name=test --iodepth=4 --rw=readwrite --bs=1m --direct=1 --rwmixread=100 --rwmixwrite=0 --ioengine=openvstorage --hostname=10.100.69.121 --port=26203 --protocol=tcp --volumename=ci_scenario_hypervisor_ha_test_vdisk12 --enable_ha=1 --verify=crc32c-intel --verifysort=1 --verify_fatal=1 --verify_backlog=1000000; do :; done; exec bash

Changes in cmd:

bs = 1m
iodepth = 4

Fail over 1 volume

All volumes migrated in 123s
No validation errors

Fail over 25 volumes

All volumes migrated in 160
No validation errors

Fail over 50 volumes

All volumes migrated in 180s
No validation errors

Fail over 100 volumes

Usage of increased FD limit
Split up into two fios
All 100 volumes failed over
Volumes failed over in 224s
No validation errors

JeffreyDevloo · 2017-04-03T13:09:45Z

Fixed by #442

kinvaris assigned JeffreyDevloo Feb 6, 2017

kinvaris mentioned this issue Feb 6, 2017

Add DTL autotests #426

Closed

wimpers added the type_feature label Feb 9, 2017

wimpers added this to the Fargo RC4 milestone Feb 9, 2017

pploegaert added the priority_normal label Feb 16, 2017

pploegaert added the state_inprogress label Feb 20, 2017

JeffreyDevloo closed this as completed Apr 3, 2017

JeffreyDevloo removed the state_inprogress label Apr 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HA autotests #427

Add HA autotests #427

kinvaris commented Feb 6, 2017 •

edited by wimpers

Loading

wimpers commented Feb 9, 2017

pploegaert commented Feb 16, 2017

JeffreyDevloo commented Feb 22, 2017 •

edited

Loading

JeffreyDevloo commented Apr 3, 2017

Add HA autotests #427

Add HA autotests #427

Comments

kinvaris commented Feb 6, 2017 • edited by wimpers Loading

wimpers commented Feb 9, 2017

pploegaert commented Feb 16, 2017

JeffreyDevloo commented Feb 22, 2017 • edited Loading

Problems along the way

Use of a not so powerful environment

Not killing the volumedriver process but killing the whole node

Able to manage the parent hypervisor of my environment

Threading in Python

Test cases

Used settings:

Fail over 1 volume

Fail over 25 volumes

Fail over 50 volumes

Fail over 100 volumes

Increasing BS to 1mb as it previously showed errors

Fail over 1 volume

Fail over 25 volumes

Fail over 50 volumes

Fail over 100 volumes

JeffreyDevloo commented Apr 3, 2017

kinvaris commented Feb 6, 2017 •

edited by wimpers

Loading

JeffreyDevloo commented Feb 22, 2017 •

edited

Loading