Race condition when creating machines? #402

abhinavdahiya · 2018-09-07T02:33:43Z

Version Reports:

Distro version of host:

Fedora 28

Terraform Version Report

Terraform v0.11.8

Libvirt version

4.1.0

terraform-provider-libvirt plugin version (git-hash)

6c9b294

Description of Issue/Question

Setup

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "tectonic_net" {
  name = "tectonic"

  mode   = "nat"
  bridge = "tt0"

  domain = "tt.testing"

  addresses = ["192.168.124.0/24"]

  dns = [{
    local_only = true
  }]

  autostart = true
}

locals {
  master_ips = ["192.168.124.11", "192.168.124.12"]
  worker_ips = ["192.168.124.51", "192.168.124.52"]
}

resource "libvirt_domain" "master" {
  count = "2"

  name = "master${count.index}"

  memory = "2048"
  vcpu   = "2"

  console {
    type        = "pty"
    target_port = 0
  }

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-master-${count.index}"
    addresses  = ["${local.master_ips[count.index]}"]
  }
}

resource "libvirt_domain" "worker" {
  count = "2"

  name   = "worker${count.index}"
  memory = "1024"
  vcpu   = "2"

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-worker-${count.index}"
    addresses  = ["${local.worker_ips[count.index]}"]
  }
}

Steps to Reproduce Issue

terraform init
terraform apply

Terraform should have created tectonic network, 2 master machines and 2 worker machines.

But terraform exists with error:

Error: Error applying plan:

2 error(s) occurred:

* libvirt_domain.worker[0]: 1 error(s) occurred:

* libvirt_domain.worker.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')
* libvirt_domain.master[0]: 1 error(s) occurred:

* libvirt_domain.master.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')

Complete output

Plan: 5 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

libvirt_network.tectonic_net: Creating...
  addresses.#:      "" => "1"
  addresses.0:      "" => "192.168.124.0/24"
  autostart:        "" => "true"
  bridge:           "" => "tt0"
  dns.#:            "" => "1"
  dns.0.local_only: "" => "true"
  domain:           "" => "tt.testing"
  mode:             "" => "nat"
  name:             "" => "tectonic"
libvirt_network.tectonic_net: Creation complete after 5s (ID: 7bcaf736-3b2d-4509-ab1e-7680d6e17fbf)
libvirt_domain.worker[1]: Creating...
  arch:                             "" => "<computed>"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "1024"
  name:                             "" => "worker1"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.52"
  network_interface.0.hostname:     "" => "adahiya-worker-1"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.worker[0]: Creating...
  arch:                             "" => "<computed>"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "1024"
  name:                             "" => "worker0"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.51"
  network_interface.0.hostname:     "" => "adahiya-worker-0"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.master[1]: Creating...
  arch:                             "" => "<computed>"
  console.#:                        "" => "1"
  console.0.target_port:            "" => "0"
  console.0.type:                   "" => "pty"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "2048"
  name:                             "" => "master1"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.12"
  network_interface.0.hostname:     "" => "adahiya-master-1"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.master[0]: Creating...
  arch:                             "" => "<computed>"
  console.#:                        "" => "1"
  console.0.target_port:            "" => "0"
  console.0.type:                   "" => "pty"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "2048"
  name:                             "" => "master0"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.11"
  network_interface.0.hostname:     "" => "adahiya-master-0"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.worker[1]: Creation complete after 2s (ID: 72dae79b-dc84-4e3a-b6d0-65f478f53038)
libvirt_domain.master[1]: Creation complete after 2s (ID: ed8f694e-3c22-4d36-9e77-63187ee68ba1)

Error: Error applying plan:

2 error(s) occurred:

* libvirt_domain.worker[0]: 1 error(s) occurred:

* libvirt_domain.worker.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')
* libvirt_domain.master[0]: 1 error(s) occurred:

* libvirt_domain.master.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

The text was updated successfully, but these errors were encountered:

abhinavdahiya · 2018-09-07T02:36:57Z

I suspect it is a race condition because, this main.tf completes to success

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "tectonic_net" {
  name = "tectonic"

  mode   = "nat"
  bridge = "tt0"

  domain = "tt.testing"

  addresses = ["192.168.124.0/24"]

  dns = [{
    local_only = true
  }]

  autostart = true
}

locals {
  master_ips = ["192.168.124.11", "192.168.124.12"]
  worker_ips = ["192.168.124.51", "192.168.124.52"]
}

resource "libvirt_domain" "master" {
  count = "2"

  name = "master${count.index}"

  memory = "2048"
  vcpu   = "2"

  console {
    type        = "pty"
    target_port = 0
  }

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-master-${count.index}"
    addresses  = ["${local.master_ips[count.index]}"]
  }
}

resource "libvirt_domain" "worker" {
  count = "2"

  name   = "worker${count.index}"
  memory = "1024"
  vcpu   = "2"

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-worker-${count.index}"
    addresses  = ["${local.worker_ips[count.index]}"]
  }

  depends_on = ["libvirt_domain.master"]
}

depends_on section in workers forces workers to be created after masters have been created, serializing the whole thing.

wking · 2018-09-07T03:50:38Z

It's surprising that serializing the domains helps with what seems (from the error message) to be a domain/network race. Does adding depends_on = ["libvirt_network.tectonic_net"] to both master and worker domains have any affect?

wking · 2018-09-07T04:08:49Z

It's surprising that serializing the domains helps with what seems (from the error message) to be a domain/network race.

Ah, the error message is just busted. Your full log shows libvirt_network.tectonic_net's full creation and the injection of its ID as network_interface.0.network_id in all four domains.

wking · 2018-09-07T04:31:57Z

I cannot reproduce this on RHEL 7.5's 3.10.0-891.el7.x86_64 kernel with Terraform 0.11.7, libvirt 3.9.0, QEMU 2.9.0, and terraform-provider-libvirt 7e52bbe or 6c9b294. I also failed to reproduce this with Terraform 0.11.8 and terraform-provider-libvirt 6c9b294. Maybe it's a kernel/libvirt/QEMU bug?

MalloZup · 2018-09-07T07:13:59Z

hi @wking @abhinavdahiya thx for reporting.
@abhinavdahiya can you attach

TF_LOG=debug terraform apply ? so we can have more debug info

@wking for libvirt logs https://wiki.libvirt.org/page/DebugLogs. ( could be interesting if you find something there with libvirt.logs)

Just looking for your logs to me looks a race condition, namely that the network is created but notyet for the domains which don't see it. but i'm curious to see the logs of terraform debug

tia

steveej · 2018-09-07T14:09:14Z

I'm able to reproduce it in about 9/10 cases.

Running this on:

# system
Linux steveej-laptop 4.17.19 #1-NixOS SMP Fri Aug 24 11:07:17 UTC 2018 x86_64 GNU/Linux

# libvirt
Compiled against library: libvirt 3.10.0
Using library: libvirt 3.10.0
Using API: QEMU 3.10.0
Running hypervisor: QEMU 2.11.2

Here's are the terraform and libvirt logs.

wking · 2018-09-07T14:32:36Z

Here is the libvirt bug for "Hash operation not allowed during iteration".

MalloZup · 2018-09-07T14:43:45Z

@wking @steveej thank you for infos.

@wking yeah this is really a know and annoying bug on libvirt side. I have also encountered many times, and we are also impacted in other projects... 🎶

@abhinavdahiya i don't think that reverting on your project would help in that case (from my pov).

The actual solution is to upgrade the libvirt pkg with the patch. ( this the solution and we will also upgrade the systems as we have the patch). ( we might also add a note in future version to reccomend a libvirt version which will contain the fix for this bug)

Afaik there are workarounds on the user side:

use parallelism option in terraform ( this i dont like personally)
use depends_on on resource terraform (see in this issue)

On the codebase:

use mutex as we already use ( but maybe we might find out also where we could apply them if possible. But this could also slowdown the performance).
do other research on it ( but i think there is not much we can do against this 🐑 )

wking · 2018-09-07T16:49:52Z

I'm running RHEL's libvirt 3.9.0 14.el7_5.6, which has a patch:

$ rpm -q --changelog libvirt | head -n3
* Tue Jun 05 2018 Jiri Denemark <jdenemar@redhat.com> - 3.9.0-14.el7_5.6
- logging: Don't inhibit shutdown in system daemon (rhbz#1573268)
- util: don't check for parallel iteration in hash-related functions (rhbz#1581364)

rhbz#1581364 is a backport to RHEL 7.5 of the rhbz#1576464 I linked earlier. The upstream commit fixing this (linked from rhbz#1576464) is 4d7384eb, which punts serialization up to the callers. I'm not sure if the issue we're seeing here is because:

a. The old check (before libvirt/libvirt@4d7384eb) was overly strict.
b. Folks seeing problems are running versions of libvirt with internal racy virhash.c consumers.
c. Folks seeing problems are running libvirt callers with racy virhash.c consumers.

(a) and (b) would be addressable by patching libvirt. (c) might be an issue with this repository.

@MalloZup, does that make sense? Do you know which case applies?

We are seeing race like symptoms when creating multiple domainsets in parallel. For more info: dmacvicar/terraform-provider-libvirt#402 The issue suggests that its most proabaly a libvirt bug that we can avoid by serializing master and worker domain sets.

MalloZup · 2018-09-10T07:38:13Z

@wking thank you for your precise comment.

From my pov is a great news the fact you cannot reproduce with latest libvirt (#402 (comment))

i plan to update to next libvirt-devel version on my setup containing that patch so i can verify if we don\t have other issues

To me i would 98.99% ( ,0.99 is for com purposes 😄 ) exclude the (c) hypothesis because the locking mechanism and race conditions due to repository:

the mutex mechanism is working well on the pool level when we create volumes.

terraform-provider-libvirt/libvirt/resource_libvirt_volume.go

Line 80 in 676b5a3

client.poolMutexKV.Lock(poolName)

and we lock always before refreshing pools.

terraform-provider-libvirt/libvirt/resource_libvirt_volume.go

Line 91 in 676b5a3

waitForSuccess("error refreshing pool for volume", func() error {

So to me once the libvirt version is higher the pb should disappear. ( i will test this week this).

For any info/question feel free to ping me and thanks for your collaboration and infos 👍

steveej · 2018-09-10T14:53:10Z

@wking

I'm not sure if the issue we're seeing here is because:

a. The old check (before libvirt/libvirt@4d7384e) was overly strict.
b. Folks seeing problems are running versions of libvirt with internal racy virhash.c consumers.
c. Folks seeing problems are running libvirt callers with racy virhash.c consumers.

According to Bug 1581364 which you linked as the fix, it says:

Prior to this update, guest virtual machine actions that use a python library in some cases failed and "Hash operation not allowed during iteration" error messages were logged. Several redundant thread access checks have been removed, and the problem no longer occurs.

which reads to me as case (a), and I'm not sure if (b) is also the case because the callchain isn't obvious to me yet. I'm wondering if there's anything we can do to diminish the problem on the client side.

wking · 2018-09-11T03:56:15Z

I'm wondering if there's anything we can do to diminish the problem on the client side.

openshift/installer#226 is a client-side workaround (using the depends_on approach @abhinavdahiya floated earlier). With the libvirt patch out for months, the easiest approach is probably "bump your libvirt to pick up the (backported) patch."

MalloZup · 2018-12-15T09:39:01Z

Closing since cleared. Thx all contributors and Linux lover, no matter which distro., for the info sharing. Cu in next issue or pr :)

MalloZup added the Need info label Sep 7, 2018

abhinavdahiya mentioned this issue Sep 7, 2018

*: revert to older libvirt provider implementation openshift/installer#219

Closed

MalloZup added libvirt bug Upstream bugs in libvirt and removed Need info labels Sep 7, 2018

abhinavdahiya mentioned this issue Sep 7, 2018

steps/infra: serialize master and worker domain sets for libvirt openshift/installer#226

Merged

ghtyrant mentioned this issue Oct 11, 2018

<host> not added to network when creating a domain #441

Closed

MalloZup closed this as completed Dec 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when creating machines? #402

Race condition when creating machines? #402

abhinavdahiya commented Sep 7, 2018

abhinavdahiya commented Sep 7, 2018 •

edited

Loading

wking commented Sep 7, 2018

wking commented Sep 7, 2018

wking commented Sep 7, 2018 •

edited

Loading

MalloZup commented Sep 7, 2018 •

edited

Loading

steveej commented Sep 7, 2018

wking commented Sep 7, 2018

MalloZup commented Sep 7, 2018 •

edited

Loading

wking commented Sep 7, 2018

MalloZup commented Sep 10, 2018 •

edited

Loading

steveej commented Sep 10, 2018

wking commented Sep 11, 2018 •

edited

Loading

MalloZup commented Dec 15, 2018 •

edited

Loading

Race condition when creating machines? #402

Race condition when creating machines? #402

Comments

abhinavdahiya commented Sep 7, 2018

Version Reports:

Distro version of host:

Terraform Version Report

Libvirt version

terraform-provider-libvirt plugin version (git-hash)

Description of Issue/Question

Setup

Steps to Reproduce Issue

abhinavdahiya commented Sep 7, 2018 • edited Loading

wking commented Sep 7, 2018

wking commented Sep 7, 2018

wking commented Sep 7, 2018 • edited Loading

MalloZup commented Sep 7, 2018 • edited Loading

steveej commented Sep 7, 2018

wking commented Sep 7, 2018

MalloZup commented Sep 7, 2018 • edited Loading

wking commented Sep 7, 2018

MalloZup commented Sep 10, 2018 • edited Loading

steveej commented Sep 10, 2018

wking commented Sep 11, 2018 • edited Loading

MalloZup commented Dec 15, 2018 • edited Loading

abhinavdahiya commented Sep 7, 2018 •

edited

Loading

wking commented Sep 7, 2018 •

edited

Loading

MalloZup commented Sep 7, 2018 •

edited

Loading

MalloZup commented Sep 7, 2018 •

edited

Loading

MalloZup commented Sep 10, 2018 •

edited

Loading

wking commented Sep 11, 2018 •

edited

Loading

MalloZup commented Dec 15, 2018 •

edited

Loading