-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when creating machines? #402
Comments
I suspect it is a race condition because, this main.tf completes to success provider "libvirt" {
uri = "qemu:///system"
}
resource "libvirt_network" "tectonic_net" {
name = "tectonic"
mode = "nat"
bridge = "tt0"
domain = "tt.testing"
addresses = ["192.168.124.0/24"]
dns = [{
local_only = true
}]
autostart = true
}
locals {
master_ips = ["192.168.124.11", "192.168.124.12"]
worker_ips = ["192.168.124.51", "192.168.124.52"]
}
resource "libvirt_domain" "master" {
count = "2"
name = "master${count.index}"
memory = "2048"
vcpu = "2"
console {
type = "pty"
target_port = 0
}
network_interface {
network_id = "${libvirt_network.tectonic_net.id}"
hostname = "adahiya-master-${count.index}"
addresses = ["${local.master_ips[count.index]}"]
}
}
resource "libvirt_domain" "worker" {
count = "2"
name = "worker${count.index}"
memory = "1024"
vcpu = "2"
network_interface {
network_id = "${libvirt_network.tectonic_net.id}"
hostname = "adahiya-worker-${count.index}"
addresses = ["${local.worker_ips[count.index]}"]
}
depends_on = ["libvirt_domain.master"]
}
|
It's surprising that serializing the domains helps with what seems (from the error message) to be a domain/network race. Does adding |
Ah, the error message is just busted. Your full log shows |
hi @wking @abhinavdahiya thx for reporting.
@wking for libvirt logs https://wiki.libvirt.org/page/DebugLogs. ( could be interesting if you find something there with libvirt.logs) Just looking for your logs to me looks a race condition, namely that the network is created but notyet for the domains which don't see it. but i'm curious to see the logs of terraform debug tia |
I'm able to reproduce it in about 9/10 cases. Running this on:
|
Here is the libvirt bug for "Hash operation not allowed during iteration". |
@wking @steveej thank you for infos. @wking yeah this is really a know and annoying bug on libvirt side. I have also encountered many times, and we are also impacted in other projects... 🎶 @abhinavdahiya i don't think that reverting on your project would help in that case (from my pov). The actual solution is to upgrade the libvirt pkg with the patch. ( this the solution and we will also upgrade the systems as we have the patch). ( we might also add a note in future version to reccomend a libvirt version which will contain the fix for this bug) Afaik there are workarounds on the user side:
On the codebase:
|
I'm running RHEL's libvirt 3.9.0 14.el7_5.6, which has a patch: $ rpm -q --changelog libvirt | head -n3
* Tue Jun 05 2018 Jiri Denemark <jdenemar@redhat.com> - 3.9.0-14.el7_5.6
- logging: Don't inhibit shutdown in system daemon (rhbz#1573268)
- util: don't check for parallel iteration in hash-related functions (rhbz#1581364) rhbz#1581364 is a backport to RHEL 7.5 of the rhbz#1576464 I linked earlier. The upstream commit fixing this (linked from rhbz#1576464) is 4d7384eb, which punts serialization up to the callers. I'm not sure if the issue we're seeing here is because: a. The old check (before libvirt/libvirt@4d7384eb) was overly strict. (a) and (b) would be addressable by patching libvirt. (c) might be an issue with this repository. @MalloZup, does that make sense? Do you know which case applies? |
We are seeing race like symptoms when creating multiple domainsets in parallel. For more info: dmacvicar/terraform-provider-libvirt#402 The issue suggests that its most proabaly a libvirt bug that we can avoid by serializing master and worker domain sets.
@wking thank you for your precise comment. From my pov is a great news the fact you cannot reproduce with latest libvirt (#402 (comment)) i plan to update to next libvirt-devel version on my setup containing that patch so i can verify if we don\t have other issues To me i would 98.99% ( ,0.99 is for com purposes 😄 ) exclude the (c) hypothesis because the locking mechanism and race conditions due to repository: the mutex mechanism is working well on the pool level when we create volumes.
and we lock always before refreshing pools.
So to me once the libvirt version is higher the pb should disappear. ( i will test this week this). For any info/question feel free to ping me and thanks for your collaboration and infos 👍 |
According to Bug 1581364 which you linked as the fix, it says:
which reads to me as case (a), and I'm not sure if (b) is also the case because the callchain isn't obvious to me yet. I'm wondering if there's anything we can do to diminish the problem on the client side. |
openshift/installer#226 is a client-side workaround (using the |
Closing since cleared. Thx all contributors and Linux lover, no matter which distro., for the info sharing. Cu in next issue or pr :) |
Version Reports:
Distro version of host:
Fedora 28
Terraform Version Report
Terraform v0.11.8
Libvirt version
4.1.0
terraform-provider-libvirt plugin version (git-hash)
6c9b294
Description of Issue/Question
Setup
Steps to Reproduce Issue
Terraform should have created
tectonic
network, 2 master machines and 2 worker machines.But terraform exists with error:
Complete output
The text was updated successfully, but these errors were encountered: