Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Run multiple kata-container instances on one host will failed on Arm64 #843

Closed
Weichen81 opened this issue Oct 19, 2018 · 14 comments
Closed
Labels
bug Incorrect behaviour

Comments

@Weichen81
Copy link
Contributor

Description of problem

As Arm64 is using the block device as the rootfs in guest. If we run two or more kata-container instances
on one host, we will get following error:

root@entos-thunderx2-desktop:~# docker run -dt --runtime kata-runtime ubuntu
7baf2c0f26100b0e642ad4122ce9ea1cb556fb3ed6cfcb454cb0586cbbe6194d
root@entos-thunderx2-desktop:~# docker run -dt --runtime kata-runtime ubuntu
2065fcf560c136091b5e69b6a12c95de94308309bb7fd5309df852503752203a
docker: Error response from daemon: OCI runtime create failed: qemu-system-aarch64: -device virtio-blk,drive=image-9f100592ac95eec6,scsi=off,config-wce=off: Failed to get "write" lock
Is another process using the image?: unknown.

This is because all kata-container instances are sharing the same rootfs image. And all instances
want to open this file with RW permission. But this file has been locked by the first instance already.
Because we're using the RAW format for rootfs image.

We have tried to change the RAW format to QCOW or QCOW2 format. Yes, with 'COPY-ON-WRITE'
feature, we can run two or three instances at the same time. But, as the number of instances is inceasing,
the speed of creating instance becomes more and more slow. I think this may be caused by QCOW/QCOW2 itself. Because QCOW/QCOW2 haven't been used massively on cloud. Most cloud platforms are using network block device for virtual machines.

My question is that:
can we use NBD for kata-container instances to bypass this issue?

I know x86_64 is using persist memory, so it doesn't have similar issue @jodh-intel @gnawux @Pennyzct


(replace this text with the output of the kata-collect-data.sh script, after
you have reviewed its content to ensure it does not contain any private
information).

@egernst egernst added the bug Incorrect behaviour label Nov 1, 2018
@Weichen81
Copy link
Contributor Author

@Pennyzct and me has done some investigations about this issue. We have considered 3 methods:

  1. Do some works to improve the QCOW/QCOW2. --> It needs lots time to fix and verify, and we have some legal issue to do it.
  2. Use NBD for kata-containers. --> But it seems we have to modify OCI specification.
  3. Try to use the same method of x86_64 --> Enable nvdimm support for Arm64

Both of us thought, if the method#3 can work, that would be the best choice. So we tried from it.
Luckily, after some efforts, we got nvdimm work on Arm64, we can run multiple kata-containers
instance at one host and it doesn't have the same issue as QCOW/QCOW2.

Here are what we have done:

  1. We have to upgrade host and guest kernel to 4.20-rc3 and apply Suzuki's patch series:
    https://patchwork.kernel.org/patch/10531723/
  2. Apply Eric Auger's NVDIMM patches for QEMU
    https://patchwork.kernel.org/cover/10647305/
  3. Enable the CONFIG_ACPI_NFIT for Guest kernel.
  4. Use the same QEMU NVDIMM parameters as x86_64

Yes, after doing above work, we can run Kata-containers with NVDIMM on Arm64. But our concern
is that:

  1. We have to upgrade host and guest kernel to 4.20-rc3. Currently, we have specified the host kernel
    versions, and the guest kernel is 4.14-x
  2. We have to maintain these patches by ourselves until these patches have been merged to upstream,
  3. Arm64 and x86_64 may use the different guest kernel version.

WDYT @jodh-intel @gnawux @grahamwhaley @egernst @bergwolf

@gnawux
Copy link
Member

gnawux commented Nov 23, 2018

Firstly, glad to hear the nvdimm method works, and looking forward to it.

Secondly, @egernst talked with me about the kernel version things, and we both want to upgrade the guest kernel to 4.19 at least.

Personally, I think the different versions for different architectures should not be a blocker.

What do others think about?

@grahamwhaley
Copy link
Contributor

nice work!
Agree, different versions on different arch's I think would not be unexpected. OK, so, generally we try to keep up with the latest 'longterm', so we get fixes and backports but don't churn too much on 'stable' or 'head' kernels. But, if an arch needs a feature that means it has to live on 'stable' for now, then so be it.

@Weichen81
Copy link
Contributor Author

@gnawux @grahamwhaley @egernst @jodh-intel
I have updated the Linux kernel of Arm CI server to v4.20-rc4, @Pennyzct will test the related patches (kata, QEMU and guest kernel) on CI server. If the test will be passed, we still start to send PRs. So could you please stop run PR CI on Arm server until @Pennyzct finish her test?

@jodh-intel
Copy link
Contributor

/cc @chavafg.

@grahamwhaley
Copy link
Contributor

@Weichen81 @chavafg @Pennyzct - OK, what I've done is change the label on the arm slave node from arm_node to arm_node_XXX. That should stop any of the jobs matching that as a build node, so should not schedule any builds on the slave.
Once you are done with your updates and verify we can re-enable, then we (probably myself or @chavafg ) will go remove the _XXX from the label, and the backlog of jobs should start flowing again.
(note to @chavafg - I've not marked the node as 'offline' - just changing the labels has worked well for me in the past when working on the metrics nodes - I guess if there were more folks tinkering then updating the offline status with a full comment of how and why would be more appropriate ;-) )

@chavafg
Copy link
Contributor

chavafg commented Nov 28, 2018

thanks @grahamwhaley

@Pennyzct
Copy link
Contributor

Pennyzct commented Dec 3, 2018

Hi~all @gnawux @grahamwhaley @jodh-intel @chavafg I have done all nvdimm-related tests on ARM CI, and for now the host kernel of ARM CI has been updated to 4.20-rc4 as requested.

root@testing-1:~# uname -a
Linux testing-1 4.20.0-rc4 #1 SMP Wed Nov 28 14:44:27 CST 2018 aarch64 aarch64 aarch64 GNU/Linux

so could anyone help me bring ARM CI online? After that, I could pull a bundle of requests to make aarch64 nvdimm-supported for kata-runtime. 😊

@grahamwhaley
Copy link
Contributor

Sure @Pennyzct - I'll bring the ARM CI slave back online, and we'll see how it goes :-)
You should be able to monitor how the builds are going at http://jenkins.katacontainers.io/computer/arm01_slave/builds
There look to be 5 pending jobs or so, which should start being processed..

@Pennyzct
Copy link
Contributor

Pennyzct commented Dec 3, 2018

@grahamwhaley thanks.😝

@amshinde
Copy link
Member

amshinde commented Dec 7, 2018

@Weichen81 Qemu has shared flag that you can pass to virtio-block to allow an image to be shared among several VM, did you take a look at that?
I had added a similar flag for passing block devices on x86_64:
70edc56

@Weichen81
Copy link
Contributor Author

@amshinde We had tried the shared flag before, but when we ran more than 3 kata-containers, the start up speed would be slower and slower. This is why we wanted to try nvdimm for Arm64

@Pennyzct
Copy link
Contributor

Hi~ @amshinde thanks for the proposal~
I was reading the related share-rw docs on qemu. It says:
If the guest can safely share the disk image with other writers the @code{-device ...,share-rw=on} parameter can be used. This is only safe if the guest is running software, such as a cluster file system, that coordinates disk accesses to avoid corruption
FWIT, this option doesn't provide extra write protection between multiple VMs. And kata also doesn't provide disk access coordination for rootfs, I think that it is risky to use this solution. ;)

@Weichen81
Copy link
Contributor Author

@Pennyzct Yes, I had forgot that, in my tests, if I did writes in one instance, other instances would get ext4-fs error. because we don't have software to notify other instances to update file system cache from disk.

Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 5, 2019
Original guest image was reprensented as block device in qemu-aarch64,
and it will bring up write lock error when running multiple containers.
Thanks to the new expanded IPA_SIZE feature in kernel 4.20 and
Eric Auger's related patch set in qemu(which are still under upstream
review), we could fully support nvdimm on arm64.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 5, 2019
Since we overrided the func appendImage for aarch64, we should also
provide related unit test.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 5, 2019
dax is not fully supported on arm64, so we disable dax for now.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 6, 2019
Original guest image was reprensented as block device in qemu-aarch64,
and it will bring up write lock error when running multiple containers.
Thanks to the new expanded IPA_SIZE feature in kernel 4.20 and
Eric Auger's related patch set in qemu(which are still under upstream
review), we could fully support nvdimm on arm64.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 6, 2019
Since we overrided the func appendImage for aarch64, we should also
provide related unit test.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 7, 2019
Since we overrided the func appendImage for aarch64, we should also
provide related unit test.

Depends-on: github.com/kata-containers/packaging#377

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Pennyzct added a commit to Pennyzct/runtime that referenced this issue Mar 7, 2019
Since we overrided the func appendImage for aarch64, we should also
provide related unit test.

Fixes: kata-containers#843

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
egernst pushed a commit to egernst/runtime that referenced this issue Feb 9, 2021
Improve the PR porting GitHub action by referencing a central script to
handle the checks rather than hard-coding them in the workflow YAML.

This ensures all PRs use the latest porting policy encoded in the
script and makes maintenance easier.

Related: kata-containers/kata-containers#634

Fixes: kata-containers#843.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour
Projects
None yet
Development

No branches or pull requests

8 participants