Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oardocker does not work on the latest docker version anymore #54

Open
xy124 opened this issue Jan 9, 2020 · 6 comments
Open

Oardocker does not work on the latest docker version anymore #54

xy124 opened this issue Jan 9, 2020 · 6 comments

Comments

@xy124
Copy link

xy124 commented Jan 9, 2020

To reproduce:

$ docker --version
Docker version 18.09.7, build 2d0083d
$ oardocker --version
oardocker, version 1.5.0.dev0
$ oardocker init -f build install http://oar-ftp.imag.fr/oar/2.5/sources/testing/oar-2.5.8.tar.gz start connect frontend

docker@frontend ~
$ oarsub -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=1
Interactive mode: waiting...
Starting...
ERROR: some resources did not respond

docker@frontend ~
$ 

root@server /home/docker
$ grep error /var/log/oar.log 
[TAKTUK OUTPUT] node1-1: perl - init (79): error > sh: 9: cannot create /dev/oar_cgroups_links/blkio//oar/blkio.weight: Permission denied
[TAKTUK OUTPUT] node1-1: perl - init (79): error > /bin/echo: write error: Broken pipe
[TAKTUK OUTPUT] node1-1: perl - init (79): error > [job_resource_manager_cgroups][1][node1][ERROR] Failed to create cgroup /oar
[error] [2020-01-09 14:10:10.373] [bipbip 1] /!\ Some nodes are inaccessible (CPUSET_ERROR):
[info] [2020-01-09 14:10:10.471] [NodeChangeState] Set nodes to suspected after error (CPUSET_ERROR): node1

... its using debian strech in oardocker if this is useful.

It is a guess that this is related to the docker version as before an update everything just worked fine.

@augu5te
Copy link
Contributor

augu5te commented Jan 9, 2020

I think your cgroup hierarchy is in a bad state. The simple approach to clean it is to reboot your machine.
Note: With the same version of docker and oardocker on NixOS there are not issue.

@xy124
Copy link
Author

xy124 commented Jan 9, 2020

 docker info:
Containers: 126
 Running: 6
 Paused: 0
 Stopped: 120
Images: 473
Server Version: 18.09.7
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 
runc version: N/A
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 5.0.0-37-generic
Operating System: Ubuntu 18.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.04GiB
Name: narrenkappe
ID: JY2T:IHEG:7RHB:P3WJ:LCNZ:6PU7:ECD4:G7ZJ:FM5X:PZQ4:WO7I:FGO3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

@augu5te
Copy link
Contributor

augu5te commented Jan 9, 2020

And the output of "docker config inspect" on my laptop is:

Containers: 78
Running: 0
Paused: 0
Stopped: 78
Images: 74
Server Version: 18.09.7
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: .m
runc version: N/A
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.19.80
Operating System: NixOS 19.03.173676.d1dff0bcd9f (Koi)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.44GiB
Name: akira
ID: WJKP:H7FR:UWVC:HAK4:3GYK:QYPW:QOEC:E6SD:YOFR:YWKF:YJKJ:N6OB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true

@xy124
Copy link
Author

xy124 commented Mar 19, 2020

Bug still persists but error looks different now:

$ pip install git+https://github.com/oar-team/oar-docker.git
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.4 LTS
Release:	18.04
Codename:	bionic
$ oardocker --version
oardocker, version 1.5.0.dev0
$ docker --version
Docker version 19.03.6, build 369ce74a3c
$ oardocker init -f build install git+https://github.com/oar-team/oar.git start connect frontend

Oar versionis git hash: a0d1c4045ffc1748cb59d77c46ccd4fc7a2400e7

docker@frontend ~
$ oarsub -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=1
Interactive mode: waiting...
Starting...
ERROR: some resources did not respond

The following helps to work arround:

$ oardocker connect -l root server
root@server ~
vi /etc/oar/job_resource_manager_cgroups.pl

And apply the following patch manually (basically to remove all blkio checking):

--- job_resource_manager_cgroups.pl	2020-03-19 11:56:51.257232670 +0100
+++ a.pl	2020-03-19 11:54:22.751497831 +0100
@@ -140,7 +140,7 @@
             flock(LOCKFILE,LOCK_EX) or exit_myself(17,"flock failed: $!");
             if (!(-r $Cgroup_directory_collection_links.'/cpuset/tasks')){
                 if (!(-r $OS_cgroups_path.'/cpuset/tasks')){
-                    my $cgroup_list = "cpuset,cpu,cpuacct,devices,freezer,blkio";
+                    my $cgroup_list = "cpuset,cpu,cpuacct,devices,freezer";
                     $cgroup_list .= ",memory" if ($ENABLE_MEMCG eq "YES");
                     if (system('oardodo mkdir -p '.$Cgroup_mount_point.' &&
                                 oardodo mount -t cgroup -o '.$cgroup_list.' none '.$Cgroup_mount_point.' || exit 1
@@ -152,7 +152,6 @@
                                 oardodo ln -s '.$Cgroup_mount_point.' '.$Cgroup_directory_collection_links.'/cpuacct &&
                                 oardodo ln -s '.$Cgroup_mount_point.' '.$Cgroup_directory_collection_links.'/devices &&
                                 oardodo ln -s '.$Cgroup_mount_point.' '.$Cgroup_directory_collection_links.'/freezer &&
-                                oardodo ln -s '.$Cgroup_mount_point.' '.$Cgroup_directory_collection_links.'/blkio &&
                                 [ "'.$ENABLE_MEMCG.'" =  "YES" ] && oardodo ln -s '.$Cgroup_mount_point.' '.$Cgroup_directory_collection_links.'/memory || true
                                ')){
                         exit_myself(4,"Failed to mount cgroup pseudo filesystem");
@@ -167,7 +166,6 @@
                                 oardodo ln -s '.$OS_cgroups_path.'/cpuacct '.$Cgroup_directory_collection_links.'/cpuacct &&
                                 oardodo ln -s '.$OS_cgroups_path.'/devices '.$Cgroup_directory_collection_links.'/devices &&
                                 oardodo ln -s '.$OS_cgroups_path.'/freezer '.$Cgroup_directory_collection_links.'/freezer &&
-                                oardodo ln -s '.$OS_cgroups_path.'/blkio '.$Cgroup_directory_collection_links.'/blkio &&
                                 [ "'.$ENABLE_MEMCG.'" =  "YES" ] && oardodo ln -s '.$OS_cgroups_path.'/memory '.$Cgroup_directory_collection_links.'/memory || true
                                ')){
                         exit_myself(4,"Failed to link existing OS cgroup pseudo filesystem");
@@ -183,8 +181,7 @@
                              done
                              /bin/echo 0 | cat > '.$Cgroup_directory_collection_links.'/cpuset/'.$Cpuset->{cpuset_path}.'/cpuset.cpu_exclusive &&
                              cat '.$Cgroup_directory_collection_links.'/cpuset/cpuset.mems > '.$Cgroup_directory_collection_links.'/cpuset/'.$Cpuset->{cpuset_path}.'/cpuset.mems &&
-                             cat '.$Cgroup_directory_collection_links.'/cpuset/cpuset.cpus > '.$Cgroup_directory_collection_links.'/cpuset/'.$Cpuset->{cpuset_path}.'/cpuset.cpus &&
-                             /bin/echo 1000 | cat > '.$Cgroup_directory_collection_links.'/blkio/'.$Cpuset->{cpuset_path}.'/blkio.weight
+                             cat '.$Cgroup_directory_collection_links.'/cpuset/cpuset.cpus > '.$Cgroup_directory_collection_links.'/cpuset/'.$Cpuset->{cpuset_path}.'/cpuset.cpus
                             ')){
                     exit_myself(4,"Failed to create cgroup $Cpuset->{cpuset_path}");
                 }
@@ -250,9 +247,6 @@
         # TODO: Need to do more tests to validate so remove this feature
         #       Some values are not working when echoing
         $IO_ratio = 1000;
-        if (system( '/bin/echo '.$IO_ratio.' | cat > '.$Cgroup_directory_collection_links.'/blkio/'.$Cpuset_path_job.'/blkio.weight')){
-            exit_myself(5,"Failed to set the blkio.weight to $IO_ratio");
-        }
 
         if ($ENABLE_DEVICESCG eq "YES"){
             my @devices_deny = ();

@augu5te
Copy link
Contributor

augu5te commented Mar 19, 2020

Looks like "Blkio CG" should be an option as MEMCG, I'll ask Pierre about this.
Thx for your investigation.

@npf
Copy link
Contributor

npf commented Mar 19, 2020

Yes, the inferface of the blockio cgroup changed a bit with recent kernels. Commenting the blockio lines in the job_resource_manager is the quick fix.

The real fix would be to adapt to the latest blockio interface. PR welcomed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants