Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image cache from ISO not being loaded #9892

Open
rothgar opened this issue Dec 7, 2024 · 6 comments
Open

Image cache from ISO not being loaded #9892

rothgar opened this issue Dec 7, 2024 · 6 comments

Comments

@rothgar
Copy link
Member

rothgar commented Dec 7, 2024

Bug Report

I've been trying to get the image cache to work with an ISO on a bare metal machine.

I created the iso with image cache via

docker run --rm -t -v $PWD/image-cache.oci:/tmp/ -v $PWD/_out:/secureboot:ro -v $PWD/_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:v1.9.0-beta.0 iso --image-cache /tmp/

and when I mount the iso on my local machine I can see the image cache directory and blobs

I'm generating a default machine config and applying the following patch

machine:
  features:
    imageCache:
      localEnabled: true
---
apiVersion: v1alpha1
kind: VolumeConfig
name: IMAGECACHE
provisioning:
  diskSelector:
    match: "system_disk"
  minSize: 4GB
  maxSize: 4GB

After the machine is installed the imagecacheconfig says it was skipped but I'm not sure why

node: 192.168.4.26
metadata:
    namespace: cri
    type: ImageCacheConfigs.cri.talos.dev
    id: image-cache
    version: 4
    owner: cri.ImageCacheConfigController
    phase: running
    created: 2024-12-07T01:00:52Z
    updated: 2024-12-07T01:00:57Z
spec:
    status: ready
    copyStatus: skipped
    roots:
        - /system/imagecache/disk

When I look through the registryd logs I can see all of the images were not found

192.168.4.26: 2024-12-07T01:03:40.209Z ERROR failed to handle request {"error": "1 error occurred:\n\t* stat /system/imagecache/disk/blob/sha256:0bc8e07f93b4cc11ff57d8ea896ed9d3622c4e79bbd7ac3d03d684a269a6ac8a: no such file or directory\n\n"}
192.168.4.26: 2024-12-07T01:03:40.228Z INFO image request {"method": "GET", "url": "/v2/coredns/coredns/blobs/sha256:ceabd226a7a8e199afc46c5f32d5c23fe98001ff25e6a8c7ca5f4490241256c2?ns=registry.k8s.io", "remote_addr": "127.0.0.1:48630", "name": "coredns/coredns", "digest": "sha256:ceabd226a7a8e199afc46c5f32d5c23fe98001ff25e6a8c7ca5f4490241256c2", "is_blob": true, "registry": "registry.k8s.io"}

When I look on the filesystem the /system/imagecache/disk directory is empty

talosctl list /system/imagecache/disk/ --talosconfig ./talosconfig -n 192.168.4.26 -e 192.168.4.26
NODE           NAME
192.168.4.26   .
192.168.4.26   lost+found

Logs

support.zip

Environment

  • Talos version:
talosctl version --nodes 192.168.4.26 --talosconfig ../talosconfig -e 192.168.4.26  
Client:
        Tag:         v1.9.0-beta.0
        SHA:         580805ba
        Built:       
        Go version:  go1.23.3
        OS/Arch:     linux/amd64
Server:
        NODE:        192.168.4.26
        Tag:         v1.9.0-beta.0
        SHA:         580805ba
        Built:       
        Go version:  go1.23.3
        OS/Arch:     linux/amd64
        Enabled:     RBAC
  • Kubernetes version:
Client Version: v1.31.3
Kustomize Version: v5.4.2
Server Version: v1.32.0-rc.0
  • Platform: metal
@smira
Copy link
Member

smira commented Dec 7, 2024

We would need the logs of the machine as it was installing to dig any further.

After the install the state looks okay given that the ISO was removed.

@rothgar
Copy link
Member Author

rothgar commented Dec 7, 2024

Is there a way to get those logs? Once I sent the config the api was unavailable.

Maybe if I connect it to Omni?

@smira
Copy link
Member

smira commented Dec 9, 2024

Getting logs is no different than any other Talos usage. Serial console logs are the easiest, make sure console=ttyS0 is set on boot.

IPMI, Omni or network log receiver are also options. It depends on your environment

@rothgar
Copy link
Member Author

rothgar commented Dec 9, 2024

I tried again with a system that already had talos installed on the disk but was in maintenance mode. Here's what I did.

Generated an 8GB raw disk and dd it to a nvme drive. I verified the drive had an IMAGECACHE partition. I put the drive into my machine and booted it and talos came up in maintenance mode.

I then generated a config with the following patch

machine:
  features:
    imageCache:
      localEnabled: true

I applied it to the machine with a full (default) machine config. The nice thing about doing it this way is the machine didn't have to reboot so I didn't lose any logs.

I then bootstrapped the system and checked the image cache config and it says the cache is ready (copy skipped which is expected in this case).

talosctl get imagecacheconfig -o yaml -e 192.168.4.26 -n 192.168.4.26 --talosconfig ../talosconfig                    
node: 192.168.4.26
metadata:
    namespace: cri
    type: ImageCacheConfigs.cri.talos.dev
    id: image-cache
    version: 4
    owner: cri.ImageCacheConfigController
    phase: running
    created: 2024-12-09T23:10:11Z
    updated: 2024-12-09T23:12:02Z
spec:
    status: ready
    copyStatus: skipped
    roots:
        - /system/imagecache/disk

When I look at the registryd.log I can see all the image requests and none of them have errors like they did with the ISO.

Here's the full support zip just for future reference support.zip

I'm still trying to figure out why the ISO isn't working for me and getting full logs from the installer.

@rothgar
Copy link
Member Author

rothgar commented Dec 11, 2024

We dug into this some more and the basic issue is I'm not actually booting my physical servers from a real CD. I'm using a USB drive with Ventoy which is booting the ISO image.

Here's what we found. Booting the system with ventoy shows the following discovered volumes

NODE   NAMESPACE   TYPE               ID        VERSION   TYPE   SIZE     DISCOVERED   LABEL   PARTITIONLABEL
       runtime     DiscoveredVolume   loop0     1         disk   74 MB    squashfs             
       runtime     DiscoveredVolume   nvme0n1   1         disk   512 GB                        
       runtime     DiscoveredVolume   sda       1         disk   62 GB

I tried a USB drive by restoring the disk volume in gnome disks and also tried dd the iso to a usb drive via sudo dd if=_out/metal-amd64.iso of=/dev/sdd. In both cases the discovered volumes looked similar but had a partition

NODE   NAMESPACE   TYPE               ID        VERSION   TYPE        SIZE     DISCOVERED   LABEL   PARTITIONLABEL
       runtime     DiscoveredVolume   loop0     1         disk        74 MB    squashfs             
       runtime     DiscoveredVolume   nvme0n1   1         disk        512 GB                        
       runtime     DiscoveredVolume   sda       1         disk        125 GB   gpt                  
       runtime     DiscoveredVolume   sda1      1         partition   125 GB

Finally I used a KVM to attach a virtual ISO over USB. This worked and was detected properly by the talos and the image cache worked as expected.

NODE   NAMESPACE   TYPE               ID        VERSION   TYPE   SIZE     DISCOVERED   LABEL                 PARTITIONLABEL
       runtime     DiscoveredVolume   loop0     1         disk   74 MB    squashfs                           
       runtime     DiscoveredVolume   nvme0n1   1         disk   512 GB                                      
       runtime     DiscoveredVolume   sr0       1         disk   543 MB   iso9660      TALOS_V1_9_0_BETA_1

One idea to address this would be to allow the image cache to be pulled from a different location via disk label or partition.

rothgar added a commit to rothgar/talos that referenced this issue Dec 12, 2024
Clarifying information from siderolabs#9892

Signed-off-by: Justin Garrison <justin.garrison@siderolabs.com>
@smira
Copy link
Member

smira commented Dec 12, 2024

I still believe directly flashed to the USB should work correctly, that sda/sda1 output looks incorrect to me.

rothgar added a commit to rothgar/talos that referenced this issue Dec 12, 2024
Clarifying information from siderolabs#9892

Signed-off-by: Justin Garrison <justin.garrison@siderolabs.com>
rothgar added a commit to rothgar/talos that referenced this issue Dec 12, 2024
Clarifying information from siderolabs#9892

Signed-off-by: Justin Garrison <justin.garrison@siderolabs.com>
smira pushed a commit to rothgar/talos that referenced this issue Dec 13, 2024
Clarifying information from siderolabs#9892

Signed-off-by: Justin Garrison <justin.garrison@siderolabs.com>
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants