Skip to content

Fixing node issues

Once you find an issue on the node that can affect any user job

  1. Create an issue in GitLab describing the problem and taint the node with the issue number

    kubectl taint node-1 nautilus.io/issue=issue_number:NoSchedule
    1. If you’re using the zsh-scripts.sh script from Ansible repo, use the node_maint function to automagically create the issue and taint the node with it
  2. Drain the node. Check all pods except the system ones are gone

  3. Check that all storage volumes are successfully unmounted from the node (both lsblk devices and k8s volumeattachments)

  4. Check in linstor controller pod that no linstor volumes exist on the node (even diskless ones)

  5. Delete the spegel pod - it’s not tolerating the nautilus.io/issue taint and won’t restart. If you need containerd to work, change the config:

    sudo -s
    sed -i '/^[[:space:]]*config_path = "\/etc\/containerd\/certs\.d"$/d' /etc/containerd/config.toml
    rm -rf /etc/containerd/certs.d
    systemctl restart containerd
  6. Fix the issue

When the issue is fixed

  1. Verify the node connectivity

  2. If spegel was diabled, put back the config for it and restart containerd:

    /etc/containerd/config.toml
    [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"
  3. Close the issue in gitlab and untaint the node

  4. Check that all pods are running on the node without issues

NSF Logo
This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.