Fixing node issues
Once you find an issue on the node that can affect any user job
Create an issue in GitLab describing the problem and taint the node with the issue number
kubectl taint node-1 nautilus.io/issue=issue_number:NoSchedule- If you’re using the zsh-scripts.sh script from Ansible repo, use the
node_maintfunction to automagically create the issue and taint the node with it
- If you’re using the zsh-scripts.sh script from Ansible repo, use the
Drain the node. Check all pods except the system ones are gone
Check that all storage volumes are successfully unmounted from the node (both
lsblkdevices and k8s volumeattachments)Check in linstor controller pod that no linstor volumes exist on the node (even diskless ones)
Delete the spegel pod - it’s not tolerating the
nautilus.io/issuetaint and won’t restart. If you need containerd to work, change the config:sudo -ssed -i '/^[[:space:]]*config_path = "\/etc\/containerd\/certs\.d"$/d' /etc/containerd/config.tomlrm -rf /etc/containerd/certs.dsystemctl restart containerdFix the issue
When the issue is fixed
If spegel was diabled, put back the config for it and restart containerd:
/etc/containerd/config.toml [plugins."io.containerd.grpc.v1.cri".registry]config_path = "/etc/containerd/certs.d"Close the issue in gitlab and untaint the node
Check that all pods are running on the node without issues

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.