Fixing node issues

Once you find an issue on the node that can affect any user job

Create an issue in GitLab describing the problem and taint the node with the issue number
```
kubectl taint node-1 nautilus.io/issue=issue_number:NoSchedule
```
1. If you’re using the zsh-scripts.sh script from Ansible repo, use the node_maint function to automagically create the issue and taint the node with it
Drain the node. Check all pods except the system ones are gone
Check that all storage volumes are successfully unmounted from the node (both lsblk devices and k8s volumeattachments)
Check in linstor controller pod that no linstor volumes exist on the node (even diskless ones)

Delete the spegel pod - it’s not tolerating the nautilus.io/issue taint and won’t restart. If you need containerd to work, change the config:

sudo -s
sed -i '/^[[:space:]]*config_path = "\/etc\/containerd\/certs\.d"$/d' /etc/containerd/config.toml
rm -rf /etc/containerd/certs.d
systemctl restart containerd

Fix the issue

When the issue is fixed

Verify the node connectivity

If spegel was diabled, put back the config for it and restart containerd:

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d"

Close the issue in gitlab and untaint the node
Check that all pods are running on the node without issues

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.