Problem Statement
Google’s Compute Engine uses a process called ‘live migration’ to handle software/hardware upgrades to nodes. This is done by shifting the workload of a running node to a new node, and then proceeding with upgrades. Usually, there is a minimal delay while provisioning a new node to the cluster. This may cause the following issues:
- A data disk attached to the node may not be attached to the provisioned new node, once the new node has been assigned. This will cause the cStor pool pods in that node to go into CrashLoopBackOff state.
OpenEBS version
Any version
OpenEBS Storage Engine
cStor
Symptoms
- cStor pool pods from the node going through live migration go into CrashLoopBackOff state.
- Node facing live migration does not have data disk attached.
- If enough of the pool pods go into CrashLoopBackOff state, it does not fit the quorum criteria and thus the volumes may go into read-only state. This may send the application pod into CrashLoopBackOff state.
Troubleshooting
The disk still has to be manually attached to the node using Google’s Cloud Shell machine or using the Compute Engine settings in the GUI.
Solution
If the blockdevices are having at least one of a unique parameter from WWN, Model, Serial and Vendor, then the user can attach the block device manually to the new provisioned node. Use can use Google’s Cloud Shell machine to reattach the disk to the node and can access the Cloud Shell machine by clicking on the 'Connect' option and then clicking on 'Run in Cloud Shell'.
Execute the following command to attach the disk to the node:
gcloud compute instances attach-disk <node-name> --disk <disk-name> --device-name <device-name> --zone <zone>
The following will get the list of blockdevices in the project using the following command:
gcloud compute disks list
Note: ZFS scans for labels in all of the available disks and finds disk which belongs to a pool and adds them to that respective pool.