An OpenShift downgrade story (Part 1)
Ever since I have worked with Red Hat OpenShift 4.x, I have been really curious to know the reason why a version downgrade is not recommended/supported for OCP v4.x. I understand that downgrading back to an old version is not a normal requirement, but, this option would be really helpful in some critical situations like severe bugs encountered (with no feasible workarounds) in the latest versions; hitting the upgrade mistakenly; etc.
So lately, I got fortunate enough to encounter a catastrophe of this downgrade procedure. Basically, in my scenario, the upgrade was initiated mistakenly to a different version (v4.7.42) than the one we have planned for (v4.7.34). Due to some policies at the organizational level, it was strictly required to have the cluster on v4.8.28, which was the beginning of the trouble as there was no upgrade path available from v4.7.42 to v4.8.28 at the moment.
Now, when I think of going back to a previous version, the following of the two answers seem preferable:
oc adm upgrade --to-image=<image> --force .
2. Restoring the
etcd backup captured right before initiating the upgrade.
We proceeded with option 2, as that appeared a bit more trustworthy among both. The etcd backup was taken when the cluster was at v4.6.32. So, we’ll be going back to v4.6.32 from v4.7.42.
We performed the restore as per the steps specified in the
disaster recovery section of product documentation. The restoration progressed pretty well, all in the expected manner, which kind of made us feel that the hype around
no downgrades recommendedmight be just a casual design prospect and not a big deal. But, the nightmare started once the etcd restoration was completed.
So, as we restored the cluster, all the operators were back to v4.6.32. However, the
machine-config operators were not healthy. On checking further, we observed that MCO (machine-config operator) has marked all the cluster nodes in a degraded state, and the reason was
ignition version mismatch (the details shall be unraveled as you continue to read). This resulted in a state, where the cluster is at v4.6.32 but the cluster nodes remained updated as per v4.7.42.
Why the nodes could not roll back to the previous version?
So, we all must be aware of the importance of ignition configs for RHCOS nodes & ignition spec versions specified in the `machineconfigs` managed by MCO. (Details on how ignition works could be checked here).
With OpenShift v4.7, RHCOS started supporting ignition spec v3.2, however, for OpenShift v4.6, it was up to v3.1.
So, with cluster nodes already at RHCOS v4.7 (ignition spec v3.2), MCO at v4.6 not being friends with ignition v3.2, marked it as unknown and unsupported. Following is the excerpt of the error thrown by MCO:
Let’s talk about the steps we took to bypass the unstable state of the cluster in Part 2!