In Part 1, we discussed how a simple miss during selecting the channel leads to a dreadful state of the cluster, but there are no problems we cannot solve together, and very few that we can solve by ourselves. Here, we’ll be talking about the approach we took.
As we had to eventually upgrade our cluster to a stable v4.7.34 once it was restored back to v4.6, we agreed to initiate an upgrade at the same (unstable) state of the cluster. Please be reminded that this is not a recommended/supported way for an upgrade as per the Product guidelines, but given the panic, we were facing at that time, we had no choice other than to give it a try. So, all the operators were at v4.6.32 with all the cluster nodes marked as degraded by MCO, we began the upgrade for v4.7.34 (this time, we chose the correct channel 😅).
Given the unhealthy state of the cluster, the upgrade was obviously not a piece of cake. It went well till the network operator, but then we saw things going wrong again. Most of the newly created pods like router/sdn/etc, were stuck in the Pending or ContainerCreating state for ~20 minutes, without any relevant logs/events. We looked through the container logs on specific nodes with no success. So, as a workaround, we did a random reboot of one of the nodes, and voila.. the trick worked!
With so much happening, the nodes weren’t really in good shape to allow the creation of new containers, and a reboot must have helped to reset everything and let the containers be created. This took a lot of manual efforts to reboot each node (as we have a lot of workers in our cluster), but we can’t complain as long as it was helping us. Almost all the operators upgraded to v4.7.34 with these manual interventions, and now was the time for MCO to update, the scariest of all, but guess what,
The thing you’re most afraid of is the very thing that’ll set you free!
And so did the MCO. As soon as the MCO pod rolled out with an image of v4.7.34, it no longer considered the ignition version (3.2.0) as unsupported. (ref: Ignition Spec version vs OCP version)
As MCO did not have any problem with the ignition version in the current machine-configs on the nodes, it rolled out the changes smoothly as per the ‘creationTimestamp’.
- It updated the nodes first to v4.6.32 (as the machine-config was regenerated after etcd restore).
- Then, updated all the nodes to v4.7.34.
Once the cluster was upgraded to v4.7.34, we monitored it for a day to confirm that nothing’s broken after so much has happened to it’s state.
In the meantime, we just went down the RCA lane, in order to understand why conversion of ignition spec from v2.x or v3.1.0 to higher versions (v3.2.0) is easily feasible but not the reverse?
When we update a cluster, the MCO automatically generates new machine-configs with the relevant ignition spec version, and we could find the reason here (ref: RHBZ #1947477):
The translate.go ensures the translation of older versions of ignition spec as per the newer versions (3) when the MCO detects that the existing configuration of the machine is on an older spec version.
However, as the ignition spec version translation is one way trip at the moment, the 4.6 MCO could not understand v3.2 configs which lead to the failures. This might be a potential enhancement with future releases, but until it’s released, prefer not to downgrade OpenShift to previous versions, or if required, then do analyze the version compatibilities.
I am continuing my research more on how we can have this situation resolved if not by the upgrade. That might involve some tweaks using rpm-ostree or osImageURLs. Stay tuned for details on that in the next blog!