I run my OKD cluster on a set of VMs in VMware ESXi. My backups consist of shutting down all VMs and then copying them somewhere else – so full shutdown and backup.
The cluster setup was done using the steps outlined in OKD 4.5 small cluster on ESX.
This has worked well and I have been able to restart the OKD cluster after a full restore with no issue. But now the cluster doesn’t start, OKD web console doesn’t work and when I monitor the VMs and ESX host the CPU, RAM, Disk and Network resources are all really low.
I struggled a long time to figure out how to start up what is essentiallya really old VM backup set. I finally got it figured, I think, but this is on the basis of the steps, order and timing I have outlined below.
So I am now starting up a full backup set that was backed up on 13/1/2021 and today is the 14/3/2021.
I have two ESXi servers, “HP3” has the services and control-plane nodes and “Lenovo5” has the computer/worker node.
Failed restore
To demonstrate the problem I started the services node and the control pane node. These two nodes are both on the “HP3” ESX server. The below is the memory and CPU usage for 24 minutes after I started these nodes. As you can see there was very little activity. I started the nodes at 19:14:
When you try to login you just get:
Prechecks
Before starting check that the date and time are correct on all ESX servers so the initial time for the VMs will all be more or less in sync with each other prior to NTP kicking in.
Services node
To begin with only start the services node as this doesn’t actually start OKD and is important as it needs to provide NTP, DNS and Proxy services.
There are some below things you should ensure are done prior to starting the other nodes.
Add a firewall rule for NTP and restart:
firewall-cmd --permanent --zone=public --add-port=123/udp
systemctl restart firewalld
Then I edit the chrony config. So backup and then edit “/etc/chrony.conf”. For me, in New Zealand, this is the chrony.conf file I used:
#
# Example chrony file from zoyinc.com
#
# Using New Zealand NPT servers - Please set to your local NTP public servers
#server 43.252.70.34
server 0.pool.ntp.org
server 1.pool.ntp.org
server 2.pool.ntp.org
server 3.pool.ntp.org
server 216.239.35.0
server 216.239.35.4
# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift
# Allow the system clock to be stepped in the first three updates
# if its offset is larger than 1 second.
makestep 1.0 3
# Enable kernel synchronization of the real-time clock (RTC).
rtcsync
# Allow NTP client access from local network.
allow 192.168.0.0/16
# Serve time even if not synchronized to a time source.
local stratum 10
# Specify directory for log files.
logdir /var/log/chrony
# Select which information is logged.
log measurements statistics tracking
Now restart:
systemctl restart chronyd.service
Check the sources for chrony by running “chronyc sources”: This should return something like:
[root@okd4-services ~]# chronyc sources
210 Number of sources = 6
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? time1.google.com 1 6 1 28 -984us[ -984us] +/- 80ms
^? time2.google.com 1 6 1 28 -18us[ -18us] +/- 60ms
^? ns1.att.wlg.telesmart.co> 2 6 1 29 -1792us[-1792us] +/- 14ms
^? ip-103-106-65-219.addr.l> 2 6 1 30 -576us[ -576us] +/- 38ms
^? 101-100-146-146.myrepubl> 2 6 1 30 +962us[ +962us] +/- 52ms
^? ns2.tdc.akl.telesmart.co> 2 6 1 30 -813us[ -813us] +/- 6027us
It appears that the worker/master nodes will use UTC so for consistency enable UTC on the services VM by running:
timedatectl set-timezone UTC
Initial startup
Now that the services node is up and configured start up the control plane node – do NOT start the compute/worker node yet.
On the services node run:
export KUBECONFIG=/opt/okd4/install_dir/auth/kubeconfig
oc get csr
Because you have just started the Control Pane this will return:
[root@okd4-services ~]# oc get csr
Unable to connect to the server: x509: certificate has expired or is not yet valid
[root@okd4-services ~]#
Keep running “oc get csr” until you get a certificate. You may see the following while you wait for it to start
[root@okd4-services ~]# oc get csr
No resources found
This could take a few minutes so be patient:
[root@okd4-services ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jnpdf 28s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
Approve this certificate using “oc adm certificate approve <csr name>”:
[root@okd4-services ~]# oc adm certificate approve csr-d5gt4
certificatesigningrequest.certificates.k8s.io/csr-d5gt4 approved
[root@okd4-services ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jnpdf 42s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
Keep looking for approvals, we are expecting a “system:node” csr for the control plane. It will look like:
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-55m7p 19s kubernetes.io/kubelet-serving system:node:okd4-control-plane-1.lab.okd.local Pending
csr-d5gt4 63s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
You need to also approve this as before:
[root@okd4-services ~]# oc adm certificate approve csr-55m7p
certificatesigningrequest.certificates.k8s.io/csr-55m7p approved
[root@okd4-services ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-55m7p 41s kubernetes.io/kubelet-serving system:node:okd4-control-plane-1.lab.okd.local Approved,Issued
csr-d5gt4 85s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
Wait for web console to come up
At this point simply wait for the web console to come up, this could take 10 minutes. Once the web console comes up you will also see a lot more CPU activity and memory usage compared to the earlier screenshots when OKD didn’t start.
Start the compute/worker node
Now that the web console is up you will be able to see some things but some things are still not showing:
So now start the compute/worker node.
As before keep monitoring for csrs by running “oc get csr” on the services node.
[root@okd4-services ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-2d68r 26s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-jnpdf 15m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-rxwww 14m kubernetes.io/kubelet-serving system:node:okd4-control-plane-1.lab.okd.local Approved,Issued
Approve the certificates as they come through. Once this is stabilized you should see:
[root@okd4-services ~]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-2d68r 2m4s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-jnpdf 17m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-rxwww 15m kubernetes.io/kubelet-serving system:node:okd4-control-plane-1.lab.okd.local Approved,Issued
csr-wlbkc 61s kubernetes.io/kubelet-serving system:node:okd4-compute-1.lab.okd.local Approved,Issued
It is worth noting that the above 4 certificate signing requests include one each for the control plane and compute node plus an equal number of bootstrap requests.
In say 5 minutes you should see the web console looking much healthier:
The CPU and memory on the HP3 ESX host also looks a lot healthier. Note I started the control plane node at 19:33 and the compute node at 19:56: