Upgrade ------- - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install. Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it. Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical service on the 'latest' tag). Update may break things down. kube-service-catalog -------------------- - Update of 'kube-service-catalog' breaks OpenShift health check curl -k https://apiserver.kube-service-catalog.svc/healthz It complains on 'etcd'. The speific etcd check curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd reports that all servers are unreachable. - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug described here: https://github.com/kubernetes/kubernetes/issues/47131 The 'oc' queries are extremely slow and healthz service reports that there is too many connections. Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly. - The information bellow is attempt to understand the reason. In fact, it is the list specifying that is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting openshift_enable_service_catalog: false - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is operational, so the connection is not completely lost, but is not working as expected in some circustmances. - There is no significant changes. The exactly same docker images are installed. The only change in '/etc' is updated certificates used by 'apiserver' and 'control-manager'. * The certificates are located in '/etc/origin/service-catalog/' on the first master server. 'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in 'apiserver' and 'control-manager'. The provisioning code is in: openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but actual generation can be skipped and old files re-used to generate secret. * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards. So, this is update is not a problem (or at least not a sole problem). - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using etcdctl3 member list * The last command is actually bash alias which executes ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the second port got problems. I was trying to change 2379 to 2380 in command above and it was failing. However, it does not work either if the cluster in healhy state. * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue as problem persist even if we keep certificates intact. However, I also verified that newly generated certificates are completely similar to old ones and containe the correct hostnames inside. * Last idea was that actually 'asb-etcd' is broken. It complains 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "") However, the same error is present in log directly after install while the cluster is completely healthy. - The networking seems also not an issue. The configurations during install and upgrade are exactly the same. All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved) by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.