summaryrefslogtreecommitdiffstats
path: root/docs/upgrade.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/upgrade.txt')
-rw-r--r--docs/upgrade.txt64
1 files changed, 64 insertions, 0 deletions
diff --git a/docs/upgrade.txt b/docs/upgrade.txt
new file mode 100644
index 0000000..b4f22d6
--- /dev/null
+++ b/docs/upgrade.txt
@@ -0,0 +1,64 @@
+Upgrade
+-------
+ - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install.
+ Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it.
+ Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical
+ service on the 'latest' tag). Update may break things down.
+
+kube-service-catalog
+--------------------
+ - Update of 'kube-service-catalog' breaks OpenShift health check
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+ It complains on 'etcd'. The speific etcd check
+ curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd
+ reports that all servers are unreachable.
+
+ - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug
+ described here:
+ https://github.com/kubernetes/kubernetes/issues/47131
+ The 'oc' queries are extremely slow and healthz service reports that there is too many connections.
+ Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly.
+
+ - The information bellow is attempt to understand the reason. In fact, it is the list specifying that
+ is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting
+ openshift_enable_service_catalog: false
+
+ - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some
+ miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is
+ operational, so the connection is not completely lost, but is not working as expected in some
+ circustmances.
+
+ - There is no significant changes. The exactly same docker images are installed. The only change in
+ '/etc' is updated certificates used by 'apiserver' and 'control-manager'.
+ * The certificates are located in '/etc/origin/service-catalog/' on the first master server.
+ 'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They
+ are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in
+ 'apiserver' and 'control-manager'. The provisioning code is in:
+ openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml
+ it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but
+ actual generation can be skipped and old files re-used to generate secret.
+ * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards.
+ So, this is update is not a problem (or at least not a sole problem).
+
+ - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using
+ etcdctl3 member list
+ * The last command is actually bash alias which executes
+ ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
+ Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the
+ second port got problems. I was trying to change 2379 to 2380 in command above and it was failing.
+ However, it does not work either if the cluster in healhy state.
+ * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation
+ fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue
+ as problem persist even if we keep certificates intact. However, I also verified that newly generated
+ certificates are completely similar to old ones and containe the correct hostnames inside.
+ * Last idea was that actually 'asb-etcd' is broken. It complains
+ 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+ However, the same error is present in log directly after install while the cluster is completely
+ healthy.
+
+ - The networking seems also not an issue. The configurations during install and upgrade are exactly the same.
+ All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved)
+ by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves
+ back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.
+
+ \ No newline at end of file