docs/upgrade.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Upgrade
-------
 - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install.
 Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it.
 Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical 
 service on the 'latest' tag). Update may break things down.
 
kube-service-catalog
--------------------
 - Update of 'kube-service-catalog' breaks OpenShift health check
        curl -k https://apiserver.kube-service-catalog.svc/healthz
 It complains on 'etcd'. The speific etcd check
    curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd
 reports that all servers are unreachable.
 
 - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug
 described here:
        https://github.com/kubernetes/kubernetes/issues/47131
 The 'oc' queries are extremely slow and healthz service reports that there is too many connections.
 Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly.

 - The information bellow is attempt to understand the reason. In fact, it is the list specifying that
 is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting
         openshift_enable_service_catalog: false

 - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some 
 miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is 
 operational, so the connection is not completely lost, but is not working as expected in some
 circustmances.

 - There is no significant changes. The exactly same docker images are installed. The only change in
 '/etc' is updated certificates used by 'apiserver' and 'control-manager'. 
    * The certificates are located in '/etc/origin/service-catalog/' on the first master server. 
    'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They
    are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in
    'apiserver' and 'control-manager'. The provisioning code is in:
        openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml
    it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but 
    actual generation can be skipped and old files re-used to generate secret. 
    * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards.
    So, this is update is not a problem (or at least not a sole problem). 
 
 - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using
            etcdctl3 member list
    * The last command is actually bash alias which executes
        ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
    Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the 
    second port got problems. I was trying to change 2379 to 2380 in command above and it was failing.
    However, it does not work either if the cluster in healhy state.
    * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation 
    fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue
    as problem persist even if we keep certificates intact. However, I also verified that newly generated 
    certificates are completely similar to old ones and containe the correct hostnames inside.
    * Last idea was that actually 'asb-etcd' is broken. It complains 
        2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
    However, the same error is present in log directly after install while the cluster is completely 
    healthy.
    
 - The networking seems also not an issue. The configurations during install and upgrade are exactly the same.
 All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved) 
 by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves
 back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.