summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2018-03-11 19:56:38 +0100
committerSuren A. Chilingaryan <csa@suren.me>2018-03-11 19:56:38 +0100
commitf3c41dd13a0a86382b80d564e9de0d6b06fb1dbf (patch)
tree3522ce77203da92bb2b6f7cfa2b0999bf6cc132c /docs
parent6bc3a3ac71e11fb6459df715536fec373c123a97 (diff)
downloadands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.gz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.bz2
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.tar.xz
ands-f3c41dd13a0a86382b80d564e9de0d6b06fb1dbf.zip
Various fixes before moving to hardware installation
Diffstat (limited to 'docs')
-rw-r--r--docs/ands_ansible.txt2
-rw-r--r--docs/backup.txt26
-rw-r--r--docs/consistency.txt36
-rw-r--r--docs/managment.txt166
-rw-r--r--docs/network.txt58
-rw-r--r--docs/pods.txt13
-rw-r--r--docs/regions.txt16
-rw-r--r--docs/samples/templates/00-katrin-restricted.yml.j244
-rw-r--r--docs/samples/vars/run_oc.yml6
-rw-r--r--docs/samples/vars/variants.yml33
-rw-r--r--docs/troubleshooting.txt210
-rw-r--r--docs/upgrade.txt64
12 files changed, 673 insertions, 1 deletions
diff --git a/docs/ands_ansible.txt b/docs/ands_ansible.txt
index 80a7cf0..70800e1 100644
--- a/docs/ands_ansible.txt
+++ b/docs/ands_ansible.txt
@@ -89,7 +89,7 @@ Ansible parameters (global)
glusterfs_version group_vars
glusterfs_transport group_vars
- - OPenShift specific
+ - OpenShift specific
ands_openshift_labels setup/configs Labels to assign to the nodes
ands_openshift_projects setup/configs List of projects to configure (with GlusterFS endpoints, etc.)
ands_openshift_users setup/configs Optional list of user names with contacts
diff --git a/docs/backup.txt b/docs/backup.txt
new file mode 100644
index 0000000..1b25592
--- /dev/null
+++ b/docs/backup.txt
@@ -0,0 +1,26 @@
+Critical directories and services
+---------------------------------
+ - etcd database [ once ]
+ * There is etcd2 and etcd3 APIs. OpenShift 3.5+ uses etcd3, but documentation
+ still describes etcd2-style backup. etcd3 is backward compatible with etcd2,
+ and we can run etcd2 backup as well. Now the question if we need to backup
+ both ways (OpenShift 3.5 is definitively has etcd3 data) or just etcd3
+ considering it is a bug in documentation.
+ * etcd3
+ etcdctl3 --endpoints="192.168.213.1:2379" snapshot save snapshot.db
+ * etcd2
+ etcdctl backup --data-dir /var/lib/etcd/ --backup-dir .
+ cp "$ETCD_DATA_DIR"/member/snap/db member/snap/db
+
+ - heketi topology [ once ]
+ heketi-cli -s http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info --json
+
+ - Gluster volume information [ storage nodes ]
+ * /var/lib/glusterd/glusterd.info
+ * /var/lib/glusterd/peers
+ * /var/lib/glusterd/glustershd - not mentioned in docs
+
+ - etc [ all nodes ]
+ * /etc/origin/ - Only *.key *.crt from /etc/origin/master in docs
+ * /etc/etcd - Not mentioned
+ * /etc/docker - Only certs.d
diff --git a/docs/consistency.txt b/docs/consistency.txt
new file mode 100644
index 0000000..127d9a7
--- /dev/null
+++ b/docs/consistency.txt
@@ -0,0 +1,36 @@
+General overview
+=================
+ - etcd services (worth checking both ports)
+ etcdctl3 --endpoints="192.168.213.1:2379" member list - doesn't check health only reports members
+ oc get cs - only etcd (other services will fail on Openshift)
+ - All nodes and pods are fine and running and all pvc are bound
+ oc get nodes
+ oc get pods --all-namespaces -o wide
+ oc get pvc --all-namespaces -o wide
+ - API health check
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+Storage
+=======
+ - Heketi status
+ heketi-cli -s http://heketi-storage.glusterfs.svc.cluster.local:8080 --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)" topology info
+ - Status of Gluster Volume (and its bricks which with heketi fails often)
+ gluster volume info
+ ./gluster.sh info all_heketi
+ - Check available storage space on system partition and LVM volumes (docker, heketi, ands)
+ Run 'df -h' and 'lvdisplay' on each node
+
+Networking
+==========
+ - Check that both internal and external addresses are resolvable from all hosts.
+ * I.e. we should be able to resolve 'google.com'
+ * And we should be able to resolve 'heketi-storage.glusterfs.svc.cluster.local'
+
+ - Check that keepalived service is up and the corresponding ip's are really assigned to one
+ of the nodes (vagrant provisioner would remove keepalived tracked ips, but keepalived will
+ continue running without noticing it)
+
+ - Ensure, we don't have override of cluster_name to first master (which we do during the
+ provisioning of OpenShift plays)
+
+ \ No newline at end of file
diff --git a/docs/managment.txt b/docs/managment.txt
new file mode 100644
index 0000000..1eca8a8
--- /dev/null
+++ b/docs/managment.txt
@@ -0,0 +1,166 @@
+DOs and DONTs
+=============
+ Here we discuss things we should do and we should not do!
+
+ - Scaling up cluster is normally problem-less. Both nodes & masters can be added
+ fast and without much troubles afterwards.
+
+ - Upgrade procedure may cause the problems. The main trouble that many pods are
+ configured to use the 'latest' tag. And the latest versions has latest problems (some
+ of the tags can be fixed to actual version, but finding that is broken and why takes
+ a lot of effort)...
+ * Currently, there is problems if 'kube-service-catalog' is updated (see discussion
+ in docs/upgrade.txt). While it seems nothing really changes, the connection between
+ apiserver and etcd breaks down (at least for health checks). The intallation reamins
+ pretty much usable, but not in healthy state. This particular update is blocked by
+ setting.
+ openshift_enable_service_catalog: false
+ Then, it is left in 'Error' state, but can be easily recovered by deteleting and
+ allowing system to re-create a new pod.
+ * However, as cause is unclear, it is possible that something else with break as time
+ passes and new images are released. It is ADVISED to check upgrade in staging first.
+ * During upgrade also other system pods may stuck in Error state (as explained
+ in troubleshooting) and block the flow of upgrade. Just delete them and allow
+ system to re-create to continue.
+ * After upgrade, it is necessary to verify that all pods are operational and
+ restart ones in 'Error' states.
+
+ - Re-running install will break on heketi. And it will DESTROY heketi topology!
+ DON"T DO IT! Instead a separate components can be re-installed.
+ * For instance to reinstall 'openshift-ansible-service-broker' use
+ openshift-install-service-catalog.yml
+ * There is a way to prevent plays from touching heketi, we need to define
+ openshift_storage_glusterfs_is_missing: False
+ openshift_storage_glusterfs_heketi_is_missing: False
+ But I am not sure if it is only major issue.
+
+ - Few administrative tools could cause troubles. Don't run
+ * oc adm diagnostics
+
+
+Failures / Immidiate
+========
+ - We need to remove the failed node from etcd cluster
+ etcdctl3 --endpoints="192.168.213.1:2379" member list
+ etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+
+ - Further, the following is required on all remaining nodes if the node is forever gone
+ * Delete node
+ oc delete node
+ * Remove it also from /etc/etcd.conf on all nodes ETCD_INITIAL_CLUSTER
+ * Remove failed nodes from 'etcdClinetInfo' section in /etc/origin/master/master-config.yaml
+ systemctl restart origin-master-api.service
+
+Scaling / Recovery
+=======
+ - One important point.
+ * If we lost data on the storage node, it should be re-added with different name (otherwise
+ the GlusterFS recovery would be significantly more complicated)
+ * If Gluster bricks are preserved, we may keep the name. I have not tried, but according to
+ documentation, it should be possible to reconnect it back and synchronize. Still it may be
+ easier to use a new name again to simplify procedure.
+ * Simple OpenShift nodes may be re-added with the same name, no problem.
+
+ - Next we need to perform all prepartion steps (the --limit should not be applied as we normally
+ need to update CentOS on all nodes to synchronize software versions; list all nodes in /etc/hosts
+ files; etc).
+ ./setup.sh -i staging prepare
+
+ - The OpenShift scale is provided as several ansible plays (scale-masters, scale-nodes, scale-etcd).
+ * Running 'masters' will also install configured 'nodes' and 'etcd' daemons
+ * I guess running 'nodes' will also handle 'etcd' daemons, but I have not checked.
+
+Problems
+--------
+ - There should be no problems if a simple node crashed, but things may go wrong if one of the
+ masters is crashed. And things definitively will go wrong if complete cluster will be cut from the power.
+ * Some pods will be stuck polling images. This happens if node running docker-registry have crashed
+ and the persistent storage was not used to back the registry. It can be fixed by re-schedulling build
+ and roling out the latest version from dc.
+ oc -n adei start-build adei
+ oc -n adei rollout latest mysql
+ OpenShift will trigger rollout automatically in some time, but it will take a while. The builds
+ should be done manually it seems.
+ * In case of long outtage some CronJobs will stop execute. The reason is some protection against
+ excive loads and missing defaults. Fix is easy, just setup how much time the OpenShift scheduller
+ allows to CronJob to start before considering it failed:
+ oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 10 }}'
+
+ - if we forgot to remove old host from etcd cluster, the OpenShift node will be configured, but etcd
+ will not be installed. We need, then, to remove the node as explained above and run scale of etcd
+ cluster.
+ * In multiple ocasions, the etcd daemon has failed after reboot and needed to be resarted manually.
+ If half of the daemons is broken, the 'oc' will block.
+
+
+
+Storage / Recovery
+=======
+ - Furthermore, it is necessary to add glusterfs nodes on a new storage nodes. It is not performed
+ automatically by scale plays. The 'glusterfs' play should be executed with additional options
+ specifying that we are just re-configuring nodes. We can check if all pods are serviced
+ oc -n glusterfs get pods -o wide
+ Both OpenShift and etcd clusters should be in proper state before running this play. Fixing and re-running
+ should be not an issue.
+
+ - More details:
+ https://docs.openshift.com/container-platform/3.7/day_two_guide/host_level_tasks.html
+
+
+Heketi
+------
+ - With heketi things are straighforward, we need to mark node broken. Then heketi will automatically move the
+ bricks to other servers (as he thinks fit).
+ * Accessing heketi
+ heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
+ * Gettiing required ids
+ heketi-cli topology info
+ * Removing node
+ heketi-cli node info <failed_node_id>
+ heketi-cli node disable <failed_node_id>
+ heketi-cli node remove <failed_node_id>
+ * Thats it. A few self-healing daemons are running which should bring the volumes in order automatically.
+ * The node will still persist in heketi topology as failed, but will not be used ('node delete' potentially could destroy it, but it is failin)
+
+ - One problem with heketi, it may start volumes before bricks get ready. Consequently, it may run volumes with several bricks offline. It should be
+ checked and fixed by restarting the volumes.
+
+KaaS Volumes
+------------
+ There is two modes.
+ - If we migrated to a new server, we need to migrate bricks (force is required because
+ the source break is dead and data can't be copied)
+ gluster volume replace-brick <volume> <src_brick> <dst_brick> commit force
+ * There is healing daemons running and nothing else has to be done.
+ * There play and scripts available to move all bricks automatically
+
+ - If we kept the name and the data is still there, it should be also relatively easy
+ to perform migration (not checked). We also should have backups of all this data.
+ * Ensure Gluster is not running on the failed node
+ oadm manage-node ipeshift2 --schedulable=false
+ oadm manage-node ipeshift2 --evacuate
+ * Verify the gluster pod is not active. It may be running, but not ready.
+ Could be double checked with 'ps'.
+ oadm manage-node ipeshift2 --list-pods
+ * Get the original Peer UUID of the failed node (by running on healthy node)
+ gluster peer status
+ * And create '/var/lib/glusterd/glusterd.info' similar to the one on the
+ healthy nodes, but with the found UUID.
+ * Copy peers from the healthy nodes to /var/lib/glusterd/peers. We need to
+ copy from 2 nodes as node does not hold peer information on itself.
+ * Create mount points and re-schedule gluster pod. See more details
+ https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts
+ * Start healing
+ gluster volume heal VOLNAME full
+
+ - However, if data is lost, it is quite complecated to recover using the same server name.
+ We should rename the server and use first approach instead.
+
+
+
+Scaling
+=======
+We have currently serveral assumptions which will probably not hold true for larger clusters
+ - Gluster
+ To simplify matters we just reference servers in the storage group manually
+ Arbiter may work for several groups and we should define several brick path in this case
diff --git a/docs/network.txt b/docs/network.txt
new file mode 100644
index 0000000..a164d36
--- /dev/null
+++ b/docs/network.txt
@@ -0,0 +1,58 @@
+Configuration
+=============
+openshift_ip Infiniband IPs for fast communication (it also used for ADEI/MySQL bridge
+ and so should reside on fast network.
+openshift_hostname The 'cluster' host name. Should match real host name for certificat validation.
+ So, it should be set if default ip does not resolve to host name
+openshift_public_ip We may either skip this or set to our 192.168.26.xxx network. Usage is unclear
+openshift_public_hostname I guess it is also for certificates, but while communicating with external systems
+openshift_master_cluster_hostname Internal cluster load-balancer or just pointer to master host
+openshift_public_master_cluster_hostname The main cluster gateway
+
+
+Complex Network
+===============
+Some things in OpenShift ansible scripts are still implemented with assumption we have
+a simple network configuration with a single interface communicating to the world. There
+are several options to change this:
+ openshift_set_node_ip - This variable configures nodeIP in the node configuration. This
+ variable is needed in cases where it is desired for node traffic to go over an interface
+ other than the default network interface.
+ openshift_ip - This variable overrides the cluster internal IP address for the system.
+ Use this when using an interface that is not configured with the default route.
+ openshift_hostname - This variable overrides the internal cluster host name for the system.
+ Use this when the system’s default IP address does not resolve to the system host name.
+Furthermore, if we use infiniband which is not accessible to outside world we need to set
+ openshift_public_ip - Use this for cloud installations, or for hosts on networks using
+ a network address translation
+ openshift_public_hostname - Use this for cloud installations, or for hosts on networks
+ using a network address translation (NAT).
+
+ This is, however, is not used trough all system components. Some provisioning code and
+installed scripts are still detect kind of 'main system ip' to look for the
+services. This ip is intendified either as 'ansible_default_ip' or by the code trying
+to look for the ip which is used to send packet over default route. Ansible in the end does
+the some thing. This plays bad for several reasons.
+ - We have keepalived ips moving between systems. The scripts are actually catching
+ this moving ips instead of the fixed ip bound to the system.
+ - There could be several default routes. While it is not a problem, scripts does not expect
+ that and may fail.
+
+For instance, the script '99-origin-dns.sh' in /etc/NetworkManager/dispatcher.d.
+ * def_route=$(/sbin/ip route list match 0.0.0.0/0 | awk '{print $3 }')
+ 1) Does not expect multiple default routes and will find just a random one. Then,
+ * if [[ ${DEVICE_IFACE} == ${def_route_int} ]]; then
+ check may fail and the resolv.conf will be not updated because currently up'ed
+ interface is not on default route, but it actually is. Furthermore,
+ * def_route_ip=$(/sbin/ip route get to ${def_route} | awk '{print $5}')
+ 2) ignorant of keepalived and will bound to keepalived.
+
+ But I am not sure the problems are limited to this script. There could be other places with
+ the same logic. Some details are here:
+ https://docs.openshift.com/container-platform/3.7/admin_guide/manage_nodes.html#manage-node-change-node-traffic-interface
+
+Hostnames
+=========
+ The linux host name (uname -a) should match the hostnames assigned to openshift nodes. Otherwise, the certificate verification
+ will fail. It seems minor issue as system continue functioning, but better to avoid. The check can be performed with etcd:
+ etcdctl3 --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints="192.168.213.1:2379,192.168.213.3:2379,192.168.213.4:2379"
diff --git a/docs/pods.txt b/docs/pods.txt
new file mode 100644
index 0000000..b84f42f
--- /dev/null
+++ b/docs/pods.txt
@@ -0,0 +1,13 @@
+Updating Daemon Set
+===================
+ - Not trivial. We need to
+ a) Re-recreate ds
+ * Manualy change 'imagePullPolicty' to 'Always' if it is set to 'IfNotExisting'
+ b) Destory all nodes and allow ds to recreate them
+
+ - Sample: Updateing gluster
+ oc -n glusterfs delete ds/glusterfs-storage
+ oc -n glusterfs process glusterfs IMAGE_NAME=chsa/gluster-centos IMAGE_VERSION=312 > gluster.json
+ *** Edit
+ oc -n glusterfs create -f gluster.json
+ oc -n glusterfs delete pods -l 'glusterfs=storage-pod'
diff --git a/docs/regions.txt b/docs/regions.txt
new file mode 100644
index 0000000..88b8f5e
--- /dev/null
+++ b/docs/regions.txt
@@ -0,0 +1,16 @@
+region=infra Infrastructure nodes which are used by OpenShift to run router and registry services. This is
+ more or less ipekatrin* nodes down in the basement.
+region=prod Production servers (ipecompute*, etc.) located anythere, but I expect only basement.
+region=dev Temporary nodes
+
+zone=default Basement
+zone=404 Second server room on 4th floor
+zone=student Student room
+zone=external Other external places
+
+
+
+production: 1 Specifies all production servers (no extra load, no occasional reboots)
+ This includes 'infra' and 'prod' regions.
+server: 1 Like production, but with occasional reboots and some extra testing load possible
+permanent: 1 Non-production systems, but which are permanently connected to OpenShift
diff --git a/docs/samples/templates/00-katrin-restricted.yml.j2 b/docs/samples/templates/00-katrin-restricted.yml.j2
new file mode 100644
index 0000000..6221f30
--- /dev/null
+++ b/docs/samples/templates/00-katrin-restricted.yml.j2
@@ -0,0 +1,44 @@
+# Overriding SCC rules to allow arbitrary gluster mounts in restricted containers
+---
+allowHostDirVolumePlugin: false
+allowHostIPC: false
+allowHostNetwork: false
+allowHostPID: false
+allowHostPorts: false
+allowPrivilegedContainer: false
+allowedCapabilities: null
+apiVersion: v1
+defaultAddCapabilities: null
+fsGroup:
+ type: MustRunAs
+groups:
+- system:authenticated
+kind: SecurityContextConstraints
+metadata:
+ annotations:
+ kubernetes.io/description: restricted denies access to all host features and requires
+ pods to be run with a UID, and SELinux context that are allocated to the namespace. This
+ is the most restrictive SCC.
+ creationTimestamp: null
+ name: katrin-restricted
+priority: null
+readOnlyRootFilesystem: false
+requiredDropCapabilities:
+- KILL
+- MKNOD
+- SYS_CHROOT
+- SETUID
+- SETGID
+runAsUser:
+ type: MustRunAsRange
+seLinuxContext:
+ type: MustRunAs
+supplementalGroups:
+ type: RunAsAny
+volumes:
+- glusterfs
+- configMap
+- downwardAPI
+- emptyDir
+- persistentVolumeClaim
+- secret
diff --git a/docs/samples/vars/run_oc.yml b/docs/samples/vars/run_oc.yml
new file mode 100644
index 0000000..a464549
--- /dev/null
+++ b/docs/samples/vars/run_oc.yml
@@ -0,0 +1,6 @@
+oc:
+ - template: "[0-3]*"
+ - template: "[4-6]*"
+ - resource: "route/apache"
+ oc: "expose svc/kaas --name apache --hostname=apache.{{ openshift_master_default_subdomain }}"
+ - template: "*"
diff --git a/docs/samples/vars/variants.yml b/docs/samples/vars/variants.yml
new file mode 100644
index 0000000..c7a27b4
--- /dev/null
+++ b/docs/samples/vars/variants.yml
@@ -0,0 +1,33 @@
+# First port is exposed
+
+pods:
+ kaas:
+ variant: "{{ ands_prefer_docker | default(false) | ternary('docker', 'centos') }}"
+ centos:
+ service: { host: "{{ katrin_node }}", ports: [ 80/8080, 443/8043 ] }
+ sched: { replicas: 1, selector: { master: 1 } }
+ selector: { master: 1 }
+ images:
+ - image: "centos/httpd-24-centos7"
+ mappings:
+ - { name: "etc", path: "apache2-kaas-centos", mount: "/etc/httpd" }
+ - { name: "www", path: "kaas", mount: "/opt/rh/httpd24/root/var/www/html" }
+ - { name: "log", path: "apache2-kaas", mount: "/var/log/httpd24" }
+ probes:
+ - { port: 8080, path: '/index.html' }
+ docker:
+ service: { host: "{{ katrin_node }}", ports: [ 80/8080, 443/8043 ] }
+ sched: { replicas: 1, selector: { master: 1 } }
+ selector: { master: 1 }
+ images:
+ - image: "httpd:2.2"
+ mappings:
+ - { name: "etc", path: "apache2-kaas-docker", mount: "/usr/local/apache2/conf" }
+ - { name: "www", path: "kaas", mount: "/usr/local/apache2/htdocs" }
+ - { name: "log", path: "apache2-kaas", mount: "/usr/local/apache2/logs" }
+ probes:
+ - { port: 8080, path: '/index.html' }
+
+
+
+ \ No newline at end of file
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
new file mode 100644
index 0000000..b4ac8e7
--- /dev/null
+++ b/docs/troubleshooting.txt
@@ -0,0 +1,210 @@
+The services has to be running
+------------------------------
+ Etcd:
+ - etcd
+
+ Node:
+ - origin-node
+
+ Master nodes:
+ - origin-master-api
+ - origin-master-controllers
+ - origin-master is not running
+
+ Required Services:
+ - lvm2-lvmetad.socket
+ - lvm2-lvmetad.service
+ - docker
+ - NetworkManager
+ - firewalld
+ - dnsmasq
+ - openvswitch
+
+ Extra Services:
+ - ssh
+ - ntp
+ - openvpn
+ - ganesha (on master nodes, optional)
+
+Pods has to be running
+----------------------
+ Kubernetes System
+ - kube-service-catalog/apiserver
+ - kube-service-catalog/controller-manager
+
+ OpenShift Main Services
+ - default/docker-registry
+ - default/registry-console
+ - default/router (3 replicas)
+ - openshift-template-service-broker/api-server (daemonset, on all nodes)
+
+ OpenShift Secondary Services
+ - openshift-ansible-service-broker/asb
+ - openshift-ansible-service-broker/asb-etcd
+
+ GlusterFS
+ - glusterfs-storage (daemonset, on all storage nodes)
+ - glusterblock-storage-provisioner-dc
+ - heketi-storage
+
+ Metrics (openshift-infra):
+ - hawkular-cassandra
+ - hawkular-metrics
+ - heapster
+
+
+Debugging
+=========
+ - Ensure system consistency as explained in 'consistency.txt' (incomplete)
+ - Check current pod logs and possibly logs for last failed instance
+ oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well
+ - Verify initialization steps (check if all volumes are mounted)
+ oc describe <pod name>
+ - It worth looking the pod environment
+ oc env po <pod name> --list
+ - It worth connecting running container with 'rsh' session and see running processes,
+ internal logs, etc. The 'debug' session will start a new instance of the pod.
+ - If try looking if corresponding pv/pvc are bound. Check logs for pv.
+ * Even if 'pvc' is bound. The 'pv' may have problems with its backend.
+ * Check logs here: /var/lib/origin/plugins/kubernetes.io/glusterfs/
+ - Another frequent problems is failing 'postStart' hook. Or 'livenessProbe'. As it
+ immediately crashes it is not possible to connect. Remedies are:
+ * Set larger initial delay to check the probe.
+ * Try to remove hook and execute it using 'rsh'/'debug'
+ - Determine node running the pod and check the host logs in '/var/log/messages'
+ * Particularly logs of 'origin-master-controllers' are of interest
+ - Check which docker images are actually downloaded on the node
+ docker images
+
+network
+=======
+ - There is a NetworkManager script which should adjust /etc/resolv.conf to use local dnsmasq server.
+ This is based on '/etc/NetworkManager/dispatcher.d/99-origin-dns.sh' which does not play well
+ if OpenShift is running on non-default network interface. I provided a patched version, but it
+ worth verifying
+ * that nameserver is pointing to the host itself (but not localhost, this is important
+ to allow running pods to use it)
+ * that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+ * In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
+ If script misbehaves, it is possible to call it manually like that
+ DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
+
+
+etcd (and general operability)
+====
+ - Few of this sevices may seem running accroding to 'systemctl', but actually misbehave. Then, it
+ may be needed to restart them manually. I have noticed it with
+ * lvm2-lvmetad.socket (pvscan will complain on problems)
+ * node-origin
+ * etcd but BEWARE of too entusiastic restarting:
+ - However, restarting etcd many times is BAD as it may trigger a severe problem with
+ 'kube-service-catalog/apiserver'. The bug description is here
+ https://github.com/kubernetes/kubernetes/issues/47131
+ - Due to problem mentioned above, all 'oc' queries are very slow. There is not proper
+ solution suggested. But killing the 'kube-service-catalog/apiserver' helps for a while.
+ The pod is restarted and response times are back in order.
+ * Another way to see this problem is quering 'healthz' service which would tell that
+ there is too many clients and, please, retry later.
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+
+ - On node crash, the etcd database may get corrupted.
+ * There is no easy fix. Backup/restore is not working.
+ * Easiest option is to remove the failed etcd from the cluster.
+ etcdctl3 --endpoints="192.168.213.1:2379" member list
+ etcdctl3 --endpoints="192.168.213.1:2379" member remove <hexid>
+ * Add it to [new_etcd] section in inventory and run openshift-etcd to scale-up etcd cluster.
+
+ - There is a helth check provided by the cluster
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+ it may complain about etcd problems. It seems triggered by OpenShift upgrade. The real cause and
+ remedy is unclear, but the installation is mostly working. Discussion is in docs/upgrade.txt
+
+ - There is also a different etcd which is integral part of the ansible service broker:
+ 'openshift-ansible-service-broker/asb-etcd'. If investigated with 'oc logs' it complains
+ on:
+ 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+ WARNING: 2018/03/07 20:54:48 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
+ Nevertheless, it seems working without much trouble. The error message seems caused by
+ certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
+ the issue.
+
+pods (failed pods, rogue namespaces, etc...)
+====
+ - After crashes / upgrades some pods may end up in 'Error' state. This is quite often happen to
+ * kube-service-catalog/controller-manager
+ * openshift-template-service-broker/api-server
+ Normally, they should be deleted. Then, OpenShift will auto-restart pods and they likely will run without problems.
+ for name in $(oc get pods -n openshift-template-service-broker | grep Error | awk '{ print $1 }' ); do oc -n openshift-template-service-broker delete po $name; done
+ for name in $(oc get pods -n kube-service-catalog | grep Error | awk '{ print $1 }' ); do oc -n kube-service-catalog delete po $name; done
+
+ - Other pods will fail with 'ImagePullBackOff' after cluster crash. The problem is that ImageStreams populated by 'builds' will
+ not be recreated automatically. By default OpenShift docker registry is stored on ephemeral disks and is lost on crash. The build should be
+ re-executed manually.
+ oc -n adei start-build adei
+
+ - Furthermore, after long outtages the CronJobs will stop functioning. The reason can be found by analyzing '/var/log/messages' or specially
+ systemctl status origin-master-controllers
+ it will contain something like:
+ 'Cannot determine if <namespace>/<cronjob> needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.'
+ * The reason is that after 100 missed (or failed) launch periods it will stop trying to avoid excive load. The remedy is set 'startingDeadlineSeconds'
+ which tells the system that if cronJob has failed to start in the allocated interval we stop trying until the next start period. Then, 100 is only
+ counted the specified period. I.e. we should set period bellow the 'launch period / 100'.
+ https://github.com/kubernetes/kubernetes/issues/45825
+ * The running CronJobs can be easily patched with
+ oc -n adei patch cronjob/adei-autogen-update --patch '{ "spec": {"startingDeadlineSeconds": 120 }}'
+
+ - Sometimes there is rogue namespaces in 'deleting' state. This is also hundreds of reasons, but mainly
+ * Crash of both masters during population / destruction of OpenShift resources
+ * Running of 'oc adm diagnostics'
+ It is unclear how to remove them manually, but it seems if we run
+ * OpenShift upgrade, the namespaces are gone (but there could be a bunch of new problems).
+ * ... i don't know if install, etc. May cause the trouble...
+
+ - There is also rogue pods (mainly due to some problems with unmounting lost storage), etc. If 'oc delete' does not
+ work for a long time. It worth
+ * Determining the host running failed pod with 'oc get pods -o wide'
+ * Going to the pod and killing processes and stopping the container using docker command
+ * Looking in the '/var/lib/origin/openshift.local.volumes/pods' for the remnants of the container
+ - This can be done with 'find . -name heketi*' or something like...
+ - There could be problematic mounts which can be freed with lazy umount
+ - The folders for removed pods may (and should) be removed.
+
+ - Looking into the '/var/log/messages', it is sometimes possible to spot various erros like
+ * Orphaned pod "212074ca-1d15-11e8-9de3-525400225b53" found, but volume paths are still present on disk.
+ The volumes can be removed in '/var/lib/origin/openshift.local.volumes/pods' on the corresponding node
+ * PodSandbox "aa28e9c7605cae088838bb4c9b92172083680880cd4c085d93cbc33b5b9e8910" from runtime service failed: ...
+ - We can find and remove the corresponding container (the short id is just first letters of the long id)
+ docker ps -a | grep aa28e9c76
+ docker rm <id>
+ - We further can just destroy all containers which are not running (it will actually try to remove all,
+ but just error message will be printed for running ones)
+ docker ps -aq --no-trunc | xargs docker rm
+
+
+Storage
+=======
+ - Running a lot of pods may exhaust available storage. It worth checking if
+ * There is enough Docker storage for containers (lvm)
+ * There is enough Heketi storage for dynamic volumes (lvm)
+ * The root file system on nodes still has space for logs, etc.
+ Particularly there is a big problem for ansible-ran virtual machines. The system disk is stored
+ under '/root/VirtualBox VMs' and is not cleaned/destroyed unlike second hard drive on 'vagrant
+ destroy'. So, it should be cleaned manually.
+
+ - Problems with pvc's can be evaluated by running
+ oc -n openshift-ansible-service-broker describe pvc etcd
+ Furthermore it worth looking in the folder with volume logs. For each 'pv' it stores subdirectories
+ with pods executed on this host which are mount this pod and holds the log for this pods.
+ /var/lib/origin/plugins/kubernetes.io/glusterfs/
+
+ - Heketi is problematic.
+ * Worth checking if topology is fine and running.
+ heketi-cli -s http://heketi-storage-glusterfs.openshift.suren.me --user admin --secret "$(oc get secret heketi-storage-admin-secret -n glusterfs -o jsonpath='{.data.key}' | base64 -d)"
+ - Furthermore, the heketi gluster volumes may be started, but with multiple bricks offline. This can
+ be checked with
+ gluster volume status <vol> detail
+ * If not all bricks online, likely it is just enought to restart the volume
+ gluster volume stop <vol>
+ gluster volume start <vol>
+ * This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
+
diff --git a/docs/upgrade.txt b/docs/upgrade.txt
new file mode 100644
index 0000000..b4f22d6
--- /dev/null
+++ b/docs/upgrade.txt
@@ -0,0 +1,64 @@
+Upgrade
+-------
+ - The 'upgrade' may break things causing long cluster outtages or even may require a complete re-install.
+ Currently, I found problem with 'kube-service-catalog', but I am not sure problems are limited to it.
+ Furthermore, we currently using 'latest' tag of several docker images (heketi is example of a critical
+ service on the 'latest' tag). Update may break things down.
+
+kube-service-catalog
+--------------------
+ - Update of 'kube-service-catalog' breaks OpenShift health check
+ curl -k https://apiserver.kube-service-catalog.svc/healthz
+ It complains on 'etcd'. The speific etcd check
+ curl -k https://apiserver.kube-service-catalog.svc/healthz/etcd
+ reports that all servers are unreachable.
+
+ - In fact etcd is working and the cluster is mostly functional. Occasionaly, it may suffer from the bug
+ described here:
+ https://github.com/kubernetes/kubernetes/issues/47131
+ The 'oc' queries are extremely slow and healthz service reports that there is too many connections.
+ Killing the 'kube-service-catalog/apiserver' helps for a while, but problem returns occasionlly.
+
+ - The information bellow is attempt to understand the reason. In fact, it is the list specifying that
+ is NOT the reason. The only found solution is to prevent update of 'kube-service-catalog' by setting
+ openshift_enable_service_catalog: false
+
+ - The problem only occurs if 'openshift_service_catalog' role is executed. It results in some
+ miscommunication between 'apiserver' and/or 'control-manager' with etcd. Still the cluster is
+ operational, so the connection is not completely lost, but is not working as expected in some
+ circustmances.
+
+ - There is no significant changes. The exactly same docker images are installed. The only change in
+ '/etc' is updated certificates used by 'apiserver' and 'control-manager'.
+ * The certificates are located in '/etc/origin/service-catalog/' on the first master server.
+ 'oc adm ca' is used for generation. However, certificates in this folder are not used directly. They
+ are barely a temporary files used to generate 'secrets/service-catalog-ssl' which is used in
+ 'apiserver' and 'control-manager'. The provisioning code is in:
+ openshift-ansible/roles/openshift_service_catalog/tasks/generate_certs.yml
+ it can't be disabled completely as registered 'apiserver_ca' variable is used in install.yml, but
+ actual generation can be skipped and old files re-used to generate secret.
+ * I have tried to modify role to keep old certificates. The healhz check was still broken afterwards.
+ So, this is update is not a problem (or at least not a sole problem).
+
+ - The 'etcd' cluster seems OK. On all nodes, the etcd can be verified using
+ etcdctl3 member list
+ * The last command is actually bash alias which executes
+ ETCDCTL_API=3 /usr/bin/etcdctl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --endpoints https://`hostname`:2379 member list
+ Actually, etcd is serving two ports 2379 (clients) and 2380 (peers). One idea was that may be the
+ second port got problems. I was trying to change 2379 to 2380 in command above and it was failing.
+ However, it does not work either if the cluster in healhy state.
+ * One idea was that certificates are re-generated for wrong ip/names and, hence, certificate validation
+ fails. Or that the originally generated CA is registered with etcd. This is certainly not the (only) issue
+ as problem persist even if we keep certificates intact. However, I also verified that newly generated
+ certificates are completely similar to old ones and containe the correct hostnames inside.
+ * Last idea was that actually 'asb-etcd' is broken. It complains
+ 2018-03-07 20:54:48.791735 I | embed: rejected connection from "127.0.0.1:43066" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority", ServerName "")
+ However, the same error is present in log directly after install while the cluster is completely
+ healthy.
+
+ - The networking seems also not an issue. The configurations during install and upgrade are exactly the same.
+ All names are defined in /etc/hosts. Furthermore, the names in /etc/hosts are resolved (and back-resolved)
+ by provided dnsmasq server. I.e. ipeshift1 resolves to 192.168.13.1 using nslookup and 192.168.13.1 resolves
+ back to ipeshift1. So, the configuration is indistinguishable from proper one with properly configured DNS.
+
+ \ No newline at end of file