summaryrefslogtreecommitdiffstats
path: root/docs/troubleshooting.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r--docs/troubleshooting.txt49
1 files changed, 49 insertions, 0 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index b4ac8e7..ef3c206 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -60,6 +60,8 @@ Debugging
oc logs <pod name> --tail=100 [-p] - dc/name or ds/name as well
- Verify initialization steps (check if all volumes are mounted)
oc describe <pod name>
+ - Security (SCC) problems are visible if replica controller is queried
+ oc -n adei get rc/mysql-1 -o yaml
- It worth looking the pod environment
oc env po <pod name> --list
- It worth connecting running container with 'rsh' session and see running processes,
@@ -85,6 +87,7 @@ network
* that nameserver is pointing to the host itself (but not localhost, this is important
to allow running pods to use it)
* that correct upstream nameservers are listed in '/etc/dnsmasq.d/origin-upstream-dns.conf'
+ * that correct upstream nameservers are listed in '/etc/origin/node/resolv.conf'
* In some cases, it was necessary to restart dnsmasq (but it could be also for different reasons)
If script misbehaves, it is possible to call it manually like that
DEVICE_IFACE="eth1" ./99-origin-dns.sh eth1 up
@@ -96,6 +99,7 @@ etcd (and general operability)
may be needed to restart them manually. I have noticed it with
* lvm2-lvmetad.socket (pvscan will complain on problems)
* node-origin
+ * glusterd in container (just kill the misbehaving pod, it will be recreated)
* etcd but BEWARE of too entusiastic restarting:
- However, restarting etcd many times is BAD as it may trigger a severe problem with
'kube-service-catalog/apiserver'. The bug description is here
@@ -181,6 +185,13 @@ pods (failed pods, rogue namespaces, etc...)
docker ps -aq --no-trunc | xargs docker rm
+Builds
+======
+ - After changing storage for integrated docker registry, it may refuse builds with HTTP error 500. It is necessary
+ to run:
+ oadm policy reconcile-cluster-roles
+
+
Storage
=======
- Running a lot of pods may exhaust available storage. It worth checking if
@@ -208,3 +219,41 @@ Storage
gluster volume start <vol>
* This may break services depending on provisioned 'pv' like 'openshift-ansible-service-broker/asb-etcd'
+ - If something gone wrong, heketi may end-up creating a bunch of new volumes, corrupt database, and crash
+ refusing to start. Here is the recovery procedure.
+ * Sometimes, it is still possible to start by setting 'HEKETI_IGNORE_STALE_OPERATIONS' environmental
+ variable on the container.
+ oc -n glusterfs env dc heketi-storage -e HEKETI_IGNORE_STALE_OPERATIONS=true
+ * Even if it works, it does not solve the main issue with corruption. It is necessary to start a
+ debugging pod for heketi (oc debug) export corrupted databased, fix it, and save back. Having
+ database backup could save a lot of hussle to find that is amiss.
+ heketi db export --dbfile heketi.db --jsonfile /tmp/q.json
+ oc cp glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q.json q.json
+ cat q.json | python -m json.tool > q2.json
+ ...... Fixing .....
+ oc cp q2.json glusterfs/heketi-storage-3-jqlwm-debug:/tmp/q2.json
+ heketi db import --dbfile heketi2.db --jsonfile /tmp/q2.json
+ cp heketi2.db /var/lib/heketi/heketi.db
+ * If bunch of disks is created, there are still various left-overs. First, the Gluster volumes
+ have to be cleaned. The idea is to compare 'vol_' prefixed volumes in Heketi and Gluster. And
+ remove ones not present in heketi. There is the script in 'ands/scripts'.
+ * There is LVM volumes left from Gluster (or even allocated, but not associated with Gluster for
+ various failurs. so this clean-up is worth making independently). On each node we can easily find
+ volumes created today
+ lvdisplay -o name,time -S 'time since "2018-03-16"'
+ or again we can compare lvm volumes which are used by Gluster bricks and which are not. The later
+ ones should be cleaned up. Again there is the script.
+
+Performance
+===========
+ - To find if OpenShift restricts the usage of system resources, we can 'rsh' to container and check
+ cgroup limits in sysfs
+ /sys/fs/cgroup/cpuset/cpuset.cpus
+ /sys/fs/cgroup/memory/memory.limit_in_bytes
+
+
+Various
+=======
+ - IPMI may cause problems as well. Particularly, the mounted CDrom may start complaining. Easiest is
+ just to remove it from the running system with
+ echo 1 > /sys/block/sdd/device/delete