summaryrefslogtreecommitdiffstats
path: root/docs/troubleshooting.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/troubleshooting.txt')
-rw-r--r--docs/troubleshooting.txt33
1 files changed, 30 insertions, 3 deletions
diff --git a/docs/troubleshooting.txt b/docs/troubleshooting.txt
index 5eb0cc7..459143e 100644
--- a/docs/troubleshooting.txt
+++ b/docs/troubleshooting.txt
@@ -28,9 +28,9 @@ The services has to be running
Pods has to be running
----------------------
- Kubernetes System
+ Kubernetes System - Integration with public cloud resources as it seems
- kube-service-catalog/apiserver
- - kube-service-catalog/controller-manager
+ - kube-service-catalog/controller-manager - this seems optional
OpenShift Main Services
- default/docker-registry
@@ -39,7 +39,7 @@ Pods has to be running
- openshift-template-service-broker/api-server (daemonset, on all nodes)
OpenShift Secondary Services
- - openshift-ansible-service-broker/asb
+ - openshift-ansible-service-broker/asb - this is optional
- openshift-ansible-service-broker/asb-etcd
GlusterFS
@@ -132,6 +132,25 @@ etcd (and general operability)
certificate verification code which introduced in etcd 3.2. There are multiple bug repports on
the issue.
+services
+========
+ - kube-service-catalog/controller-manager might stuck in CrashLoopBackOff. It seems doesn't matter in current setup.
+ * The problem is expired certificate of kube-service-catalog/apiserver. This can be checked with
+ curl 'https://172.30.183.21:443/apis/servicecatalog.k8s.io/v1beta1
+ * The certificates are located in '/etc/origin/service-catalog' and can be verified.
+ * There is possibly a way to renew it. However, while prototyping the cluster, it got severely broken on each time
+ upgrade was executed. The new certificate in 'service-catalog' was one of very few things which actually changed
+ in the upgrade. Therefore, it might be dangerous to replace it.
+ * On other hand, it seems no missin services in the current configuration
+
+nodes: domino failures
+=====
+ - If OpenShift cluster is overloaded, we might get a domino failures if a single node goes off (even temporarily disconnected, e.g. due to restart of origin-node) and all pods
+ are rescheduled to oterh nodes of the cluster.
+ * Increased load, then, may trigger some other nodes offline (for a short while) and cause all nodes to be rescheduled from them as well.
+ * This might continue infinitely as one node is gets disconnected after another, pods get rescheduled, and process never stops
+ * The only solution is to remove temporarily some pods, e.g. ADEI pods could be easily removed and, then, provivisioned back
+
pods: very slow scheduling (normal start time in seconds range), failed pods, rogue namespaces, etc...
====
- OpenShift has numerous problems with clean-up resources after the pods. The problems are more likely to happen on the
@@ -287,6 +306,14 @@ Storage
=======
- The offline bricks can be brough back into the service with the follwoing command
gluster volume start openshift force
+ If this doesn't help, the volume should be stopped and started again
+ gluster volume stop openshift
+ gluster volume start openshift
+
+ This might cause problems to the services. Likely pods will continue to run, but will be
+ not be able to access mounted volumes. Particularly, adei-frontends/adei-cachers are affected.
+ So, this services have to be restarted manually in some cases.
+
- Running a lot of pods may exhaust available storage. It worth checking if
* There is enough Docker storage for containers (lvm)