summaryrefslogtreecommitdiffstats
path: root/docs/problems.txt
diff options
context:
space:
mode:
authorSuren A. Chilingaryan <csa@suren.me>2020-01-22 03:16:06 +0100
committerSuren A. Chilingaryan <csa@suren.me>2020-01-22 03:16:06 +0100
commit1e8153c2af051ce48d5aa08d3dbdc0d0970ea532 (patch)
tree7bb1441a87521aa8c3c5524f95fa645850a6826e /docs/problems.txt
parente0b1b53f21095707af87a095934e971d788a90c7 (diff)
downloadands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.gz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.bz2
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.tar.xz
ands-1e8153c2af051ce48d5aa08d3dbdc0d0970ea532.zip
Document another problem with lost IPs and exhausting of SDN IP range
Diffstat (limited to 'docs/problems.txt')
-rw-r--r--docs/problems.txt20
1 files changed, 14 insertions, 6 deletions
diff --git a/docs/problems.txt b/docs/problems.txt
index 099193a..3b652ec 100644
--- a/docs/problems.txt
+++ b/docs/problems.txt
@@ -13,13 +13,14 @@ Client Connection
box pops up.
-Rogue network interfaces on OpenVSwitch bridge
-==============================================
+Leaked resourced after node termination: Rogue network interfaces on OpenVSwitch bridge, unreclaimed IPs in pod-network, ...
+=======================================
Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but
severity of the problem is increased if extreme amount of images is presented in local Docker storage.
Several thousands is defenitively intensifies this problem.
- * The issue is discussed here:
+ * The issues are discussed here:
https://bugzilla.redhat.com/show_bug.cgi?id=1518684
+ https://bugzilla.redhat.com/show_bug.cgi?id=1518912
* And can be determined by looking into:
ovs-vsctl show
@@ -30,6 +31,12 @@ Rogue network interfaces on OpenVSwitch bridge
* With time, the new rogue interfaces are created faster and faster. At some point, it really
slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even
if not so many rogue interfaces still present
+ * Furthermore, there is a limit range of IPs allocated for pod-network at each node. Whatever
+ it is caused by tje lost bridges or it is an unrellated resource-management problem in OpenShift,
+ but this IPs also start to leak. As number of leaked IPs increase, it gets longer for OpenShift
+ to find IP which is still free and pod schedulling slows down further. At some point, the complete
+ range of IPs will get exhausted and pods will fail to start (after long waiting in Scheduling state)
+ on the affected node.
* Even if not failed, it takes several minutes to schedule the pod on the affected nodes.
Cause:
@@ -38,7 +45,6 @@ Rogue network interfaces on OpenVSwitch bridge
* Could be related to 'container kill failed' problem explained in the section bellow.
Cannot kill container ###: rpc error: code = 2 desc = no such process
-
Solutions:
* According to RedHat the temporal solution is to reboot affected node (just temporarily reduces the rate how
often the new spurious interfaces appear, but not preventing the problem completely in my case). The problem
@@ -46,8 +52,10 @@ Rogue network interfaces on OpenVSwitch bridge
* The simplest work-around is to just remove rogue interface. They will be re-created, but performance
problems only starts after hundreds accumulate.
ovs-vsctl del-port br0 <iface>
- * It seems helpful to purge unused docker images to reduce the rate of interface apperance.
-
+ * Similarly, the unused IPs could be cleaned in "/var/lib/cni/networks/openshift-sdn", just check if docker
+ image referenced in each IP file is still running with "docker ps". Afterwards, the 'orgin-node' service
+ should be restarted.
+ * It seems also helpful to purge unused docker images to reduce the rate of interface apperance.
Status:
* Cron job is installed which cleans rogue interfaces as they number hits 25.