docs/problems.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

Actions Required
================
 * Long-term solution to 'rogue' interfaces is unclear. May require update to OpenShift 3.9 or later.
 However, proposed work-around should do unless execution rate grows significantly.
 * All other problems found in logs can be ignored.
 

Rogue network interfaces on OpenVSwitch bridge
==============================================
 Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear.
  * The issue is discussed here:
        https://bugzilla.redhat.com/show_bug.cgi?id=1518684
  * And can be determined by looking into:
    ovs-vsctl show

 Problems:
  * As number of rogue interfaces grow, it start to have impact on performance. Operations with
  ovs slows down and at some point the pods schedulled to the affected node fail to start due to
  timeouts. This is indicated in 'oc describe' as: 'failed to create pod sandbox'

 Cause:
  * Unclear, but it seems periodic ADEI cron jobs causes the issue.
  * Could be related to 'container kill failed' problem explained in the section bellow.
     Cannot kill container ###: rpc error: code = 2 desc = no such process

         
 Solutions:
  * According to RedHat the temporal solution is to reboot affected node (not tested yet). The problem
  should go away, but may re-apper after a while. 
  * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
  problems only starts after hundreds accumulate.
    ovs-vsctl del-port br0 <iface>
  
 Status:
   * Cron job is installed which cleans rogue interfaces as they number hits 25.


Orphaning / pod termination problems in the logs
================================================
 There is several classes of problems reported with unknow reprecursions in the system log. Currently, I
 don't see any negative side effects except some of these issues may trigger "rogue interfaces" problem.

 ! container kill failed because of 'container not found' or 'no such process': Cannot kill container ###: rpc error: code = 2 desc = no such process"

   Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
   problem with rogue interfaces staying on the OpenVSwitch bridge.
    
  Scenario:
    * happens with short-living containers 

 - containerd: unable to save f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a starttime: read /proc/81994/stat: no such process
   containerd: f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a (pid 81994) has become an orphan, killing it
    
  Scenario:
    This happens every couple of minutes and attributed to perfectely alive and running pods. 
    * For instance, ipekatrin1 was complaining some ADEI pod.
    * After I removed this pod, it immidiately started complaining on 'glusterfs' replica.
    * If 'glusterfs' pod re-created, the problem persist.
    * It seems only a single pod is affected at each given moment (at least this was always true 
    on ipekatrin1 & ipekatrin2 while I was researching the problem)
    
  Relations:
    * This problem is not aligned with the next 'container not found' problem. One happens with short-living containers which
    actually get destroyed. This one is triggered for persistent container which keep going. And in fact this problem is triggered
    significantly more frequently.

  Cause:
    * Seems related to docker health checks due to a bug in docker 1.12* which is resolved in 1.13.0rc2
        https://github.com/moby/moby/issues/28336
        
  Problems:
    * It seems only extensive logging, according to the discussion in the issue

  Solution: Ignore for now
    * docker-1.13 had some problems with groups (I don't remember exactly) and it was decided to not run it with current version of KaaS.
    * Only update docker after extensive testing on the development cluster or not at all.

 - W0625 03:49:34.231471   36511 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": Unexpected command output nsenter: cannot open /proc/63586/ns/net: No such file or directory
 - W0630 21:40:20.978177    5552 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "..."
  Scenario:
    * It seems can be ignored, see RH bug.
    * Happens with short-living containers (adei cron jobs)

  Relations:
    * This is also not aligned with 'container not found'. The time in logs differ significantly.
    * It is also not aligned with 'orphan' problem.

  Cause:
    ? https://bugzilla.redhat.com/show_bug.cgi?id=1434950 

 - E0630 14:05:40.304042    5552 glusterfs.go:148] glusterfs: failed to get endpoints adei-cfg[an empty namespace may not be set when a resource name is provided]
   E0630 14:05:40.304062    5552 reconciler.go:367] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/glusterfs/4

    I guess some configuration issue.... Probably can be ignored...

  Scenario:
    * Reported on long running pods with persistent volumes (katrin, adai-db)
    * Also seems an unrelated set of the problems.