Sep 24 13:34:18 ipekatrin2 kernel: Memory cgroup out of memory: Kill process 57372 (mongod) score 1984 or sacrifice child Sep 24 13:34:22 ipekatrin2 origin-node: I0924 13:34:22.704691 93115 kubelet.go:1921] SyncLoop (container unhealthy): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)" Sep 24 13:34:29 ipekatrin2 origin-node: I0924 13:34:29.774596 93115 kubelet.go:1888] SyncLoop (PLEG): "mongodb-2-6j5w7_services(b350130e-ac45-11e9-bbd6-0cc47adef0e6)", event: &pleg.PodLifecycleEvent{ID:"b350130e-ac45-11e9-bbd6-0cc47adef0e6", Type:"ContainerStarted", Data:"1d485a4dd86b8f7ff24649789eee000d55319ef64d9b447c532a43fadce2831e"} Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177258 93115 roundrobin.go:310] LoadBalancerRR: Setting endpoints for services/mongodb:mongo to [10.130.0.91:27017] Sep 24 13:34:35 ipekatrin2 origin-node: I0924 13:34:35.177323 93115 roundrobin.go:240] Delete endpoint 10.130.0.91:27017 for service "services/mongodb:mongo" ... Nothing about mongod on any node until the mass destruction .... ==== Sep 25 07:52:00 ipekatrin2 origin-node: I0925 07:52:00.422291 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.448988393s ago; threshold is 3m0s] Sep 25 07:52:31 ipekatrin2 origin-master-controllers: I0925 07:52:31.761961 109653 nodecontroller.go:617] Node is NotReady. Adding Pods on Node ipekatrin2.ipe.kit.edu to eviction queue Sep 25 07:52:47 ipekatrin2 origin-master-controllers: I0925 07:52:47.584394 109653 controller_utils.go:89] Starting deletion of pod services/mongodb-2-6j5w7 Sep 25 07:56:04 ipekatrin2 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111) Sep 25 08:07:41 ipekatrin2 systemd-logind: Failed to start session scope session-118144.scope: Connection timed out ==== Sep 26 08:53:19 ipekatrin2 origin-master-controllers: I0926 08:53:19.435468 109653 nodecontroller.go:644] Node is unresponsive. Adding Pods on Node ipekatrin3.ipe.kit.edu to eviction queues: Sep 26 08:54:09 ipekatrin3 kernel: glustertimer invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=-999 Sep 26 08:54:27 ipekatrin3 kernel: Out of memory: Kill process 91288 (mysqld) score 1075 or sacrifice child Sep 26 08:54:14 ipekatrin2 etcd: lost the TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 reader) Sep 26 08:55:02 ipekatrin2 etcd: established a TCP streaming connection with peer 2696c5f68f35c672 (stream MsgApp v2 writer) Sep 26 08:57:54 ipekatrin3 origin-node: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111) Sep 26 09:34:20 ipekatrin2 origin-node: I0926 09:34:20.361306 93115 kubelet.go:1796] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 8m12.284528292s ago; threshold is 3m0s] 0. ipeatrin1 (and to lesser degree ipekatrin2) was affected by huge number of images slowing down the Docker communication. Scheduling on ipekatrin1 was disabled for deveopment purposes. 1. On 24th monogodb used more memory when allowed by 'dc' configuration and was killed by OpenShift/cgroup OOM. 2. For some reason, the service was not restarted making rocketchat un-operationa; 3. On 25.09 7:52 katrin2 get unhealthy and unschedularble due to PLEG timeouts? * Pods migrating ipekatrin3. Performance problems due to mass migration causing systemd (and mount problems) * System recovered relatively quickly, but few pods was running on ipekatrin2 and ipekatrin3 was severely overloaded 4. On 26.09 8:53 System OOM killer was triggered on katrin3 due to overall lack of memory * Node was marked unhealthy and pods eviction was triggered * etcd problems registered, making real problems in cluster fabric 5. On 26.09 9:34 PLEG recovered for some reason. * Most of the pods were rescheduled automatically and the systemwas recovered occasionally.