summaryrefslogtreecommitdiffstats
path: root/docs/problems.txt
blob: 49137aa3ceb8e599880ceb7041c9a48c90796aa5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
Actions Required
================
 * Long-term solution to 'rogue' interfaces is unclear. May require update to OpenShift 3.9 or later.
 However, proposed work-around should do unless execution rate grows significantly.
 * All other problems found in logs can be ignored.
 

Client Connection
=================
 * For some reason OpenShift requests client certificates. This is ignored by majority of old browsers,
   but Firefox 70+ is able to offer installed user certificates. If KIT certificate selected, OpenShift
   fails to start. This can be easily circumvented just by pressing 'Cancel' when client-cerificate selection
   box pops up.


Leaked resourced after node termination: Rogue network interfaces on OpenVSwitch bridge, unreclaimed IPs in pod-network, ...
=======================================
 Sometimes OpenShift fails to clean-up after terminated pod properly. The actual reason is unclear, but
 severity of the problem is increased if extreme amount of images is presented in local Docker storage.
 Several thousands is defenitively intensifies this problem.
  * The issues are discussed here:
        https://bugzilla.redhat.com/show_bug.cgi?id=1518684
        https://bugzilla.redhat.com/show_bug.cgi?id=1518912
  * And can be determined by looking into:
    ovs-vsctl show

 Problems:
  * As number of rogue interfaces grow, it start to have impact on performance. Operations with
  ovs slows down and at some point the pods schedulled to the affected node fail to start due to
  timeouts. This is indicated in 'oc describe' as: 'failed to create pod sandbox'
  * With time, the new rogue interfaces are created faster and faster. At some point, it really
  slow downs system and causes pod failures (if many pods are re-scheduled in paralllel) even 
  if not so many rogue interfaces still present
  * Furthermore, there is a limit range of IPs allocated for pod-network at each node. Whatever 
  it is caused by tje lost bridges or it is an unrellated resource-management problem in OpenShift,
  but this IPs also start to leak. As number of leaked IPs increase, it gets longer for OpenShift
  to find IP which is still free and pod schedulling slows down further. At some point, the complete
  range of IPs will get exhausted and pods will fail to start (after long waiting in Scheduling state)
  on the affected node.
  * Even if not failed, it takes several minutes to schedule the pod on the affected nodes.

 Cause:
  * Unclear, but it seems periodic ADEI cron jobs causes the issue if many images are present
  in docker.
  * Could be related to 'container kill failed' problem explained in the section bellow.
     Cannot kill container ###: rpc error: code = 2 desc = no such process

 Solutions:
  * According to RedHat the temporal solution is to reboot affected node (just temporarily reduces the rate how 
  often the new spurious interfaces appear, but not preventing the problem completely in my case). The problem
  should go away, but may re-apper after a while. 
  * The simplest work-around is to just remove rogue interface. They will be re-created, but performance
  problems only starts after hundreds accumulate.
    ovs-vsctl del-port br0 <iface>
  * Similarly, the unused IPs could be cleaned in "/var/lib/cni/networks/openshift-sdn", just check if docker 
  image referenced in each IP file is still running with "docker ps". Afterwards, the 'orgin-node' service
  should be restarted.
  * It seems also helpful to purge unused docker images to reduce the rate of interface apperance.
  
 Status:
   * Cron job is installed which cleans rogue interfaces as they number hits 25.


Hanged pods
===========
 POD processes may stuck. Normally, such processes will be detected using 'liveliness' probe and will be 
 restarted by OpenShift if necessary. However, ocasionally processes may stuck in syscalls (such processes
 are marked with 'D' in ps). These processes can't be killed with SIGKILL and OpenShift will not be able
 to terminate them leaving indefinitely in 'Terminating' status.
 
 Problems:
  * Pods stuck in 'Terminating' status preventing start of new replicas. In case of 'jobs', large number
  of 'Terminating' pods could overload OpenShift controllers.

 Cause:
  * One reason are the spurious locks on the GlusterFS file system. On CentOS 7, it impossible to interrupt 
  process waiting for the lock initiated by blocking 'flock' call. It gets stuck in a syscall and is indicated 
  by state 'D' in the ps output. Sometimes, GlusterFS may kept files locked despite that processes holding these
  locks have already exited/crashed. I am not sure about exact conditions when this happens, but it seems for 
  instance the crashed Docker daemon may cause effect if some of the running containers were holding locks on 
  GFS at the moment of crash. 
    - We can verify if this is the case by checking if process associated with the problematic pod is stuck in 
     state 'D' and by analyzing its backtrace (/proc/<pid>/stack).
    

 Solutions: 
  * Avoid blocking flock on GlusterFS. Use polling with sleep instead. To release already stuck pods, we need
  to find and destroy problematic locks. GlusterFS allows to debug locks using 'statedump', check GlusterFS 
  documentation for details. While there is also mechanism to clean such locks. It is not always working. 
  Alternative is to remove locked files AND keep them removed for a while until all blocked 'flock' syscalls 
  are released.


Hanged MySQL connection
=======================
 Stale MySQL locks may prevent new clients connecting to certain tables in MySQL database. 
 
 Problems:
  * The problem may affect either only clients trying to obtain 'write' access or for all usage patterns. In the first case, 
  it will cause ADEI 'caching' threads to hang indefinitely and 'maintain' threads will be terminated after specified timeout
  leaving administrative scripts un-processed.
  
 Cause:
  * For whatever reason, some crashed clients may preserve the locks. I believe it could also be caused by 
  crashed 'docker' daemon as one possibel reason. The problem can be found bt executing 'SHOW PROCESSLIST' 
  on MySQL server. More diagnostic possibilities are discussed in MySQL notes.
  
 Solutions;
  * Normally, restarting MySQL pod should be enough.


Orphaning / pod termination problems in the logs
================================================
 There is several classes of problems reported with unknow reprecursions in the system log. Currently, I
 don't see any negative side effects except some of these issues may trigger "rogue interfaces" problem.

 ! container kill failed because of 'container not found' or 'no such process': Cannot kill container ###: rpc error: code = 2 desc = no such process"

   Despite the errror, the containers are actually killed and pods destroyed. However, this error likely triggers
   problem with rogue interfaces staying on the OpenVSwitch bridge.
    
  Scenario:
    * happens with short-living containers 

 - containerd: unable to save f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a starttime: read /proc/81994/stat: no such process
   containerd: f7c3e6c02cdbb951670bc7ff925ddd7efd75a3bb5ed60669d4b182e5337dec23:d5b9394468235f7c9caca8ad4d97e7064cc49cd59cadd155eceae84545dc472a (pid 81994) has become an orphan, killing it
    
  Scenario:
    This happens every couple of minutes and attributed to perfectely alive and running pods. 
    * For instance, ipekatrin1 was complaining some ADEI pod.
    * After I removed this pod, it immidiately started complaining on 'glusterfs' replica.
    * If 'glusterfs' pod re-created, the problem persist.
    * It seems only a single pod is affected at each given moment (at least this was always true 
    on ipekatrin1 & ipekatrin2 while I was researching the problem)
    
  Relations:
    * This problem is not aligned with the next 'container not found' problem. One happens with short-living containers which
    actually get destroyed. This one is triggered for persistent container which keep going. And in fact this problem is triggered
    significantly more frequently.

  Cause:
    * Seems related to docker health checks due to a bug in docker 1.12* which is resolved in 1.13.0rc2
        https://github.com/moby/moby/issues/28336
        
  Problems:
    * It seems only extensive logging, according to the discussion in the issue

  Solution: Ignore for now
    * docker-1.13 had some problems with groups (I don't remember exactly) and it was decided to not run it with current version of KaaS.
    * Only update docker after extensive testing on the development cluster or not at all.

 - W0625 03:49:34.231471   36511 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": Unexpected command output nsenter: cannot open /proc/63586/ns/net: No such file or directory
 - W0630 21:40:20.978177    5552 docker_sandbox.go:337] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "...": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "..."
  Scenario:
    * It seems can be ignored, see RH bug.
    * Happens with short-living containers (adei cron jobs)

  Relations:
    * This is also not aligned with 'container not found'. The time in logs differ significantly.
    * It is also not aligned with 'orphan' problem.

  Cause:
    ? https://bugzilla.redhat.com/show_bug.cgi?id=1434950 

 - E0630 14:05:40.304042    5552 glusterfs.go:148] glusterfs: failed to get endpoints adei-cfg[an empty namespace may not be set when a resource name is provided]
   E0630 14:05:40.304062    5552 reconciler.go:367] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/glusterfs/4

    I guess some configuration issue.... Probably can be ignored...

  Scenario:
    * Reported on long running pods with persistent volumes (katrin, adai-db)
    * Also seems an unrelated set of the problems.
    
    
Evicted Pods
============
 Pods are evicted if node running pod becomes unavailable or have not enough resources to run the pod.
 - It is possible to lookup which resource is likely triggering event by 
    > oc describe node ipekatrin2.ipe.kit.edu
     Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
     ----                  ------  -----------------                       ------------------                      ------                          -------
     OutOfDisk             False   Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:33 +0100         KubeletHasSufficientDisk        kubelet has sufficient disk space available
     MemoryPressure        False   Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:33 +0100         KubeletHasSufficientMemory      kubelet has sufficient memory available
     DiskPressure          False   Tue, 05 Apr 2022 03:24:54 +0200         Mon, 04 Apr 2022 10:00:23 +0200         KubeletHasNoDiskPressure        kubelet has no disk pressure
     Ready                 True    Tue, 05 Apr 2022 03:24:54 +0200         Tue, 21 Dec 2021 19:09:43 +0100         KubeletReady                    kubelet is posting ready status
  The latest transition is 'DiskPressure' happened on Apr 04. So, likely disk is an issue.
  
 - DiskPressure eviction
    * This might happen because the pod writting to much output to the logs (standard ouput). This logs are stored under '/var/lib/origin/openshift.local.volumes/pods/...'
    and if growing large might use all the space in '/var' file system. OpenShift is not rotating these logs and have no other mechanisms to prevent large output eventually 
    causing space issues. So, pods have to rate-limit output to stdout. Otherwise, we need to find misbehaving pods which writes too much...
    * Another problem is 'inode' pressure. This can be checked with 'df' and anything above 80% is definitively a sign of a problem
            df -i
      The particular folder with lots of inodes can be found with the following command:
            { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
      Likely there would be some openshift-related volume logs in '/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/'
      Check particularly cronJob logs mounting volumes, e.g. various 'adei' stuff. Can be cleaned with
            find  /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/ -name '*.log' -delete

 - If resource is not available for long time, node will become NotReady and all pods will be evicted. However, short term problems caused by pod itself are likely cause only eviction of this
 particular pod itself only (once pod evicted disk/memory-space is recalimed, logs are deleted). So, it is possible to find the problematic pod by looking which pod was evicted most frequently.